agnes-the-ai-analyst/CLAUDE.md
minasarustamyan 4fb2818a19
Add /marketplace browse page + Model B opt-in stack composition (#230)
* Add /marketplace browse page + Model B opt-in stack composition

New /marketplace browse surface unifies the curated marketplaces
(admin-managed git mirrors) and the community Flea Market behind
three tabs — Curated / Flea / My Stack — with per-tab category
filter, search across both sources with scope checkboxes, and
numeric pagination, all driven by URL query state. Plugin detail
at /marketplace/curated/<slug>/<plugin> and /marketplace/flea/<id>;
nested skill / agent detail at /marketplace/curated/<slug>/<plugin>/
{skill,agent}/<name> and the flea-side single-page detail.

Model B opt-in: an RBAC grant on a curated plugin is now only
*eligibility*. The user must click "Add to my stack" for it to
enter their served Claude Code marketplace. Composition flips
from (rbac ∖ opt_outs) ∪ store_installs to
(rbac ∩ subscriptions) ∪ store_installs. The legacy
user_plugin_optouts table is renamed user_curated_subscriptions
(schema v27) — same table shape, inverted semantic, repository
methods become subscribe / unsubscribe / is_subscribed.

UX vocabulary: Install → Add to my stack, Installed → In your
stack, card "Installed" badge → "In stack" (amber pill), tab
"My Subscriptions" → "My Stack". Bridges the two-step model
(server-side bookmark vs. on-laptop install) the previous label
hid. Click triggers an inline post-add hint panel under the
description with the agnes refresh-marketplace recipe + Copy
chip, dismissible per-browser via localStorage.

Per-tab info blocks above the filter row:
- Curated: trust signal — "Each plugin here has a named curator
  accountable for it." (blue accent + See-all-curators link)
- Flea: open-shelf signal — "Anyone in the company can upload
  here." (purple accent + Tips-for-sharing link)
- My Stack: personal-shelf orientation — "Your AI stack —
  everything you've added." (slate accent, no link)

Tabs carry per-tab Heroicons (shield-check / building-storefront
/ rectangle-stack) tinted to match each tab's accent; flips white
when the tab is active for contrast.

Hero illustration anchored to the right of the blue hero panel
(absolute, 47% wide, behind the search row content). Hidden
under 900px viewport.

Action-row CTAs realigned to publication intent: curated
"How to add new content" → "Submit a plugin" (links to the
guide page); flea button removed since +Upload sits next to it.
Empty-state CTAs match. /marketplace/guide/{curated,flea}
routes now host publication-flow guide pages with placeholder
ledes — full copy to be authored separately.

Categories: Heroicons-based icons mapped per category in
src/category_icons.py (zero new dependencies; SVG path strings
inlined). Marketplace cards, filter pills, and detail pages
read from the same source.

API endpoints under /api/marketplace:
- GET /items per-tab listing (curated / flea / my)
- GET /categories per-tab non-zero counts
- GET /curated/{slug}/{plugin} plugin detail
- POST/DELETE /curated/{slug}/{plugin}/install subscribe toggle
- GET /curated/{slug}/{plugin}/{skill,agent}/{name} inner item
The tab=my branch reads directly from
user_curated_subscriptions ∪ user_store_installs (not
resolve_user_marketplace, which bundles flea skills/agents into
a single store-bundle synthetic entry useful for serving the
Claude Code marketplace ZIP/git but wrong for browsing where
each item should appear as its own card).

Detail pages: plugin detail surfaces inner skills/agents as
clickable nested cards; commands/hooks/MCPs render as plain
name lists. Skill/agent detail mirrors the plugin layout with
kind-tinted accents (skill = green, agent = purple), Description
+ Details sidebar, Files + Docs sections, and the "How to call
it" copy-able invocation chip showing /<plugin>:<inner-name>
exactly as Claude Code namespaces it post-install. Curated
nested has no install button — links back to the parent plugin.

Navbar: standalone "My AI Stack" relabelled "My Stack" and
points at /marketplace?tab=my; "Store" link removed (Store
flow is reachable via the Flea Market tab's +Upload button).
The standalone /my-ai-stack and /store routes still work for
old bookmarks.

Tests cover the new browse / categories / install / RBAC paths
under tests/test_marketplace_api.py; existing marketplace and
store tests updated for Model B (explicit subscribe in fixtures).
Schema bumped v26 → v27 with idempotent migration that wipes
existing user_plugin_optouts rows on flip and adds
marketplace_plugins.created_at with registered_at backfill.

* Fix v28 migration + post-rebase test fallout

v28 ALTER TABLE marketplace_plugins ADD COLUMN created_at conflicted with
_SYSTEM_SCHEMA's earlier CREATE that already includes the column on fresh
installs (test fixtures starting at any pre-v28 version trip on it).
Switch to ADD COLUMN IF NOT EXISTS — same idiom as the upstream v27
Keboola sync-strategy migration on the same ladder.

Two test patches needed after the rebase bumped SCHEMA_VERSION 27 → 28:
- test_keboola_v27_migration.py: test_schema_version_constant_is_27 was
  pinning ==27. Loosened to >=27 (the test's purpose is to verify the
  v27 Keboola migration, not to pin the current SCHEMA_VERSION).
- test_setup_page_unified.py: was monkeypatching resolve_allowed_plugins
  but compute_default_agent_prompt now reads from resolve_user_marketplace
  (Model B-aware). Stub the right function so the test exercises the
  v28 served-set path.

* Harden curated skill/agent inner endpoints against path traversal

`_read_inner`, the `skill_dir` walk in `curated_skill_detail`, and the
`agent_path.stat` in `curated_agent_detail` joined URL path-params onto
`plugin_root` without verifying the resolved candidate stayed inside it.
Starlette's `[^/]+` on `{skill_name}` / `{agent_name}` blocks the direct
URL exploit (encoded `/` 404s before the handler), but a curator-planted
symlink inside a curated marketplace's git mirror could still dereference
outside the plugin tree on read.

Adds `_safe_join(plugin_root, *parts)` doing
`Path.resolve(strict=True)` + `relative_to(plugin_root.resolve())`, used
by all three call sites so the boundary is enforced once and consistently.
Tests cover the helper directly (normal path resolves, escaping `..`
returns None, escaping symlink returns None, missing file returns None)
plus an end-to-end check that the symlink case actually 404s on the
HTTP endpoint. Symlink tests skip on Windows where symlink creation
needs elevated permissions; they run on Linux CI.

---------

Co-authored-by: Minas Arustamyan <arustamyan.minas@gmail.com>
2026-05-08 14:22:19 +02:00

531 lines
32 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# AI Data Analyst
Open-source data distribution platform for AI analytical systems. Extracts data from sources into DuckDB, serves via FastAPI, and distributes parquets to analysts who use Claude Code for local analysis.
## First-Time Setup
When a user opens this project for the first time, guide them through interactive setup:
### Step 1: Gather Information
Ask the user for:
1. Company domain (e.g., "acme.com") - used for Google OAuth
2. Data source type: keboola / bigquery / csv
3. Instance name (e.g., "Acme Data Analyst")
### Step 2: Generate Configuration
1. Copy `config/instance.yaml.example` to `config/instance.yaml`
2. Fill in values from Step 1
3. If Keboola: ask for Storage API token, stack URL, project ID
4. Create `.env` from `config/.env.template`
### Step 3: Register Tables
1. Use the FastAPI admin API (`POST /api/admin/register-table`, then `PUT /api/admin/registry/{id}` for updates) or webapp UI to register tables
2. Tables are stored in DuckDB `table_registry` with source_type, bucket, source_table, query_mode
3. For migration from old format: `python scripts/migrate_registry_to_duckdb.py`
### Step 4: Docker Deployment
```bash
docker compose up # Start app + scheduler
docker compose --profile full up # Include telegram bot
# HTTPS mode — Caddy + corporate-CA certs at /data/state/certs
docker compose -f docker-compose.yml -f docker-compose.prod.yml -f docker-compose.tls.yml \
--profile tls up -d
```
See `docs/DEPLOYMENT.md`**TLS** for cert provisioning + `scripts/ops/agnes-tls-rotate.sh` (daily refetch from `TLS_FULLCHAIN_URL`, `SIGUSR1` reload on diff, no-op when unchanged). The infra repo's `startup.sh` installs this as a systemd timer automatically.
## Project Structure
```
├── src/ # Core engine
│ ├── db.py # DuckDB schema (system.duckdb, analytics.duckdb)
│ ├── orchestrator.py # SyncOrchestrator — ATTACHes extract.duckdb files
│ ├── repositories/ # DuckDB-backed CRUD (sync_state, table_registry, users, etc.)
│ ├── profiler.py # Data profiling
│ └── catalog_export.py # OpenMetadata catalog export
├── app/ # FastAPI application
│ ├── main.py # App setup, router registration
│ ├── api/ # REST API (sync, data, catalog, admin, auth)
│ └── web/ # HTML dashboard routes
├── connectors/ # Data source connectors (extract.duckdb contract)
│ ├── keboola/ # Keboola: extractor.py (DuckDB extension) + client.py (fallback)
│ ├── bigquery/ # BigQuery: extractor.py (remote-only via DuckDB BQ extension)
│ └── jira/ # Jira: webhook + incremental parquet → extract.duckdb
├── cli/ # CLI tool (`agnes pull`, `agnes query`, `agnes admin`)
├── app/auth/ # Authentication (FastAPI-based providers)
├── services/ # Standalone services (scheduler, telegram_bot, ws_gateway, etc.)
├── server/ # Legacy deployment infrastructure
├── scripts/ # Utility + migration scripts
├── config/ # Configuration templates (instance.yaml.example)
├── docs/ # Documentation + metric YAML definitions
└── tests/ # Test suite (633 tests)
```
## Architecture: extract.duckdb Contract
Every data source produces the same output:
```
/data/extracts/{source_name}/
├── extract.duckdb ← _meta table + views
└── data/ ← parquet files (local sources only)
```
### Remote table support (`_remote_attach`)
Extractors with remote/passthrough tables (query_mode='remote') include a `_remote_attach` table
in extract.duckdb so the orchestrator can re-ATTACH the external DuckDB extension at query time:
```sql
CREATE TABLE _remote_attach (
alias VARCHAR, -- DuckDB alias used in views, e.g. 'kbc'
extension VARCHAR, -- Extension name, e.g. 'keboola'
url VARCHAR, -- Connection URL
token_env VARCHAR -- Env-var name holding the auth token, OR empty for
-- extensions with built-in auth (e.g. BigQuery uses the
-- GCE metadata server — see `connectors/bigquery/auth.py`).
);
```
The orchestrator reads this table, installs/loads the extension, fetches the token
(via `token_env` lookup, or via the extension-specific auth path when `token_env=''`),
creates a session-scoped DuckDB SECRET when the extension requires one (BigQuery), and
ATTACHes the external source. Views referencing `kbc."bucket"."table"` then resolve correctly.
This mechanism is generic — any connector can plug in.
The SyncOrchestrator scans `/data/extracts/*/extract.duckdb`, ATTACHes each into master `analytics.duckdb`, and creates views.
```
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Keboola │ │ BigQuery │ │ Jira │
│ extractor │ │ extractor │ │ webhooks │
│ (DuckDB ext) │ │ (remote BQ) │ │ (incremental)│
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │
▼ ▼ ▼
extract.duckdb extract.duckdb extract.duckdb
+ data/*.parquet (views → BQ) + data/*.parquet
│ │ │
└─────────────────┼─────────────────┘
SyncOrchestrator.rebuild()
ATTACH → master views in analytics.duckdb
┌──────────┼──────────┐
▼ ▼ ▼
FastAPI CLI
(serve) (agnes pull)
```
Source modes:
- **Batch pull** (Keboola, `query_mode='local'`): DuckDB extension downloads to parquet, scheduled
- **Remote attach** (BigQuery, `query_mode='remote'`): DuckDB BQ extension, no download, queries go to BQ
- **Materialized SQL** (BigQuery, `query_mode='materialized'`): scheduler runs admin-registered SQL through DuckDB BQ extension (via `BqAccess` from `connectors/bigquery/access.py`) and writes the result to `/data/extracts/bigquery/data/<id>.parquet`. Distributed via the same manifest + `agnes pull` flow as Keboola tables. Cost guardrail via `data_source.bigquery.max_bytes_per_materialize` (default 10 GiB; set `0` to disable — YAML `null` falls through to the default).
- **Real-time push** (Jira): Webhooks update parquets incrementally
## Configuration
Instance-specific config: `config/instance.yaml` (see example).
Environment variables: `.env` (never committed).
Table definitions: DuckDB `table_registry` table in `system.duckdb`.
## Development
```bash
# Setup
python3 -m venv .venv && source .venv/bin/activate
uv pip install ".[dev]"
# Run FastAPI locally
uvicorn app.main:app --reload
# Run tests
pytest tests/ -v
# Trigger sync manually
curl -X POST http://localhost:8000/api/sync/trigger
# Docker
docker compose up
```
### Local sync & Claude Code hooks
`agnes pull` is the canonical analyst-side distribution path: pulls the RBAC-filtered manifest from the server, downloads parquets whose MD5 changed (skipping `query_mode='remote'` rows), rebuilds local DuckDB views over them. `agnes push` mirrors it for the upload direction (sessions, CLAUDE.local.md).
`agnes init` writes two hooks into `<workspace>/.claude/settings.json`:
- `SessionStart``agnes pull --quiet` — pulls fresh parquets at the start of every Claude Code session
- `SessionEnd``agnes push --quiet` — uploads session jsonl + `CLAUDE.local.md` to the server
Both pass `--quiet` so they don't pollute Claude Code stdout, and trail with `|| true` so a server outage never blocks a session. Workspace-level (not user-home) so the hooks fire only when Claude Code opens this analyst workspace, not in unrelated sessions on the same machine.
Admin RBAC for auto-sync: `query_mode IN ('local', 'materialized')` plus a `resource_grants` row for one of the analyst's groups → table appears in their manifest → `agnes pull` downloads it. No per-user sync config; the admin layer is the single source of truth.
## Business Metrics
Standardized metric definitions live in DuckDB (`metric_definitions` table). Import starter pack:
```bash
agnes admin metrics import docs/metrics/
```
### For AI agents analyzing data:
Before computing any business metric, look up the canonical definition:
1. `agnes catalog --metrics` — find the relevant metric
2. `agnes catalog --metrics --show revenue/mrr` — read the SQL and business rules
3. Use the SQL from the metric definition, adapt to the specific question
Never invent metric calculations — always use the canonical definitions.
## Querying Agnes data — agent rails
When asked about ANY data in Agnes, follow this protocol.
### Discovery first
Before writing ANY query against a table, run:
agnes catalog --json | jq <filter> # know what's available
agnes schema <table> # learn columns + types
agnes describe <table> -n 5 # see real values for shape
NEVER write `SELECT * FROM <table>` blindly. For local-mode tables it's
wasteful; for remote-mode tables it can blow up at 225M rows.
### Choose the right tool
Tables in `agnes catalog` have a `query_mode`:
- **`local`**: data is on the laptop as parquet (synced via `agnes pull`).
Query directly with `agnes query "SELECT … FROM <table>"`.
- **`remote`** (typically BigQuery): the parquet does NOT exist on the laptop.
You MUST either:
1. **`agnes snapshot create`** a filtered subset → query the local snapshot, OR
2. **`agnes query --remote`** for one-shot server-side execution. Works on
all `query_mode='remote'` rows regardless of upstream BQ entity type
(BASE TABLE → Storage Read API with predicate pushdown; VIEW /
MATERIALIZED_VIEW → BQ jobs API, no pushdown). Cost-guarded by a
5 GiB scan cap (configurable in /admin/server-config). Direct
`bq."<dataset>"."<table>"` paths are registry-gated — unregistered
paths return 403 `bq_path_not_registered`.
3. **`agnes query --register-bq`** for hybrid joins (rarely needed).
### `agnes snapshot create` workflow (preferred for remote tables)
# 1. estimate first
agnes snapshot create web_sessions_example \
--select event_date,country_code,session_id \
--where "event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
AND country_code = 'CZ'" \
--estimate
# → "estimated_scan_bytes: 4.2 GB, result: ~250k rows, 12 MB locally"
# 2. if reasonable, fetch
agnes snapshot create web_sessions_example ... --as cz_recent
# 3. query the local snapshot
agnes query "SELECT event_date, COUNT(*) FROM cz_recent GROUP BY 1 ORDER BY 1"
### Heuristics for `agnes snapshot create`
- ALWAYS list specific columns in `--select`. Avoid implicit SELECT *.
- ALWAYS include a `--where` for remote tables; otherwise add `--limit`.
- ALWAYS run `--estimate` first when:
- You're not sure of the data shape
- The table has `partition_by` or `clustered_by` set (per `agnes schema`)
- The fetch could plausibly exceed 1 GB local bytes
- Reuse `agnes snapshot list` before fetching — if a snapshot covers your
query already, skip the fetch.
### BigQuery SQL flavor for `--where`
For `source_type=bigquery` (per `agnes catalog`):
- Date literal: `DATE '2026-01-01'` (NOT `'2026-01-01'::date`)
- Timestamp literal: `TIMESTAMP '2026-01-01 00:00:00 UTC'`
- Now: `CURRENT_DATE()`, `CURRENT_TIMESTAMP()`
- Date arithmetic: `DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)`
- Regex: `REGEXP_CONTAINS(col, r'pattern')` (raw string!)
- NULL: `col IS NOT NULL` (standard)
- Cast: `CAST(x AS INT64)` (NOT `INT`)
For `source_type=keboola` / `source_type=jira` (local), use DuckDB SQL flavor
in your `agnes query` calls — there's no `--where` on local since fetch is implicit.
### Snapshot hygiene
- Reuse snapshots across questions in the same conversation.
- Use descriptive names: `cz_recent`, `orders_q1_us`, `sessions_today`.
- Drop with `agnes snapshot drop <name>` when done with a topic.
- `agnes disk-info` to see total cache size.
### When NOT to use `agnes snapshot create`
- Single aggregate on remote BASE TABLE (`SELECT COUNT(*) FROM remote`):
use `agnes query --remote "SELECT COUNT(*) FROM web_sessions_example"`.
Storage Read API pushes the COUNT into BQ — cheap, no materialization.
- Single aggregate on remote VIEW/MATERIALIZED_VIEW: same syntax works
(#160), but the BQ jobs API can't push WHERE/COUNT into the view body.
Cost guardrail (default 5 GiB) catches expensive scans → 400
`remote_scan_too_large` with `agnes snapshot create` suggestion. Pivot to
`agnes snapshot create <id> --where '<predicate>'` if the cap is hit.
- Throwaway exploration: `agnes query --remote "SELECT … FROM <registered_id>"`.
Direct `bq."<dataset>"."<table>"` paths are now registry-gated — register
first or use the catalog id.
- Cross-table JOIN with both tables remote: combine `agnes snapshot create` for one
side + `agnes query --remote` for the other; full cross-remote JOIN
requires more thought (see #101 for design space).
## Marketplace Repositories
Admin-managed git repos cloned nightly to `${DATA_DIR}/marketplaces/<slug>/`
so FastAPI can read their contents from disk.
- Register via `/admin/marketplaces` (admin UI) or `POST /api/marketplaces`.
- Scheduler calls `POST /api/marketplaces/sync-all` (admin-only, authed via `SCHEDULER_API_TOKEN`) at `daily 03:00` UTC. Routing through HTTP keeps the app the sole writer to `system.duckdb` — the previous in-process call from the scheduler container raced the app's long-lived DB handle and 500-ed on `Could not set lock on file`.
- Manual re-sync from the UI ("Sync now") hits `POST /api/marketplaces/{id}/sync`.
- PATs for private repos persist to `${DATA_DIR}/state/.env_overlay` (chmod 600) as `AGNES_MARKETPLACE_<SLUG>_TOKEN`. DuckDB stores only the env-var name (`token_env`), never the secret.
- Registry lives in DuckDB table `marketplace_registry` (schema v9).
- After each successful sync, `src/marketplace.py` parses `.claude-plugin/marketplace.json`
from the cloned repo and caches the plugin list in `marketplace_plugins`
(keyed by `(marketplace_id, plugin_name)`).
- `src/marketplace.py` handles clone/fetch/reset with token redaction in any surfaced error message.
## Access control (v13)
Two layers, no role hierarchy. Full reference: [`docs/RBAC.md`](docs/RBAC.md).
- `user_groups` — named groups. Two seeded as `is_system=TRUE` at startup:
`Admin` (god-mode short-circuit on every authorization check) and
`Everyone` (auto-membership for every user).
- `user_group_members``(user_id, group_id, source)`. `source ∈
{admin, google_sync, system_seed}` so each writer only manipulates its own
rows; Google's nightly DELETE+INSERT does not clobber admin-added members.
- `resource_grants` — generic `(group, resource_type, resource_id)` triple.
Replaces `plugin_access` from v12; the same shape now covers any future
entity-scoped grant (datasets, knowledge categories, …).
Resource types are an `app.resource_types.ResourceType` `StrEnum` paired with
a `ResourceTypeSpec` registered in `RESOURCE_TYPES` — adding a new one is one
enum member, one `list_blocks(conn)` delegate (projects domain tables into the
`(block → items)` shape the /admin/access tree renders), and one spec entry.
No DB migration, no second wiring step. Endpoints gate with either
`require_admin` (app-level) or `require_resource_access(ResourceType.X,
"{path}")` (entity-level), both from `app.auth.access`.
Admin UI: `/admin/access`. CLI: `agnes admin group {list,create,delete,members,
add-member,remove-member}` and `agnes admin grant {list,create,delete}`.
## Claude Code marketplace endpoint
Agnes serves a single aggregated Claude Code marketplace over two channels,
both gated by PAT auth and filtered per caller:
- `GET /marketplace.zip` — deterministic ZIP download with `ETag` /
`If-None-Match` (304 when content unchanged). Consumed by a client-side
SessionStart hook.
- `GET /marketplace.git/*` — git smart-HTTP (dulwich via a2wsgi). Registered
in Claude Code once, then Claude Code owns the clone/fetch cycle.
Auth: ZIP uses `Authorization: Bearer <PAT>`. Git uses HTTP Basic where the
password field carries the PAT (`https://x:<PAT>@host/marketplace.git/`) —
git CLI does not speak Bearer.
Content: filtered via `src.marketplace_filter.resolve_allowed_plugins` which
joins `resource_grants ↔ marketplace_plugins` (matching
`mp.marketplace_id || '/' || mp.name = rg.resource_id`) scoped to the
caller's `user_group_members`. Admin is treated as a regular group here —
no god-mode shortcut for the marketplace feed, so admins curate their own
view by granting plugins to the Admin group (or any group they belong to).
On-disk layout in the served ZIP / git tree uses a slug-prefixed directory
(`plugins/<slug>-<plugin>/`) so two marketplaces shipping a same-named
plugin don't overwrite each other's files. The synth marketplace.json's
`name` field, however, is the plugin's authoritative name from its own
`.claude-plugin/plugin.json` (with a fallback to the upstream
marketplace.json `name`) — Claude Code's `/plugin` UI resolves a loaded
plugin back to its catalog entry by `plugin.json` name, so the catalog
entry's `name` must match. Same-named plugins from two upstream
marketplaces therefore collide in the catalog by design; admin RBAC
(which grants survive the filter) decides which one wins, identical to
how Claude Code behaves when a user adds two upstream marketplaces with
overlapping plugin names directly. `/marketplace/info` exposes both
`name` and `prefixed_name` so operators can disambiguate.
Cache: content-addressed bare repos at `${DATA_DIR}/marketplaces/git-cache/`
keyed by sha256(filtered content). Two users with the same RBAC view share
one repo; content change → new repo next to the old one. No TTL / prune yet.
User registration inside Claude Code:
```
# ZIP channel (typically via a SessionStart hook that unpacks into ./marketplace/)
curl -H "Authorization: Bearer $AGNES_PAT" https://agnes.example.com/marketplace.zip
# Git channel — one-time registration. Two paths; pick the first that works.
# (a) Direct registration — preferred when it works.
/plugin marketplace add https://x:$AGNES_PAT@agnes.example.com/marketplace.git/
# (b) Two-step fallback — required when (a) fails. Bun-compiled `claude` on
# macOS / Windows ignores the OS trust store and CA env vars on the
# marketplace HTTPS path, so direct add can fail with TLS errors against
# a private-CA Agnes instance even when system tools work fine. System
# `git` honors GIT_SSL_CAINFO + the OS trust store, so cloning manually
# and pointing Claude Code at the local clone sidesteps the Bun TLS path
# entirely.
git clone https://x:$AGNES_PAT@agnes.example.com/marketplace.git/ ~/agnes-marketplace
claude plugin marketplace add ~/agnes-marketplace
# Optional hardening: strip the PAT from the cloned repo's origin so it
# doesn't sit in plaintext at ~/agnes-marketplace/.git/config — re-clone via
# the dashboard's setup flow when the PAT rotates.
git -C ~/agnes-marketplace remote set-url origin https://agnes.example.com/marketplace.git/
```
The dashboard-served setup payload (see `app/web/setup_instructions.py`) already
branches between (a) and (b) automatically based on platform when a private CA
is in play. The block above is the manual equivalent for users registering
outside that flow (e.g. operators bringing up a new instance, or
analysts whose first attempt failed and need to retry by hand).
## Hybrid Queries (BigQuery + Local)
For tables too large to sync locally, use hybrid queries that JOIN local data with on-demand BigQuery results:
```bash
agnes query --sql "SELECT o.*, t.views FROM orders o JOIN traffic t ON o.date = t.date" \
--register-bq "traffic=SELECT date, SUM(views) as views FROM dataset.web WHERE date > '2026-01-01' GROUP BY 1"
```
The `--register-bq` flag executes a BigQuery subquery, loads the result into memory, and makes it available as a DuckDB view for the final SQL. Multiple `--register-bq` flags can be used for multiple BQ sources.
For complex SQL, use stdin mode:
```bash
echo '{"register_bq": {"traffic": "SELECT ..."}, "sql": "SELECT ..."}' | agnes query --stdin
```
## Extensibility
### Data Sources (extract.duckdb contract)
New connector = `connectors/<name>/extractor.py` producing `extract.duckdb + data/`.
Must create `_meta` table with columns: table_name, description, rows, size_bytes, extracted_at, query_mode.
Orchestrator ATTACHes it automatically.
### Authentication
Auth providers in `app/auth/` (FastAPI-based):
- **Google**: OAuth via Google (Workspace group memberships pulled at sign-in — see `docs/auth-groups.md` for the GCP setup checklist + the `security` label gotcha)
- **Email**: Email magic link (itsdangerous token)
- **Desktop**: JWT for API
### RBAC
See **[Access control (v13)](#access-control-v13)** above and [`docs/RBAC.md`](docs/RBAC.md) for the full reference. TL;DR for module authors: gate endpoints with `Depends(require_admin)` for app-level mutations or `Depends(require_resource_access(ResourceType.X, "{path}"))` for entity-scoped grants. Add a new resource type by extending the `ResourceType` `StrEnum` and registering a `ResourceTypeSpec` (with a `list_blocks` projection delegate) in `app/resource_types.py`.
## Release & deploy workflows
Two separate release.yml-style workflows produce GHCR images. Pick the one that matches what you're shipping.
### `release.yml` — auto-build on every push
Runs on **every** push to **every** branch.
- Push to `main``:stable`, `:stable-YYYY.MM.N` (CalVer).
- Push to non-main `<prefix>/<branch>``:dev`, `:dev-YYYY.MM.N`, `:dev-<branch-slug>`, and (when prefix isn't a Git Flow convention) `:dev-<prefix>-latest` alias.
VMs that pin to a floating tag (`:dev`, `:dev-<prefix>-latest`) auto-upgrade within ~5 min via the cron in `agnes-auto-upgrade.sh`. Convenient for per-developer dev VMs; **footgun for shared dev VMs** (last pusher wins, regardless of who).
### `keboola-deploy.yml` — tag-triggered, explicit deploy only
Runs **only** on git tags matching `keboola-deploy-*`. Publishes:
- `:keboola-deploy-<git-tag-suffix>` — immutable, tied to the exact commit
- `:keboola-deploy-latest` — floating alias the consumer pins to
**Operator workflow:**
```bash
git checkout <commit-or-branch>
git tag keboola-deploy-<descriptive-name>
git push origin keboola-deploy-<descriptive-name>
# → workflow builds + publishes both tags
# → VM cron picks up :keboola-deploy-latest within ~5 min
# → manual cron trigger (skip the wait): sudo /usr/local/bin/agnes-auto-upgrade.sh on the VM
```
Use this when the consumer (e.g. a customer dev VM) needs **deploy-when-I-decide** semantics — no surprise rollouts from upstream branch pushes by other contributors. The infra repo pins `image_tag = "keboola-deploy-latest"` on the relevant VM.
### Module versioning
The customer-instance Terraform module under `infra/modules/customer-instance/` is published as `infra-vMAJOR.MINOR.PATCH` git tags (separate from app CalVer tags). Bump on any module-API change; downstream infra repos pin to the tag in their `source = "github.com/keboola/agnes-the-ai-analyst//infra/modules/customer-instance?ref=infra-v1.X.Y"`.
After merging a module change to `main`:
```bash
git tag infra-vX.Y.Z origin/main
git push origin infra-vX.Y.Z
```
### Replacing a VM after a startup-script change
Module sets `lifecycle { ignore_changes = [metadata_startup_script] }` on `google_compute_instance.vm` so normal `terraform apply` doesn't churn running VMs. To propagate a startup-script update, trigger the consumer's apply workflow manually with the VM resource address — typical workflow_dispatch input is `recreate_targets='module.agnes.google_compute_instance.vm["<vm-name>"]'`.
## Key Implementation Details
### DuckDB Schema (src/db.py)
- Schema v28 with auto-migration v1→…→v28 (v5 adds `users.active`, v6 adds `personal_access_tokens`, v7 adds `personal_access_tokens.last_used_ip`, v8/v9 added the legacy internal_roles/role-grants tables, v10 added `view_ownership` for cross-connector view-name collision detection (issue #81 Group C), v11 added marketplace_registry + marketplace_plugins + user_groups + plugin_access, v12 added users.groups JSON + user_groups.is_system, **v13 replaces internal_roles/group_mappings/user_role_grants/plugin_access with user_group_members + resource_grants and drops users.groups JSON**, v14 adds FK constraints on user_group_members + resource_grants after orphan cleanup, v15 adds knowledge_items context-engineering columns + contradictions + session_extraction_state, v16 adds verification_evidence, v17 adds knowledge_item_relations, v18 drops stranded non-google memberships from google-managed groups, **v19 drops legacy `dataset_permissions`, `access_requests` tables and `users.role`, `table_registry.is_public` columns — table access is now exclusively per-group via `resource_grants(resource_type='table')`**, **v20 adds `source_query` TEXT to `table_registry` to back `query_mode='materialized'` (BigQuery scheduled-query parquet path)**, **v21 adds `welcome_template` singleton table backing the Agent Setup Prompt admin override (`/admin/agent-prompt`)**, **v22 reserves the `setup_banner` table — feature dropped mid-development; table retained for forward compatibility with already-migrated instances**, **v23 adds `claude_md_template` singleton table backing the Agent Workspace Prompt admin override (`/admin/workspace-prompt`)**, **v24 rewrites materialized BQ `source_query` from DuckDB-flavor `bq."ds"."t"` to BQ-native `` `<project>.ds.t` `` so the new wrapping path accepts them; idempotent + warns when project unconfigured**, **v25 adds `store_entities` + `user_store_installs` + `user_plugin_optouts` backing the `/store` and `/my-ai-stack` pages — the served marketplace is now `(admin_granted opt_outs) store_installs`**, **v26 unifies Keboola `query_mode='local'` rows into `'materialized'` — the old local mode (DuckDB Keboola extension's COPY through QueryService) is replaced by the new Storage API export-async path which works regardless of project flags; existing `local` Keboola rows are flipped, NULL `source_query` means full-table export**, **v27 adds 7 columns to `table_registry` for Keboola per-table sync-strategy support: `incremental_window_days`, `max_history_days`, `incremental_column`, `where_filters`, `partition_by`, `partition_granularity`, `initial_load_chunk_days`. Layered on top of v26: admins can opt specific tables back to `query_mode='local'` (via the Direct extract Edit-modal radio) to enable the new dispatcher. The pre-existing `sync_strategy` column (default `'full_refresh'`) is reused — pre-v27 it was inert catalog metadata; post-v27 the Keboola extractor dispatches off it (`full_refresh` | `incremental` | `partitioned`). All new columns NULL on existing rows; meaningful only when paired with the matching strategy.**, **v28 introduces explicit-install (Model B) for curated marketplace plugins — served set flips from `(rbac user_plugin_optouts)` to `(rbac ∩ subscriptions)`. The `user_plugin_optouts` table+columns are reused (no DDL rename) so existing operator instances skip migration churn; row PRESENCE flips meaning from "excluded" to "subscribed", and the migration wipes existing rows so the inverted reading starts from a clean baseline. Also adds `marketplace_plugins.created_at` (per-plugin newest-first sort on /marketplace), backfilled from parent `marketplace_registry.registered_at` so existing plugins get a sensible date until the next sync overwrites with `CURRENT_TIMESTAMP`.** — see CHANGELOG and docs/RBAC.md)
- `table_registry`: id, name, source_type, bucket, source_table, query_mode, sync_schedule, etc.
- `sync_state`, `sync_history`: track extraction progress
- `users`, `audit_log`: account state + audit trail. RBAC lives in `user_groups` + `user_group_members` + `resource_grants`.
- System DB at `{DATA_DIR}/state/system.duckdb`
- Analytics DB at `{DATA_DIR}/analytics/server.duckdb`
### SyncOrchestrator (src/orchestrator.py)
- `rebuild()`: scans extracts dir, ATTACHes all, creates master views, updates sync_state
- `rebuild_source(name)`: single source (used after Jira webhooks)
- Thread-safe via `_rebuild_lock`
### Connector Pattern
- **Keboola**: `connectors/keboola/extractor.py` uses DuckDB Keboola extension, fallback to `client.py`
- **BigQuery**: `connectors/bigquery/extractor.py` uses DuckDB BQ extension (remote-only, no download)
- **Jira**: `connectors/jira/webhook.py``incremental_transform.py``extract_init.py` updates `_meta`
- `connectors/keboola/client.py`: legacy Keboola Storage API wrapper (kept as fallback)
### Config Loading
1. `config/loader.py` loads `instance.yaml`
2. `app/instance_config.py` exposes `get_data_source_type()`, `get_value()`
3. Table config lives in DuckDB `table_registry` (not markdown files)
### Files NOT to modify (stable infrastructure)
- `connectors/jira/file_lock.py` - Advisory file locking
- `connectors/jira/transform.py` - Core Jira transform logic
- `services/ws_gateway/` - WebSocket notification gateway
## Vendor-agnostic OSS — no customer-specific content
This repo is the public OSS distribution. **Nothing customer-specific belongs in code, configuration defaults, comments, docs, commit messages, PR titles, or PR bodies.** That includes:
- Specific deployments or brands (private VM names, internal product brands, organization names that aren't already public sponsors).
- Cloud project IDs, internal hostnames, runbook paths from a particular install (`/opt/<deployment>`, `<host>.<internal-domain>`, `prj-<org>-…`, internal SA emails).
- Cross-references to private repos (`<private-org>/<private-repo>#NN`). Describe the integration in generic terms or link to public examples instead.
When you motivate a change, frame it abstractly ("behind a TLS-terminating reverse proxy", "in containerized deploys") rather than naming a specific operator. When you show examples, use placeholders (`example.com`, `<your-host>`, `<install-dir>`). When config has reasonable defaults pulled from one deployment's habits, generalize them or surface them as documented examples — not hard-coded assumptions.
Customer-specific automation, hostnames, and identities live in private infra repos that *consume* this OSS. The OSS describes capabilities, defaults, and configuration knobs — not how a specific operator wired them up.
## Changelog discipline — non-negotiable
**Every PR that adds, removes, or changes user-visible behavior MUST update `CHANGELOG.md` in the same PR.** No exceptions, no follow-ups, no "I'll do it after merge". User-visible = anything an operator, end-user, or downstream integrator can observe: CLI flags / output / exit codes, REST endpoints / payloads / status codes, web UI, `instance.yaml` schema, env vars, `extract.duckdb` contract, Docker / compose / Caddyfile knobs, default behaviors, breaking changes, security fixes.
**How:**
- Add a bullet under the topmost `## [Unreleased]` heading (create one if missing — it sits above the latest released version).
- Group by `### Added` / `### Changed` / `### Fixed` / `### Removed` / `### Internal` (Keep-a-Changelog sections).
- Mark breaking changes with `**BREAKING**` at the start of the bullet — operators grep for that string before bumping the pin.
- Reference the relevant doc/runbook if one exists (e.g. `see docs/auth-groups.md`), don't restate it.
- Internal-only changes (refactors, test additions, dependency bumps without behavior change) go under `### Internal` — still log them, just keep them terse.
**When you cut a release:**
- Rename `## [Unreleased]``## [X.Y.Z] — YYYY-MM-DD`.
- Append a new empty `## [Unreleased]` section at the top so the next PR has somewhere to land.
- Bump `version` in `pyproject.toml` to match `X.Y.Z`.
- Tag the merge commit as `vX.Y.Z` and push the tag.
**If you find yourself opening a PR without a CHANGELOG entry, stop and add one before requesting review.** Reviewers should bounce PRs that touch user-visible behavior without a changelog update — same way they'd bounce a PR with no test changes for new logic.
## Git Commits & Pull Requests
- Keep commit messages clean and concise
- Do not include AI attribution in commits or PRs
- Before opening a PR, scan the diff and the PR body for the customer-specific tokens listed above (`grep -niE '<token1>|<token2>|...'`). If anything matches, generalize or remove it.