feat(workspace-prompt): decision tree + size-hint so analyst Claude gets it right first try

Three concrete changes addressing the "analyst Claude misuses the CLI"
class of bugs (image.png table — issues #3, #5, plus the recurrent
"how big is this table" guesswork):

1. config/claude_md_template.txt — the template agnes init writes to
   <workspace>/CLAUDE.md. Surfaces every catalog-row field with a why,
   adds a query_mode-based decision tree, explicit --estimate scoping
   (snapshot create ONLY — was the #1 first-try error), an agnes fetch
   → agnes snapshot create rename note, and a 6-row failure-mode table
   that maps each common error wording to its right next step.

2. app/api/v2_catalog.py — populate rough_size_hint for local +
   materialized rows from the on-disk parquet size, bucketed
   small/medium/large/very_large. Was hardcoded null with a TODO; AI
   couldn't tell "is this 6.8 GB" without a failed --remote round-trip.

3. cli/update_check.py — the [update] banner survived the da→agnes
   rename and printed "[update] da X is out of date" on every command,
   training analysts to associate the binary with the old name.

Verified by rendering the template against representative contexts
(33/33 tests pass) and running every use case from the original
screenshot through the real CLI against a dev VM.
This commit is contained in:
ZdenekSrotyr 2026-05-05 16:38:32 +02:00
parent 2ae486bc5d
commit 30e81a15b9
4 changed files with 114 additions and 13 deletions

View file

@ -15,6 +15,11 @@ CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every C
- **Per-user parallel parquet downloads in `agnes pull`** — the download loop in `cli/lib/pull.py` now uses a `ThreadPoolExecutor` with concurrency capped by the new `AGNES_PULL_PARALLELISM` env var (default 4, set 1 to restore pre-PR serial behavior). On a registry of N tables the wall-clock time drops from `Σ stream_download_seconds(table_i)` to roughly `max × ceil(N/4)`. Works hand-in-hand with the Caddy `file_server` change below: without it parallel client-side downloads would still queue on the single uvicorn worker; with it each request is its own caddy goroutine + sendfile, so 4-way parallelism actually delivers throughput. Per-table error semantics preserved — a failure on one table no longer aborts the rest of the batch. - **Per-user parallel parquet downloads in `agnes pull`** — the download loop in `cli/lib/pull.py` now uses a `ThreadPoolExecutor` with concurrency capped by the new `AGNES_PULL_PARALLELISM` env var (default 4, set 1 to restore pre-PR serial behavior). On a registry of N tables the wall-clock time drops from `Σ stream_download_seconds(table_i)` to roughly `max × ceil(N/4)`. Works hand-in-hand with the Caddy `file_server` change below: without it parallel client-side downloads would still queue on the single uvicorn worker; with it each request is its own caddy goroutine + sendfile, so 4-way parallelism actually delivers throughput. Per-table error semantics preserved — a failure on one table no longer aborts the rest of the batch.
- **`scripts/ops/agnes-auto-upgrade.sh` now re-fetches Caddyfile + every compose overlay** from `keboola/agnes-the-ai-analyst@main` on every tick, hashes them, and triggers a `docker compose up -d` recreation when the hash changes — same path as an image-digest change. Pre-fix the script only watched `docker images` digests, so a Caddyfile or compose change in main never reached running VMs (only fresh boots ran `startup.sh`'s file fetch). Without this, the new file_server downloads-path below would land in the image but stay inert against an old Caddyfile. The script also self-updates from the same path so the very fix that watches config files isn't itself stuck on running VMs. Fail-soft on curl errors — keeps the existing file rather than blanking it. - **`scripts/ops/agnes-auto-upgrade.sh` now re-fetches Caddyfile + every compose overlay** from `keboola/agnes-the-ai-analyst@main` on every tick, hashes them, and triggers a `docker compose up -d` recreation when the hash changes — same path as an image-digest change. Pre-fix the script only watched `docker images` digests, so a Caddyfile or compose change in main never reached running VMs (only fresh boots ran `startup.sh`'s file fetch). Without this, the new file_server downloads-path below would land in the image but stay inert against an old Caddyfile. The script also self-updates from the same path so the very fix that watches config files isn't itself stuck on running VMs. Fail-soft on curl errors — keeps the existing file rather than blanking it.
- **Caddy `file_server` for parquet downloads**`GET /api/data/{table_id}/download` is now intercepted at the Caddy layer (TLS profile only) and served directly via sendfile/zero-copy from the data volume mounted read-only at `/srv` inside the caddy container. Caddy authorises every request via a new lightweight RBAC probe `GET /api/data/{table_id}/check-access` (returns 204 when the caller has read access on the table, 403 otherwise) using the `forward_auth` directive — the bulk byte transfer never touches uvicorn workers. Resolves a real production failure mode where a single multi-GB analyst pull held the app's only uvicorn worker for the duration of the stream and starved the UI / `/api/health` / every other API endpoint, eventually flipping the container to `unhealthy`. Path discovery uses Caddy's `try_files` over the known `extract.duckdb` v2 source subdirs (`bigquery/data/<id>.parquet`, `keboola/data/<id>.parquet`, `jira/data/<id>.parquet`); a parquet not at any of those paths transparently falls through to the existing app handler so legacy `src_data/parquet` layouts and future connectors keep working with no Caddyfile change. Non-Caddy deployments (dev `docker compose up` without `--profile tls`) continue to use the app handler unchanged. - **Caddy `file_server` for parquet downloads**`GET /api/data/{table_id}/download` is now intercepted at the Caddy layer (TLS profile only) and served directly via sendfile/zero-copy from the data volume mounted read-only at `/srv` inside the caddy container. Caddy authorises every request via a new lightweight RBAC probe `GET /api/data/{table_id}/check-access` (returns 204 when the caller has read access on the table, 403 otherwise) using the `forward_auth` directive — the bulk byte transfer never touches uvicorn workers. Resolves a real production failure mode where a single multi-GB analyst pull held the app's only uvicorn worker for the duration of the stream and starved the UI / `/api/health` / every other API endpoint, eventually flipping the container to `unhealthy`. Path discovery uses Caddy's `try_files` over the known `extract.duckdb` v2 source subdirs (`bigquery/data/<id>.parquet`, `keboola/data/<id>.parquet`, `jira/data/<id>.parquet`); a parquet not at any of those paths transparently falls through to the existing app handler so legacy `src_data/parquet` layouts and future connectors keep working with no Caddyfile change. Non-Caddy deployments (dev `docker compose up` without `--profile tls`) continue to use the app handler unchanged.
- **Workspace prompt: decision tree, common-mistakes callout, failure-mode dictionary** in `config/claude_md_template.txt` (the template `agnes init` writes to `<workspace>/CLAUDE.md`). Surfaces every catalog-row field analyst Claude should read before deciding which command to use (`query_mode`, `sql_flavor`, `where_examples`, `fetch_via`, `rough_size_hint`); explicitly binds `--estimate` to `agnes snapshot create` ONLY (was the most-failed first-try misuse — fails with `No such option: --estimate` on `agnes query`); calls out the `agnes fetch``agnes snapshot create` rename so stale-doc analysts don't run a non-command; documents the BQ permission model (server SA, not personal Google identity) and a 6-row failure-mode table mapping each common error wording to its cause + the right next step.
- **`rough_size_hint` populated for `local` + `materialized` catalog rows** in `GET /api/v2/catalog` (was hardcoded `null` with a "Task 8" TODO). Reads the parquet file size at `${DATA_DIR}/extracts/<source_type>/data/<table_id>.parquet` and buckets into `small` (≤100 MiB), `medium` (≤1 GiB), `large` (≤10 GiB), `very_large` (>10 GiB). `remote` rows stay `null` for now (size requires a BQ INFORMATION_SCHEMA call; tracked separately). Lets analyst Claude pick `agnes snapshot create` over `agnes query --remote` by inspecting `agnes catalog --json` rather than discovering size empirically via a failed `--remote` round-trip.
### Changed
- **CLI update-banner now says `agnes` instead of `da`** (`cli/update_check.py:format_outdated_notice`). The string `[update] da X is out of date` had survived the `da``agnes` CLI rename and was the most-visible stale identifier in the analyst-facing surface — every CLI command printed it on stderr when a newer wheel was available.
### Fixed ### Fixed

View file

@ -2,10 +2,12 @@
from __future__ import annotations from __future__ import annotations
from datetime import datetime, timezone from datetime import datetime, timezone
from pathlib import Path
from fastapi import APIRouter, Depends from fastapi import APIRouter, Depends
import duckdb import duckdb
from app.auth.dependencies import get_current_user, _get_db from app.auth.dependencies import get_current_user, _get_db
from app.utils import get_data_dir as _get_data_dir
from src.rbac import can_access_table from src.rbac import can_access_table
from src.repositories.table_registry import TableRegistryRepository from src.repositories.table_registry import TableRegistryRepository
from app.api.v2_cache import TTLCache from app.api.v2_cache import TTLCache
@ -43,6 +45,52 @@ def _fetch_hint(table_id: str, source_type: str) -> str:
return "already local — query directly via `agnes query`" return "already local — query directly via `agnes query`"
# Coarse size buckets for `rough_size_hint`. Boundaries chosen so an analyst
# Claude can decide tool by inspection: anything `large` or worse implies
# `agnes snapshot create` over `agnes query --remote`. Numbers reflect the
# default `bq_max_scan_bytes` 5 GiB ceiling — at "large" you're already at
# half the per-query gate and a naive `--remote` is likely to refuse.
_SIZE_BUCKETS = (
(10 * 2**20, "small"), # ≤10 MiB
(100 * 2**20, "small"), # ≤100 MiB still small (analyst-laptop scale)
(1 * 2**30, "medium"), # ≤1 GiB
(10 * 2**30, "large"), # ≤10 GiB
)
def _bucket_size(byte_count: int) -> str:
for cap, label in _SIZE_BUCKETS:
if byte_count <= cap:
return label
return "very_large"
def _materialized_size_hint(table_id: str, source_type: str, query_mode: str) -> str | None:
"""Return a rough size bucket for a row whose data is on the server's
local filesystem (any `query_mode` that produces a parquet `local` and
`materialized`). Returns ``None`` for `remote` (size requires a BQ
INFORMATION_SCHEMA round-trip; tracked separately) and for tables whose
parquet hasn't been materialised yet so the AI gets ``null`` not a
misleading "small".
Layout matches the v2 extract.duckdb contract:
${DATA_DIR}/extracts/<source_type>/data/<table_id>.parquet
"""
if query_mode == "remote":
return None
if not source_type:
return None
try:
path = Path(_get_data_dir()) / "extracts" / source_type / "data" / f"{table_id}.parquet"
if not path.exists():
return None
return _bucket_size(path.stat().st_size)
except Exception:
# Filesystem stat() race / permissions / weird DATA_DIR — fall back
# to null rather than crash the whole catalog response.
return None
def build_catalog(conn: duckdb.DuckDBPyConnection, user: dict) -> dict: def build_catalog(conn: duckdb.DuckDBPyConnection, user: dict) -> dict:
rows = _table_rows_cache.get(_TABLE_ROWS_KEY) rows = _table_rows_cache.get(_TABLE_ROWS_KEY)
if rows is None: if rows is None:
@ -66,7 +114,10 @@ def build_catalog(conn: duckdb.DuckDBPyConnection, user: dict) -> dict:
"sql_flavor": _flavor_for(r.get("source_type") or ""), "sql_flavor": _flavor_for(r.get("source_type") or ""),
"where_examples": _examples_for(r.get("source_type") or ""), "where_examples": _examples_for(r.get("source_type") or ""),
"fetch_via": _fetch_hint(r["id"], r.get("source_type") or ""), "fetch_via": _fetch_hint(r["id"], r.get("source_type") or ""),
"rough_size_hint": None, # populated by Task 8 schema endpoint when called "rough_size_hint": _materialized_size_hint(
r["id"], r.get("source_type") or "",
r.get("query_mode") or "local",
),
}) })
return { return {

View file

@ -184,7 +184,7 @@ def format_outdated_notice(info: UpdateInfo) -> str:
literal string "None" into a copy-pasteable command drop the upgrade literal string "None" into a copy-pasteable command drop the upgrade
snippet in that case. snippet in that case.
""" """
msg = f"[update] da {info.installed} is out of date — latest on this server is {info.latest}." msg = f"[update] agnes {info.installed} is out of date — latest on this server is {info.latest}."
if info.download_url: if info.download_url:
msg += f" Upgrade: uv tool install --force {info.download_url}" msg += f" Upgrade: uv tool install --force {info.download_url}"
return msg return msg

View file

@ -58,15 +58,43 @@ Not every table is synced. Tables registered with `query_mode: "remote"` live in
BigQuery, accessed server-side via DuckDB's BQ extension — no parquet on disk. BigQuery, accessed server-side via DuckDB's BQ extension — no parquet on disk.
Tables you don't see in `server/parquet/` may still be queryable. Tables you don't see in `server/parquet/` may still be queryable.
### Discovery first ### Discovery first — read `agnes catalog --json` BEFORE every cross-table decision
`agnes catalog --json` returns one row per table with these fields. Use them; don't guess:
| Field | What it tells you | How to use it |
|---|---|---|
| `query_mode` | `local` (parquet on laptop) / `remote` (BQ on demand) / `materialized` (synced parquet of a BQ result) | Picks the tool — see decision tree below |
| `source_type` | `keboola` / `bigquery` / `jira` | Determines SQL dialect |
| `sql_flavor` | `duckdb` for local sources, `bigquery` for `--remote` queries on BQ rows | What syntax `--where` expects |
| `where_examples` | 13 example WHERE predicates that are valid for this table's dialect | Copy as starting point for `--where` |
| `fetch_via` | Pre-formatted `agnes snapshot create …` template for this table | The canonical "how do I get a slice of this table" command |
| `rough_size_hint` | Coarse size hint (`small` / `medium` / `large` or null when unknown) | Bigger than `medium` → never `agnes query --remote` without a tight `--where`; use `agnes snapshot create` |
``` ```
agnes catalog --json | jq '.[] | {name, source_type, query_mode}' # see all tables + their modes agnes catalog --json # full structured view (use this in scripts)
agnes schema <table> # columns + types agnes catalog # human-readable summary
agnes describe <table> -n 5 # sample rows agnes schema <table> # columns + types (BIGQUERY/DUCKDB dialect printed in header)
agnes describe <table> -n 5 # sample rows (works on local & materialized only)
``` ```
For local-mode tables, query directly with `agnes query "SELECT … FROM <table>"`. ### Decision tree — pick the right tool BEFORE writing SQL
```
┌─ local → agnes query "SELECT ..."
agnes catalog → ─────┤
query_mode of <table> ├─ materialized → agnes query (parquet was synced by agnes pull)
│ (if missing locally, run `agnes pull` first)
└─ remote → choose by table size + query shape:
- one cheap probe (COUNT, schema-confirm, single agg ≤200s)
→ agnes query --remote "..."
- repeated questions on same slice / large scan
→ agnes snapshot create <table> --select ... --where ... --as <name>
then agnes query "SELECT ... FROM <name>"
- join with a local table
→ agnes query --register-bq "alias=BQ_SQL" --sql "..."
```
### Three patterns for `query_mode: "remote"` tables ### Three patterns for `query_mode: "remote"` tables
@ -76,13 +104,30 @@ For local-mode tables, query directly with `agnes query "SELECT … FROM <table>
| **`agnes query --remote`** | one-shot, server-side execution against BigQuery (works for BASE TABLE rows directly + VIEW/MATERIALIZED_VIEW rows via the BQ jobs API; cost-guarded by a 5 GiB scan cap configurable in /admin/server-config) | single aggregate / cheap probe | | **`agnes query --remote`** | one-shot, server-side execution against BigQuery (works for BASE TABLE rows directly + VIEW/MATERIALIZED_VIEW rows via the BQ jobs API; cost-guarded by a 5 GiB scan cap configurable in /admin/server-config) | single aggregate / cheap probe |
| **`agnes query --register-bq`** | hybrid joins between local snapshots and ad-hoc BQ subqueries | crossing local + remote | | **`agnes query --register-bq`** | hybrid joins between local snapshots and ad-hoc BQ subqueries | crossing local + remote |
### Permission model + cost — important ### Common mistakes — avoid on first try
- BQ access goes through the **agnes server's GCE service account**, not your personal Google credentials. If a query fails with a permission error, the table is in a project the server SA cannot read — escalate to admin, do NOT try to authenticate yourself. - **`--estimate` is on `agnes snapshot create` ONLY.** Do NOT pass it to `agnes query` — fails with `No such option: --estimate`. The estimate flow is a snapshot-creation cost gate, not a query primitive.
- Every BQ query bills the SA's GCP project for **bytes scanned**. A naive `SELECT * FROM <large_table>` can cost real money. ALWAYS: - **Old `agnes fetch` / `da fetch` / `da query` references in stale docs** — the CLI is `agnes`; `agnes fetch` was renamed to `agnes snapshot create`. If you see those names, translate before running.
- filter via `--where` on the partition column (typically a date) - **Don't attempt personal GCP auth** if a BQ query fails with permission errors. BQ access uses the **server's service account**, not your Google identity — escalate to admin instead.
- list specific columns in `--select` — column-store BQ skips the rest, cheaper - **Don't `agnes query --remote "SELECT * FROM <large_table>"`** without a `--where`. Even if the scan-byte gate refuses, you've wasted the round-trip; gate yourself first by reading `rough_size_hint` and `where_examples` from `agnes catalog --json`.
- run `--estimate` first when unsure of the table size or partitioning
### Failure-mode dictionary — what each error means + the right response
| Error wording (substring) | Cause | Response |
|---|---|---|
| `Binder Error: Query execution exceeded the timeout. Job ID: ...` | BQ-side query took >~200 s wall-clock; the DuckDB BQ extension's `bq_query_timeout_ms` (default 90 s, server may bump to 600 s) elapsed | Narrow `--where` (especially partition column), drop unused columns from `--select`, or switch to `agnes snapshot create` to materialise once + query locally |
| `HTTP 400: remote_scan_too_large` | Server's `bq_max_scan_bytes` cost gate refused the query (default 5 GiB) | Tighten `--where`; consider `agnes snapshot create` so the cost is paid once, then local queries are free |
| `HTTP 401: ... unauthorized` | PAT expired or wrong | `agnes init --server-url ... --token <new-PAT>`; re-mint via the dashboard's "Personal Access Tokens" page |
| `HTTP 403: cross_project_forbidden` (with `serviceusage` mention) | Server SA lacks `serviceusage.services.use` on the BQ data project | Escalate to admin to set `data_source.bigquery.billing_project`; do NOT try personal auth |
| `ReadTimeout` (client-side) on `agnes query --remote` | CLI is older than 0.35.1 (had 30 s default) | `agnes --version`; if <0.35.1, upgrade with `uv tool install --force <wheel-from-server>` (the URL is in the `[update]` banner that prints on every command). Then retry. |
| `unknown columns: [...]` from `agnes snapshot create` | `--select` lists columns that don't exist | Run `agnes schema <table>` and copy column names verbatim |
### Cost discipline — every BQ query bills bytes scanned
A naive `SELECT * FROM <large_table>` can cost real money. ALWAYS:
- filter via `--where` on the partition column (typically a date) — read `where_examples` in `agnes catalog --json`
- list specific columns in `--select` — column-store BQ skips the rest
- run `--estimate` first (only valid on `agnes snapshot create`) when the table is partitioned/clustered or when `rough_size_hint` is unknown
### `agnes snapshot create` discipline ### `agnes snapshot create` discipline