From 30e81a15b99ac918599ddd90ae810e103b902e74 Mon Sep 17 00:00:00 2001 From: ZdenekSrotyr Date: Tue, 5 May 2026 16:38:32 +0200 Subject: [PATCH] feat(workspace-prompt): decision tree + size-hint so analyst Claude gets it right first try MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Three concrete changes addressing the "analyst Claude misuses the CLI" class of bugs (image.png table — issues #3, #5, plus the recurrent "how big is this table" guesswork): 1. config/claude_md_template.txt — the template agnes init writes to /CLAUDE.md. Surfaces every catalog-row field with a why, adds a query_mode-based decision tree, explicit --estimate scoping (snapshot create ONLY — was the #1 first-try error), an agnes fetch → agnes snapshot create rename note, and a 6-row failure-mode table that maps each common error wording to its right next step. 2. app/api/v2_catalog.py — populate rough_size_hint for local + materialized rows from the on-disk parquet size, bucketed small/medium/large/very_large. Was hardcoded null with a TODO; AI couldn't tell "is this 6.8 GB" without a failed --remote round-trip. 3. cli/update_check.py — the [update] banner survived the da→agnes rename and printed "[update] da X is out of date" on every command, training analysts to associate the binary with the old name. Verified by rendering the template against representative contexts (33/33 tests pass) and running every use case from the original screenshot through the real CLI against a dev VM. --- CHANGELOG.md | 5 +++ app/api/v2_catalog.py | 53 ++++++++++++++++++++++++++- cli/update_check.py | 2 +- config/claude_md_template.txt | 67 +++++++++++++++++++++++++++++------ 4 files changed, 114 insertions(+), 13 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index b34d7ae..89f00e7 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -15,6 +15,11 @@ CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every C - **Per-user parallel parquet downloads in `agnes pull`** — the download loop in `cli/lib/pull.py` now uses a `ThreadPoolExecutor` with concurrency capped by the new `AGNES_PULL_PARALLELISM` env var (default 4, set 1 to restore pre-PR serial behavior). On a registry of N tables the wall-clock time drops from `Σ stream_download_seconds(table_i)` to roughly `max × ceil(N/4)`. Works hand-in-hand with the Caddy `file_server` change below: without it parallel client-side downloads would still queue on the single uvicorn worker; with it each request is its own caddy goroutine + sendfile, so 4-way parallelism actually delivers throughput. Per-table error semantics preserved — a failure on one table no longer aborts the rest of the batch. - **`scripts/ops/agnes-auto-upgrade.sh` now re-fetches Caddyfile + every compose overlay** from `keboola/agnes-the-ai-analyst@main` on every tick, hashes them, and triggers a `docker compose up -d` recreation when the hash changes — same path as an image-digest change. Pre-fix the script only watched `docker images` digests, so a Caddyfile or compose change in main never reached running VMs (only fresh boots ran `startup.sh`'s file fetch). Without this, the new file_server downloads-path below would land in the image but stay inert against an old Caddyfile. The script also self-updates from the same path so the very fix that watches config files isn't itself stuck on running VMs. Fail-soft on curl errors — keeps the existing file rather than blanking it. - **Caddy `file_server` for parquet downloads** — `GET /api/data/{table_id}/download` is now intercepted at the Caddy layer (TLS profile only) and served directly via sendfile/zero-copy from the data volume mounted read-only at `/srv` inside the caddy container. Caddy authorises every request via a new lightweight RBAC probe `GET /api/data/{table_id}/check-access` (returns 204 when the caller has read access on the table, 403 otherwise) using the `forward_auth` directive — the bulk byte transfer never touches uvicorn workers. Resolves a real production failure mode where a single multi-GB analyst pull held the app's only uvicorn worker for the duration of the stream and starved the UI / `/api/health` / every other API endpoint, eventually flipping the container to `unhealthy`. Path discovery uses Caddy's `try_files` over the known `extract.duckdb` v2 source subdirs (`bigquery/data/.parquet`, `keboola/data/.parquet`, `jira/data/.parquet`); a parquet not at any of those paths transparently falls through to the existing app handler so legacy `src_data/parquet` layouts and future connectors keep working with no Caddyfile change. Non-Caddy deployments (dev `docker compose up` without `--profile tls`) continue to use the app handler unchanged. +- **Workspace prompt: decision tree, common-mistakes callout, failure-mode dictionary** in `config/claude_md_template.txt` (the template `agnes init` writes to `/CLAUDE.md`). Surfaces every catalog-row field analyst Claude should read before deciding which command to use (`query_mode`, `sql_flavor`, `where_examples`, `fetch_via`, `rough_size_hint`); explicitly binds `--estimate` to `agnes snapshot create` ONLY (was the most-failed first-try misuse — fails with `No such option: --estimate` on `agnes query`); calls out the `agnes fetch` → `agnes snapshot create` rename so stale-doc analysts don't run a non-command; documents the BQ permission model (server SA, not personal Google identity) and a 6-row failure-mode table mapping each common error wording to its cause + the right next step. +- **`rough_size_hint` populated for `local` + `materialized` catalog rows** in `GET /api/v2/catalog` (was hardcoded `null` with a "Task 8" TODO). Reads the parquet file size at `${DATA_DIR}/extracts//data/.parquet` and buckets into `small` (≤100 MiB), `medium` (≤1 GiB), `large` (≤10 GiB), `very_large` (>10 GiB). `remote` rows stay `null` for now (size requires a BQ INFORMATION_SCHEMA call; tracked separately). Lets analyst Claude pick `agnes snapshot create` over `agnes query --remote` by inspecting `agnes catalog --json` rather than discovering size empirically via a failed `--remote` round-trip. + +### Changed +- **CLI update-banner now says `agnes` instead of `da`** (`cli/update_check.py:format_outdated_notice`). The string `[update] da X is out of date` had survived the `da` → `agnes` CLI rename and was the most-visible stale identifier in the analyst-facing surface — every CLI command printed it on stderr when a newer wheel was available. ### Fixed diff --git a/app/api/v2_catalog.py b/app/api/v2_catalog.py index 426b320..a5b660e 100644 --- a/app/api/v2_catalog.py +++ b/app/api/v2_catalog.py @@ -2,10 +2,12 @@ from __future__ import annotations from datetime import datetime, timezone +from pathlib import Path from fastapi import APIRouter, Depends import duckdb from app.auth.dependencies import get_current_user, _get_db +from app.utils import get_data_dir as _get_data_dir from src.rbac import can_access_table from src.repositories.table_registry import TableRegistryRepository from app.api.v2_cache import TTLCache @@ -43,6 +45,52 @@ def _fetch_hint(table_id: str, source_type: str) -> str: return "already local — query directly via `agnes query`" +# Coarse size buckets for `rough_size_hint`. Boundaries chosen so an analyst +# Claude can decide tool by inspection: anything `large` or worse implies +# `agnes snapshot create` over `agnes query --remote`. Numbers reflect the +# default `bq_max_scan_bytes` 5 GiB ceiling — at "large" you're already at +# half the per-query gate and a naive `--remote` is likely to refuse. +_SIZE_BUCKETS = ( + (10 * 2**20, "small"), # ≤10 MiB + (100 * 2**20, "small"), # ≤100 MiB still small (analyst-laptop scale) + (1 * 2**30, "medium"), # ≤1 GiB + (10 * 2**30, "large"), # ≤10 GiB +) + + +def _bucket_size(byte_count: int) -> str: + for cap, label in _SIZE_BUCKETS: + if byte_count <= cap: + return label + return "very_large" + + +def _materialized_size_hint(table_id: str, source_type: str, query_mode: str) -> str | None: + """Return a rough size bucket for a row whose data is on the server's + local filesystem (any `query_mode` that produces a parquet — `local` and + `materialized`). Returns ``None`` for `remote` (size requires a BQ + INFORMATION_SCHEMA round-trip; tracked separately) and for tables whose + parquet hasn't been materialised yet so the AI gets ``null`` not a + misleading "small". + + Layout matches the v2 extract.duckdb contract: + ${DATA_DIR}/extracts//data/.parquet + """ + if query_mode == "remote": + return None + if not source_type: + return None + try: + path = Path(_get_data_dir()) / "extracts" / source_type / "data" / f"{table_id}.parquet" + if not path.exists(): + return None + return _bucket_size(path.stat().st_size) + except Exception: + # Filesystem stat() race / permissions / weird DATA_DIR — fall back + # to null rather than crash the whole catalog response. + return None + + def build_catalog(conn: duckdb.DuckDBPyConnection, user: dict) -> dict: rows = _table_rows_cache.get(_TABLE_ROWS_KEY) if rows is None: @@ -66,7 +114,10 @@ def build_catalog(conn: duckdb.DuckDBPyConnection, user: dict) -> dict: "sql_flavor": _flavor_for(r.get("source_type") or ""), "where_examples": _examples_for(r.get("source_type") or ""), "fetch_via": _fetch_hint(r["id"], r.get("source_type") or ""), - "rough_size_hint": None, # populated by Task 8 schema endpoint when called + "rough_size_hint": _materialized_size_hint( + r["id"], r.get("source_type") or "", + r.get("query_mode") or "local", + ), }) return { diff --git a/cli/update_check.py b/cli/update_check.py index d278218..90cb2f2 100644 --- a/cli/update_check.py +++ b/cli/update_check.py @@ -184,7 +184,7 @@ def format_outdated_notice(info: UpdateInfo) -> str: literal string "None" into a copy-pasteable command — drop the upgrade snippet in that case. """ - msg = f"[update] da {info.installed} is out of date — latest on this server is {info.latest}." + msg = f"[update] agnes {info.installed} is out of date — latest on this server is {info.latest}." if info.download_url: msg += f" Upgrade: uv tool install --force {info.download_url}" return msg diff --git a/config/claude_md_template.txt b/config/claude_md_template.txt index b3320df..31c78d3 100644 --- a/config/claude_md_template.txt +++ b/config/claude_md_template.txt @@ -58,15 +58,43 @@ Not every table is synced. Tables registered with `query_mode: "remote"` live in BigQuery, accessed server-side via DuckDB's BQ extension — no parquet on disk. Tables you don't see in `server/parquet/` may still be queryable. -### Discovery first +### Discovery first — read `agnes catalog --json` BEFORE every cross-table decision + +`agnes catalog --json` returns one row per table with these fields. Use them; don't guess: + +| Field | What it tells you | How to use it | +|---|---|---| +| `query_mode` | `local` (parquet on laptop) / `remote` (BQ on demand) / `materialized` (synced parquet of a BQ result) | Picks the tool — see decision tree below | +| `source_type` | `keboola` / `bigquery` / `jira` | Determines SQL dialect | +| `sql_flavor` | `duckdb` for local sources, `bigquery` for `--remote` queries on BQ rows | What syntax `--where` expects | +| `where_examples` | 1–3 example WHERE predicates that are valid for this table's dialect | Copy as starting point for `--where` | +| `fetch_via` | Pre-formatted `agnes snapshot create …` template for this table | The canonical "how do I get a slice of this table" command | +| `rough_size_hint` | Coarse size hint (`small` / `medium` / `large` or null when unknown) | Bigger than `medium` → never `agnes query --remote` without a tight `--where`; use `agnes snapshot create` | ``` -agnes catalog --json | jq '.[] | {name, source_type, query_mode}' # see all tables + their modes -agnes schema # columns + types -agnes describe
-n 5 # sample rows +agnes catalog --json # full structured view (use this in scripts) +agnes catalog # human-readable summary +agnes schema
# columns + types (BIGQUERY/DUCKDB dialect printed in header) +agnes describe
-n 5 # sample rows (works on local & materialized only) ``` -For local-mode tables, query directly with `agnes query "SELECT … FROM
"`. +### Decision tree — pick the right tool BEFORE writing SQL + +``` + ┌─ local → agnes query "SELECT ..." +agnes catalog → ─────┤ +query_mode of
├─ materialized → agnes query (parquet was synced by agnes pull) + │ (if missing locally, run `agnes pull` first) + │ + └─ remote → choose by table size + query shape: + - one cheap probe (COUNT, schema-confirm, single agg ≤200s) + → agnes query --remote "..." + - repeated questions on same slice / large scan + → agnes snapshot create
--select ... --where ... --as + then agnes query "SELECT ... FROM " + - join with a local table + → agnes query --register-bq "alias=BQ_SQL" --sql "..." +``` ### Three patterns for `query_mode: "remote"` tables @@ -76,13 +104,30 @@ For local-mode tables, query directly with `agnes query "SELECT … FROM
| **`agnes query --remote`** | one-shot, server-side execution against BigQuery (works for BASE TABLE rows directly + VIEW/MATERIALIZED_VIEW rows via the BQ jobs API; cost-guarded by a 5 GiB scan cap configurable in /admin/server-config) | single aggregate / cheap probe | | **`agnes query --register-bq`** | hybrid joins between local snapshots and ad-hoc BQ subqueries | crossing local + remote | -### Permission model + cost — important +### Common mistakes — avoid on first try -- BQ access goes through the **agnes server's GCE service account**, not your personal Google credentials. If a query fails with a permission error, the table is in a project the server SA cannot read — escalate to admin, do NOT try to authenticate yourself. -- Every BQ query bills the SA's GCP project for **bytes scanned**. A naive `SELECT * FROM ` can cost real money. ALWAYS: - - filter via `--where` on the partition column (typically a date) - - list specific columns in `--select` — column-store BQ skips the rest, cheaper - - run `--estimate` first when unsure of the table size or partitioning +- **`--estimate` is on `agnes snapshot create` ONLY.** Do NOT pass it to `agnes query` — fails with `No such option: --estimate`. The estimate flow is a snapshot-creation cost gate, not a query primitive. +- **Old `agnes fetch` / `da fetch` / `da query` references in stale docs** — the CLI is `agnes`; `agnes fetch` was renamed to `agnes snapshot create`. If you see those names, translate before running. +- **Don't attempt personal GCP auth** if a BQ query fails with permission errors. BQ access uses the **server's service account**, not your Google identity — escalate to admin instead. +- **Don't `agnes query --remote "SELECT * FROM "`** without a `--where`. Even if the scan-byte gate refuses, you've wasted the round-trip; gate yourself first by reading `rough_size_hint` and `where_examples` from `agnes catalog --json`. + +### Failure-mode dictionary — what each error means + the right response + +| Error wording (substring) | Cause | Response | +|---|---|---| +| `Binder Error: Query execution exceeded the timeout. Job ID: ...` | BQ-side query took >~200 s wall-clock; the DuckDB BQ extension's `bq_query_timeout_ms` (default 90 s, server may bump to 600 s) elapsed | Narrow `--where` (especially partition column), drop unused columns from `--select`, or switch to `agnes snapshot create` to materialise once + query locally | +| `HTTP 400: remote_scan_too_large` | Server's `bq_max_scan_bytes` cost gate refused the query (default 5 GiB) | Tighten `--where`; consider `agnes snapshot create` so the cost is paid once, then local queries are free | +| `HTTP 401: ... unauthorized` | PAT expired or wrong | `agnes init --server-url ... --token `; re-mint via the dashboard's "Personal Access Tokens" page | +| `HTTP 403: cross_project_forbidden` (with `serviceusage` mention) | Server SA lacks `serviceusage.services.use` on the BQ data project | Escalate to admin to set `data_source.bigquery.billing_project`; do NOT try personal auth | +| `ReadTimeout` (client-side) on `agnes query --remote` | CLI is older than 0.35.1 (had 30 s default) | `agnes --version`; if <0.35.1, upgrade with `uv tool install --force ` (the URL is in the `[update]` banner that prints on every command). Then retry. | +| `unknown columns: [...]` from `agnes snapshot create` | `--select` lists columns that don't exist | Run `agnes schema
` and copy column names verbatim | + +### Cost discipline — every BQ query bills bytes scanned + +A naive `SELECT * FROM ` can cost real money. ALWAYS: +- filter via `--where` on the partition column (typically a date) — read `where_examples` in `agnes catalog --json` +- list specific columns in `--select` — column-store BQ skips the rest +- run `--estimate` first (only valid on `agnes snapshot create`) when the table is partitioned/clustered or when `rough_size_hint` is unknown ### `agnes snapshot create` discipline