From 30e81a15b99ac918599ddd90ae810e103b902e74 Mon Sep 17 00:00:00 2001
From: ZdenekSrotyr <zdenek.srotyr@keboola.com>
Date: Tue, 5 May 2026 16:38:32 +0200
Subject: [PATCH] feat(workspace-prompt): decision tree + size-hint so analyst
 Claude gets it right first try
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Three concrete changes addressing the "analyst Claude misuses the CLI"
class of bugs (image.png table — issues #3, #5, plus the recurrent
"how big is this table" guesswork):

1. config/claude_md_template.txt — the template agnes init writes to
   <workspace>/CLAUDE.md. Surfaces every catalog-row field with a why,
   adds a query_mode-based decision tree, explicit --estimate scoping
   (snapshot create ONLY — was the #1 first-try error), an agnes fetch
   → agnes snapshot create rename note, and a 6-row failure-mode table
   that maps each common error wording to its right next step.

2. app/api/v2_catalog.py — populate rough_size_hint for local +
   materialized rows from the on-disk parquet size, bucketed
   small/medium/large/very_large. Was hardcoded null with a TODO; AI
   couldn't tell "is this 6.8 GB" without a failed --remote round-trip.

3. cli/update_check.py — the [update] banner survived the da→agnes
   rename and printed "[update] da X is out of date" on every command,
   training analysts to associate the binary with the old name.

Verified by rendering the template against representative contexts
(33/33 tests pass) and running every use case from the original
screenshot through the real CLI against a dev VM.
---
 CHANGELOG.md                  |  5 +++
 app/api/v2_catalog.py         | 53 ++++++++++++++++++++++++++-
 cli/update_check.py           |  2 +-
 config/claude_md_template.txt | 67 +++++++++++++++++++++++++++++------
 4 files changed, 114 insertions(+), 13 deletions(-)
diff --git a/CHANGELOG.md b/CHANGELOG.md
index b34d7ae..89f00e7 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -15,6 +15,11 @@ CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every C
 - **Per-user parallel parquet downloads in `agnes pull`** — the download loop in `cli/lib/pull.py` now uses a `ThreadPoolExecutor` with concurrency capped by the new `AGNES_PULL_PARALLELISM` env var (default 4, set 1 to restore pre-PR serial behavior). On a registry of N tables the wall-clock time drops from `Σ stream_download_seconds(table_i)` to roughly `max × ceil(N/4)`. Works hand-in-hand with the Caddy `file_server` change below: without it parallel client-side downloads would still queue on the single uvicorn worker; with it each request is its own caddy goroutine + sendfile, so 4-way parallelism actually delivers throughput. Per-table error semantics preserved — a failure on one table no longer aborts the rest of the batch.
 - **`scripts/ops/agnes-auto-upgrade.sh` now re-fetches Caddyfile + every compose overlay** from `keboola/agnes-the-ai-analyst@main` on every tick, hashes them, and triggers a `docker compose up -d` recreation when the hash changes — same path as an image-digest change. Pre-fix the script only watched `docker images` digests, so a Caddyfile or compose change in main never reached running VMs (only fresh boots ran `startup.sh`'s file fetch). Without this, the new file_server downloads-path below would land in the image but stay inert against an old Caddyfile. The script also self-updates from the same path so the very fix that watches config files isn't itself stuck on running VMs. Fail-soft on curl errors — keeps the existing file rather than blanking it.
 - **Caddy `file_server` for parquet downloads** — `GET /api/data/{table_id}/download` is now intercepted at the Caddy layer (TLS profile only) and served directly via sendfile/zero-copy from the data volume mounted read-only at `/srv` inside the caddy container. Caddy authorises every request via a new lightweight RBAC probe `GET /api/data/{table_id}/check-access` (returns 204 when the caller has read access on the table, 403 otherwise) using the `forward_auth` directive — the bulk byte transfer never touches uvicorn workers. Resolves a real production failure mode where a single multi-GB analyst pull held the app's only uvicorn worker for the duration of the stream and starved the UI / `/api/health` / every other API endpoint, eventually flipping the container to `unhealthy`. Path discovery uses Caddy's `try_files` over the known `extract.duckdb` v2 source subdirs (`bigquery/data/<id>.parquet`, `keboola/data/<id>.parquet`, `jira/data/<id>.parquet`); a parquet not at any of those paths transparently falls through to the existing app handler so legacy `src_data/parquet` layouts and future connectors keep working with no Caddyfile change. Non-Caddy deployments (dev `docker compose up` without `--profile tls`) continue to use the app handler unchanged.
+- **Workspace prompt: decision tree, common-mistakes callout, failure-mode dictionary** in `config/claude_md_template.txt` (the template `agnes init` writes to `<workspace>/CLAUDE.md`). Surfaces every catalog-row field analyst Claude should read before deciding which command to use (`query_mode`, `sql_flavor`, `where_examples`, `fetch_via`, `rough_size_hint`); explicitly binds `--estimate` to `agnes snapshot create` ONLY (was the most-failed first-try misuse — fails with `No such option: --estimate` on `agnes query`); calls out the `agnes fetch` → `agnes snapshot create` rename so stale-doc analysts don't run a non-command; documents the BQ permission model (server SA, not personal Google identity) and a 6-row failure-mode table mapping each common error wording to its cause + the right next step.
+- **`rough_size_hint` populated for `local` + `materialized` catalog rows** in `GET /api/v2/catalog` (was hardcoded `null` with a "Task 8" TODO). Reads the parquet file size at `${DATA_DIR}/extracts/<source_type>/data/<table_id>.parquet` and buckets into `small` (≤100 MiB), `medium` (≤1 GiB), `large` (≤10 GiB), `very_large` (>10 GiB). `remote` rows stay `null` for now (size requires a BQ INFORMATION_SCHEMA call; tracked separately). Lets analyst Claude pick `agnes snapshot create` over `agnes query --remote` by inspecting `agnes catalog --json` rather than discovering size empirically via a failed `--remote` round-trip.
+
+### Changed
+- **CLI update-banner now says `agnes` instead of `da`** (`cli/update_check.py:format_outdated_notice`). The string `[update] da X is out of date` had survived the `da` → `agnes` CLI rename and was the most-visible stale identifier in the analyst-facing surface — every CLI command printed it on stderr when a newer wheel was available.
 
 ### Fixed
 
diff --git a/app/api/v2_catalog.py b/app/api/v2_catalog.py
index 426b320..a5b660e 100644
--- a/app/api/v2_catalog.py
+++ b/app/api/v2_catalog.py
@@ -2,10 +2,12 @@
 
 from __future__ import annotations
 from datetime import datetime, timezone
+from pathlib import Path
 from fastapi import APIRouter, Depends
 import duckdb
 
 from app.auth.dependencies import get_current_user, _get_db
+from app.utils import get_data_dir as _get_data_dir
 from src.rbac import can_access_table
 from src.repositories.table_registry import TableRegistryRepository
 from app.api.v2_cache import TTLCache
@@ -43,6 +45,52 @@ def _fetch_hint(table_id: str, source_type: str) -> str:
     return "already local — query directly via `agnes query`"
 
 
+# Coarse size buckets for `rough_size_hint`. Boundaries chosen so an analyst
+# Claude can decide tool by inspection: anything `large` or worse implies
+# `agnes snapshot create` over `agnes query --remote`. Numbers reflect the
+# default `bq_max_scan_bytes` 5 GiB ceiling — at "large" you're already at
+# half the per-query gate and a naive `--remote` is likely to refuse.
+_SIZE_BUCKETS = (
+    (10 * 2**20, "small"),     # ≤10 MiB
+    (100 * 2**20, "small"),    # ≤100 MiB still small (analyst-laptop scale)
+    (1 * 2**30, "medium"),     # ≤1 GiB
+    (10 * 2**30, "large"),     # ≤10 GiB
+)
+
+
+def _bucket_size(byte_count: int) -> str:
+    for cap, label in _SIZE_BUCKETS:
+        if byte_count <= cap:
+            return label
+    return "very_large"
+
+
+def _materialized_size_hint(table_id: str, source_type: str, query_mode: str) -> str | None:
+    """Return a rough size bucket for a row whose data is on the server's
+    local filesystem (any `query_mode` that produces a parquet — `local` and
+    `materialized`). Returns ``None`` for `remote` (size requires a BQ
+    INFORMATION_SCHEMA round-trip; tracked separately) and for tables whose
+    parquet hasn't been materialised yet so the AI gets ``null`` not a
+    misleading "small".
+
+    Layout matches the v2 extract.duckdb contract:
+      ${DATA_DIR}/extracts/<source_type>/data/<table_id>.parquet
+    """
+    if query_mode == "remote":
+        return None
+    if not source_type:
+        return None
+    try:
+        path = Path(_get_data_dir()) / "extracts" / source_type / "data" / f"{table_id}.parquet"
+        if not path.exists():
+            return None
+        return _bucket_size(path.stat().st_size)
+    except Exception:
+        # Filesystem stat() race / permissions / weird DATA_DIR — fall back
+        # to null rather than crash the whole catalog response.
+        return None
+
+
 def build_catalog(conn: duckdb.DuckDBPyConnection, user: dict) -> dict:
     rows = _table_rows_cache.get(_TABLE_ROWS_KEY)
     if rows is None:
@@ -66,7 +114,10 @@ def build_catalog(conn: duckdb.DuckDBPyConnection, user: dict) -> dict:
             "sql_flavor": _flavor_for(r.get("source_type") or ""),
             "where_examples": _examples_for(r.get("source_type") or ""),
             "fetch_via": _fetch_hint(r["id"], r.get("source_type") or ""),
-            "rough_size_hint": None,  # populated by Task 8 schema endpoint when called
+            "rough_size_hint": _materialized_size_hint(
+                r["id"], r.get("source_type") or "",
+                r.get("query_mode") or "local",
+            ),
         })
 
     return {
diff --git a/cli/update_check.py b/cli/update_check.py
index d278218..90cb2f2 100644
--- a/cli/update_check.py
+++ b/cli/update_check.py
@@ -184,7 +184,7 @@ def format_outdated_notice(info: UpdateInfo) -> str:
     literal string "None" into a copy-pasteable command — drop the upgrade
     snippet in that case.
     """
-    msg = f"[update] da {info.installed} is out of date — latest on this server is {info.latest}."
+    msg = f"[update] agnes {info.installed} is out of date — latest on this server is {info.latest}."
     if info.download_url:
         msg += f" Upgrade: uv tool install --force {info.download_url}"
     return msg
diff --git a/config/claude_md_template.txt b/config/claude_md_template.txt
index b3320df..31c78d3 100644
--- a/config/claude_md_template.txt
+++ b/config/claude_md_template.txt
@@ -58,15 +58,43 @@ Not every table is synced. Tables registered with `query_mode: "remote"` live in
 BigQuery, accessed server-side via DuckDB's BQ extension — no parquet on disk.
 Tables you don't see in `server/parquet/` may still be queryable.
 
-### Discovery first
+### Discovery first — read `agnes catalog --json` BEFORE every cross-table decision
+
+`agnes catalog --json` returns one row per table with these fields. Use them; don't guess:
+
+| Field | What it tells you | How to use it |
+|---|---|---|
+| `query_mode` | `local` (parquet on laptop) / `remote` (BQ on demand) / `materialized` (synced parquet of a BQ result) | Picks the tool — see decision tree below |
+| `source_type` | `keboola` / `bigquery` / `jira` | Determines SQL dialect |
+| `sql_flavor` | `duckdb` for local sources, `bigquery` for `--remote` queries on BQ rows | What syntax `--where` expects |
+| `where_examples` | 1–3 example WHERE predicates that are valid for this table's dialect | Copy as starting point for `--where` |
+| `fetch_via` | Pre-formatted `agnes snapshot create …` template for this table | The canonical "how do I get a slice of this table" command |
+| `rough_size_hint` | Coarse size hint (`small` / `medium` / `large` or null when unknown) | Bigger than `medium` → never `agnes query --remote` without a tight `--where`; use `agnes snapshot create` |
 
 ```
-agnes catalog --json | jq '.[] | {name, source_type, query_mode}'   # see all tables + their modes
-agnes schema <table>                                                # columns + types
-agnes describe <table> -n 5                                         # sample rows
+agnes catalog --json                              # full structured view (use this in scripts)
+agnes catalog                                     # human-readable summary
+agnes schema <table>                              # columns + types (BIGQUERY/DUCKDB dialect printed in header)
+agnes describe <table> -n 5                       # sample rows (works on local & materialized only)
 ```
 
-For local-mode tables, query directly with `agnes query "SELECT … FROM <table>"`.
+### Decision tree — pick the right tool BEFORE writing SQL
+
+```
+                       ┌─ local        → agnes query "SELECT ..."
+agnes catalog →   ─────┤
+query_mode of <table>  ├─ materialized → agnes query (parquet was synced by agnes pull)
+                       │                 (if missing locally, run `agnes pull` first)
+                       │
+                       └─ remote       → choose by table size + query shape:
+                                          - one cheap probe (COUNT, schema-confirm, single agg ≤200s)
+                                              → agnes query --remote "..."
+                                          - repeated questions on same slice / large scan
+                                              → agnes snapshot create <table> --select ... --where ... --as <name>
+                                                then agnes query "SELECT ... FROM <name>"
+                                          - join with a local table
+                                              → agnes query --register-bq "alias=BQ_SQL" --sql "..."
+```
 
 ### Three patterns for `query_mode: "remote"` tables
 
@@ -76,13 +104,30 @@ For local-mode tables, query directly with `agnes query "SELECT … FROM <table>
 | **`agnes query --remote`** | one-shot, server-side execution against BigQuery (works for BASE TABLE rows directly + VIEW/MATERIALIZED_VIEW rows via the BQ jobs API; cost-guarded by a 5 GiB scan cap configurable in /admin/server-config) | single aggregate / cheap probe |
 | **`agnes query --register-bq`** | hybrid joins between local snapshots and ad-hoc BQ subqueries | crossing local + remote |
 
-### Permission model + cost — important
+### Common mistakes — avoid on first try
 
-- BQ access goes through the **agnes server's GCE service account**, not your personal Google credentials. If a query fails with a permission error, the table is in a project the server SA cannot read — escalate to admin, do NOT try to authenticate yourself.
-- Every BQ query bills the SA's GCP project for **bytes scanned**. A naive `SELECT * FROM <large_table>` can cost real money. ALWAYS:
-  - filter via `--where` on the partition column (typically a date)
-  - list specific columns in `--select` — column-store BQ skips the rest, cheaper
-  - run `--estimate` first when unsure of the table size or partitioning
+- **`--estimate` is on `agnes snapshot create` ONLY.** Do NOT pass it to `agnes query` — fails with `No such option: --estimate`. The estimate flow is a snapshot-creation cost gate, not a query primitive.
+- **Old `agnes fetch` / `da fetch` / `da query` references in stale docs** — the CLI is `agnes`; `agnes fetch` was renamed to `agnes snapshot create`. If you see those names, translate before running.
+- **Don't attempt personal GCP auth** if a BQ query fails with permission errors. BQ access uses the **server's service account**, not your Google identity — escalate to admin instead.
+- **Don't `agnes query --remote "SELECT * FROM <large_table>"`** without a `--where`. Even if the scan-byte gate refuses, you've wasted the round-trip; gate yourself first by reading `rough_size_hint` and `where_examples` from `agnes catalog --json`.
+
+### Failure-mode dictionary — what each error means + the right response
+
+| Error wording (substring) | Cause | Response |
+|---|---|---|
+| `Binder Error: Query execution exceeded the timeout. Job ID: ...` | BQ-side query took >~200 s wall-clock; the DuckDB BQ extension's `bq_query_timeout_ms` (default 90 s, server may bump to 600 s) elapsed | Narrow `--where` (especially partition column), drop unused columns from `--select`, or switch to `agnes snapshot create` to materialise once + query locally |
+| `HTTP 400: remote_scan_too_large` | Server's `bq_max_scan_bytes` cost gate refused the query (default 5 GiB) | Tighten `--where`; consider `agnes snapshot create` so the cost is paid once, then local queries are free |
+| `HTTP 401: ... unauthorized` | PAT expired or wrong | `agnes init --server-url ... --token <new-PAT>`; re-mint via the dashboard's "Personal Access Tokens" page |
+| `HTTP 403: cross_project_forbidden` (with `serviceusage` mention) | Server SA lacks `serviceusage.services.use` on the BQ data project | Escalate to admin to set `data_source.bigquery.billing_project`; do NOT try personal auth |
+| `ReadTimeout` (client-side) on `agnes query --remote` | CLI is older than 0.35.1 (had 30 s default) | `agnes --version`; if <0.35.1, upgrade with `uv tool install --force <wheel-from-server>` (the URL is in the `[update]` banner that prints on every command). Then retry. |
+| `unknown columns: [...]` from `agnes snapshot create` | `--select` lists columns that don't exist | Run `agnes schema <table>` and copy column names verbatim |
+
+### Cost discipline — every BQ query bills bytes scanned
+
+A naive `SELECT * FROM <large_table>` can cost real money. ALWAYS:
+- filter via `--where` on the partition column (typically a date) — read `where_examples` in `agnes catalog --json`
+- list specific columns in `--select` — column-store BQ skips the rest
+- run `--estimate` first (only valid on `agnes snapshot create`) when the table is partitioned/clustered or when `rough_size_hint` is unknown
 
 ### `agnes snapshot create` discipline