agnes-the-ai-analyst/config/claude_md_template.txt
Vojtech bd7b8c3233
fix(analyst): document BigQuery remote-query capability in bootstrap CLAUDE.md template (#154)
* fix(analyst): document BigQuery remote-query capability in bootstrap CLAUDE.md template

Closes #153.

The CLAUDE.md template generated by `da analyst bootstrap` (config/claude_md_template.txt)
covered metrics, sync, corporate memory, and directory layout — but had ZERO
mention of query_mode: "remote", da fetch, da query --remote, or --register-bq.
Result: the AI analyst running in a freshly-bootstrapped workspace had no
idea BigQuery-backed tables existed, no path to fetch unsynced data, and no
fallback for tables not in the catalog.

Validated against /Users/<user>/foundry-ai/foundryai-data-analyst/CLAUDE.md
on 2026-05-01: section confirmed missing. Workspace-level (parent-dir)
CLAUDE.md carried legacy SSH-heredoc instructions but the analyst-level
file (which Claude reads as primary project context) had nothing.

## Changes

### config/claude_md_template.txt (+83)

Added a `## Remote Queries (BigQuery)` section covering:

- Discovery first — `da catalog --json | jq '...'` to see all tables
  with their query_mode, then `da schema` and `da describe` for shape.
- Three query patterns:
  - `da fetch` (preferred) — materialize a filtered subset locally,
    query the snapshot, drop when done.
  - `da query --remote` — one-shot server-side execution (cheap probes).
  - `da query --register-bq` — hybrid joins between local + ad-hoc BQ.
- `da fetch` estimate-first discipline — rules of thumb on
  --select / --where / --estimate / snapshot reuse.
- BigQuery SQL flavor cheat sheet for `--where` (DATE literal,
  DATE_SUB, REGEXP_CONTAINS, CAST AS INT64).
- Unknown-table fallback: when a table isn't in `da catalog` at all,
  use ad-hoc `--register-bq` if the agnes server SA has BQ access, or
  ask admin to register with `query_mode: "remote"` for ongoing use.
- Pointer to `da skills show agnes-data-querying` for deeper guidance.

### docs/setup/claude_md_template.txt (deleted)

Stale 359-line template that documented the deprecated SSH-heredoc
remote_query.sh protocol. No code references it (verified via grep
across .py / .sh / .yml / .md). Removing eliminates two failure
modes:
1. A future refactor accidentally pulling it into a workspace and
   shipping deprecated guidance to analyst Claude sessions.
2. Reviewer confusion over which template is canonical.

### CHANGELOG.md

`### Fixed` and `### Removed` entries under [Unreleased].

## Tested

- Manually walked the diff against `da skills show agnes-data-querying`
  output on a live VM (foundryai-development) — patterns + flags
  match the modern CLI exactly.
- Re-bootstrap test deferred: requires network round-trip; pattern
  is identical to existing template substitution path so render is
  not at risk.

## Out of scope

- The companion gap that data_description.md often only enumerates
  query_mode: "local" tables (no signal that other modes exist) —
  separate concern, fix likely belongs in the metadata generator
  on the server side, not in the analyst template.
- Encouraging admins to register frequently-queried BQ tables as
  `query_mode: "remote"` in the registry — workflow improvement, not
  a code bug.

* chore(release): cut 0.28.0

---------

Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
2026-05-01 12:06:41 +02:00

163 lines
7.9 KiB
Text
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# {instance_name} — AI Data Analyst
This workspace is connected to {server_url}.
## Rules
- Before computing any business metric: run `da metrics show <category>/<name>`
- **For canonical table list with query modes: `da catalog`.** `data/metadata/schema.json` covers `query_mode: "local"` tables only — for remote/hybrid tables it's incomplete. Treat `da catalog` as source of truth.
- Do not use DESCRIBE/SHOW COLUMNS — use `da schema <table>` instead
- Save work output to `user/artifacts/`
- Sync data regularly with `da sync`
- **Personal customizations go in `.claude/CLAUDE.local.md`, NOT here.** This file is regenerated by `da analyst setup --force`; edits here will be lost. CLAUDE.local.md is preserved across regeneration and uploaded on `da sync --upload-only`.
## Metrics Workflow
1. `da metrics list` — find the relevant metric
2. `da metrics show revenue/mrr` — read SQL and business rules
3. Use the canonical SQL from the metric definition, adapt to the question
4. Never invent metric calculations — always check existing definitions first
## Data Sync
- `da sync` — download current data from server
- `da sync --docs-only` — just metadata and metrics (fast refresh)
- `da sync --upload-only` — upload sessions and local notes to server
- Data on the server refreshes every {sync_interval}
## Remote Queries (BigQuery) — when data isn't on the laptop
Not every table is synced. Tables registered with `query_mode: "remote"` live in
BigQuery, accessed server-side via DuckDB's BQ extension — no parquet on disk.
Tables you don't see in `data/parquet/` may still be queryable.
### Discovery first
```
da catalog --json | jq '.[] | {name, source_type, query_mode}' # see all tables + their modes
da schema <table> # columns + types
da describe <table> -n 5 # sample rows
```
For local-mode tables, query directly with `da query "SELECT … FROM <table>"`.
### Three patterns for `query_mode: "remote"` tables
| Pattern | Tool | Use when |
|---------|------|----------|
| **`da fetch`** (preferred) | materializes a filtered subset locally → query the snapshot | repeated questions on same slice |
| **`da query --remote`** | one-shot, server-side execution against BigQuery | single aggregate / cheap probe |
| **`da query --register-bq`** | hybrid joins between local snapshots and ad-hoc BQ subqueries | crossing local + remote |
### Permission model + cost — important
- BQ access goes through the **agnes server's GCE service account**, not your personal Google credentials. If a query fails with a permission error, the table is in a project the server SA cannot read — escalate to admin, do NOT try to authenticate yourself.
- Every BQ query bills the SA's GCP project for **bytes scanned**. A naive `SELECT * FROM <large_table>` can cost real money. ALWAYS:
- filter via `--where` on the partition column (typically a date)
- list specific columns in `--select` — column-store BQ skips the rest, cheaper
- run `--estimate` first when unsure of the table size or partitioning
### `da fetch` discipline
```
# 1. ESTIMATE first — refuses to fetch without knowing the cost
da fetch <table> --select col1,col2 --where "date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)" --estimate
# 2. If reasonable, fetch as a named snapshot
da fetch <table> --select col1,col2 --where "..." --as my_recent
# 3. Query the local snapshot
da query "SELECT col1, COUNT(*) FROM my_recent GROUP BY 1"
# 4. List + drop snapshots when done
da snapshot list
da snapshot drop my_recent
```
Rules of thumb:
- ALWAYS list specific columns in `--select`. Avoid implicit SELECT *.
- ALWAYS include a `--where` for remote tables; otherwise add `--limit`.
- ALWAYS run `--estimate` first when the table is `partition_by` / `clustered_by`
per `da schema`, or could plausibly exceed 1 GB local bytes.
- Reuse snapshots across questions in the same conversation — `da snapshot list`
before fetching.
### Snapshot freshness — when to refresh
Snapshots are point-in-time copies. They go stale as the source data updates (most BQ tables refresh daily; check `sync_schedule` per `da catalog`). For each new conversation:
```
da snapshot list # see existing snapshots + their ages
da snapshot drop my_recent # drop stale ones
da fetch <table> --select ... --where ... --as my_recent # re-fetch
```
If the question is time-sensitive (e.g. "today's orders"), assume any snapshot older than the table's `sync_schedule` is stale and refresh.
### Hybrid query example — local + remote in one query
`da query --register-bq` lets a single SQL statement join a local table with an ad-hoc BQ subquery. The BQ subquery runs first (server-side), result registered as a DuckDB view, then the joined query runs locally.
```
da query \
--register-bq "traffic=SELECT date, country, SUM(views) AS views \
FROM \`prj.web_analytics.sessions\` \
WHERE date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) \
GROUP BY 1, 2" \
--sql "SELECT o.date, o.country, o.revenue, t.views, o.revenue / NULLIF(t.views,0) AS rev_per_view \
FROM orders o \
JOIN traffic t ON o.date = t.date AND o.country = t.country \
ORDER BY 1 DESC"
```
The BQ subquery MUST contain `WHERE` and/or `GROUP BY` to keep the registered result manageable (target: under 500K rows, well under 100 MB). Multiple `--register-bq` flags can compose multiple BQ sources. For complex SQL, use `--stdin` mode (`echo '{"register_bq":{...},"sql":"..."}' | da query --stdin`).
### BigQuery SQL flavor for `--where`
Source-typed `bigquery` tables use BigQuery dialect, not DuckDB:
- Date literal: `DATE '2026-01-01'`
- Timestamp literal: `TIMESTAMP '2026-01-01 00:00:00 UTC'`
- Now: `CURRENT_DATE()`, `CURRENT_TIMESTAMP()`
- Date arithmetic: `DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)`
- Regex: `REGEXP_CONTAINS(col, r'pattern')` (raw string!)
- Cast: `CAST(x AS INT64)` (NOT `INT`)
### When the table you want isn't in `da catalog`
The table may exist in BigQuery but not be registered with Agnes yet. Two options:
1. **Ad-hoc one-shot** — register a BQ subquery as a view inline, no admin needed
if the agnes server SA has BQ access:
```
da query --register-bq "live=SELECT * FROM \`project.dataset.table\` WHERE date >= '...' LIMIT 1000" \
--sql "SELECT * FROM live"
```
2. **Ask admin to register** the table with `query_mode: "remote"` so it shows up
in `da catalog` and supports `da fetch` / `da query --remote`. This is the
right path for any table you'll query repeatedly.
### Deeper guidance
For the full protocol, including hybrid-query examples, snapshot hygiene, and
when NOT to use `da fetch`, run:
```
da skills show agnes-data-querying
```
## Corporate Memory
Rules injected by `da sync` from the server's corporate knowledge base live in `.claude/rules/km_*.md`. They are automatically loaded by Claude Code on every session start.
- `km_<id>.md` — mandatory rules (always enforced)
- `km_approved.md` — approved guidance (confidence × recency ranked)
Run `da sync` to refresh. Rules are pruned automatically when items are revoked.
## Directory Structure
- `data/` — read-only data downloaded from server
- `data/parquet/` — table data in Parquet format
- `data/duckdb/` — local analytics DuckDB database
- `data/metadata/` — profiles, schema, metrics cache
- `user/` — your workspace (persistent across syncs)
- `user/artifacts/` — analysis outputs, reports, charts
- `user/sessions/` — Claude Code session logs
- `.claude/CLAUDE.local.md` — your personal notes + workspace customizations. **Never overwritten by `da analyst setup --force`.** Uploaded to the server on `da sync --upload-only`. Put any local-only Claude instructions, project-specific reminders, or temporary notes here — NOT in CLAUDE.md (this file is regenerated from a template).