* fix(analyst): document BigQuery remote-query capability in bootstrap CLAUDE.md template Closes #153. The CLAUDE.md template generated by `da analyst bootstrap` (config/claude_md_template.txt) covered metrics, sync, corporate memory, and directory layout — but had ZERO mention of query_mode: "remote", da fetch, da query --remote, or --register-bq. Result: the AI analyst running in a freshly-bootstrapped workspace had no idea BigQuery-backed tables existed, no path to fetch unsynced data, and no fallback for tables not in the catalog. Validated against /Users/<user>/foundry-ai/foundryai-data-analyst/CLAUDE.md on 2026-05-01: section confirmed missing. Workspace-level (parent-dir) CLAUDE.md carried legacy SSH-heredoc instructions but the analyst-level file (which Claude reads as primary project context) had nothing. ## Changes ### config/claude_md_template.txt (+83) Added a `## Remote Queries (BigQuery)` section covering: - Discovery first — `da catalog --json | jq '...'` to see all tables with their query_mode, then `da schema` and `da describe` for shape. - Three query patterns: - `da fetch` (preferred) — materialize a filtered subset locally, query the snapshot, drop when done. - `da query --remote` — one-shot server-side execution (cheap probes). - `da query --register-bq` — hybrid joins between local + ad-hoc BQ. - `da fetch` estimate-first discipline — rules of thumb on --select / --where / --estimate / snapshot reuse. - BigQuery SQL flavor cheat sheet for `--where` (DATE literal, DATE_SUB, REGEXP_CONTAINS, CAST AS INT64). - Unknown-table fallback: when a table isn't in `da catalog` at all, use ad-hoc `--register-bq` if the agnes server SA has BQ access, or ask admin to register with `query_mode: "remote"` for ongoing use. - Pointer to `da skills show agnes-data-querying` for deeper guidance. ### docs/setup/claude_md_template.txt (deleted) Stale 359-line template that documented the deprecated SSH-heredoc remote_query.sh protocol. No code references it (verified via grep across .py / .sh / .yml / .md). Removing eliminates two failure modes: 1. A future refactor accidentally pulling it into a workspace and shipping deprecated guidance to analyst Claude sessions. 2. Reviewer confusion over which template is canonical. ### CHANGELOG.md `### Fixed` and `### Removed` entries under [Unreleased]. ## Tested - Manually walked the diff against `da skills show agnes-data-querying` output on a live VM (foundryai-development) — patterns + flags match the modern CLI exactly. - Re-bootstrap test deferred: requires network round-trip; pattern is identical to existing template substitution path so render is not at risk. ## Out of scope - The companion gap that data_description.md often only enumerates query_mode: "local" tables (no signal that other modes exist) — separate concern, fix likely belongs in the metadata generator on the server side, not in the analyst template. - Encouraging admins to register frequently-queried BQ tables as `query_mode: "remote"` in the registry — workflow improvement, not a code bug. * chore(release): cut 0.28.0 --------- Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
163 lines
7.9 KiB
Text
163 lines
7.9 KiB
Text
# {instance_name} — AI Data Analyst
|
||
|
||
This workspace is connected to {server_url}.
|
||
|
||
## Rules
|
||
- Before computing any business metric: run `da metrics show <category>/<name>`
|
||
- **For canonical table list with query modes: `da catalog`.** `data/metadata/schema.json` covers `query_mode: "local"` tables only — for remote/hybrid tables it's incomplete. Treat `da catalog` as source of truth.
|
||
- Do not use DESCRIBE/SHOW COLUMNS — use `da schema <table>` instead
|
||
- Save work output to `user/artifacts/`
|
||
- Sync data regularly with `da sync`
|
||
- **Personal customizations go in `.claude/CLAUDE.local.md`, NOT here.** This file is regenerated by `da analyst setup --force`; edits here will be lost. CLAUDE.local.md is preserved across regeneration and uploaded on `da sync --upload-only`.
|
||
|
||
## Metrics Workflow
|
||
1. `da metrics list` — find the relevant metric
|
||
2. `da metrics show revenue/mrr` — read SQL and business rules
|
||
3. Use the canonical SQL from the metric definition, adapt to the question
|
||
4. Never invent metric calculations — always check existing definitions first
|
||
|
||
## Data Sync
|
||
- `da sync` — download current data from server
|
||
- `da sync --docs-only` — just metadata and metrics (fast refresh)
|
||
- `da sync --upload-only` — upload sessions and local notes to server
|
||
- Data on the server refreshes every {sync_interval}
|
||
|
||
## Remote Queries (BigQuery) — when data isn't on the laptop
|
||
|
||
Not every table is synced. Tables registered with `query_mode: "remote"` live in
|
||
BigQuery, accessed server-side via DuckDB's BQ extension — no parquet on disk.
|
||
Tables you don't see in `data/parquet/` may still be queryable.
|
||
|
||
### Discovery first
|
||
|
||
```
|
||
da catalog --json | jq '.[] | {name, source_type, query_mode}' # see all tables + their modes
|
||
da schema <table> # columns + types
|
||
da describe <table> -n 5 # sample rows
|
||
```
|
||
|
||
For local-mode tables, query directly with `da query "SELECT … FROM <table>"`.
|
||
|
||
### Three patterns for `query_mode: "remote"` tables
|
||
|
||
| Pattern | Tool | Use when |
|
||
|---------|------|----------|
|
||
| **`da fetch`** (preferred) | materializes a filtered subset locally → query the snapshot | repeated questions on same slice |
|
||
| **`da query --remote`** | one-shot, server-side execution against BigQuery | single aggregate / cheap probe |
|
||
| **`da query --register-bq`** | hybrid joins between local snapshots and ad-hoc BQ subqueries | crossing local + remote |
|
||
|
||
### Permission model + cost — important
|
||
|
||
- BQ access goes through the **agnes server's GCE service account**, not your personal Google credentials. If a query fails with a permission error, the table is in a project the server SA cannot read — escalate to admin, do NOT try to authenticate yourself.
|
||
- Every BQ query bills the SA's GCP project for **bytes scanned**. A naive `SELECT * FROM <large_table>` can cost real money. ALWAYS:
|
||
- filter via `--where` on the partition column (typically a date)
|
||
- list specific columns in `--select` — column-store BQ skips the rest, cheaper
|
||
- run `--estimate` first when unsure of the table size or partitioning
|
||
|
||
### `da fetch` discipline
|
||
|
||
```
|
||
# 1. ESTIMATE first — refuses to fetch without knowing the cost
|
||
da fetch <table> --select col1,col2 --where "date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)" --estimate
|
||
|
||
# 2. If reasonable, fetch as a named snapshot
|
||
da fetch <table> --select col1,col2 --where "..." --as my_recent
|
||
|
||
# 3. Query the local snapshot
|
||
da query "SELECT col1, COUNT(*) FROM my_recent GROUP BY 1"
|
||
|
||
# 4. List + drop snapshots when done
|
||
da snapshot list
|
||
da snapshot drop my_recent
|
||
```
|
||
|
||
Rules of thumb:
|
||
- ALWAYS list specific columns in `--select`. Avoid implicit SELECT *.
|
||
- ALWAYS include a `--where` for remote tables; otherwise add `--limit`.
|
||
- ALWAYS run `--estimate` first when the table is `partition_by` / `clustered_by`
|
||
per `da schema`, or could plausibly exceed 1 GB local bytes.
|
||
- Reuse snapshots across questions in the same conversation — `da snapshot list`
|
||
before fetching.
|
||
|
||
### Snapshot freshness — when to refresh
|
||
|
||
Snapshots are point-in-time copies. They go stale as the source data updates (most BQ tables refresh daily; check `sync_schedule` per `da catalog`). For each new conversation:
|
||
|
||
```
|
||
da snapshot list # see existing snapshots + their ages
|
||
da snapshot drop my_recent # drop stale ones
|
||
da fetch <table> --select ... --where ... --as my_recent # re-fetch
|
||
```
|
||
|
||
If the question is time-sensitive (e.g. "today's orders"), assume any snapshot older than the table's `sync_schedule` is stale and refresh.
|
||
|
||
### Hybrid query example — local + remote in one query
|
||
|
||
`da query --register-bq` lets a single SQL statement join a local table with an ad-hoc BQ subquery. The BQ subquery runs first (server-side), result registered as a DuckDB view, then the joined query runs locally.
|
||
|
||
```
|
||
da query \
|
||
--register-bq "traffic=SELECT date, country, SUM(views) AS views \
|
||
FROM \`prj.web_analytics.sessions\` \
|
||
WHERE date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) \
|
||
GROUP BY 1, 2" \
|
||
--sql "SELECT o.date, o.country, o.revenue, t.views, o.revenue / NULLIF(t.views,0) AS rev_per_view \
|
||
FROM orders o \
|
||
JOIN traffic t ON o.date = t.date AND o.country = t.country \
|
||
ORDER BY 1 DESC"
|
||
```
|
||
|
||
The BQ subquery MUST contain `WHERE` and/or `GROUP BY` to keep the registered result manageable (target: under 500K rows, well under 100 MB). Multiple `--register-bq` flags can compose multiple BQ sources. For complex SQL, use `--stdin` mode (`echo '{"register_bq":{...},"sql":"..."}' | da query --stdin`).
|
||
|
||
### BigQuery SQL flavor for `--where`
|
||
|
||
Source-typed `bigquery` tables use BigQuery dialect, not DuckDB:
|
||
|
||
- Date literal: `DATE '2026-01-01'`
|
||
- Timestamp literal: `TIMESTAMP '2026-01-01 00:00:00 UTC'`
|
||
- Now: `CURRENT_DATE()`, `CURRENT_TIMESTAMP()`
|
||
- Date arithmetic: `DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)`
|
||
- Regex: `REGEXP_CONTAINS(col, r'pattern')` (raw string!)
|
||
- Cast: `CAST(x AS INT64)` (NOT `INT`)
|
||
|
||
### When the table you want isn't in `da catalog`
|
||
|
||
The table may exist in BigQuery but not be registered with Agnes yet. Two options:
|
||
|
||
1. **Ad-hoc one-shot** — register a BQ subquery as a view inline, no admin needed
|
||
if the agnes server SA has BQ access:
|
||
```
|
||
da query --register-bq "live=SELECT * FROM \`project.dataset.table\` WHERE date >= '...' LIMIT 1000" \
|
||
--sql "SELECT * FROM live"
|
||
```
|
||
2. **Ask admin to register** the table with `query_mode: "remote"` so it shows up
|
||
in `da catalog` and supports `da fetch` / `da query --remote`. This is the
|
||
right path for any table you'll query repeatedly.
|
||
|
||
### Deeper guidance
|
||
|
||
For the full protocol, including hybrid-query examples, snapshot hygiene, and
|
||
when NOT to use `da fetch`, run:
|
||
|
||
```
|
||
da skills show agnes-data-querying
|
||
```
|
||
|
||
## Corporate Memory
|
||
|
||
Rules injected by `da sync` from the server's corporate knowledge base live in `.claude/rules/km_*.md`. They are automatically loaded by Claude Code on every session start.
|
||
|
||
- `km_<id>.md` — mandatory rules (always enforced)
|
||
- `km_approved.md` — approved guidance (confidence × recency ranked)
|
||
|
||
Run `da sync` to refresh. Rules are pruned automatically when items are revoked.
|
||
|
||
## Directory Structure
|
||
- `data/` — read-only data downloaded from server
|
||
- `data/parquet/` — table data in Parquet format
|
||
- `data/duckdb/` — local analytics DuckDB database
|
||
- `data/metadata/` — profiles, schema, metrics cache
|
||
- `user/` — your workspace (persistent across syncs)
|
||
- `user/artifacts/` — analysis outputs, reports, charts
|
||
- `user/sessions/` — Claude Code session logs
|
||
- `.claude/CLAUDE.local.md` — your personal notes + workspace customizations. **Never overwritten by `da analyst setup --force`.** Uploaded to the server on `da sync --upload-only`. Put any local-only Claude instructions, project-specific reminders, or temporary notes here — NOT in CLAUDE.md (this file is regenerated from a template).
|