# {instance_name} — AI Data Analyst This workspace is connected to {server_url}. ## Rules - Before computing any business metric: run `da metrics show /` - **For canonical table list with query modes: `da catalog`.** `data/metadata/schema.json` covers `query_mode: "local"` tables only — for remote/hybrid tables it's incomplete. Treat `da catalog` as source of truth. - Do not use DESCRIBE/SHOW COLUMNS — use `da schema ` instead - Save work output to `user/artifacts/` - Sync data regularly with `da sync` - **Personal customizations go in `.claude/CLAUDE.local.md`, NOT here.** This file is regenerated by `da analyst setup --force`; edits here will be lost. CLAUDE.local.md is preserved across regeneration and uploaded on `da sync --upload-only`. ## Metrics Workflow 1. `da metrics list` — find the relevant metric 2. `da metrics show revenue/mrr` — read SQL and business rules 3. Use the canonical SQL from the metric definition, adapt to the question 4. Never invent metric calculations — always check existing definitions first ## Data Sync - `da sync` — download current data from server - `da sync --docs-only` — just metadata and metrics (fast refresh) - `da sync --upload-only` — upload sessions and local notes to server - Data on the server refreshes every {sync_interval} ## Remote Queries (BigQuery) — when data isn't on the laptop Not every table is synced. Tables registered with `query_mode: "remote"` live in BigQuery, accessed server-side via DuckDB's BQ extension — no parquet on disk. Tables you don't see in `data/parquet/` may still be queryable. ### Discovery first ``` da catalog --json | jq '.[] | {name, source_type, query_mode}' # see all tables + their modes da schema
# columns + types da describe
-n 5 # sample rows ``` For local-mode tables, query directly with `da query "SELECT … FROM
"`. ### Three patterns for `query_mode: "remote"` tables | Pattern | Tool | Use when | |---------|------|----------| | **`da fetch`** (preferred) | materializes a filtered subset locally → query the snapshot | repeated questions on same slice | | **`da query --remote`** | one-shot, server-side execution against BigQuery | single aggregate / cheap probe | | **`da query --register-bq`** | hybrid joins between local snapshots and ad-hoc BQ subqueries | crossing local + remote | ### Permission model + cost — important - BQ access goes through the **agnes server's GCE service account**, not your personal Google credentials. If a query fails with a permission error, the table is in a project the server SA cannot read — escalate to admin, do NOT try to authenticate yourself. - Every BQ query bills the SA's GCP project for **bytes scanned**. A naive `SELECT * FROM ` can cost real money. ALWAYS: - filter via `--where` on the partition column (typically a date) - list specific columns in `--select` — column-store BQ skips the rest, cheaper - run `--estimate` first when unsure of the table size or partitioning ### `da fetch` discipline ``` # 1. ESTIMATE first — refuses to fetch without knowing the cost da fetch
--select col1,col2 --where "date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)" --estimate # 2. If reasonable, fetch as a named snapshot da fetch
--select col1,col2 --where "..." --as my_recent # 3. Query the local snapshot da query "SELECT col1, COUNT(*) FROM my_recent GROUP BY 1" # 4. List + drop snapshots when done da snapshot list da snapshot drop my_recent ``` Rules of thumb: - ALWAYS list specific columns in `--select`. Avoid implicit SELECT *. - ALWAYS include a `--where` for remote tables; otherwise add `--limit`. - ALWAYS run `--estimate` first when the table is `partition_by` / `clustered_by` per `da schema`, or could plausibly exceed 1 GB local bytes. - Reuse snapshots across questions in the same conversation — `da snapshot list` before fetching. ### Snapshot freshness — when to refresh Snapshots are point-in-time copies. They go stale as the source data updates (most BQ tables refresh daily; check `sync_schedule` per `da catalog`). For each new conversation: ``` da snapshot list # see existing snapshots + their ages da snapshot drop my_recent # drop stale ones da fetch
--select ... --where ... --as my_recent # re-fetch ``` If the question is time-sensitive (e.g. "today's orders"), assume any snapshot older than the table's `sync_schedule` is stale and refresh. ### Hybrid query example — local + remote in one query `da query --register-bq` lets a single SQL statement join a local table with an ad-hoc BQ subquery. The BQ subquery runs first (server-side), result registered as a DuckDB view, then the joined query runs locally. ``` da query \ --register-bq "traffic=SELECT date, country, SUM(views) AS views \ FROM \`prj.web_analytics.sessions\` \ WHERE date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) \ GROUP BY 1, 2" \ --sql "SELECT o.date, o.country, o.revenue, t.views, o.revenue / NULLIF(t.views,0) AS rev_per_view \ FROM orders o \ JOIN traffic t ON o.date = t.date AND o.country = t.country \ ORDER BY 1 DESC" ``` The BQ subquery MUST contain `WHERE` and/or `GROUP BY` to keep the registered result manageable (target: under 500K rows, well under 100 MB). Multiple `--register-bq` flags can compose multiple BQ sources. For complex SQL, use `--stdin` mode (`echo '{"register_bq":{...},"sql":"..."}' | da query --stdin`). ### BigQuery SQL flavor for `--where` Source-typed `bigquery` tables use BigQuery dialect, not DuckDB: - Date literal: `DATE '2026-01-01'` - Timestamp literal: `TIMESTAMP '2026-01-01 00:00:00 UTC'` - Now: `CURRENT_DATE()`, `CURRENT_TIMESTAMP()` - Date arithmetic: `DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)` - Regex: `REGEXP_CONTAINS(col, r'pattern')` (raw string!) - Cast: `CAST(x AS INT64)` (NOT `INT`) ### When the table you want isn't in `da catalog` The table may exist in BigQuery but not be registered with Agnes yet. Two options: 1. **Ad-hoc one-shot** — register a BQ subquery as a view inline, no admin needed if the agnes server SA has BQ access: ``` da query --register-bq "live=SELECT * FROM \`project.dataset.table\` WHERE date >= '...' LIMIT 1000" \ --sql "SELECT * FROM live" ``` 2. **Ask admin to register** the table with `query_mode: "remote"` so it shows up in `da catalog` and supports `da fetch` / `da query --remote`. This is the right path for any table you'll query repeatedly. ### Deeper guidance For the full protocol, including hybrid-query examples, snapshot hygiene, and when NOT to use `da fetch`, run: ``` da skills show agnes-data-querying ``` ## Corporate Memory Rules injected by `da sync` from the server's corporate knowledge base live in `.claude/rules/km_*.md`. They are automatically loaded by Claude Code on every session start. - `km_.md` — mandatory rules (always enforced) - `km_approved.md` — approved guidance (confidence × recency ranked) Run `da sync` to refresh. Rules are pruned automatically when items are revoked. ## Directory Structure - `data/` — read-only data downloaded from server - `data/parquet/` — table data in Parquet format - `data/duckdb/` — local analytics DuckDB database - `data/metadata/` — profiles, schema, metrics cache - `user/` — your workspace (persistent across syncs) - `user/artifacts/` — analysis outputs, reports, charts - `user/sessions/` — Claude Code session logs - `.claude/CLAUDE.local.md` — your personal notes + workspace customizations. **Never overwritten by `da analyst setup --force`.** Uploaded to the server on `da sync --upload-only`. Put any local-only Claude instructions, project-specific reminders, or temporary notes here — NOT in CLAUDE.md (this file is regenerated from a template).