agnes-the-ai-analyst/config/claude_md_template.txt

{# Default analyst-onboarding workspace prompt for "da analyst setup".
   Rendered server-side by src/claude_md.py. Edit this file to change
   the OSS default; admins override per-instance via /admin/workspace-prompt.

   Available context (see docs/agent-workspace-prompt.md for the full reference):
     instance.name, instance.subtitle
     server.url, server.hostname
     sync_interval                — string from instance.yaml
     data_source.type             — keboola | bigquery | local
     tables                       — list of {name, description, query_mode}
     metrics.count, metrics.categories
     marketplaces                 — list of {slug, name, plugins:[{name}]}
     user.id, user.email, user.name, user.is_admin, user.groups
     now, today                   — datetime / date string
#}
# {{ instance.name }} — AI Data Analyst

This workspace is connected to {{ server.url }}.
{% if instance.subtitle %}Operated by **{{ instance.subtitle }}**.{% endif %}

## Rules
- Before computing any business metric: run `da metrics show <category>/<name>`
- **For canonical table list with query modes: `da catalog`.** `data/metadata/schema.json` covers `query_mode: "local"` tables only — for remote/hybrid tables it's incomplete. Treat `da catalog` as source of truth.
- Do not use DESCRIBE/SHOW COLUMNS — use `da schema <table>` instead
- Save work output to `user/artifacts/`
- Sync data regularly with `da sync`
- **Personal customizations go in `.claude/CLAUDE.local.md`, NOT here.** This file is regenerated by `da analyst setup --force`; edits here will be lost. CLAUDE.local.md is preserved across regeneration and uploaded on `da sync --upload-only`.

## Metrics Workflow
1. `da metrics list` — find the relevant metric ({{ metrics.count }} available, categories: {{ metrics.categories | join(", ") or "none yet" }})
2. `da metrics show <category>/<name>` — read SQL and business rules
3. Use the canonical SQL from the metric definition, adapt to the question
4. Never invent metric calculations — always check existing definitions first

## Data Sync
- `da sync` — download current data from server
- `da sync --docs-only` — just metadata and metrics (fast refresh)
- `da sync --upload-only` — upload sessions and local notes to server
- Data on the server refreshes every {{ sync_interval }}

## Available Datasets
{% for t in tables -%}
- `{{ t.name }}`{% if t.description %} — {{ t.description }}{% endif %}{% if t.query_mode == "remote" %} *(remote, queried on demand)*{% endif %}
{% else -%}
- _No tables registered yet — ask an admin to register tables in the dashboard._
{% endfor %}

{% if marketplaces -%}
## Plugins available to you
{% for mp in marketplaces -%}
- **{{ mp.name }}** ({{ mp.slug }}): {{ mp.plugins | map(attribute="name") | join(", ") }}
{% endfor %}
{% endif -%}

## Remote Queries (BigQuery) — when data isn't on the laptop

Not every table is synced. Tables registered with `query_mode: "remote"` live in
BigQuery, accessed server-side via DuckDB's BQ extension — no parquet on disk.
Tables you don't see in `data/parquet/` may still be queryable.

### Discovery first

```
da catalog --json | jq '.[] | {name, source_type, query_mode}'   # see all tables + their modes
da schema <table>                                                # columns + types
da describe <table> -n 5                                         # sample rows
```

For local-mode tables, query directly with `da query "SELECT … FROM <table>"`.

### Three patterns for `query_mode: "remote"` tables

| Pattern | Tool | Use when |
|---------|------|----------|
| **`da fetch`** (preferred) | materializes a filtered subset locally → query the snapshot | repeated questions on same slice |
| **`da query --remote`** | one-shot, server-side execution against BigQuery (works for BASE TABLE rows directly + VIEW/MATERIALIZED_VIEW rows via the BQ jobs API; cost-guarded by a 5 GiB scan cap configurable in /admin/server-config) | single aggregate / cheap probe |
| **`da query --register-bq`** | hybrid joins between local snapshots and ad-hoc BQ subqueries | crossing local + remote |

### Permission model + cost — important

- BQ access goes through the **agnes server's GCE service account**, not your personal Google credentials. If a query fails with a permission error, the table is in a project the server SA cannot read — escalate to admin, do NOT try to authenticate yourself.
- Every BQ query bills the SA's GCP project for **bytes scanned**. A naive `SELECT * FROM <large_table>` can cost real money. ALWAYS:
  - filter via `--where` on the partition column (typically a date)
  - list specific columns in `--select` — column-store BQ skips the rest, cheaper
  - run `--estimate` first when unsure of the table size or partitioning

### `da fetch` discipline

```
# 1. ESTIMATE first — refuses to fetch without knowing the cost
da fetch <table> --select col1,col2 --where "date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)" --estimate

# 2. If reasonable, fetch as a named snapshot
da fetch <table> --select col1,col2 --where "..." --as my_recent

# 3. Query the local snapshot
da query "SELECT col1, COUNT(*) FROM my_recent GROUP BY 1"

# 4. List + drop snapshots when done
da snapshot list
da snapshot drop my_recent
```

Rules of thumb:
- ALWAYS list specific columns in `--select`. Avoid implicit SELECT *.
- ALWAYS include a `--where` for remote tables; otherwise add `--limit`.
- ALWAYS run `--estimate` first when the table is `partition_by` / `clustered_by`
  per `da schema`, or could plausibly exceed 1 GB local bytes.
- Reuse snapshots across questions in the same conversation — `da snapshot list`
  before fetching.

### Snapshot freshness — when to refresh

Snapshots are point-in-time copies. They go stale as the source data updates (most BQ tables refresh daily; check `sync_schedule` per `da catalog`). For each new conversation:

```
da snapshot list                            # see existing snapshots + their ages
da snapshot drop my_recent                  # drop stale ones
da fetch <table> --select ... --where ... --as my_recent   # re-fetch
```

If the question is time-sensitive (e.g. "today's orders"), assume any snapshot older than the table's `sync_schedule` is stale and refresh.

### Hybrid query example — local + remote in one query

`da query --register-bq` lets a single SQL statement join a local table with an ad-hoc BQ subquery. The BQ subquery runs first (server-side), result registered as a DuckDB view, then the joined query runs locally.

```
da query \
  --register-bq "traffic=SELECT date, country, SUM(views) AS views \
                 FROM \`prj.web_analytics.sessions\` \
                 WHERE date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) \
                 GROUP BY 1, 2" \
  --sql "SELECT o.date, o.country, o.revenue, t.views, o.revenue / NULLIF(t.views,0) AS rev_per_view \
         FROM orders o \
         JOIN traffic t ON o.date = t.date AND o.country = t.country \
         ORDER BY 1 DESC"
```

The BQ subquery MUST contain `WHERE` and/or `GROUP BY` to keep the registered result manageable (target: under 500K rows, well under 100 MB). Multiple `--register-bq` flags can compose multiple BQ sources. For complex SQL, use `--stdin` mode (`echo '{"register_bq":{...},"sql":"..."}' | da query --stdin`).

### BigQuery SQL flavor for `--where`

Source-typed `bigquery` tables use BigQuery dialect, not DuckDB:

- Date literal: `DATE '2026-01-01'`
- Timestamp literal: `TIMESTAMP '2026-01-01 00:00:00 UTC'`
- Now: `CURRENT_DATE()`, `CURRENT_TIMESTAMP()`
- Date arithmetic: `DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)`
- Regex: `REGEXP_CONTAINS(col, r'pattern')` (raw string!)
- Cast: `CAST(x AS INT64)` (NOT `INT`)

### When the table you want isn't in `da catalog`

The table may exist in BigQuery but not be registered with Agnes yet. Two options:

1. **Ad-hoc one-shot** — register a BQ subquery as a view inline, no admin needed
   if the agnes server SA has BQ access:
   ```
   da query --register-bq "live=SELECT * FROM \`project.dataset.table\` WHERE date >= '...' LIMIT 1000" \
            --sql "SELECT * FROM live"
   ```
2. **Ask admin to register** the table with `query_mode: "remote"` so it shows up
   in `da catalog` and supports `da fetch` / `da query --remote`. This is the
   right path for any table you'll query repeatedly.

### Deeper guidance

For the full protocol, including hybrid-query examples, snapshot hygiene, and
when NOT to use `da fetch`, run:

```
da skills show agnes-data-querying
```

## Corporate Memory

Rules injected by `da sync` from the server's corporate knowledge base live in `.claude/rules/km_*.md`. They are automatically loaded by Claude Code on every session start.

- `km_<id>.md` — mandatory rules (always enforced)
- `km_approved.md` — approved guidance (confidence × recency ranked)

Run `da sync` to refresh. Rules are pruned automatically when items are revoked.

## Directory Structure
- `data/` — read-only data downloaded from server
  - `data/parquet/` — table data in Parquet format
  - `data/duckdb/` — local analytics DuckDB database
  - `data/metadata/` — profiles, schema, metrics cache
- `user/` — your workspace (persistent across syncs)
  - `user/artifacts/` — analysis outputs, reports, charts
  - `user/sessions/` — Claude Code session logs
- `.claude/CLAUDE.local.md` — your personal notes + workspace customizations. **Never overwritten by `da analyst setup --force`.** Uploaded to the server on `da sync --upload-only`. Put any local-only Claude instructions, project-specific reminders, or temporary notes here — NOT in CLAUDE.md (this file is regenerated from a template).

_Hello {{ user.name or user.email }} — generated {{ today }}._