agnes-the-ai-analyst/config/claude_md_template.txt

195 lines
9.4 KiB
Text
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{# Default analyst-onboarding welcome prompt for "da analyst setup".
Rendered server-side by src/welcome_template.py. Edit this file to change
the OSS default; admins override per-instance via /admin/welcome.
Available context (see docs/welcome-template.md for the full reference):
instance.name, instance.subtitle
server.url, server.hostname
sync_interval — string from instance.yaml
data_source.type — keboola | bigquery | local
tables — list of {name, description, query_mode}
metrics.count, metrics.categories
marketplaces — list of {slug, name, plugins:[name]}
user.email, user.name, user.is_admin, user.groups
now, today — datetime / date string
#}
# {{ instance.name }} — AI Data Analyst
This workspace is connected to {{ server.url }}.
{% if instance.subtitle %}Operated by **{{ instance.subtitle }}**.{% endif %}
## Rules
- Before computing any business metric: run `da metrics show <category>/<name>`
- **For canonical table list with query modes: `da catalog`.** `data/metadata/schema.json` covers `query_mode: "local"` tables only — for remote/hybrid tables it's incomplete. Treat `da catalog` as source of truth.
- Do not use DESCRIBE/SHOW COLUMNS — use `da schema <table>` instead
- Save work output to `user/artifacts/`
- Sync data regularly with `da sync`
- **Personal customizations go in `.claude/CLAUDE.local.md`, NOT here.** This file is regenerated by `da analyst setup --force`; edits here will be lost. CLAUDE.local.md is preserved across regeneration and uploaded on `da sync --upload-only`.
## Metrics Workflow
1. `da metrics list` — find the relevant metric ({{ metrics.count }} available, categories: {{ metrics.categories | join(", ") or "none yet" }})
2. `da metrics show <category>/<name>` — read SQL and business rules
3. Use the canonical SQL from the metric definition, adapt to the question
4. Never invent metric calculations — always check existing definitions first
## Data Sync
- `da sync` — download current data from server
- `da sync --docs-only` — just metadata and metrics (fast refresh)
- `da sync --upload-only` — upload sessions and local notes to server
- Data on the server refreshes every {{ sync_interval }}
## Available Datasets
{% for t in tables -%}
- `{{ t.name }}`{% if t.description %} — {{ t.description }}{% endif %}{% if t.query_mode == "remote" %} *(remote, queried on demand)*{% endif %}
{% else -%}
- _No tables registered yet — ask an admin to register tables in the dashboard._
{% endfor %}
{% if marketplaces -%}
## Plugins available to you
{% for mp in marketplaces -%}
- **{{ mp.name }}** ({{ mp.slug }}): {{ mp.plugins | map(attribute="name") | join(", ") }}
{% endfor %}
{% endif -%}
## Remote Queries (BigQuery) — when data isn't on the laptop
Not every table is synced. Tables registered with `query_mode: "remote"` live in
BigQuery, accessed server-side via DuckDB's BQ extension — no parquet on disk.
Tables you don't see in `data/parquet/` may still be queryable.
### Discovery first
```
da catalog --json | jq '.[] | {name, source_type, query_mode}' # see all tables + their modes
da schema <table> # columns + types
da describe <table> -n 5 # sample rows
```
For local-mode tables, query directly with `da query "SELECT … FROM <table>"`.
### Three patterns for `query_mode: "remote"` tables
| Pattern | Tool | Use when |
|---------|------|----------|
| **`da fetch`** (preferred) | materializes a filtered subset locally → query the snapshot | repeated questions on same slice |
| **`da query --remote`** | one-shot, server-side execution against BigQuery | single aggregate / cheap probe |
| **`da query --register-bq`** | hybrid joins between local snapshots and ad-hoc BQ subqueries | crossing local + remote |
### Permission model + cost — important
- BQ access goes through the **agnes server's GCE service account**, not your personal Google credentials. If a query fails with a permission error, the table is in a project the server SA cannot read — escalate to admin, do NOT try to authenticate yourself.
- Every BQ query bills the SA's GCP project for **bytes scanned**. A naive `SELECT * FROM <large_table>` can cost real money. ALWAYS:
- filter via `--where` on the partition column (typically a date)
- list specific columns in `--select` — column-store BQ skips the rest, cheaper
- run `--estimate` first when unsure of the table size or partitioning
### `da fetch` discipline
```
# 1. ESTIMATE first — refuses to fetch without knowing the cost
da fetch <table> --select col1,col2 --where "date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)" --estimate
# 2. If reasonable, fetch as a named snapshot
da fetch <table> --select col1,col2 --where "..." --as my_recent
# 3. Query the local snapshot
da query "SELECT col1, COUNT(*) FROM my_recent GROUP BY 1"
# 4. List + drop snapshots when done
da snapshot list
da snapshot drop my_recent
```
Rules of thumb:
- ALWAYS list specific columns in `--select`. Avoid implicit SELECT *.
- ALWAYS include a `--where` for remote tables; otherwise add `--limit`.
- ALWAYS run `--estimate` first when the table is `partition_by` / `clustered_by`
per `da schema`, or could plausibly exceed 1 GB local bytes.
- Reuse snapshots across questions in the same conversation — `da snapshot list`
before fetching.
### Snapshot freshness — when to refresh
Snapshots are point-in-time copies. They go stale as the source data updates (most BQ tables refresh daily; check `sync_schedule` per `da catalog`). For each new conversation:
```
da snapshot list # see existing snapshots + their ages
da snapshot drop my_recent # drop stale ones
da fetch <table> --select ... --where ... --as my_recent # re-fetch
```
If the question is time-sensitive (e.g. "today's orders"), assume any snapshot older than the table's `sync_schedule` is stale and refresh.
### Hybrid query example — local + remote in one query
`da query --register-bq` lets a single SQL statement join a local table with an ad-hoc BQ subquery. The BQ subquery runs first (server-side), result registered as a DuckDB view, then the joined query runs locally.
```
da query \
--register-bq "traffic=SELECT date, country, SUM(views) AS views \
FROM \`prj.web_analytics.sessions\` \
WHERE date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) \
GROUP BY 1, 2" \
--sql "SELECT o.date, o.country, o.revenue, t.views, o.revenue / NULLIF(t.views,0) AS rev_per_view \
FROM orders o \
JOIN traffic t ON o.date = t.date AND o.country = t.country \
ORDER BY 1 DESC"
```
The BQ subquery MUST contain `WHERE` and/or `GROUP BY` to keep the registered result manageable (target: under 500K rows, well under 100 MB). Multiple `--register-bq` flags can compose multiple BQ sources. For complex SQL, use `--stdin` mode (`echo '{"register_bq":{...},"sql":"..."}' | da query --stdin`).
### BigQuery SQL flavor for `--where`
Source-typed `bigquery` tables use BigQuery dialect, not DuckDB:
- Date literal: `DATE '2026-01-01'`
- Timestamp literal: `TIMESTAMP '2026-01-01 00:00:00 UTC'`
- Now: `CURRENT_DATE()`, `CURRENT_TIMESTAMP()`
- Date arithmetic: `DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)`
- Regex: `REGEXP_CONTAINS(col, r'pattern')` (raw string!)
- Cast: `CAST(x AS INT64)` (NOT `INT`)
### When the table you want isn't in `da catalog`
The table may exist in BigQuery but not be registered with Agnes yet. Two options:
1. **Ad-hoc one-shot** — register a BQ subquery as a view inline, no admin needed
if the agnes server SA has BQ access:
```
da query --register-bq "live=SELECT * FROM \`project.dataset.table\` WHERE date >= '...' LIMIT 1000" \
--sql "SELECT * FROM live"
```
2. **Ask admin to register** the table with `query_mode: "remote"` so it shows up
in `da catalog` and supports `da fetch` / `da query --remote`. This is the
right path for any table you'll query repeatedly.
### Deeper guidance
For the full protocol, including hybrid-query examples, snapshot hygiene, and
when NOT to use `da fetch`, run:
```
da skills show agnes-data-querying
```
## Corporate Memory
Rules injected by `da sync` from the server's corporate knowledge base live in `.claude/rules/km_*.md`. They are automatically loaded by Claude Code on every session start.
- `km_<id>.md` — mandatory rules (always enforced)
- `km_approved.md` — approved guidance (confidence × recency ranked)
Run `da sync` to refresh. Rules are pruned automatically when items are revoked.
## Directory Structure
- `data/` — read-only data downloaded from server
- `data/parquet/` — table data in Parquet format
- `data/duckdb/` — local analytics DuckDB database
- `data/metadata/` — profiles, schema, metrics cache
- `user/` — your workspace (persistent across syncs)
- `user/artifacts/` — analysis outputs, reports, charts
- `user/sessions/` — Claude Code session logs
- `.claude/CLAUDE.local.md` — your personal notes + workspace customizations. **Never overwritten by `da analyst setup --force`.** Uploaded to the server on `da sync --upload-only`. Put any local-only Claude instructions, project-specific reminders, or temporary notes here — NOT in CLAUDE.md (this file is regenerated from a template).
_Hello {{ user.name or user.email }} — generated {{ today }}._