Fixes the rails docs that PR #154 over-promised. The reporter (#160) tried `da query --remote` against a VIEW row and saw a catalog error; the previous version of the docs said this would work as a one-shot server-side execution. Now it actually does (see prior commits), but the docs also need to acknowledge the new cost guardrail and the registry-gated direct-bq path. Touched files: - **CLAUDE.md** (root, "Querying Agnes data — agent rails"): the `da query --remote` bullet under "Choose the right tool" now spells out the BASE TABLE vs VIEW/MATERIALIZED_VIEW pushdown asymmetry + the 5 GiB scan cap + the registry-gating of direct bq.* paths. "When NOT to use `da fetch`" decision matrix updated with a separate row for VIEW aggregates so analysts see why the cap might trip. - **config/claude_md_template.txt** (PR #154's analyst CLAUDE.md): three-patterns table caveat for the cost guardrail. - **cli/skills/agnes-data-querying.md**: `When NOT to use da fetch` matrix updated with the same VIEW caveat + registry-gating note. - **cli/skills/agnes-table-registration.md:121**: replaced the example that suggested raw `bq."<project>.<dataset>.<table>"` syntax (now blocked by the RBAC patch) with the registered-name form. - **CHANGELOG.md**: full Unreleased entry with Added (Test Connection endpoint + cost-cap server-config knob + placeholder UI), Fixed (the five #160-class fixes: VIEW resolution, RBAC patch, blocklist, bigquery_query() blocking, CLI render, hybrid endpoint detail flattening), Changed (BREAKING legacy_wrap_views removal + quota relocation). 140 tests pass across the issue-affected files.
124 lines
5.5 KiB
Markdown
124 lines
5.5 KiB
Markdown
---
|
|
name: agnes-data-querying
|
|
description: Use when querying any data in Agnes — discovery first, estimate before fetch, materialize scoped subsets locally
|
|
---
|
|
|
|
# Querying Agnes data
|
|
|
|
When asked about ANY data in Agnes, follow this protocol: **discover → choose tool → fetch (with estimate) → query locally → clean up**.
|
|
|
|
## Discovery first
|
|
|
|
Before writing ANY query, understand what's available:
|
|
|
|
```bash
|
|
da catalog --json | jq <filter> # know what's available
|
|
da schema <table> # learn columns + types
|
|
da describe <table> -n 5 # see real values for shape
|
|
```
|
|
|
|
**Never** write `SELECT * FROM <table>` blindly. For local-mode tables it's wasteful; for remote-mode tables it can blow up at 225M+ rows.
|
|
|
|
## Choose the right tool
|
|
|
|
Tables in `da catalog` have a `query_mode`:
|
|
|
|
| Mode | Means | How to query |
|
|
|------|-------|--------------|
|
|
| `local` | parquet synced on laptop | `da query "SELECT …"` directly |
|
|
| `remote` (BigQuery) | parquet NOT on laptop | `da fetch` subset → snapshot, OR `da query --remote` one-shot |
|
|
|
|
For **remote tables**, you MUST either:
|
|
1. `da fetch` a filtered subset → query the local snapshot (preferred), OR
|
|
2. `da query --remote` for one-shot server-side execution, OR
|
|
3. `da query --register-bq` for hybrid joins (rare; see docs)
|
|
|
|
## The `da fetch` workflow (preferred for remote tables)
|
|
|
|
### 1. Estimate first
|
|
|
|
Always estimate before fetching:
|
|
|
|
```bash
|
|
da fetch web_sessions_example \
|
|
--select event_date,country_code,session_id \
|
|
--where "event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
|
|
AND country_code = 'CZ'" \
|
|
--estimate
|
|
```
|
|
|
|
Output tells you scan cost, expected rows, and local bytes — so you know if it's reasonable.
|
|
|
|
### 2. If reasonable, fetch to snapshot
|
|
|
|
```bash
|
|
da fetch web_sessions_example \
|
|
--select event_date,country_code,session_id \
|
|
--where "event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
|
|
AND country_code = 'CZ'" \
|
|
--as cz_recent
|
|
```
|
|
|
|
### 3. Query the local snapshot
|
|
|
|
```bash
|
|
da query "SELECT event_date, COUNT(*) FROM cz_recent GROUP BY 1 ORDER BY 1"
|
|
```
|
|
|
|
## Heuristics for `da fetch`
|
|
|
|
| Requirement | Why |
|
|
|-------------|-----|
|
|
| **Always `--select` specific columns** | Avoid implicit `SELECT *` on remote (expensive) |
|
|
| **Always `--where` for remote tables** | Otherwise add `--limit` to keep result bounded |
|
|
| **Always `--estimate` first if unsure** | Partition/clustering metadata + shape matters; dry runs are free |
|
|
| **Reuse snapshots across questions** | `da snapshot list` before fetching — existing snapshot? Skip the fetch |
|
|
|
|
## BigQuery SQL flavor for `--where`
|
|
|
|
For `source_type=bigquery` (per `da catalog`), use BigQuery SQL syntax:
|
|
|
|
| Syntax | Example |
|
|
|--------|---------|
|
|
| Date literal | `DATE '2026-01-01'` (NOT `'2026-01-01'::date`) |
|
|
| Timestamp literal | `TIMESTAMP '2026-01-01 00:00:00 UTC'` |
|
|
| Now | `CURRENT_DATE()`, `CURRENT_TIMESTAMP()` |
|
|
| Date arithmetic | `DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)` |
|
|
| Regex | `REGEXP_CONTAINS(col, r'pattern')` (raw string!) |
|
|
| NULL check | `col IS NOT NULL` (standard) |
|
|
| Cast | `CAST(x AS INT64)` (NOT `INT`) |
|
|
|
|
For `source_type=keboola` / `source_type=jira` (local), use **DuckDB SQL** in your `da query` calls — there's no `--where` on local since fetch is implicit.
|
|
|
|
## Snapshot hygiene
|
|
|
|
- Reuse snapshots across questions in the same conversation
|
|
- Use descriptive names: `cz_recent`, `orders_q1_us`, `sessions_today`
|
|
- Drop with `da snapshot drop <name>` when done with a topic
|
|
- Check total cache size with `da disk-info`
|
|
|
|
## When NOT to use `da fetch`
|
|
|
|
| Scenario | Use instead |
|
|
|----------|------------|
|
|
| Single aggregate on remote BASE TABLE (`SELECT COUNT(*)`) | `da query --remote "SELECT COUNT(*) FROM web_sessions_example"` — cheap, no fetch needed (Storage Read API pushes the COUNT into BQ) |
|
|
| Single aggregate on remote VIEW/MATERIALIZED_VIEW | Same syntax works (#160) but the BQ jobs API can't push WHERE/COUNT into the view body. Cost guardrail (default 5 GiB) catches expensive scans → 400 `remote_scan_too_large` with `da fetch` suggestion. Pivot to `da fetch <id> --where '<predicate>'` if rejected. |
|
|
| Throwaway exploration with raw BQ syntax | `da query --remote "SELECT … FROM <registered_id>"` — direct `bq."<dataset>"."<table>"` paths are now registry-gated (403 `bq_path_not_registered` if not registered). Register first or use the catalog id. |
|
|
| Cross-table JOIN with both remote | Use `da fetch` for one side + `da query --remote` for the other; full cross-remote JOIN needs design (see #101) |
|
|
|
|
## When the table you need isn't in `da catalog`
|
|
|
|
The catalog reads from `system.duckdb::table_registry` — entries land there only via admin registration, not auto-discovery. If `da catalog` doesn't show what the user is asking about:
|
|
|
|
1. Tell the user the table isn't registered
|
|
2. Hand off to an admin (or, if you have admin role yourself, follow the **agnes-table-registration** skill)
|
|
3. Don't `da query --remote` your way around it — the catalog gap means the registry doesn't track this dataset, RBAC can't gate it, and quotas don't apply
|
|
|
|
## Protocol summary
|
|
|
|
1. **Discover**: `da catalog`, `da schema`, `da describe`
|
|
2. **Check query_mode**: local (direct) or remote (fetch or --remote)?
|
|
3. **For remote**: `--estimate` first, then `da fetch` with `--select` + `--where`
|
|
4. **Snapshot name**: descriptive (`cz_recent`), reuse across questions
|
|
5. **Query**: `da query` against snapshot; DuckDB SQL syntax
|
|
6. **Cleanup**: `da snapshot drop` when done; `da disk-info` to check size
|