agnes-the-ai-analyst/cli/skills/agnes-data-querying.md
ZdenekSrotyr b2d54126dc docs(query): #160 align CLAUDE.md/skills/CHANGELOG with new --remote behavior + cost guardrail
Fixes the rails docs that PR #154 over-promised. The reporter (#160)
tried `da query --remote` against a VIEW row and saw a catalog error;
the previous version of the docs said this would work as a one-shot
server-side execution. Now it actually does (see prior commits), but
the docs also need to acknowledge the new cost guardrail and the
registry-gated direct-bq path.

Touched files:

- **CLAUDE.md** (root, "Querying Agnes data — agent rails"): the
  `da query --remote` bullet under "Choose the right tool" now spells
  out the BASE TABLE vs VIEW/MATERIALIZED_VIEW pushdown asymmetry +
  the 5 GiB scan cap + the registry-gating of direct bq.* paths.
  "When NOT to use `da fetch`" decision matrix updated with a separate
  row for VIEW aggregates so analysts see why the cap might trip.
- **config/claude_md_template.txt** (PR #154's analyst CLAUDE.md):
  three-patterns table caveat for the cost guardrail.
- **cli/skills/agnes-data-querying.md**: `When NOT to use da fetch`
  matrix updated with the same VIEW caveat + registry-gating note.
- **cli/skills/agnes-table-registration.md:121**: replaced the
  example that suggested raw `bq."<project>.<dataset>.<table>"` syntax
  (now blocked by the RBAC patch) with the registered-name form.
- **CHANGELOG.md**: full Unreleased entry with Added (Test Connection
  endpoint + cost-cap server-config knob + placeholder UI), Fixed (the
  five #160-class fixes: VIEW resolution, RBAC patch, blocklist,
  bigquery_query() blocking, CLI render, hybrid endpoint detail
  flattening), Changed (BREAKING legacy_wrap_views removal + quota
  relocation).

140 tests pass across the issue-affected files.
2026-05-04 10:33:06 +02:00

5.5 KiB

name description
agnes-data-querying Use when querying any data in Agnes — discovery first, estimate before fetch, materialize scoped subsets locally

Querying Agnes data

When asked about ANY data in Agnes, follow this protocol: discover → choose tool → fetch (with estimate) → query locally → clean up.

Discovery first

Before writing ANY query, understand what's available:

da catalog --json | jq <filter>     # know what's available
da schema <table>                    # learn columns + types
da describe <table> -n 5             # see real values for shape

Never write SELECT * FROM <table> blindly. For local-mode tables it's wasteful; for remote-mode tables it can blow up at 225M+ rows.

Choose the right tool

Tables in da catalog have a query_mode:

Mode Means How to query
local parquet synced on laptop da query "SELECT …" directly
remote (BigQuery) parquet NOT on laptop da fetch subset → snapshot, OR da query --remote one-shot

For remote tables, you MUST either:

  1. da fetch a filtered subset → query the local snapshot (preferred), OR
  2. da query --remote for one-shot server-side execution, OR
  3. da query --register-bq for hybrid joins (rare; see docs)

The da fetch workflow (preferred for remote tables)

1. Estimate first

Always estimate before fetching:

da fetch web_sessions_example \
    --select event_date,country_code,session_id \
    --where "event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) 
             AND country_code = 'CZ'" \
    --estimate

Output tells you scan cost, expected rows, and local bytes — so you know if it's reasonable.

2. If reasonable, fetch to snapshot

da fetch web_sessions_example \
    --select event_date,country_code,session_id \
    --where "event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) 
             AND country_code = 'CZ'" \
    --as cz_recent

3. Query the local snapshot

da query "SELECT event_date, COUNT(*) FROM cz_recent GROUP BY 1 ORDER BY 1"

Heuristics for da fetch

Requirement Why
Always --select specific columns Avoid implicit SELECT * on remote (expensive)
Always --where for remote tables Otherwise add --limit to keep result bounded
Always --estimate first if unsure Partition/clustering metadata + shape matters; dry runs are free
Reuse snapshots across questions da snapshot list before fetching — existing snapshot? Skip the fetch

BigQuery SQL flavor for --where

For source_type=bigquery (per da catalog), use BigQuery SQL syntax:

Syntax Example
Date literal DATE '2026-01-01' (NOT '2026-01-01'::date)
Timestamp literal TIMESTAMP '2026-01-01 00:00:00 UTC'
Now CURRENT_DATE(), CURRENT_TIMESTAMP()
Date arithmetic DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
Regex REGEXP_CONTAINS(col, r'pattern') (raw string!)
NULL check col IS NOT NULL (standard)
Cast CAST(x AS INT64) (NOT INT)

For source_type=keboola / source_type=jira (local), use DuckDB SQL in your da query calls — there's no --where on local since fetch is implicit.

Snapshot hygiene

  • Reuse snapshots across questions in the same conversation
  • Use descriptive names: cz_recent, orders_q1_us, sessions_today
  • Drop with da snapshot drop <name> when done with a topic
  • Check total cache size with da disk-info

When NOT to use da fetch

Scenario Use instead
Single aggregate on remote BASE TABLE (SELECT COUNT(*)) da query --remote "SELECT COUNT(*) FROM web_sessions_example" — cheap, no fetch needed (Storage Read API pushes the COUNT into BQ)
Single aggregate on remote VIEW/MATERIALIZED_VIEW Same syntax works (#160) but the BQ jobs API can't push WHERE/COUNT into the view body. Cost guardrail (default 5 GiB) catches expensive scans → 400 remote_scan_too_large with da fetch suggestion. Pivot to da fetch <id> --where '<predicate>' if rejected.
Throwaway exploration with raw BQ syntax da query --remote "SELECT … FROM <registered_id>" — direct bq."<dataset>"."<table>" paths are now registry-gated (403 bq_path_not_registered if not registered). Register first or use the catalog id.
Cross-table JOIN with both remote Use da fetch for one side + da query --remote for the other; full cross-remote JOIN needs design (see #101)

When the table you need isn't in da catalog

The catalog reads from system.duckdb::table_registry — entries land there only via admin registration, not auto-discovery. If da catalog doesn't show what the user is asking about:

  1. Tell the user the table isn't registered
  2. Hand off to an admin (or, if you have admin role yourself, follow the agnes-table-registration skill)
  3. Don't da query --remote your way around it — the catalog gap means the registry doesn't track this dataset, RBAC can't gate it, and quotas don't apply

Protocol summary

  1. Discover: da catalog, da schema, da describe
  2. Check query_mode: local (direct) or remote (fetch or --remote)?
  3. For remote: --estimate first, then da fetch with --select + --where
  4. Snapshot name: descriptive (cz_recent), reuse across questions
  5. Query: da query against snapshot; DuckDB SQL syntax
  6. Cleanup: da snapshot drop when done; da disk-info to check size