13 Devin findings across 10 files: 🔴 Critical: - app/api/v2_catalog.py:42 — `_fetch_hint` returns `da fetch` in /api/v2/catalog responses (user-visible in every catalog list) - cli/skills/agnes-data-querying.md — 11 stale `da fetch`/`da sync` refs in the bundled skill markdown - config/claude_md_template.txt:38 — referenced `agnes pull --docs-only` flag that does NOT exist in agnes pull (removed; spec only ships --quiet/--json/ --dry-run) 🟡 Important: - app/api/admin.py:252 — `da fetch` in bq_max_scan_bytes hint - cli/commands/auth.py:119 — `da sync` in import-token docstring (--help text) - cli/commands/tokens.py:48 — "Export it so `da` can use it" prose - ARCHITECTURE.md — 4 stale rows in CLI commands table - README.md — stale paragraphs for analysts (da sync, da analyst setup) 🚩 Substantive observations addressed: - app/api/query.py:249,302,489 — server-side error/help strings still said `da sync`/`da fetch` (returned in API responses to clients) - cli/commands/snapshot.py:235-241 — DuckDB existence guard incorrectly blocked `--estimate` (server-side dry-run that never opens local DB). Added test ensuring estimate path skips the guard. Skipped (intentionally historical): - app/api/admin.py:2377,2429,2437 — historical comments describing past manifest-vs-sync_state bug; past tense, accurate to keep as `da sync`.
5.7 KiB
| name | description |
|---|---|
| agnes-data-querying | Use when querying any data in Agnes — discovery first, estimate before fetch, materialize scoped subsets locally |
Querying Agnes data
When asked about ANY data in Agnes, follow this protocol: discover → choose tool → fetch (with estimate) → query locally → clean up.
Discovery first
Before writing ANY query, understand what's available:
agnes catalog --json | jq <filter> # know what's available
agnes schema <table> # learn columns + types
agnes describe <table> -n 5 # see real values for shape
Never write SELECT * FROM <table> blindly. For local-mode tables it's wasteful; for remote-mode tables it can blow up at 225M+ rows.
Choose the right tool
Tables in agnes catalog have a query_mode:
| Mode | Means | How to query |
|---|---|---|
local |
parquet synced on laptop | agnes query "SELECT …" directly |
remote (BigQuery) |
parquet NOT on laptop | agnes snapshot create subset → snapshot, OR agnes query --remote one-shot |
For remote tables, you MUST either:
agnes snapshot createa filtered subset → query the local snapshot (preferred), ORagnes query --remotefor one-shot server-side execution, ORagnes query --register-bqfor hybrid joins (rare; see docs)
The agnes snapshot create workflow (preferred for remote tables)
1. Estimate first
Always estimate before fetching:
agnes snapshot create web_sessions_example \
--select event_date,country_code,session_id \
--where "event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
AND country_code = 'CZ'" \
--estimate
Output tells you scan cost, expected rows, and local bytes — so you know if it's reasonable.
2. If reasonable, fetch to snapshot
agnes snapshot create web_sessions_example \
--select event_date,country_code,session_id \
--where "event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
AND country_code = 'CZ'" \
--as cz_recent
3. Query the local snapshot
agnes query "SELECT event_date, COUNT(*) FROM cz_recent GROUP BY 1 ORDER BY 1"
Heuristics for agnes snapshot create
| Requirement | Why |
|---|---|
Always --select specific columns |
Avoid implicit SELECT * on remote (expensive) |
Always --where for remote tables |
Otherwise add --limit to keep result bounded |
Always --estimate first if unsure |
Partition/clustering metadata + shape matters; dry runs are free |
| Reuse snapshots across questions | agnes snapshot list before fetching — existing snapshot? Skip the fetch |
BigQuery SQL flavor for --where
For source_type=bigquery (per agnes catalog), use BigQuery SQL syntax:
| Syntax | Example |
|---|---|
| Date literal | DATE '2026-01-01' (NOT '2026-01-01'::date) |
| Timestamp literal | TIMESTAMP '2026-01-01 00:00:00 UTC' |
| Now | CURRENT_DATE(), CURRENT_TIMESTAMP() |
| Date arithmetic | DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) |
| Regex | REGEXP_CONTAINS(col, r'pattern') (raw string!) |
| NULL check | col IS NOT NULL (standard) |
| Cast | CAST(x AS INT64) (NOT INT) |
For source_type=keboola / source_type=jira (local), use DuckDB SQL in your agnes query calls — there's no --where on local since fetch is implicit.
Snapshot hygiene
- Reuse snapshots across questions in the same conversation
- Use descriptive names:
cz_recent,orders_q1_us,sessions_today - Drop with
agnes snapshot drop <name>when done with a topic - Check total cache size with
agnes disk-info
When NOT to use agnes snapshot create
| Scenario | Use instead |
|---|---|
Single aggregate on remote BASE TABLE (SELECT COUNT(*)) |
agnes query --remote "SELECT COUNT(*) FROM web_sessions_example" — cheap, no fetch needed (Storage Read API pushes the COUNT into BQ) |
| Single aggregate on remote VIEW/MATERIALIZED_VIEW | Same syntax works (#160) but the BQ jobs API can't push WHERE/COUNT into the view body. Cost guardrail (default 5 GiB) catches expensive scans → 400 remote_scan_too_large with agnes snapshot create suggestion. Pivot to agnes snapshot create <id> --where '<predicate>' if rejected. |
| Throwaway exploration with raw BQ syntax | agnes query --remote "SELECT … FROM <registered_id>" — direct bq."<dataset>"."<table>" paths are now registry-gated (403 bq_path_not_registered if not registered). Register first or use the catalog id. |
| Cross-table JOIN with both remote | Use agnes snapshot create for one side + agnes query --remote for the other; full cross-remote JOIN needs design (see #101) |
When the table you need isn't in agnes catalog
The catalog reads from system.duckdb::table_registry — entries land there only via admin registration, not auto-discovery. If agnes catalog doesn't show what the user is asking about:
- Tell the user the table isn't registered
- Hand off to an admin (or, if you have admin role yourself, follow the agnes-table-registration skill)
- Don't
agnes query --remoteyour way around it — the catalog gap means the registry doesn't track this dataset, RBAC can't gate it, and quotas don't apply
Protocol summary
- Discover:
agnes catalog,agnes schema,agnes describe - Check query_mode: local (direct) or remote (fetch or --remote)?
- For remote:
--estimatefirst, thenagnes snapshot createwith--select+--where - Snapshot name: descriptive (
cz_recent), reuse across questions - Query:
agnes queryagainst snapshot; DuckDB SQL syntax - Cleanup:
agnes snapshot dropwhen done;agnes disk-infoto check size