ZdenekSrotyr 3d58768143 fix: address Devin Review findings — incomplete renames + estimate guard

13 Devin findings across 10 files:

🔴 Critical:
- app/api/v2_catalog.py:42 — `_fetch_hint` returns `da fetch` in /api/v2/catalog
  responses (user-visible in every catalog list)
- cli/skills/agnes-data-querying.md — 11 stale `da fetch`/`da sync` refs in the
  bundled skill markdown
- config/claude_md_template.txt:38 — referenced `agnes pull --docs-only` flag
  that does NOT exist in agnes pull (removed; spec only ships --quiet/--json/
  --dry-run)

🟡 Important:
- app/api/admin.py:252 — `da fetch` in bq_max_scan_bytes hint
- cli/commands/auth.py:119 — `da sync` in import-token docstring (--help text)
- cli/commands/tokens.py:48 — "Export it so `da` can use it" prose
- ARCHITECTURE.md — 4 stale rows in CLI commands table
- README.md — stale paragraphs for analysts (da sync, da analyst setup)

🚩 Substantive observations addressed:
- app/api/query.py:249,302,489 — server-side error/help strings still said
  `da sync`/`da fetch` (returned in API responses to clients)
- cli/commands/snapshot.py:235-241 — DuckDB existence guard incorrectly
  blocked `--estimate` (server-side dry-run that never opens local DB).
  Added test ensuring estimate path skips the guard.

Skipped (intentionally historical):
- app/api/admin.py:2377,2429,2437 — historical comments describing past
  manifest-vs-sync_state bug; past tense, accurate to keep as `da sync`.

2026-05-04 20:05:06 +02:00

5.7 KiB

Raw Blame History

name	description
agnes-data-querying	Use when querying any data in Agnes — discovery first, estimate before fetch, materialize scoped subsets locally

Querying Agnes data

When asked about ANY data in Agnes, follow this protocol: discover → choose tool → fetch (with estimate) → query locally → clean up.

Discovery first

Before writing ANY query, understand what's available:

agnes catalog --json | jq <filter>     # know what's available
agnes schema <table>                    # learn columns + types
agnes describe <table> -n 5             # see real values for shape

Never write SELECT * FROM <table> blindly. For local-mode tables it's wasteful; for remote-mode tables it can blow up at 225M+ rows.

Choose the right tool

Tables in agnes catalog have a query_mode:

Mode	Means	How to query
`local`	parquet synced on laptop	`agnes query "SELECT …"` directly
`remote` (BigQuery)	parquet NOT on laptop	`agnes snapshot create` subset → snapshot, OR `agnes query --remote` one-shot

For remote tables, you MUST either:

agnes snapshot create a filtered subset → query the local snapshot (preferred), OR
agnes query --remote for one-shot server-side execution, OR
agnes query --register-bq for hybrid joins (rare; see docs)

The `agnes snapshot create` workflow (preferred for remote tables)

1. Estimate first

Always estimate before fetching:

agnes snapshot create web_sessions_example \
    --select event_date,country_code,session_id \
    --where "event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) 
             AND country_code = 'CZ'" \
    --estimate

Output tells you scan cost, expected rows, and local bytes — so you know if it's reasonable.

2. If reasonable, fetch to snapshot

agnes snapshot create web_sessions_example \
    --select event_date,country_code,session_id \
    --where "event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) 
             AND country_code = 'CZ'" \
    --as cz_recent

3. Query the local snapshot

agnes query "SELECT event_date, COUNT(*) FROM cz_recent GROUP BY 1 ORDER BY 1"

Heuristics for `agnes snapshot create`

Requirement	Why
Always `--select` specific columns	Avoid implicit `SELECT *` on remote (expensive)
Always `--where` for remote tables	Otherwise add `--limit` to keep result bounded
Always `--estimate` first if unsure	Partition/clustering metadata + shape matters; dry runs are free
Reuse snapshots across questions	`agnes snapshot list` before fetching — existing snapshot? Skip the fetch

BigQuery SQL flavor for `--where`

For source_type=bigquery (per agnes catalog), use BigQuery SQL syntax:

Syntax	Example
Date literal	`DATE '2026-01-01'` (NOT `'2026-01-01'::date`)
Timestamp literal	`TIMESTAMP '2026-01-01 00:00:00 UTC'`
Now	`CURRENT_DATE()`, `CURRENT_TIMESTAMP()`
Date arithmetic	`DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)`
Regex	`REGEXP_CONTAINS(col, r'pattern')` (raw string!)
NULL check	`col IS NOT NULL` (standard)
Cast	`CAST(x AS INT64)` (NOT `INT`)

For source_type=keboola / source_type=jira (local), use DuckDB SQL in your agnes query calls — there's no --where on local since fetch is implicit.

Snapshot hygiene

Reuse snapshots across questions in the same conversation
Use descriptive names: cz_recent, orders_q1_us, sessions_today
Drop with agnes snapshot drop <name> when done with a topic
Check total cache size with agnes disk-info

When NOT to use `agnes snapshot create`

Scenario	Use instead
Single aggregate on remote BASE TABLE (`SELECT COUNT(*)`)	`agnes query --remote "SELECT COUNT(*) FROM web_sessions_example"` — cheap, no fetch needed (Storage Read API pushes the COUNT into BQ)
Single aggregate on remote VIEW/MATERIALIZED_VIEW	Same syntax works (#160) but the BQ jobs API can't push WHERE/COUNT into the view body. Cost guardrail (default 5 GiB) catches expensive scans → 400 `remote_scan_too_large` with `agnes snapshot create` suggestion. Pivot to `agnes snapshot create <id> --where '<predicate>'` if rejected.
Throwaway exploration with raw BQ syntax	`agnes query --remote "SELECT … FROM <registered_id>"` — direct `bq."<dataset>"."<table>"` paths are now registry-gated (403 `bq_path_not_registered` if not registered). Register first or use the catalog id.
Cross-table JOIN with both remote	Use `agnes snapshot create` for one side + `agnes query --remote` for the other; full cross-remote JOIN needs design (see #101)

When the table you need isn't in `agnes catalog`

The catalog reads from system.duckdb::table_registry — entries land there only via admin registration, not auto-discovery. If agnes catalog doesn't show what the user is asking about:

Tell the user the table isn't registered
Hand off to an admin (or, if you have admin role yourself, follow the agnes-table-registration skill)
Don't agnes query --remote your way around it — the catalog gap means the registry doesn't track this dataset, RBAC can't gate it, and quotas don't apply

Protocol summary

Discover: agnes catalog, agnes schema, agnes describe
Check query_mode: local (direct) or remote (fetch or --remote)?
For remote: --estimate first, then agnes snapshot create with --select + --where
Snapshot name: descriptive (cz_recent), reuse across questions
Query: agnes query against snapshot; DuckDB SQL syntax
Cleanup: agnes snapshot drop when done; agnes disk-info to check size

5.7 KiB Raw Blame History