agnes-the-ai-analyst/docs/superpowers/specs/2026-04-27-claude-fetch-primitives-design.md

# Claude-Driven Fetch Primitives + Discovery + Agent Rails — Design

> **Goal:** Replace the broken "wrap a BQ view in a DuckDB master view" approach (issue #101) with a clean primitives-based model where Claude (the LLM agent) plans the work, and Agnes provides discovery + scoped fetch + local query primitives. No client-side SQL parsing. No GCP creds on the analyst laptop.

**Status:** Design — awaiting code review and implementation plan.

**Author:** ZS (with the in-house Claude agent)

**Related issues:** #101 (BQ view-wrapping doesn't push down outer queries), #91 (admin server-config), #96 (project_id validation, already shipped), #98 (token cache, already shipped)

---

## 1. Motivation

The current BigQuery view pipeline (shipped in branch `zs/test-bq-e2e`, PR #102) wraps each registered BQ view as:

```sql
CREATE VIEW "web_sessions_example" AS
  SELECT * FROM bigquery_query('proj', 'SELECT * FROM `proj.ds.web_sessions_example`')
```

This is correct in principle, but **fails at query time** for any non-trivial view:

```sql
SELECT COUNT(*) FROM web_sessions_example
-- DuckDB rewrites to:
-- SELECT COUNT(*) FROM (SELECT * FROM bigquery_query(...))
-- BigQuery sees the inner SELECT * as the literal job and tries to materialize 225M rows.
-- → "Response too large to return"
```

DuckDB's optimizer cannot push the outer `COUNT(*)` / `WHERE` / `LIMIT` into the opaque `bigquery_query()` table function. The wrap is therefore a near-zero-utility abstraction for any BQ view of meaningful size.

We considered four mitigations (issue #101 lists them: detect-attach for views, predicate templates, pre-materialize to BQ tables, drop-the-wrap). None of them is fully satisfying as a closed system, because **the agent (Claude) is already the smart planner in the loop**. The right answer is to expose primitive operations Claude can compose, with strong railsy in CLAUDE.md, instead of trying to make DuckDB look transparent through the wrong abstraction.

## 2. Architecture

### 2.1 Two-tier query model (unchanged)

```
┌─ analyst laptop ─────────────────┐    ┌─ Agnes server ───────────┐    ┌─ BigQuery
│                                  │    │                          │    │
│  Claude (agent) ── da CLI ──┐    │    │  FastAPI                 │    │
│                              │   │    │   ├─ /api/v2/catalog     │    │
│                              │   │    │   ├─ /api/v2/schema      │    │
│              ┌───────────────┴─┐ │    │   ├─ /api/v2/sample      │    │
│              ▼                 │ │    │   ├─ /api/v2/scan        │ ──►│
│       local DuckDB             │ │    │   └─ /api/v2/scan/estimate│   │
│       ~/agnes-data/.../        │ │    │                          │    │
│       user/duckdb/             │ │    │  server DuckDB           │    │
│       analytics.duckdb         │ │    │   + BQ secret            │    │
│         + parquet views        │ │    │   + RBAC                 │    │
│         + snapshot views ◄─────┼─┘    │   + safelist             │    │
│                                  │    │                          │    │
└──────────────────────────────────┘    └──────────────────────────┘    └─
```

Local DuckDB stays the analyst's interactive SQL surface. Server-side DuckDB is the BQ entrypoint — secrets stay there. The two are joined by **fetch operations** that materialize filtered subsets onto the laptop as DuckDB views over local parquet snapshots.

### 2.2 What changes vs today

- **Drop the `bigquery_query()` wrap view** in `connectors/bigquery/extractor.py`. BQ views still get registered in `_meta` for catalog purposes, but no master view is created in `analytics.duckdb`.
- **Add server endpoints** for catalog / schema / sample / scan / scan-estimate.
- **Add CLI primitives** for fetch + snapshot management + discovery.
- **Add CLAUDE.md instructions** that teach the agent the workflow.
- **Add a standalone skill** so the agent rails load automatically when working with Agnes.

`/api/query` and `/api/query/hybrid` stay; they remain useful for one-shot server-side aggregations and existing `da query --remote` flows.

## 3. Server endpoints

### 3.0 Identifier conventions (applies to all v2 endpoints)

`table_id` is the **registry primary key** (`table_registry.id`) verbatim — lowercase ASCII, alphanumeric + underscore, ≤64 chars, validated by `src/sql_safe.py::validate_identifier`. The display name (`table_registry.name`) may differ in case but is NOT a query key. CLI commands accept `table_id` only. The registry `register-table` endpoint already lowercases id at insert time, which is the canonical normalization point.

### 3.1 `GET /api/v2/catalog`

Returns the user-visible table catalog. Filtered by RBAC (`can_access_table`, table-grain). The user must have an explicit `dataset_permissions` row OR the table must be `is_public=true` OR the user must be `admin`.

Response shape:

```json
{
  "tables": [
    {
      "id": "web_sessions_example",
      "name": "web_sessions_example",
      "description": "Session landings event view",
      "source_type": "bigquery",
      "query_mode": "remote",
      "sql_flavor": "bigquery",
      "where_examples": [
        "event_date > DATE '2026-01-01'",
        "country_code = 'CZ' AND platform = 'web'"
      ],
      "fetch_via": "da fetch web_sessions_example --select <cols> --where '<BQ predicate>' --limit <N>",
      "rough_size_hint": null
    },
    {
      "id": "orders",
      "name": "orders",
      "source_type": "keboola",
      "query_mode": "local",
      "sql_flavor": "duckdb",
      "fetch_via": "already local — query directly via `da query`",
      "rough_size_hint": "1.2k rows / 180 KB"
    }
  ],
  "server_time": "2026-04-27T17:30:00Z"
}
```

Cached server-side per user (TTL 5 min) since the catalog rarely changes mid-session.

### 3.2 `GET /api/v2/schema/{table_id}`

Returns column metadata + BQ flavor hints (when applicable).

Response shape:

```json
{
  "table_id": "web_sessions_example",
  "source_type": "bigquery",
  "sql_flavor": "bigquery",
  "columns": [
    {"name": "event_date", "type": "DATE", "nullable": false, "description": "partition column"},
    {"name": "session_id", "type": "STRING", "nullable": false},
    {"name": "country_code", "type": "STRING", "nullable": true, "description": "ISO 3166-1 alpha-2"}
  ],
  "partition_by": "event_date",
  "clustered_by": ["country_code"],
  "where_dialect_hints": {
    "date_literal": "DATE '2026-01-01'",
    "timestamp_literal": "TIMESTAMP '2026-01-01 00:00:00 UTC'",
    "interval_subtract": "DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)",
    "regex": "REGEXP_CONTAINS(field, r'pattern')",
    "cast": "CAST(x AS INT64)"
  }
}
```

Source for BQ tables: `bigquery_query()` against `INFORMATION_SCHEMA.COLUMNS` + `INFORMATION_SCHEMA.TABLE_OPTIONS` + dataset query. No data scan, sub-second.

Cached server-side per `table_id` (TTL 1 h, manual invalidate via `da catalog --refresh`).

### 3.3 `GET /api/v2/sample/{table_id}?n=5`

Returns N sample rows (default 5, max 100). For BQ: `bigquery_query('proj', 'SELECT * FROM ds.t LIMIT N')`. For local: read from parquet directly.

Response shape:

```json
{
  "table_id": "web_sessions_example",
  "rows": [
    {"event_date": "2026-04-27", "session_id": "...", "country_code": "CZ"},
    ...
  ],
  "source": "bigquery"
}
```

Cached server-side TTL 1 h, invalidated on table re-extract or admin force-refresh.

### 3.4 `POST /api/v2/scan`

The work primitive. Takes a single-table filtered fetch request, returns Arrow IPC stream.

Request shape:

```json
{
  "table_id": "web_sessions_example",
  "select": ["event_date", "country_code", "session_id"],
  "where": "event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) AND country_code = 'CZ'",
  "limit": 1000000,
  "order_by": ["event_date DESC"]
}
```

Response: Arrow IPC stream (HTTP body), schema in headers.

**RBAC scope (v1):** **table-grain parity with `/api/query`** — same `can_access_table(user, table_id, conn)` check. **No column-level or row-level access control in v1.** A user who can read the table can fetch any subset of columns and rows from it. Column/row-level RBAC is deferred to a follow-up; if added, it would extend `dataset_permissions` with `column_allowlist` and `row_predicate` fields and the validator would augment user-supplied `where` with a server-pinned predicate.

Server-side flow:

1. Auth: PAT or session → resolved user.
2. RBAC: `can_access_table(user, table_id)` — same gate as `/api/query`. 403 on deny.
3. Validate `where` with the focused validator in §3.7 (sqlglot-backed). Reject malformed → 400 with structured error.
4. Validate `select` columns: each must exist in the table's schema (cross-checked against cached schema endpoint result). 400 on unknown column.
5. Validate `limit` against `instance.yaml: api.scan.max_limit` (hard cap, default 10_000_000). 400 if exceeded.
6. Quota check (§3.8). 429 if exceeded.
7. Build target SQL:
   - For `source_type=bigquery`: `SELECT [select] FROM \`{project}.{dataset}.{source_table}\` WHERE [where] ORDER BY [order_by] LIMIT [limit]`. Pass to `bigquery_query()` with the metadata token (#98 cache helps).
   - For `source_type=keboola` / `source_type=jira`: query the local parquet via DuckDB.
8. Enforce **`max_result_bytes`** guard (`instance.yaml: api.scan.max_result_bytes`, default 2 GB). If the cumulative Arrow stream exceeds this, abort and return partial result with `X-Agnes-Truncated: true` header + warning log. Prevents a single fetch from OOMing the server worker.
9. **Stream Arrow IPC** back over HTTP. Server emits chunks as BQ delivers them; client buffers entire stream into a parquet file before exposing as DuckDB view (no streaming on the client side in v1 — see §7 deferred). Content-Type: `application/vnd.apache.arrow.stream`.
10. Append `audit_log` row per request (§10.1).

### 3.5 `POST /api/v2/scan/estimate`

Same request shape as `/api/v2/scan`, but doesn't actually run the query. Uses BQ's `dryRun: true` flag to get scan size without paying for it.

Response shape:

```json
{
  "table_id": "web_sessions_example",
  "estimated_scan_bytes": 4400000000,
  "estimated_result_rows": 245000,
  "estimated_result_bytes": 12000000,
  "bq_cost_estimate_usd": 0.022
}
```

`estimated_scan_bytes` comes directly from BQ dry-run. `estimated_result_rows` is rough — BQ doesn't provide it on dry runs, so we estimate from `bytes_processed × selectivity_factor`. `estimated_result_bytes` derives from `result_rows × avg_row_bytes_from_schema`.

For `source_type` other than BQ, return zero/unknown for cost fields.

### 3.6 Caching layer

Server uses an in-process LRU + TTL cache for catalog/schema/sample. Cache invalidation:
- `POST /api/admin/catalog/invalidate` — admin force-refresh
- Auto-invalidate on `table_registry` mutations (after `register-table` / `unregister-table`)
- TTL: catalog 5 min, schema 1 h, sample 1 h

### 3.7 Server-side WHERE validator (sqlglot)

A focused module: `app/api/where_validator.py`. The **load-bearing security perimeter** of `/api/v2/scan`. Targeting ~250 LOC + adversarial test corpus.

#### Parser

Parse with `sqlglot.parse_one(f"WHERE {predicate}", into=exp.Where, dialect="bigquery")`. Reject if parse fails.

#### Structural rejects

Walk AST and reject on any of:
- `exp.Subquery`, `exp.Select` — no nested SELECTs (prevents `WHERE x IN (SELECT ... FROM other_table)` exfiltration)
- Multiple statements (semicolon chaining)
- DDL/DML nodes: `Insert`, `Update`, `Delete`, `Drop`, `Truncate`, `Alter`, `Create`, `Copy`, `Merge`
- `exp.Column` references where the qualifier is anything other than the target `table_id` or unqualified
- Star expressions (`*`) outside aggregates
- Bytes/binary literals raw embedding
- Comments (`--` or `/* */`) — strip in pre-processing or reject

#### Function allow-list (v1, BigQuery dialect)

Allowed function categories. The list is the **explicit** v1 contract; expanding it requires a spec amendment.

| Category | Functions |
|----------|-----------|
| Comparison | `=`, `!=`, `<`, `<=`, `>`, `>=`, `IS NULL`, `IS NOT NULL`, `IN`, `NOT IN`, `BETWEEN`, `LIKE`, `NOT LIKE` |
| Boolean | `AND`, `OR`, `NOT`, `XOR` |
| Date/Time | `CURRENT_DATE`, `CURRENT_TIMESTAMP`, `CURRENT_TIME`, `DATE`, `DATETIME`, `TIMESTAMP`, `TIME`, `DATE_ADD`, `DATE_SUB`, `DATE_DIFF`, `DATE_TRUNC`, `EXTRACT`, `FORMAT_DATE`, `FORMAT_TIMESTAMP`, `PARSE_DATE`, `PARSE_TIMESTAMP`, `UNIX_SECONDS`, `UNIX_MILLIS` |
| String | `CONCAT`, `LENGTH`, `LOWER`, `UPPER`, `SUBSTR`, `SUBSTRING`, `TRIM`, `LTRIM`, `RTRIM`, `REPLACE`, `STARTS_WITH`, `ENDS_WITH`, `CONTAINS_SUBSTR`, `REGEXP_CONTAINS`, `REGEXP_EXTRACT`, `SAFE_CAST` |
| Math | `ABS`, `CEIL`, `FLOOR`, `ROUND`, `MOD`, `POWER`, `SQRT`, `LOG`, `LN`, `EXP`, `SIGN`, `GREATEST`, `LEAST` |
| Casts | `CAST` (target types: `INT64`, `FLOAT64`, `NUMERIC`, `STRING`, `BYTES`, `BOOL`, `DATE`, `DATETIME`, `TIMESTAMP`, `TIME`, `DECIMAL`, `BIGNUMERIC`) |
| Conditional | `IF`, `IFNULL`, `COALESCE`, `NULLIF`, `CASE` |

Any function not on this list is rejected with `unknown_function` error including the function name. We avoid:
- `EXTERNAL_QUERY` (data exfiltration)
- `SESSION_USER`, `CURRENT_USER`, `IS_MEMBER` (impersonation surface)
- `ML.*` (cost surprise — ML predictions are billed by row)
- `ARRAY_AGG`, `STRING_AGG` and all aggregates (predicate context, not aggregate context)
- User-defined functions and table-valued functions
- `ROW_NUMBER`, window functions (predicate context)
- BQ scripting (`BEGIN`, `LOOP`, etc.)

#### Identifier-path validation

Column references in BigQuery can be dotted (`record.subfield.leaf`) or indexed (`array[OFFSET(0)]`). The validator must:
- Walk every `exp.Column` reference
- For each path segment, validate against the cached schema (paths must be present in `INFORMATION_SCHEMA.COLUMNS` field-shape data, not just top-level columns)
- Reject array subscripts containing function calls (e.g. `array[OFFSET(SAFE_CAST(x AS INT64))]` — too clever, overrun risk)

#### Adversarial test corpus

Mandatory test cases the implementer must add (`tests/test_where_validator.py`):
- 20+ accepted predicates (typical analyst-written WHEREs across all function categories)
- 30+ rejected predicates with explicit rejection codes:
  - `nested_select`: `x IN (SELECT y FROM t)`
  - `multi_statement`: `x = 1; DROP TABLE t`
  - `ddl_in_predicate`: `x = (CREATE TABLE t (id INT))`
  - `external_query`: `x = EXTERNAL_QUERY('...')`
  - `unknown_function`: `x = OBSCURE_BUILTIN(y)`
  - `comment_inject`: `x = 1 -- AND y > 0`
  - `wildcard_expansion`: `* = 5`
  - `cross_table_ref`: `other_table.id = 1`
  - `bytes_literal_raw`: `x = b'\\x00...'`
  - And 20+ more permutations

This is the only place sqlglot lives in the codebase. Constrained, testable, single responsibility. **All decisions are explicit and listed**; no "trust sqlglot's defaults".

### 3.8 Quota architecture (v1: process-local)

`/api/v2/scan` quotas live in **process-local memory** for v1. This is a **deliberate trade-off** documented here:

- Per-user concurrent scan: in-memory dict keyed by user_id, value is `set[request_id]`. Default cap: 5. Configurable via `instance.yaml: api.scan.max_concurrent_per_user`.
- Per-user daily byte cap: same dict, value also tracks `bytes_today` + `last_reset_utc`. Reset at UTC midnight. Default: 50 GB. Configurable via `instance.yaml: api.scan.max_daily_bytes_per_user`.

**Multi-replica caveat:** if Agnes is deployed with N FastAPI replicas, each tracks quotas independently — effectively **N× the cap** is the enforced ceiling per user. **Document this in §9 risks and CHANGELOG.** A future v2 with horizontal scale must move quotas to durable storage (recommend: a `quota_state` row in `system.duckdb` mutated under `BEGIN; UPDATE … RETURNING; COMMIT;` per request — or shared Redis if Agnes ever takes a Redis dependency).

429 response shape:

```json
{
  "error": "quota_exceeded",
  "kind": "concurrent_scans" | "daily_bytes",
  "current": 5,
  "limit": 5,
  "retry_after_seconds": 0      // for daily_bytes: seconds until UTC midnight
}
```

CLI translates 429 to exit code 3 with a clear message (§10.3).

## 4. CLI commands

### 4.1 Discovery

```
da catalog [--json] [--refresh]
da schema <table_id> [--json]
da describe <table_id> [-n N] [--json]
```

`da catalog` lists tables in a human-readable table by default. With `--json`, emits the API response verbatim — Claude reads this to understand what's available.

`da schema` shows columns + types + BQ flavor hints (when applicable).

`da describe` = schema + sample rows in one shot.

Client-side cache at `~/agnes-data/user/cache/`:
- `catalog.json` (5 min TTL, invalidated on `da sync` and `--refresh`)
- `schema/<table_id>.json` (1 h TTL)
- `samples/<table_id>.json` (1 h TTL)

### 4.2 Fetch + snapshot management

```
da fetch <table> \
    [--select <cols>] \
    [--where <predicate>] \
    [--limit <N>] \
    [--order-by <cols>] \
    [--as <name>] \
    [--estimate] \
    [--no-estimate] \
    [--force]
```

Materializes a filtered subset locally as `~/agnes-data/user/snapshots/<name>.parquet`, registers `<name>` as a DuckDB view in `analytics.duckdb`, writes metadata to `~/agnes-data/user/snapshots/<name>.meta.json`.

**`--as <name>` semantics (no interactive prompts ever):**
- Default `<name>` is `<table_id>`.
- If snapshot `<name>` already exists: **fail with exit code 6** (`snapshot_exists`) and a clear message naming the existing snapshot's `fetched_at` / `rows`.
- `--force` overwrites unconditionally. No confirmation prompt; agents can't answer prompts reliably.
- `--no-confirm` is unnecessary — there are no prompts.

**Snapshot install is file-locked.** The write transaction (move parquet into place + `CREATE OR REPLACE VIEW` + write meta sidecar) acquires an exclusive `flock(2)` on `~/agnes-data/user/snapshots/.lock` for the duration. Concurrent `da fetch` invocations queue. Concurrent reads (`da query`) take a shared lock on the analytics.duckdb file via DuckDB's own concurrency model — they're not blocked by snapshot install (DuckDB allows concurrent readers, and `CREATE OR REPLACE VIEW` is metadata-only fast).

`--estimate` runs only the dry-run estimate, doesn't fetch. Prints scan bytes + result row/byte estimate + cost. Always shown before fetch unless `--no-estimate` is set.

```
da snapshot list [--json]            # name | rows | size | age | table_id | where
da snapshot refresh <name> [--where <new>]  # re-fetch with stored params
da snapshot drop <name>
da snapshot prune [--older-than 7d] [--larger-than 1g]
```

The metadata sidecar (`<name>.meta.json`) is the source of truth for `refresh`:

```json
{
  "name": "cz_recent",
  "table_id": "web_sessions_example",
  "select": ["event_date", "country_code", "session_id"],
  "where": "event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) AND country_code = 'CZ'",
  "limit": 1000000,
  "order_by": null,
  "fetched_at": "2026-04-27T17:30:00Z",
  "effective_as_of": "2026-04-27T17:30:00Z",   // server eval time of CURRENT_DATE() etc.
  "rows": 245832,
  "bytes_local": 8400000,
  "estimated_scan_bytes_at_fetch": 4400000000,
  "result_hash_md5": "abc123..."                // for refresh diff detection
}
```

**Refresh staleness UX:**

`da snapshot refresh <name>` re-runs the stored fetch with the same `where`. Behavior:

1. WHERE may contain time-relative constructs (`CURRENT_DATE()`, `INTERVAL N DAY`). Server re-evaluates them at refresh time. The new sidecar gets a fresh `effective_as_of`.
2. After refresh, CLI prints a **diff summary**:
   ```
   Refreshed cz_recent
     rows:           245 832  →  248 401   (+2 569)
     bytes_local:    8.4 MB   →  8.5 MB
     effective_as_of: 2026-04-27 17:30 UTC  →  2026-04-28 09:00 UTC
     identical:      no
   ```
3. If `result_hash_md5` matches (rows + content didn't change), print `identical: yes` and skip the parquet swap.
4. If snapshot is older than `~/.agnes/config: snapshot_stale_warn_days` (default 7), `da query` prints a one-line warning when the snapshot is referenced: `WARN: snapshot 'cz_recent' is 12 days old; consider 'da snapshot refresh cz_recent'`.

### 4.3 Disk awareness

```
da disk-info [--json]
```

Output:

```
Snapshots dir:    ~/agnes-data/user/snapshots/
Used by Agnes:    2.4 GB across 7 snapshots
Free disk:        38.2 GB
Configured cap:   10 GB (~/.agnes/config: snapshot_quota_gb)
```

`snapshot_quota_gb` is a soft cap — `da fetch` warns if exceeded but doesn't hard-fail (analyst can override). `da snapshot prune --auto` honors the cap.

### 4.4 Existing commands stay

- `da query "..."` — local DuckDB query, fast, offline-capable. Works on local-mode tables and snapshots.
- `da query --remote "..."` — passthrough to `/api/query`. For one-shot aggregates, ad-hoc raw BQ-flavor SQL.
- `da sync` — refreshes local-mode parquets. Snapshot files don't get touched.

**v1 keeps `da query --remote` as-is.** A future rename to `da query-remote` (subcommand instead of flag, for clarity) is OUT OF SCOPE for this spec; track separately if desired.

`da catalog --refresh` clears the **client-side** cache only (forces next call to fetch fresh from server). It does NOT call the admin invalidate endpoint — that requires admin role (separate `da admin catalog-refresh` for admins).

## 5. Claude rails (CLAUDE.md + skill)

### 5.1 CLAUDE.md addendum

A new section in the repo's CLAUDE.md:

```markdown
## Querying Agnes data — agent rails

When asked about ANY data in Agnes, follow this protocol.

### Discovery first

Before writing ANY query against a table, run:

    da catalog --json | jq <filter>     # know what's available
    da schema <table>                   # learn columns + types
    da describe <table> -n 5            # see real values for shape

NEVER write `SELECT * FROM <table>` blindly. For local-mode tables it's
wasteful; for remote-mode tables it can blow up at 225M rows.

### Choose the right tool

Tables in `da catalog` have a `query_mode`:

- **`local`**: data is on the laptop as parquet (synced via `da sync`).
  Query directly with `da query "SELECT … FROM <table>"`.

- **`remote`** (typically BigQuery): the parquet does NOT exist on the laptop.
  You MUST either:
  1. **`da fetch`** a filtered subset → query the local snapshot, OR
  2. **`da query --remote`** for one-shot server-side execution, OR
  3. **`da query --register-bq`** for hybrid joins (rarely needed).

### `da fetch` workflow (preferred for remote tables)

    # 1. estimate first
    da fetch web_sessions_example \
        --select event_date,country_code,session_id \
        --where "event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
                 AND country_code = 'CZ'" \
        --estimate
    # → "estimated_scan_bytes: 4.2 GB, result: ~250k rows, 12 MB locally"

    # 2. if reasonable, fetch
    da fetch web_sessions_example ... --as cz_recent

    # 3. query the local snapshot
    da query "SELECT event_date, COUNT(*) FROM cz_recent GROUP BY 1 ORDER BY 1"

### Heuristics for `da fetch`

- ALWAYS list specific columns in `--select`. Avoid implicit SELECT *.
- ALWAYS include a `--where` for remote tables; otherwise add `--limit`.
- ALWAYS run `--estimate` first when:
  - You're not sure of the data shape
  - The table has `partition_by` or `clustered_by` set (per `da schema`)
  - The fetch could plausibly exceed 1 GB local bytes
- Reuse `da snapshot list` before fetching — if a snapshot covers your
  query already, skip the fetch.

### BigQuery SQL flavor for `--where`

For `source_type=bigquery` (per `da catalog`):

- Date literal: `DATE '2026-01-01'` (NOT `'2026-01-01'::date`)
- Timestamp literal: `TIMESTAMP '2026-01-01 00:00:00 UTC'`
- Now: `CURRENT_DATE()`, `CURRENT_TIMESTAMP()`
- Date arithmetic: `DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)`
- Regex: `REGEXP_CONTAINS(col, r'pattern')` (raw string!)
- NULL: `col IS NOT NULL` (standard)
- Cast: `CAST(x AS INT64)` (NOT `INT`)

For `source_type=keboola` / `source_type=jira` (local), use DuckDB SQL flavor
in your `da query` calls — there's no `--where` on local since fetch is implicit.

### Snapshot hygiene

- Reuse snapshots across questions in the same conversation.
- Use descriptive names: `cz_recent`, `orders_q1_us`, `sessions_today`.
- Drop with `da snapshot drop <name>` when done with a topic.
- `da disk-info` to see total cache size.

### When NOT to use `da fetch`

- Single aggregate on remote table (`SELECT COUNT(*) FROM remote`):
  use `da query --remote "SELECT COUNT(*) FROM web_sessions_example"`.
  No materialization needed; cheap.
- Throwaway exploration with raw BQ syntax: `da query --remote`.
- Cross-table JOIN with both tables remote: combine `da fetch` for one
  side + `da query --remote` for the other; full cross-remote JOIN
  requires more thought (see #101 for design space).
```

### 5.2 Skill file

Standalone skill `agnes-data-querying` at `skills/agnes-data-querying.md` (loadable via the superpowers skill mechanism), which auto-activates when the user is in an Agnes-flavored project and asks data questions. Contents mirror the CLAUDE.md addendum but framed as a runnable workflow.

The skill is short — under 200 lines — and has a quick reference table of common BQ syntax gotchas.

## 6. Migration

### 6.1 Drop the wrap view

`connectors/bigquery/extractor.py::init_extract` currently emits:

```sql
CREATE OR REPLACE VIEW "<table_name>" AS
  SELECT * FROM bigquery_query('<project>', 'SELECT * FROM `<project>.<dataset>.<source_table>`')
```

Change: **don't emit any wrap view for VIEW-type entities**. The `_meta` row still gets written (so the orchestrator catalog has a record), and `_remote_attach` still gets the BQ entry (so the master DB can query via the secret), but no master-side view exists.

For BASE TABLE entities, keep the existing direct-ref view template — Storage Read API handles those fine.

Result: `analytics.duckdb` only has master views for source-type=keboola / source-type=jira / BQ-base-tables. BQ views are **not** queryable directly through `da query --remote "SELECT * FROM web_sessions_example"`. They MUST be either fetched or queried via `bigquery_query()` explicitly.

### 6.2 Backwards compatibility

Existing PRs against `zs/test-bq-e2e` ship the wrap-view code. This design replaces that. The migration:

- One commit drops the wrap-view code path in the extractor.
- One commit removes the orchestrator's `_attach_remote_extensions` BQ-secret refresh in cases where no BQ-typed view exists (it's still needed for BASE TABLE refs).
- Tests updated.

`/api/sync/manifest` already filters out `query_mode='remote'` tables for `da sync` (Task 6/7). Snapshot views are not in the manifest — they're laptop-local only.

### 6.3 Data already on dev VM

The dev VM has `web_sessions_example` registered as a remote-mode view. Post-migration:
- `analytics.duckdb` won't have a master view for it (existing wrap view will be dropped on next orchestrator rebuild).
- Claude is expected to use `da fetch` instead.

User's existing test workflow: `da fetch web_sessions_example --where ...` → snapshot → `da query`.

## 7. Out of scope

These are real concerns but explicitly NOT addressed in this design:

- **Cross-remote JOINs**: A query joining two remote BQ views directly. Workaround: fetch one side as a snapshot, then `da query --remote` with `bigquery_query()` for the other side. Long-term: see #101 follow-up "predicate templates" or "hosted Postgres bridge" alternatives.
- **Streaming results**: `da fetch` materializes the full Arrow buffer before writing to disk. For multi-GB fetches this can pause for tens of seconds. Future optimization: chunked Arrow stream → parquet writer pipe.
- **Async fetches**: `da fetch` is synchronous. No background mode. If fetch times out (default 5 min), user must retry.
- **Cross-org BQ**: assume one BQ project per Agnes deployment. Multi-project fan-out is a separate spec.
- **Custom DuckDB extension** (option A from brainstorming): not pursued because the primitives-based approach delivers 80% of the UX at 10% of the engineering cost. Revisit if production pain demands it.

## 8. Effort estimate

| Component | Owner | Days |
|-----------|-------|------|
| `/api/v2/scan` endpoint + RBAC + quota wiring | server | 1 |
| WHERE validator (§3.7) + adversarial test corpus (50+ cases) | server | 2 |
| `/api/v2/scan/estimate` (BQ dryRun via `google.cloud.bigquery` client) | server | 1.5 |
| `/api/v2/catalog` + `/api/v2/schema` + `/api/v2/sample` + caching | server | 2 |
| Audit log shape + `audit_log` migration if needed | server | 0.5 |
| `da fetch` + snapshot metadata + file-locked install | client | 1.5 |
| `da snapshot list/refresh/drop/prune` + diff summary + stale warn | client | 1.5 |
| `da catalog/schema/describe/disk-info` (with SQL flavor info) | client | 1 |
| Arrow streaming server-side, parquet write client-side | shared | 1 |
| Client-side cache at `~/agnes-data/user/cache/` | client | 0.5 |
| Drop wrap-view code path + migrate existing tests | server | 0.5 |
| CLAUDE.md instructions + skill file (with BQ flavor table + recovery prompts) | docs | 1 |
| Tests — unit (validator, quotas, RBAC) + integration (snapshot lifecycle, real BQ) | shared | 3 |
| **Total** | | **~16.5** |

Realistic timelines:
- **Two developers in parallel:** 8-9 calendar days (server+CLI tracks).
- **One developer:** ~3 weeks.

The estimate **revised upward from 10.5** based on review feedback (validator alone is ~2 d not 1; tests ~3 d not 1.5; estimate dryRun is more involved than `bigquery_query()` can do directly — needs `google.cloud.bigquery` client path).

## 9. Risks & open questions

1. **WHERE validator coverage**: the v1 allow-list (§3.7) is finite; legitimate analyst predicates may be rejected on first cut. Mitigation: explicit `unknown_function` error names the function so the analyst (or Claude) immediately sees what to drop. Allow-list expanded in follow-ups based on production rejection logs.
2. **Snapshot refresh staleness**: `CURRENT_DATE()` re-evaluates at refresh time → data shifts. Fixed literals → refresh is a content no-op (caught by `result_hash_md5` comparison per §4.2). Documented in CLI output (`identical: yes/no`).
3. **BQ dry-run accuracy for scan estimate**: `totalBytesProcessed` is accurate. `estimated_result_rows` is heuristic — worst case under-estimate → user fetches more than expected → max_result_bytes guard truncates (§3.4 step 8) with `X-Agnes-Truncated` header.
4. **Multi-replica quota**: process-local quotas (§3.8) mean N replicas → effective N×cap per user. **Documented caveat for v1.** Single-replica deployments (today's default) unaffected. Horizontal scale upgrade path: durable counter in `system.duckdb`. Captured as a follow-up issue when scale demand emerges.
5. **Multi-conversation snapshots collision**: per §4.2 file-locked install + exit-code-6-on-exists semantics make this safe — concurrent `da fetch --as same_name` causes the second to fail-fast with a clear error rather than corrupt state.
6. **BREAKING change for `da query --remote`**: dropping the wrap view (§6.1) means `da query --remote "SELECT * FROM <bq_view>"` no longer works. Existing automation scripts may break. **Must be flagged as `**BREAKING**` in CHANGELOG** per the project's changelog discipline. Mitigation: optional `--legacy-wrap-views` flag in `connectors/bigquery/extractor.py` for one release cycle to ease rollout (operator-controlled via `instance.yaml: bigquery.legacy_wrap_views: true`). Document in §6.

## 10. Implementation contracts

These are the concrete artifacts the implementer must produce; spec requirements distilled into checkable shapes.

### 10.1 Audit log shape

Every `/api/v2/scan` and `/api/v2/scan/estimate` request appends one row to the existing `audit_log` table:

```
event_type     = 'fetch_scan' | 'fetch_estimate'
user_email     = <from session/PAT>
table_id       = <request.table_id>
event_data     = JSON: {
                   select: [...],
                   where_hash: md5(where || ''),  -- not full text, can be sensitive
                   limit: ...,
                   estimated_scan_bytes: ... | null,
                   actual_result_rows: ... | null,
                   actual_result_bytes: ... | null,
                   latency_ms: ...,
                   status: 'ok' | 'rejected' | 'quota_exceeded' | 'truncated' | 'error',
                   error_kind: 'validator' | 'rbac' | 'bq' | null
                 }
```

Why `where_hash` instead of full text: WHERE clauses can include sensitive constants (user emails, IDs). Hash + structure remains debuggable from the validator's per-request log lines if needed, without persistent disclosure.

### 10.2 CLI exit codes

`da fetch`, `da snapshot *`, `da catalog`, `da schema`, `da describe`, `da disk-info` follow this exit-code contract (used by the agent to branch):

| Code | Meaning |
|------|---------|
| 0 | Success |
| 2 | Validation failed (bad WHERE, unknown column, malformed args) |
| 3 | Quota exceeded (concurrent or daily) |
| 4 | Disk full (snapshot file write failed; soft quota only emits a warning, not an exit) |
| 5 | Server error (5xx; transient) |
| 6 | Snapshot already exists (use `--force`) |
| 7 | Auth failed (no PAT, expired) |
| 8 | RBAC denied (table not accessible) |
| 9 | Network unreachable (server down) |

Each non-zero exit also writes a structured error to stderr (§10.3).

### 10.3 Error UX

CLI error format on stderr (single line + optional next-step hint):

```
Error: <one-line summary>. <Hint about how to recover.>
```

Examples:

```
Error: WHERE validator rejected 'unknown_function'. Function 'OBSCURE_FN' not in v1 allow-list.
       See: da catalog --json | jq '.tables[].sql_flavor' for the supported dialect.

Error: quota exceeded (daily_bytes). Used 51.2 GB of 50 GB cap (resets at 00:00 UTC).
       Hint: 'da snapshot list' to find oversized snapshots, 'da snapshot prune'.

Error: snapshot 'cz_recent' already exists (fetched 2 days ago, 245k rows).
       Pass --force to overwrite, or 'da snapshot refresh cz_recent' to update in place.
```

Server `/api/v2/*` errors return JSON:

```json
{
  "error": "validator_rejected",
  "kind": "unknown_function",
  "details": { "function": "OBSCURE_FN" },
  "request_id": "..."
}
```

`request_id` lets server-side log correlation work without exposing internal stack traces to clients.

### 10.4 Server config knobs (`instance.yaml`)

New section:

```yaml
api:
  scan:
    max_limit: 10000000           # rows
    max_result_bytes: 2147483648  # 2 GB
    max_concurrent_per_user: 5
    max_daily_bytes_per_user: 53687091200  # 50 GB
    bq_cost_per_tb_usd: 5.00      # for cost estimate output
    request_timeout_seconds: 300
  catalog_cache_ttl_seconds: 300  # 5 min
  schema_cache_ttl_seconds: 3600  # 1 h
  sample_cache_ttl_seconds: 3600  # 1 h; admin force-refresh path per §3.6
```

All optional; defaults applied if missing. Documented in `config/instance.yaml.example`.

### 10.5 Client config (`~/.agnes/config`)

New keys:

```yaml
snapshot_quota_gb: 10
snapshot_stale_warn_days: 7
fetch_default_estimate: true   # whether `da fetch` runs --estimate first by default
```

### 10.6 Schema drift handling

When `da snapshot refresh <name>` is called and the upstream BQ schema has changed since the snapshot was taken:

- **New column added** in BQ (not in original `--select`): no-op for refresh (we only re-fetch what's in `select`).
- **Column from `--select` was removed** in BQ: refresh fails with exit code 2 (`schema_drift`) and message `Column 'X' no longer exists in <table_id>. Drop snapshot and re-fetch with updated --select.` — leave the existing snapshot file untouched.
- **Column type changed**: re-fetch proceeds; new parquet has new type. CLI prints `WARN: column 'X' type changed STRING → INT64; downstream queries may break.`

### 10.7 Telemetry / observability

Server emits Prometheus-compatible metrics (`/metrics` endpoint, gated by admin):

- `agnes_v2_scan_request_total{status,user}` counter — request count by status
- `agnes_v2_scan_bytes_total{user}` counter — bytes returned per user
- `agnes_v2_scan_latency_seconds{quantile}` summary — request latency
- `agnes_v2_scan_concurrent_gauge{user}` gauge — current concurrent scans

Wired into existing observability stack (TBD per deploy — minimum: log lines structured for grep).

## 11. Success criteria

CI-verifiable (must pass automatically on the PR):

- [ ] **CI** — All existing tests still green on `zs/test-bq-e2e`.
- [ ] **CI** — `tests/test_where_validator.py` 50+ adversarial cases pass (§3.7 corpus).
- [ ] **CI** — Quota state correctly enforced in unit tests (concurrent + daily byte cap, 429 shape per §3.8).
- [ ] **CI** — `da catalog --json` output is machine-readable and includes `sql_flavor` per table (output-shape test).
- [ ] **CI** — `da fetch --estimate` outputs both BQ scan bytes and local result bytes (output-shape test).
- [ ] **CI** — `da snapshot list/refresh/drop/prune` lifecycle round-trip test.
- [ ] **CI** — Exit codes per §10.2 verified for every documented failure mode.
- [ ] **CI** — Vendor-token scan on touched files: empty.

Manual gates (release-time, signed off by author):

- [ ] **Manual** — On the dev VM, Claude (with the new skill loaded) answers "show me web_sessions_example for last 30 days" and produces an aggregated result in <30 s without "Response too large" errors. Verify the agent followed `da catalog → da schema → da fetch → da query` rather than direct `da query --remote`.
- [ ] **Manual** — 3 different fresh Claude sessions (without explicit prompting) follow the discovery-first protocol when asked about Agnes data. (Manual replay; document transcripts in PR.)
- [ ] **Manual** — End-to-end demo on dev VM: full discover → estimate → fetch → query loop in <2 min wall time, recorded in the PR description.
- [ ] **Manual** — Audit log inspection after demo run shows expected `event_data` shape per §10.1.