fix: address Devin Review findings — incomplete renames + estimate guard

13 Devin findings across 10 files:

🔴 Critical:
- app/api/v2_catalog.py:42 — `_fetch_hint` returns `da fetch` in /api/v2/catalog
  responses (user-visible in every catalog list)
- cli/skills/agnes-data-querying.md — 11 stale `da fetch`/`da sync` refs in the
  bundled skill markdown
- config/claude_md_template.txt:38 — referenced `agnes pull --docs-only` flag
  that does NOT exist in agnes pull (removed; spec only ships --quiet/--json/
  --dry-run)

🟡 Important:
- app/api/admin.py:252 — `da fetch` in bq_max_scan_bytes hint
- cli/commands/auth.py:119 — `da sync` in import-token docstring (--help text)
- cli/commands/tokens.py:48 — "Export it so `da` can use it" prose
- ARCHITECTURE.md — 4 stale rows in CLI commands table
- README.md — stale paragraphs for analysts (da sync, da analyst setup)

🚩 Substantive observations addressed:
- app/api/query.py:249,302,489 — server-side error/help strings still said
  `da sync`/`da fetch` (returned in API responses to clients)
- cli/commands/snapshot.py:235-241 — DuckDB existence guard incorrectly
  blocked `--estimate` (server-side dry-run that never opens local DB).
  Added test ensuring estimate path skips the guard.

Skipped (intentionally historical):
- app/api/admin.py:2377,2429,2437 — historical comments describing past
  manifest-vs-sync_state bug; past tense, accurate to keep as `da sync`.
This commit is contained in:
ZdenekSrotyr 2026-05-04 20:05:06 +02:00
parent cd8dd9508c
commit 3d58768143
12 changed files with 76 additions and 44 deletions

View file

@ -21,7 +21,7 @@
┌──────────┼──────────┐ ┌──────────┼──────────┐
▼ ▼ ▼ ▼ ▼ ▼
FastAPI CLI FastAPI CLI
(serve) (da sync) (serve) (agnes pull)
``` ```
Three source types: Three source types:
@ -115,14 +115,14 @@ Command-line tool `da` for sync, query, and admin operations.
| Command | Role | | Command | Role |
|---------|------| |---------|------|
| `da sync` | Trigger data sync | | `agnes pull` | Trigger data sync |
| `agnes query` | Run SQL against analytics.duckdb | | `agnes query` | Run SQL against analytics.duckdb |
| `agnes admin group *` | Manage user groups | | `agnes admin group *` | Manage user groups |
| `agnes admin grant *` | Manage resource grants | | `agnes admin grant *` | Manage resource grants |
| `agnes admin register-table` | Register tables in table_registry | | `agnes admin register-table` | Register tables in table_registry |
| `agnes admin break-glass <user>` | Emergency admin access recovery | | `agnes admin break-glass <user>` | Emergency admin access recovery |
| `da tokens *` | Manage personal access tokens | | `agnes auth token *` | Manage personal access tokens |
| `da metrics *` | Business metric definitions | | `agnes admin metrics *` | Business metric definitions |
| `agnes skills *` | List/show bundled skills | | `agnes skills *` | List/show bundled skills |
### 5. Authentication (`app/auth/`) ### 5. Authentication (`app/auth/`)

View file

@ -35,7 +35,7 @@ The orchestrator scans `/data/extracts/*/extract.duckdb`, attaches each into `an
┌──────────┼──────────┐ ┌──────────┼──────────┐
▼ ▼ ▼ ▼ ▼ ▼
FastAPI CLI FastAPI CLI
(serve) (da sync) (serve) (agnes pull)
``` ```
## Supported Data Sources ## Supported Data Sources
@ -47,11 +47,11 @@ The orchestrator scans `/data/extracts/*/extract.duckdb`, attaches each into `an
| **Remote attach** (`remote`) | View only, no download | BigQuery | Table is too large to materialize; latency cost of remote query is acceptable | | **Remote attach** (`remote`) | View only, no download | BigQuery | Table is too large to materialize; latency cost of remote query is acceptable |
| **Real-time push** | Incremental parquet | Jira | Source is event-driven and you need sub-minute freshness | | **Real-time push** | Incremental parquet | Jira | Source is event-driven and you need sub-minute freshness |
The first three modes are what `da sync` distributes to analysts. The fourth is server-side only — analysts query Jira data through the same `da sync`-distributed parquets. The first three modes are what `agnes pull` distributes to analysts. The fourth is server-side only — analysts query Jira data through the same `agnes pull`-distributed parquets.
Admins manage per-source registrations through the `/admin/tables` UI (per-connector tabs for BigQuery / Keboola / Jira) or the `agnes admin register-table` CLI; per-row "Manage access" deep-links to `/admin/access` for granting tables to user groups via `resource_grants(group, ResourceType.TABLE, table_id)`. Admins manage per-source registrations through the `/admin/tables` UI (per-connector tabs for BigQuery / Keboola / Jira) or the `agnes admin register-table` CLI; per-row "Manage access" deep-links to `/admin/access` for granting tables to user groups via `resource_grants(group, ResourceType.TABLE, table_id)`.
Analysts get a closed loop with Claude Code: `da analyst setup` writes `<workspace>/.claude/settings.json` with SessionStart (`da sync --quiet`) and SessionEnd (`da sync --upload-only --quiet`) hooks so every Claude Code session starts with fresh RBAC-filtered parquets and ends with the session log uploaded back. Analysts get a closed loop with Claude Code: `agnes init` writes `<workspace>/.claude/settings.json` with SessionStart (`agnes pull --quiet`) and SessionEnd (`agnes push --quiet`) hooks so every Claude Code session starts with fresh RBAC-filtered parquets and ends with the session log uploaded back.
Adding a new source means creating `connectors/<name>/extractor.py` that produces `extract.duckdb` with a `_meta` table (`table_name`, `description`, `rows`, `size_bytes`, `extracted_at`, `query_mode`). The orchestrator attaches it automatically. Adding a new source means creating `connectors/<name>/extractor.py` that produces `extract.duckdb` with a `_meta` table (`table_name`, `description`, `rows`, `size_bytes`, `extracted_at`, `query_mode`). The orchestrator attaches it automatically.
@ -86,18 +86,18 @@ curl -X POST http://localhost:8000/api/sync/trigger
## Local sync & auto-update ## Local sync & auto-update
Analysts run Claude Code against a local DuckDB built from RBAC-filtered parquets pulled from the server. `da sync` is the distribution path: Analysts run Claude Code against a local DuckDB built from RBAC-filtered parquets pulled from the server. `agnes pull` is the distribution path:
```bash ```bash
da sync # delta-pull: manifest → MD5 compare → download changed → rebuild views agnes pull # delta-pull: manifest → MD5 compare → download changed → rebuild views
da sync --quiet # same, no progress output (for hooks/cron) agnes pull --quiet # same, no progress output (for hooks/cron)
da sync --upload-only # push session jsonl + CLAUDE.local.md back to the server agnes push # push session jsonl + CLAUDE.local.md back to the server
``` ```
`da analyst setup` writes Claude Code lifecycle hooks into `<workspace>/.claude/settings.json`: `agnes init` writes Claude Code lifecycle hooks into `<workspace>/.claude/settings.json`:
- `SessionStart``da sync --quiet` — fresh data on every session - `SessionStart``agnes pull --quiet` — fresh data on every session
- `SessionEnd``da sync --upload-only --quiet` — uploads notes and session log - `SessionEnd``agnes push --quiet` — uploads notes and session log
Hooks live at workspace level so they only fire in this analyst workspace, not in unrelated Claude Code sessions on the same machine. Hooks live at workspace level so they only fire in this analyst workspace, not in unrelated Claude Code sessions on the same machine.
@ -108,7 +108,7 @@ The auto-sync set per analyst is the intersection of:
1. Tables with `query_mode IN ('local', 'materialized')` — these have parquets on disk and end up in the manifest 1. Tables with `query_mode IN ('local', 'materialized')` — these have parquets on disk and end up in the manifest
2. Tables granted to one of the analyst's groups via `resource_grants(group, ResourceType.TABLE, table_id)` (see [`docs/RBAC.md`](docs/RBAC.md)) 2. Tables granted to one of the analyst's groups via `resource_grants(group, ResourceType.TABLE, table_id)` (see [`docs/RBAC.md`](docs/RBAC.md))
To enroll a new table for auto-sync, register it (or update its `query_mode`) and grant it to the relevant groups in `/admin/access`. New analysts get the same set on their next `da sync`. To enroll a new table for auto-sync, register it (or update its `query_mode`) and grant it to the relevant groups in `/admin/access`. New analysts get the same set on their next `agnes pull`.
For BigQuery, register a `query_mode='materialized'` table with a SQL body: For BigQuery, register a `query_mode='materialized'` table with a SQL body:
@ -120,7 +120,7 @@ agnes admin register-table orders_90d \
--schedule "every 6h" --schedule "every 6h"
``` ```
The scheduler runs the query through the DuckDB BigQuery extension on each tick that's due, writes the result as a parquet, and the analyst picks it up on the next `da sync`. Cost guardrail: `data_source.bigquery.max_bytes_per_materialize` (default 10 GiB) — operations exceeding the BQ dry-run estimate are skipped. The scheduler runs the query through the DuckDB BigQuery extension on each tick that's due, writes the result as a parquet, and the analyst picks it up on the next `agnes pull`. Cost guardrail: `data_source.bigquery.max_bytes_per_materialize` (default 10 GiB) — operations exceeding the BQ dry-run estimate are skipped.
## Development Setup ## Development Setup
@ -156,7 +156,7 @@ pytest tests/ -v
│ ├── keboola/ # Keboola: extractor.py (DuckDB extension) + client.py (fallback) │ ├── keboola/ # Keboola: extractor.py (DuckDB extension) + client.py (fallback)
│ ├── bigquery/ # BigQuery: extractor.py (remote-only via DuckDB BQ extension) │ ├── bigquery/ # BigQuery: extractor.py (remote-only via DuckDB BQ extension)
│ └── jira/ # Jira: webhook + incremental parquet → extract.duckdb │ └── jira/ # Jira: webhook + incremental parquet → extract.duckdb
├── cli/ # CLI tool (`da sync`, `agnes query`, `agnes admin`) ├── cli/ # CLI tool (`agnes pull`, `agnes query`, `agnes admin`)
├── services/ # Standalone services (scheduler, telegram_bot, ws_gateway, etc.) ├── services/ # Standalone services (scheduler, telegram_bot, ws_gateway, etc.)
├── scripts/ # Utility + migration scripts ├── scripts/ # Utility + migration scripts
├── config/ # Configuration templates (instance.yaml.example) ├── config/ # Configuration templates (instance.yaml.example)

View file

@ -249,7 +249,7 @@ _KNOWN_FIELDS: dict[str, dict[str, dict]] = {
"Cost guardrail for `agnes query --remote` against query_mode='remote' " "Cost guardrail for `agnes query --remote` against query_mode='remote' "
"BQ rows (dry-run check on the underlying SELECT before execute). " "BQ rows (dry-run check on the underlying SELECT before execute). "
"Bytes processed; exceeds → 400 remote_scan_too_large with a " "Bytes processed; exceeds → 400 remote_scan_too_large with a "
"`da fetch` suggestion. 0 disables the gate. Default 5368709120 = 5 GiB." "`agnes snapshot create` suggestion. 0 disables the gate. Default 5368709120 = 5 GiB."
), ),
}, },
}, },

View file

@ -246,7 +246,7 @@ def _materialized_hint_for_query_error(
text; a hit means the operator picked a name that exists in the text; a hit means the operator picked a name that exists in the
registry but isn't queryable in this instance. The hint is the same registry but isn't queryable in this instance. The hint is the same
in both arms of the OR it tells them what the table needs and what in both arms of the OR it tells them what the table needs and what
they can do today (`da sync` or query `bq."dataset"."table"` they can do today (`agnes pull` or query `bq."dataset"."table"`
directly using the bucket/source_table from the registry row). directly using the bucket/source_table from the registry row).
""" """
# Cheap fast-path — only inspect the registry when DuckDB's error # Cheap fast-path — only inspect the registry when DuckDB's error
@ -299,7 +299,7 @@ def _build_materialized_hint(row: dict) -> str:
return ( return (
f"Table {tid!r} is registered as query_mode='materialized' but is " f"Table {tid!r} is registered as query_mode='materialized' but is "
f"not yet materialized in this instance's analytics views. Run " f"not yet materialized in this instance's analytics views. Run "
f"`da sync` (or wait for the scheduler tick / hit POST " f"`agnes pull` (or wait for the scheduler tick / hit POST "
f"/api/sync/trigger) to materialize the parquet" f"/api/sync/trigger) to materialize the parquet"
f"{direct_hint}." f"{direct_hint}."
) )
@ -486,7 +486,7 @@ def _bq_quota_and_cap_guard(*, user_id: str, dry_run_set: list, sql: str):
"limit_bytes": cap_bytes, "limit_bytes": cap_bytes,
"tables": tables, "tables": tables,
"suggestion": ( "suggestion": (
"Use `da fetch <id> --select <cols> --where <predicate> " "Use `agnes snapshot create <id> --select <cols> --where <predicate> "
"--estimate` to materialize a filtered subset, then query " "--estimate` to materialize a filtered subset, then query "
"the snapshot locally." "the snapshot locally."
), ),

View file

@ -39,7 +39,7 @@ def _examples_for(source_type: str) -> list[str]:
def _fetch_hint(table_id: str, source_type: str) -> str: def _fetch_hint(table_id: str, source_type: str) -> str:
if source_type == "bigquery": if source_type == "bigquery":
return f"da fetch {table_id} --select <cols> --where '<BQ predicate>' --limit <N>" return f"agnes snapshot create {table_id} --select <cols> --where '<BQ predicate>' --limit <N>"
return "already local — query directly via `agnes query`" return "already local — query directly via `agnes query`"

View file

@ -116,7 +116,7 @@ def import_token(
Decodes the JWT locally to extract the email claim, verifies it Decodes the JWT locally to extract the email claim, verifies it
against the server, and writes it to ~/.config/agnes/token.json using the against the server, and writes it to ~/.config/agnes/token.json using the
canonical format so subsequent `agnes auth whoami` / `da sync` calls canonical format so subsequent `agnes auth whoami` / `agnes pull` calls
authenticate cleanly. authenticate cleanly.
Example: Example:

View file

@ -235,10 +235,16 @@ def create_cmd(
# Guard: refuse to create snapshots before `agnes pull` has bootstrapped # Guard: refuse to create snapshots before `agnes pull` has bootstrapped
# the local DuckDB. Otherwise we'd open an empty DB and confuse later # the local DuckDB. Otherwise we'd open an empty DB and confuse later
# `agnes pull` runs. # `agnes pull` runs.
local_db = _local_dir() / "user" / "duckdb" / "analytics.duckdb" #
if not local_db.exists(): # `--estimate` is exempt: it's a server-side dry-run cost check that
typer.echo("Local DuckDB not found. Run: agnes pull first.", err=True) # never touches the local DuckDB, so it doesn't need the DB to exist
raise typer.Exit(1) # (and analysts use it pre-bootstrap to scope a fetch before deciding
# to materialize).
if not estimate:
local_db = _local_dir() / "user" / "duckdb" / "analytics.duckdb"
if not local_db.exists():
typer.echo("Local DuckDB not found. Run: agnes pull first.", err=True)
raise typer.Exit(1)
snap_dir = _local_dir() / "user" / "snapshots" snap_dir = _local_dir() / "user" / "snapshots"
snap_dir.mkdir(parents=True, exist_ok=True) snap_dir.mkdir(parents=True, exist_ok=True)

View file

@ -45,7 +45,7 @@ def create(
typer.echo(f"name: {data['name']}") typer.echo(f"name: {data['name']}")
typer.echo(f"expires: {data.get('expires_at') or 'never'}") typer.echo(f"expires: {data.get('expires_at') or 'never'}")
typer.echo("") typer.echo("")
typer.echo("Export it so `da` can use it:") typer.echo("Export it so `agnes` can use it:")
typer.echo(f" export AGNES_TOKEN={data['token']}") typer.echo(f" export AGNES_TOKEN={data['token']}")

View file

@ -26,21 +26,21 @@ Tables in `agnes catalog` have a `query_mode`:
| Mode | Means | How to query | | Mode | Means | How to query |
|------|-------|--------------| |------|-------|--------------|
| `local` | parquet synced on laptop | `agnes query "SELECT …"` directly | | `local` | parquet synced on laptop | `agnes query "SELECT …"` directly |
| `remote` (BigQuery) | parquet NOT on laptop | `da fetch` subset → snapshot, OR `agnes query --remote` one-shot | | `remote` (BigQuery) | parquet NOT on laptop | `agnes snapshot create` subset → snapshot, OR `agnes query --remote` one-shot |
For **remote tables**, you MUST either: For **remote tables**, you MUST either:
1. `da fetch` a filtered subset → query the local snapshot (preferred), OR 1. `agnes snapshot create` a filtered subset → query the local snapshot (preferred), OR
2. `agnes query --remote` for one-shot server-side execution, OR 2. `agnes query --remote` for one-shot server-side execution, OR
3. `agnes query --register-bq` for hybrid joins (rare; see docs) 3. `agnes query --register-bq` for hybrid joins (rare; see docs)
## The `da fetch` workflow (preferred for remote tables) ## The `agnes snapshot create` workflow (preferred for remote tables)
### 1. Estimate first ### 1. Estimate first
Always estimate before fetching: Always estimate before fetching:
```bash ```bash
da fetch web_sessions_example \ agnes snapshot create web_sessions_example \
--select event_date,country_code,session_id \ --select event_date,country_code,session_id \
--where "event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) --where "event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
AND country_code = 'CZ'" \ AND country_code = 'CZ'" \
@ -52,7 +52,7 @@ Output tells you scan cost, expected rows, and local bytes — so you know if it
### 2. If reasonable, fetch to snapshot ### 2. If reasonable, fetch to snapshot
```bash ```bash
da fetch web_sessions_example \ agnes snapshot create web_sessions_example \
--select event_date,country_code,session_id \ --select event_date,country_code,session_id \
--where "event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) --where "event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
AND country_code = 'CZ'" \ AND country_code = 'CZ'" \
@ -65,7 +65,7 @@ da fetch web_sessions_example \
agnes query "SELECT event_date, COUNT(*) FROM cz_recent GROUP BY 1 ORDER BY 1" agnes query "SELECT event_date, COUNT(*) FROM cz_recent GROUP BY 1 ORDER BY 1"
``` ```
## Heuristics for `da fetch` ## Heuristics for `agnes snapshot create`
| Requirement | Why | | Requirement | Why |
|-------------|-----| |-------------|-----|
@ -97,14 +97,14 @@ For `source_type=keboola` / `source_type=jira` (local), use **DuckDB SQL** in yo
- Drop with `agnes snapshot drop <name>` when done with a topic - Drop with `agnes snapshot drop <name>` when done with a topic
- Check total cache size with `agnes disk-info` - Check total cache size with `agnes disk-info`
## When NOT to use `da fetch` ## When NOT to use `agnes snapshot create`
| Scenario | Use instead | | Scenario | Use instead |
|----------|------------| |----------|------------|
| Single aggregate on remote BASE TABLE (`SELECT COUNT(*)`) | `agnes query --remote "SELECT COUNT(*) FROM web_sessions_example"` — cheap, no fetch needed (Storage Read API pushes the COUNT into BQ) | | Single aggregate on remote BASE TABLE (`SELECT COUNT(*)`) | `agnes query --remote "SELECT COUNT(*) FROM web_sessions_example"` — cheap, no fetch needed (Storage Read API pushes the COUNT into BQ) |
| Single aggregate on remote VIEW/MATERIALIZED_VIEW | Same syntax works (#160) but the BQ jobs API can't push WHERE/COUNT into the view body. Cost guardrail (default 5 GiB) catches expensive scans → 400 `remote_scan_too_large` with `da fetch` suggestion. Pivot to `da fetch <id> --where '<predicate>'` if rejected. | | Single aggregate on remote VIEW/MATERIALIZED_VIEW | Same syntax works (#160) but the BQ jobs API can't push WHERE/COUNT into the view body. Cost guardrail (default 5 GiB) catches expensive scans → 400 `remote_scan_too_large` with `agnes snapshot create` suggestion. Pivot to `agnes snapshot create <id> --where '<predicate>'` if rejected. |
| Throwaway exploration with raw BQ syntax | `agnes query --remote "SELECT … FROM <registered_id>"` — direct `bq."<dataset>"."<table>"` paths are now registry-gated (403 `bq_path_not_registered` if not registered). Register first or use the catalog id. | | Throwaway exploration with raw BQ syntax | `agnes query --remote "SELECT … FROM <registered_id>"` — direct `bq."<dataset>"."<table>"` paths are now registry-gated (403 `bq_path_not_registered` if not registered). Register first or use the catalog id. |
| Cross-table JOIN with both remote | Use `da fetch` for one side + `agnes query --remote` for the other; full cross-remote JOIN needs design (see #101) | | Cross-table JOIN with both remote | Use `agnes snapshot create` for one side + `agnes query --remote` for the other; full cross-remote JOIN needs design (see #101) |
## When the table you need isn't in `agnes catalog` ## When the table you need isn't in `agnes catalog`
@ -118,7 +118,7 @@ The catalog reads from `system.duckdb::table_registry` — entries land there on
1. **Discover**: `agnes catalog`, `agnes schema`, `agnes describe` 1. **Discover**: `agnes catalog`, `agnes schema`, `agnes describe`
2. **Check query_mode**: local (direct) or remote (fetch or --remote)? 2. **Check query_mode**: local (direct) or remote (fetch or --remote)?
3. **For remote**: `--estimate` first, then `da fetch` with `--select` + `--where` 3. **For remote**: `--estimate` first, then `agnes snapshot create` with `--select` + `--where`
4. **Snapshot name**: descriptive (`cz_recent`), reuse across questions 4. **Snapshot name**: descriptive (`cz_recent`), reuse across questions
5. **Query**: `agnes query` against snapshot; DuckDB SQL syntax 5. **Query**: `agnes query` against snapshot; DuckDB SQL syntax
6. **Cleanup**: `agnes snapshot drop` when done; `agnes disk-info` to check size 6. **Cleanup**: `agnes snapshot drop` when done; `agnes disk-info` to check size

View file

@ -5,7 +5,7 @@ description: Use when adding tables to the Agnes catalog so analysts can query t
# Registering tables in Agnes # Registering tables in Agnes
`agnes catalog` lists tables from `system.duckdb::table_registry`. A table you can `da fetch` exists in that registry. This skill is the protocol for getting tables into and out of it. `agnes catalog` lists tables from `system.duckdb::table_registry`. A table you can `agnes snapshot create` exists in that registry. This skill is the protocol for getting tables into and out of it.
**Auth:** every command here requires admin role. The CLI sends the active PAT (`agnes auth import-token`); REST examples use `Authorization: Bearer $PAT` against the configured server. **Auth:** every command here requires admin role. The CLI sends the active PAT (`agnes auth import-token`); REST examples use `Authorization: Bearer $PAT` against the configured server.
@ -21,7 +21,7 @@ user wants to add tables
## Before you register — verify the source exists ## Before you register — verify the source exists
Registering a table that does NOT exist at the source is silent: the row lands in the registry, but every later `da fetch` / `agnes query` against it 404s or 500s with an opaque message. Always verify first. Registering a table that does NOT exist at the source is silent: the row lands in the registry, but every later `agnes snapshot create` / `agnes query` against it 404s or 500s with an opaque message. Always verify first.
For BigQuery (`source-type=bigquery`): For BigQuery (`source-type=bigquery`):
@ -107,7 +107,7 @@ curl -sS -X DELETE \
"$AGNES_SERVER_URL/api/admin/registry/<table_id>" "$AGNES_SERVER_URL/api/admin/registry/<table_id>"
``` ```
Returns `204 No Content` on success, `404` if the id doesn't exist. **The underlying source data is NOT touched** — only the catalog entry. Local snapshots created via `da fetch` also remain on the analyst's laptop until they `agnes snapshot drop` them. Returns `204 No Content` on success, `404` if the id doesn't exist. **The underlying source data is NOT touched** — only the catalog entry. Local snapshots created via `agnes snapshot create` also remain on the analyst's laptop until they `agnes snapshot drop` them.
## Heuristics ## Heuristics
@ -120,7 +120,7 @@ Returns `204 No Content` on success, `404` if the id doesn't exist. **The underl
- The user wants to inspect a table once, doesn't intend to share it: register the row once with `query_mode='remote'` (admin-only, ~30s) and query it via `agnes query --remote "SELECT … FROM <registered_id>"`. Direct `bq."<dataset>"."<table>"` syntax is now registry-gated — unregistered paths return 403 `bq_path_not_registered` (closes the pre-existing RBAC + cost-cap bypass). - The user wants to inspect a table once, doesn't intend to share it: register the row once with `query_mode='remote'` (admin-only, ~30s) and query it via `agnes query --remote "SELECT … FROM <registered_id>"`. Direct `bq."<dataset>"."<table>"` syntax is now registry-gated — unregistered paths return 403 `bq_path_not_registered` (closes the pre-existing RBAC + cost-cap bypass).
- The data lives in a third source not yet supported by a connector: implement the connector first (see `connectors.md` skill), then register. - The data lives in a third source not yet supported by a connector: implement the connector first (see `connectors.md` skill), then register.
- The dataset already has a registered "parent" view that exposes the rows you want: register-table is for distinct catalog entities, not for slicing existing ones — slice with `da fetch --where`. - The dataset already has a registered "parent" view that exposes the rows you want: register-table is for distinct catalog entities, not for slicing existing ones — slice with `agnes snapshot create --where`.
## Confirmation flow ## Confirmation flow

View file

@ -35,7 +35,6 @@ This workspace is connected to {{ server.url }}.
## Data Sync ## Data Sync
- `agnes pull` — download current data from server - `agnes pull` — download current data from server
- `agnes pull --docs-only` — just metadata and metrics (fast refresh)
- `agnes push` — upload sessions and local notes to server - `agnes push` — upload sessions and local notes to server
- Data on the server refreshes every {{ sync_interval }} - Data on the server refreshes every {{ sync_interval }}

View file

@ -28,9 +28,36 @@ def test_snapshot_create_help():
def test_snapshot_create_no_duckdb_friendly_exit(tmp_path, monkeypatch): def test_snapshot_create_no_duckdb_friendly_exit(tmp_path, monkeypatch):
"""Real-fetch path (no --estimate) refuses without a local DuckDB."""
monkeypatch.setenv("AGNES_LOCAL_DIR", str(tmp_path)) monkeypatch.setenv("AGNES_LOCAL_DIR", str(tmp_path))
runner = CliRunner() runner = CliRunner()
result = runner.invoke(snapshot_app, ["create", "any_table", "--as", "x", "--estimate"]) result = runner.invoke(snapshot_app, ["create", "any_table", "--as", "x"])
assert result.exit_code == 1 assert result.exit_code == 1
out = result.output + (result.stderr or "") out = result.output + (result.stderr or "")
assert "Run: agnes pull" in out assert "Run: agnes pull" in out
def test_snapshot_create_estimate_skips_duckdb_guard(tmp_path, monkeypatch):
"""--estimate is server-side dry-run only; doesn't need local DuckDB.
Analysts use it pre-bootstrap to scope a fetch before committing to
materialize, so the local-DB guard would block the use case it's most
useful for. Per Devin review finding ANALYSIS_0004.
"""
monkeypatch.setenv("AGNES_LOCAL_DIR", str(tmp_path))
# Stub api_post so we don't actually hit the network — what we care about
# is that the guard doesn't fire BEFORE the API call.
from unittest.mock import MagicMock
fake_resp = MagicMock()
fake_resp.status_code = 200
fake_resp.json.return_value = {"estimated_scan_bytes": 0, "estimated_rows": 0,
"estimated_local_bytes": 0, "table_id": "any_table"}
monkeypatch.setattr("cli.commands.snapshot.api_post", lambda *a, **kw: fake_resp,
raising=False)
runner = CliRunner()
result = runner.invoke(snapshot_app, ["create", "any_table", "--as", "x", "--estimate"])
# Should NOT exit 1 with "Run: agnes pull" — that hint is for the fetch path.
out = result.output + (result.stderr or "")
assert "Run: agnes pull" not in out, \
"--estimate must not be blocked by the local-DuckDB guard"