fix: address Devin Review findings — incomplete renames + estimate guard
13 Devin findings across 10 files: 🔴 Critical: - app/api/v2_catalog.py:42 — `_fetch_hint` returns `da fetch` in /api/v2/catalog responses (user-visible in every catalog list) - cli/skills/agnes-data-querying.md — 11 stale `da fetch`/`da sync` refs in the bundled skill markdown - config/claude_md_template.txt:38 — referenced `agnes pull --docs-only` flag that does NOT exist in agnes pull (removed; spec only ships --quiet/--json/ --dry-run) 🟡 Important: - app/api/admin.py:252 — `da fetch` in bq_max_scan_bytes hint - cli/commands/auth.py:119 — `da sync` in import-token docstring (--help text) - cli/commands/tokens.py:48 — "Export it so `da` can use it" prose - ARCHITECTURE.md — 4 stale rows in CLI commands table - README.md — stale paragraphs for analysts (da sync, da analyst setup) 🚩 Substantive observations addressed: - app/api/query.py:249,302,489 — server-side error/help strings still said `da sync`/`da fetch` (returned in API responses to clients) - cli/commands/snapshot.py:235-241 — DuckDB existence guard incorrectly blocked `--estimate` (server-side dry-run that never opens local DB). Added test ensuring estimate path skips the guard. Skipped (intentionally historical): - app/api/admin.py:2377,2429,2437 — historical comments describing past manifest-vs-sync_state bug; past tense, accurate to keep as `da sync`.
This commit is contained in:
parent
cd8dd9508c
commit
3d58768143
12 changed files with 76 additions and 44 deletions
|
|
@ -21,7 +21,7 @@
|
||||||
┌──────────┼──────────┐
|
┌──────────┼──────────┐
|
||||||
▼ ▼ ▼
|
▼ ▼ ▼
|
||||||
FastAPI CLI
|
FastAPI CLI
|
||||||
(serve) (da sync)
|
(serve) (agnes pull)
|
||||||
```
|
```
|
||||||
|
|
||||||
Three source types:
|
Three source types:
|
||||||
|
|
@ -115,14 +115,14 @@ Command-line tool `da` for sync, query, and admin operations.
|
||||||
|
|
||||||
| Command | Role |
|
| Command | Role |
|
||||||
|---------|------|
|
|---------|------|
|
||||||
| `da sync` | Trigger data sync |
|
| `agnes pull` | Trigger data sync |
|
||||||
| `agnes query` | Run SQL against analytics.duckdb |
|
| `agnes query` | Run SQL against analytics.duckdb |
|
||||||
| `agnes admin group *` | Manage user groups |
|
| `agnes admin group *` | Manage user groups |
|
||||||
| `agnes admin grant *` | Manage resource grants |
|
| `agnes admin grant *` | Manage resource grants |
|
||||||
| `agnes admin register-table` | Register tables in table_registry |
|
| `agnes admin register-table` | Register tables in table_registry |
|
||||||
| `agnes admin break-glass <user>` | Emergency admin access recovery |
|
| `agnes admin break-glass <user>` | Emergency admin access recovery |
|
||||||
| `da tokens *` | Manage personal access tokens |
|
| `agnes auth token *` | Manage personal access tokens |
|
||||||
| `da metrics *` | Business metric definitions |
|
| `agnes admin metrics *` | Business metric definitions |
|
||||||
| `agnes skills *` | List/show bundled skills |
|
| `agnes skills *` | List/show bundled skills |
|
||||||
|
|
||||||
### 5. Authentication (`app/auth/`)
|
### 5. Authentication (`app/auth/`)
|
||||||
|
|
|
||||||
26
README.md
26
README.md
|
|
@ -35,7 +35,7 @@ The orchestrator scans `/data/extracts/*/extract.duckdb`, attaches each into `an
|
||||||
┌──────────┼──────────┐
|
┌──────────┼──────────┐
|
||||||
▼ ▼ ▼
|
▼ ▼ ▼
|
||||||
FastAPI CLI
|
FastAPI CLI
|
||||||
(serve) (da sync)
|
(serve) (agnes pull)
|
||||||
```
|
```
|
||||||
|
|
||||||
## Supported Data Sources
|
## Supported Data Sources
|
||||||
|
|
@ -47,11 +47,11 @@ The orchestrator scans `/data/extracts/*/extract.duckdb`, attaches each into `an
|
||||||
| **Remote attach** (`remote`) | View only, no download | BigQuery | Table is too large to materialize; latency cost of remote query is acceptable |
|
| **Remote attach** (`remote`) | View only, no download | BigQuery | Table is too large to materialize; latency cost of remote query is acceptable |
|
||||||
| **Real-time push** | Incremental parquet | Jira | Source is event-driven and you need sub-minute freshness |
|
| **Real-time push** | Incremental parquet | Jira | Source is event-driven and you need sub-minute freshness |
|
||||||
|
|
||||||
The first three modes are what `da sync` distributes to analysts. The fourth is server-side only — analysts query Jira data through the same `da sync`-distributed parquets.
|
The first three modes are what `agnes pull` distributes to analysts. The fourth is server-side only — analysts query Jira data through the same `agnes pull`-distributed parquets.
|
||||||
|
|
||||||
Admins manage per-source registrations through the `/admin/tables` UI (per-connector tabs for BigQuery / Keboola / Jira) or the `agnes admin register-table` CLI; per-row "Manage access" deep-links to `/admin/access` for granting tables to user groups via `resource_grants(group, ResourceType.TABLE, table_id)`.
|
Admins manage per-source registrations through the `/admin/tables` UI (per-connector tabs for BigQuery / Keboola / Jira) or the `agnes admin register-table` CLI; per-row "Manage access" deep-links to `/admin/access` for granting tables to user groups via `resource_grants(group, ResourceType.TABLE, table_id)`.
|
||||||
|
|
||||||
Analysts get a closed loop with Claude Code: `da analyst setup` writes `<workspace>/.claude/settings.json` with SessionStart (`da sync --quiet`) and SessionEnd (`da sync --upload-only --quiet`) hooks so every Claude Code session starts with fresh RBAC-filtered parquets and ends with the session log uploaded back.
|
Analysts get a closed loop with Claude Code: `agnes init` writes `<workspace>/.claude/settings.json` with SessionStart (`agnes pull --quiet`) and SessionEnd (`agnes push --quiet`) hooks so every Claude Code session starts with fresh RBAC-filtered parquets and ends with the session log uploaded back.
|
||||||
|
|
||||||
Adding a new source means creating `connectors/<name>/extractor.py` that produces `extract.duckdb` with a `_meta` table (`table_name`, `description`, `rows`, `size_bytes`, `extracted_at`, `query_mode`). The orchestrator attaches it automatically.
|
Adding a new source means creating `connectors/<name>/extractor.py` that produces `extract.duckdb` with a `_meta` table (`table_name`, `description`, `rows`, `size_bytes`, `extracted_at`, `query_mode`). The orchestrator attaches it automatically.
|
||||||
|
|
||||||
|
|
@ -86,18 +86,18 @@ curl -X POST http://localhost:8000/api/sync/trigger
|
||||||
|
|
||||||
## Local sync & auto-update
|
## Local sync & auto-update
|
||||||
|
|
||||||
Analysts run Claude Code against a local DuckDB built from RBAC-filtered parquets pulled from the server. `da sync` is the distribution path:
|
Analysts run Claude Code against a local DuckDB built from RBAC-filtered parquets pulled from the server. `agnes pull` is the distribution path:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
da sync # delta-pull: manifest → MD5 compare → download changed → rebuild views
|
agnes pull # delta-pull: manifest → MD5 compare → download changed → rebuild views
|
||||||
da sync --quiet # same, no progress output (for hooks/cron)
|
agnes pull --quiet # same, no progress output (for hooks/cron)
|
||||||
da sync --upload-only # push session jsonl + CLAUDE.local.md back to the server
|
agnes push # push session jsonl + CLAUDE.local.md back to the server
|
||||||
```
|
```
|
||||||
|
|
||||||
`da analyst setup` writes Claude Code lifecycle hooks into `<workspace>/.claude/settings.json`:
|
`agnes init` writes Claude Code lifecycle hooks into `<workspace>/.claude/settings.json`:
|
||||||
|
|
||||||
- `SessionStart` → `da sync --quiet` — fresh data on every session
|
- `SessionStart` → `agnes pull --quiet` — fresh data on every session
|
||||||
- `SessionEnd` → `da sync --upload-only --quiet` — uploads notes and session log
|
- `SessionEnd` → `agnes push --quiet` — uploads notes and session log
|
||||||
|
|
||||||
Hooks live at workspace level so they only fire in this analyst workspace, not in unrelated Claude Code sessions on the same machine.
|
Hooks live at workspace level so they only fire in this analyst workspace, not in unrelated Claude Code sessions on the same machine.
|
||||||
|
|
||||||
|
|
@ -108,7 +108,7 @@ The auto-sync set per analyst is the intersection of:
|
||||||
1. Tables with `query_mode IN ('local', 'materialized')` — these have parquets on disk and end up in the manifest
|
1. Tables with `query_mode IN ('local', 'materialized')` — these have parquets on disk and end up in the manifest
|
||||||
2. Tables granted to one of the analyst's groups via `resource_grants(group, ResourceType.TABLE, table_id)` (see [`docs/RBAC.md`](docs/RBAC.md))
|
2. Tables granted to one of the analyst's groups via `resource_grants(group, ResourceType.TABLE, table_id)` (see [`docs/RBAC.md`](docs/RBAC.md))
|
||||||
|
|
||||||
To enroll a new table for auto-sync, register it (or update its `query_mode`) and grant it to the relevant groups in `/admin/access`. New analysts get the same set on their next `da sync`.
|
To enroll a new table for auto-sync, register it (or update its `query_mode`) and grant it to the relevant groups in `/admin/access`. New analysts get the same set on their next `agnes pull`.
|
||||||
|
|
||||||
For BigQuery, register a `query_mode='materialized'` table with a SQL body:
|
For BigQuery, register a `query_mode='materialized'` table with a SQL body:
|
||||||
|
|
||||||
|
|
@ -120,7 +120,7 @@ agnes admin register-table orders_90d \
|
||||||
--schedule "every 6h"
|
--schedule "every 6h"
|
||||||
```
|
```
|
||||||
|
|
||||||
The scheduler runs the query through the DuckDB BigQuery extension on each tick that's due, writes the result as a parquet, and the analyst picks it up on the next `da sync`. Cost guardrail: `data_source.bigquery.max_bytes_per_materialize` (default 10 GiB) — operations exceeding the BQ dry-run estimate are skipped.
|
The scheduler runs the query through the DuckDB BigQuery extension on each tick that's due, writes the result as a parquet, and the analyst picks it up on the next `agnes pull`. Cost guardrail: `data_source.bigquery.max_bytes_per_materialize` (default 10 GiB) — operations exceeding the BQ dry-run estimate are skipped.
|
||||||
|
|
||||||
## Development Setup
|
## Development Setup
|
||||||
|
|
||||||
|
|
@ -156,7 +156,7 @@ pytest tests/ -v
|
||||||
│ ├── keboola/ # Keboola: extractor.py (DuckDB extension) + client.py (fallback)
|
│ ├── keboola/ # Keboola: extractor.py (DuckDB extension) + client.py (fallback)
|
||||||
│ ├── bigquery/ # BigQuery: extractor.py (remote-only via DuckDB BQ extension)
|
│ ├── bigquery/ # BigQuery: extractor.py (remote-only via DuckDB BQ extension)
|
||||||
│ └── jira/ # Jira: webhook + incremental parquet → extract.duckdb
|
│ └── jira/ # Jira: webhook + incremental parquet → extract.duckdb
|
||||||
├── cli/ # CLI tool (`da sync`, `agnes query`, `agnes admin`)
|
├── cli/ # CLI tool (`agnes pull`, `agnes query`, `agnes admin`)
|
||||||
├── services/ # Standalone services (scheduler, telegram_bot, ws_gateway, etc.)
|
├── services/ # Standalone services (scheduler, telegram_bot, ws_gateway, etc.)
|
||||||
├── scripts/ # Utility + migration scripts
|
├── scripts/ # Utility + migration scripts
|
||||||
├── config/ # Configuration templates (instance.yaml.example)
|
├── config/ # Configuration templates (instance.yaml.example)
|
||||||
|
|
|
||||||
|
|
@ -249,7 +249,7 @@ _KNOWN_FIELDS: dict[str, dict[str, dict]] = {
|
||||||
"Cost guardrail for `agnes query --remote` against query_mode='remote' "
|
"Cost guardrail for `agnes query --remote` against query_mode='remote' "
|
||||||
"BQ rows (dry-run check on the underlying SELECT before execute). "
|
"BQ rows (dry-run check on the underlying SELECT before execute). "
|
||||||
"Bytes processed; exceeds → 400 remote_scan_too_large with a "
|
"Bytes processed; exceeds → 400 remote_scan_too_large with a "
|
||||||
"`da fetch` suggestion. 0 disables the gate. Default 5368709120 = 5 GiB."
|
"`agnes snapshot create` suggestion. 0 disables the gate. Default 5368709120 = 5 GiB."
|
||||||
),
|
),
|
||||||
},
|
},
|
||||||
},
|
},
|
||||||
|
|
|
||||||
|
|
@ -246,7 +246,7 @@ def _materialized_hint_for_query_error(
|
||||||
text; a hit means the operator picked a name that exists in the
|
text; a hit means the operator picked a name that exists in the
|
||||||
registry but isn't queryable in this instance. The hint is the same
|
registry but isn't queryable in this instance. The hint is the same
|
||||||
in both arms of the OR — it tells them what the table needs and what
|
in both arms of the OR — it tells them what the table needs and what
|
||||||
they can do today (`da sync` or query `bq."dataset"."table"`
|
they can do today (`agnes pull` or query `bq."dataset"."table"`
|
||||||
directly using the bucket/source_table from the registry row).
|
directly using the bucket/source_table from the registry row).
|
||||||
"""
|
"""
|
||||||
# Cheap fast-path — only inspect the registry when DuckDB's error
|
# Cheap fast-path — only inspect the registry when DuckDB's error
|
||||||
|
|
@ -299,7 +299,7 @@ def _build_materialized_hint(row: dict) -> str:
|
||||||
return (
|
return (
|
||||||
f"Table {tid!r} is registered as query_mode='materialized' but is "
|
f"Table {tid!r} is registered as query_mode='materialized' but is "
|
||||||
f"not yet materialized in this instance's analytics views. Run "
|
f"not yet materialized in this instance's analytics views. Run "
|
||||||
f"`da sync` (or wait for the scheduler tick / hit POST "
|
f"`agnes pull` (or wait for the scheduler tick / hit POST "
|
||||||
f"/api/sync/trigger) to materialize the parquet"
|
f"/api/sync/trigger) to materialize the parquet"
|
||||||
f"{direct_hint}."
|
f"{direct_hint}."
|
||||||
)
|
)
|
||||||
|
|
@ -486,7 +486,7 @@ def _bq_quota_and_cap_guard(*, user_id: str, dry_run_set: list, sql: str):
|
||||||
"limit_bytes": cap_bytes,
|
"limit_bytes": cap_bytes,
|
||||||
"tables": tables,
|
"tables": tables,
|
||||||
"suggestion": (
|
"suggestion": (
|
||||||
"Use `da fetch <id> --select <cols> --where <predicate> "
|
"Use `agnes snapshot create <id> --select <cols> --where <predicate> "
|
||||||
"--estimate` to materialize a filtered subset, then query "
|
"--estimate` to materialize a filtered subset, then query "
|
||||||
"the snapshot locally."
|
"the snapshot locally."
|
||||||
),
|
),
|
||||||
|
|
|
||||||
|
|
@ -39,7 +39,7 @@ def _examples_for(source_type: str) -> list[str]:
|
||||||
|
|
||||||
def _fetch_hint(table_id: str, source_type: str) -> str:
|
def _fetch_hint(table_id: str, source_type: str) -> str:
|
||||||
if source_type == "bigquery":
|
if source_type == "bigquery":
|
||||||
return f"da fetch {table_id} --select <cols> --where '<BQ predicate>' --limit <N>"
|
return f"agnes snapshot create {table_id} --select <cols> --where '<BQ predicate>' --limit <N>"
|
||||||
return "already local — query directly via `agnes query`"
|
return "already local — query directly via `agnes query`"
|
||||||
|
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -116,7 +116,7 @@ def import_token(
|
||||||
|
|
||||||
Decodes the JWT locally to extract the email claim, verifies it
|
Decodes the JWT locally to extract the email claim, verifies it
|
||||||
against the server, and writes it to ~/.config/agnes/token.json using the
|
against the server, and writes it to ~/.config/agnes/token.json using the
|
||||||
canonical format so subsequent `agnes auth whoami` / `da sync` calls
|
canonical format so subsequent `agnes auth whoami` / `agnes pull` calls
|
||||||
authenticate cleanly.
|
authenticate cleanly.
|
||||||
|
|
||||||
Example:
|
Example:
|
||||||
|
|
|
||||||
|
|
@ -235,10 +235,16 @@ def create_cmd(
|
||||||
# Guard: refuse to create snapshots before `agnes pull` has bootstrapped
|
# Guard: refuse to create snapshots before `agnes pull` has bootstrapped
|
||||||
# the local DuckDB. Otherwise we'd open an empty DB and confuse later
|
# the local DuckDB. Otherwise we'd open an empty DB and confuse later
|
||||||
# `agnes pull` runs.
|
# `agnes pull` runs.
|
||||||
local_db = _local_dir() / "user" / "duckdb" / "analytics.duckdb"
|
#
|
||||||
if not local_db.exists():
|
# `--estimate` is exempt: it's a server-side dry-run cost check that
|
||||||
typer.echo("Local DuckDB not found. Run: agnes pull first.", err=True)
|
# never touches the local DuckDB, so it doesn't need the DB to exist
|
||||||
raise typer.Exit(1)
|
# (and analysts use it pre-bootstrap to scope a fetch before deciding
|
||||||
|
# to materialize).
|
||||||
|
if not estimate:
|
||||||
|
local_db = _local_dir() / "user" / "duckdb" / "analytics.duckdb"
|
||||||
|
if not local_db.exists():
|
||||||
|
typer.echo("Local DuckDB not found. Run: agnes pull first.", err=True)
|
||||||
|
raise typer.Exit(1)
|
||||||
|
|
||||||
snap_dir = _local_dir() / "user" / "snapshots"
|
snap_dir = _local_dir() / "user" / "snapshots"
|
||||||
snap_dir.mkdir(parents=True, exist_ok=True)
|
snap_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
|
||||||
|
|
@ -45,7 +45,7 @@ def create(
|
||||||
typer.echo(f"name: {data['name']}")
|
typer.echo(f"name: {data['name']}")
|
||||||
typer.echo(f"expires: {data.get('expires_at') or 'never'}")
|
typer.echo(f"expires: {data.get('expires_at') or 'never'}")
|
||||||
typer.echo("")
|
typer.echo("")
|
||||||
typer.echo("Export it so `da` can use it:")
|
typer.echo("Export it so `agnes` can use it:")
|
||||||
typer.echo(f" export AGNES_TOKEN={data['token']}")
|
typer.echo(f" export AGNES_TOKEN={data['token']}")
|
||||||
|
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -26,21 +26,21 @@ Tables in `agnes catalog` have a `query_mode`:
|
||||||
| Mode | Means | How to query |
|
| Mode | Means | How to query |
|
||||||
|------|-------|--------------|
|
|------|-------|--------------|
|
||||||
| `local` | parquet synced on laptop | `agnes query "SELECT …"` directly |
|
| `local` | parquet synced on laptop | `agnes query "SELECT …"` directly |
|
||||||
| `remote` (BigQuery) | parquet NOT on laptop | `da fetch` subset → snapshot, OR `agnes query --remote` one-shot |
|
| `remote` (BigQuery) | parquet NOT on laptop | `agnes snapshot create` subset → snapshot, OR `agnes query --remote` one-shot |
|
||||||
|
|
||||||
For **remote tables**, you MUST either:
|
For **remote tables**, you MUST either:
|
||||||
1. `da fetch` a filtered subset → query the local snapshot (preferred), OR
|
1. `agnes snapshot create` a filtered subset → query the local snapshot (preferred), OR
|
||||||
2. `agnes query --remote` for one-shot server-side execution, OR
|
2. `agnes query --remote` for one-shot server-side execution, OR
|
||||||
3. `agnes query --register-bq` for hybrid joins (rare; see docs)
|
3. `agnes query --register-bq` for hybrid joins (rare; see docs)
|
||||||
|
|
||||||
## The `da fetch` workflow (preferred for remote tables)
|
## The `agnes snapshot create` workflow (preferred for remote tables)
|
||||||
|
|
||||||
### 1. Estimate first
|
### 1. Estimate first
|
||||||
|
|
||||||
Always estimate before fetching:
|
Always estimate before fetching:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
da fetch web_sessions_example \
|
agnes snapshot create web_sessions_example \
|
||||||
--select event_date,country_code,session_id \
|
--select event_date,country_code,session_id \
|
||||||
--where "event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
|
--where "event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
|
||||||
AND country_code = 'CZ'" \
|
AND country_code = 'CZ'" \
|
||||||
|
|
@ -52,7 +52,7 @@ Output tells you scan cost, expected rows, and local bytes — so you know if it
|
||||||
### 2. If reasonable, fetch to snapshot
|
### 2. If reasonable, fetch to snapshot
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
da fetch web_sessions_example \
|
agnes snapshot create web_sessions_example \
|
||||||
--select event_date,country_code,session_id \
|
--select event_date,country_code,session_id \
|
||||||
--where "event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
|
--where "event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
|
||||||
AND country_code = 'CZ'" \
|
AND country_code = 'CZ'" \
|
||||||
|
|
@ -65,7 +65,7 @@ da fetch web_sessions_example \
|
||||||
agnes query "SELECT event_date, COUNT(*) FROM cz_recent GROUP BY 1 ORDER BY 1"
|
agnes query "SELECT event_date, COUNT(*) FROM cz_recent GROUP BY 1 ORDER BY 1"
|
||||||
```
|
```
|
||||||
|
|
||||||
## Heuristics for `da fetch`
|
## Heuristics for `agnes snapshot create`
|
||||||
|
|
||||||
| Requirement | Why |
|
| Requirement | Why |
|
||||||
|-------------|-----|
|
|-------------|-----|
|
||||||
|
|
@ -97,14 +97,14 @@ For `source_type=keboola` / `source_type=jira` (local), use **DuckDB SQL** in yo
|
||||||
- Drop with `agnes snapshot drop <name>` when done with a topic
|
- Drop with `agnes snapshot drop <name>` when done with a topic
|
||||||
- Check total cache size with `agnes disk-info`
|
- Check total cache size with `agnes disk-info`
|
||||||
|
|
||||||
## When NOT to use `da fetch`
|
## When NOT to use `agnes snapshot create`
|
||||||
|
|
||||||
| Scenario | Use instead |
|
| Scenario | Use instead |
|
||||||
|----------|------------|
|
|----------|------------|
|
||||||
| Single aggregate on remote BASE TABLE (`SELECT COUNT(*)`) | `agnes query --remote "SELECT COUNT(*) FROM web_sessions_example"` — cheap, no fetch needed (Storage Read API pushes the COUNT into BQ) |
|
| Single aggregate on remote BASE TABLE (`SELECT COUNT(*)`) | `agnes query --remote "SELECT COUNT(*) FROM web_sessions_example"` — cheap, no fetch needed (Storage Read API pushes the COUNT into BQ) |
|
||||||
| Single aggregate on remote VIEW/MATERIALIZED_VIEW | Same syntax works (#160) but the BQ jobs API can't push WHERE/COUNT into the view body. Cost guardrail (default 5 GiB) catches expensive scans → 400 `remote_scan_too_large` with `da fetch` suggestion. Pivot to `da fetch <id> --where '<predicate>'` if rejected. |
|
| Single aggregate on remote VIEW/MATERIALIZED_VIEW | Same syntax works (#160) but the BQ jobs API can't push WHERE/COUNT into the view body. Cost guardrail (default 5 GiB) catches expensive scans → 400 `remote_scan_too_large` with `agnes snapshot create` suggestion. Pivot to `agnes snapshot create <id> --where '<predicate>'` if rejected. |
|
||||||
| Throwaway exploration with raw BQ syntax | `agnes query --remote "SELECT … FROM <registered_id>"` — direct `bq."<dataset>"."<table>"` paths are now registry-gated (403 `bq_path_not_registered` if not registered). Register first or use the catalog id. |
|
| Throwaway exploration with raw BQ syntax | `agnes query --remote "SELECT … FROM <registered_id>"` — direct `bq."<dataset>"."<table>"` paths are now registry-gated (403 `bq_path_not_registered` if not registered). Register first or use the catalog id. |
|
||||||
| Cross-table JOIN with both remote | Use `da fetch` for one side + `agnes query --remote` for the other; full cross-remote JOIN needs design (see #101) |
|
| Cross-table JOIN with both remote | Use `agnes snapshot create` for one side + `agnes query --remote` for the other; full cross-remote JOIN needs design (see #101) |
|
||||||
|
|
||||||
## When the table you need isn't in `agnes catalog`
|
## When the table you need isn't in `agnes catalog`
|
||||||
|
|
||||||
|
|
@ -118,7 +118,7 @@ The catalog reads from `system.duckdb::table_registry` — entries land there on
|
||||||
|
|
||||||
1. **Discover**: `agnes catalog`, `agnes schema`, `agnes describe`
|
1. **Discover**: `agnes catalog`, `agnes schema`, `agnes describe`
|
||||||
2. **Check query_mode**: local (direct) or remote (fetch or --remote)?
|
2. **Check query_mode**: local (direct) or remote (fetch or --remote)?
|
||||||
3. **For remote**: `--estimate` first, then `da fetch` with `--select` + `--where`
|
3. **For remote**: `--estimate` first, then `agnes snapshot create` with `--select` + `--where`
|
||||||
4. **Snapshot name**: descriptive (`cz_recent`), reuse across questions
|
4. **Snapshot name**: descriptive (`cz_recent`), reuse across questions
|
||||||
5. **Query**: `agnes query` against snapshot; DuckDB SQL syntax
|
5. **Query**: `agnes query` against snapshot; DuckDB SQL syntax
|
||||||
6. **Cleanup**: `agnes snapshot drop` when done; `agnes disk-info` to check size
|
6. **Cleanup**: `agnes snapshot drop` when done; `agnes disk-info` to check size
|
||||||
|
|
|
||||||
|
|
@ -5,7 +5,7 @@ description: Use when adding tables to the Agnes catalog so analysts can query t
|
||||||
|
|
||||||
# Registering tables in Agnes
|
# Registering tables in Agnes
|
||||||
|
|
||||||
`agnes catalog` lists tables from `system.duckdb::table_registry`. A table you can `da fetch` exists in that registry. This skill is the protocol for getting tables into and out of it.
|
`agnes catalog` lists tables from `system.duckdb::table_registry`. A table you can `agnes snapshot create` exists in that registry. This skill is the protocol for getting tables into and out of it.
|
||||||
|
|
||||||
**Auth:** every command here requires admin role. The CLI sends the active PAT (`agnes auth import-token`); REST examples use `Authorization: Bearer $PAT` against the configured server.
|
**Auth:** every command here requires admin role. The CLI sends the active PAT (`agnes auth import-token`); REST examples use `Authorization: Bearer $PAT` against the configured server.
|
||||||
|
|
||||||
|
|
@ -21,7 +21,7 @@ user wants to add tables
|
||||||
|
|
||||||
## Before you register — verify the source exists
|
## Before you register — verify the source exists
|
||||||
|
|
||||||
Registering a table that does NOT exist at the source is silent: the row lands in the registry, but every later `da fetch` / `agnes query` against it 404s or 500s with an opaque message. Always verify first.
|
Registering a table that does NOT exist at the source is silent: the row lands in the registry, but every later `agnes snapshot create` / `agnes query` against it 404s or 500s with an opaque message. Always verify first.
|
||||||
|
|
||||||
For BigQuery (`source-type=bigquery`):
|
For BigQuery (`source-type=bigquery`):
|
||||||
|
|
||||||
|
|
@ -107,7 +107,7 @@ curl -sS -X DELETE \
|
||||||
"$AGNES_SERVER_URL/api/admin/registry/<table_id>"
|
"$AGNES_SERVER_URL/api/admin/registry/<table_id>"
|
||||||
```
|
```
|
||||||
|
|
||||||
Returns `204 No Content` on success, `404` if the id doesn't exist. **The underlying source data is NOT touched** — only the catalog entry. Local snapshots created via `da fetch` also remain on the analyst's laptop until they `agnes snapshot drop` them.
|
Returns `204 No Content` on success, `404` if the id doesn't exist. **The underlying source data is NOT touched** — only the catalog entry. Local snapshots created via `agnes snapshot create` also remain on the analyst's laptop until they `agnes snapshot drop` them.
|
||||||
|
|
||||||
## Heuristics
|
## Heuristics
|
||||||
|
|
||||||
|
|
@ -120,7 +120,7 @@ Returns `204 No Content` on success, `404` if the id doesn't exist. **The underl
|
||||||
|
|
||||||
- The user wants to inspect a table once, doesn't intend to share it: register the row once with `query_mode='remote'` (admin-only, ~30s) and query it via `agnes query --remote "SELECT … FROM <registered_id>"`. Direct `bq."<dataset>"."<table>"` syntax is now registry-gated — unregistered paths return 403 `bq_path_not_registered` (closes the pre-existing RBAC + cost-cap bypass).
|
- The user wants to inspect a table once, doesn't intend to share it: register the row once with `query_mode='remote'` (admin-only, ~30s) and query it via `agnes query --remote "SELECT … FROM <registered_id>"`. Direct `bq."<dataset>"."<table>"` syntax is now registry-gated — unregistered paths return 403 `bq_path_not_registered` (closes the pre-existing RBAC + cost-cap bypass).
|
||||||
- The data lives in a third source not yet supported by a connector: implement the connector first (see `connectors.md` skill), then register.
|
- The data lives in a third source not yet supported by a connector: implement the connector first (see `connectors.md` skill), then register.
|
||||||
- The dataset already has a registered "parent" view that exposes the rows you want: register-table is for distinct catalog entities, not for slicing existing ones — slice with `da fetch --where`.
|
- The dataset already has a registered "parent" view that exposes the rows you want: register-table is for distinct catalog entities, not for slicing existing ones — slice with `agnes snapshot create --where`.
|
||||||
|
|
||||||
## Confirmation flow
|
## Confirmation flow
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -35,7 +35,6 @@ This workspace is connected to {{ server.url }}.
|
||||||
|
|
||||||
## Data Sync
|
## Data Sync
|
||||||
- `agnes pull` — download current data from server
|
- `agnes pull` — download current data from server
|
||||||
- `agnes pull --docs-only` — just metadata and metrics (fast refresh)
|
|
||||||
- `agnes push` — upload sessions and local notes to server
|
- `agnes push` — upload sessions and local notes to server
|
||||||
- Data on the server refreshes every {{ sync_interval }}
|
- Data on the server refreshes every {{ sync_interval }}
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -28,9 +28,36 @@ def test_snapshot_create_help():
|
||||||
|
|
||||||
|
|
||||||
def test_snapshot_create_no_duckdb_friendly_exit(tmp_path, monkeypatch):
|
def test_snapshot_create_no_duckdb_friendly_exit(tmp_path, monkeypatch):
|
||||||
|
"""Real-fetch path (no --estimate) refuses without a local DuckDB."""
|
||||||
monkeypatch.setenv("AGNES_LOCAL_DIR", str(tmp_path))
|
monkeypatch.setenv("AGNES_LOCAL_DIR", str(tmp_path))
|
||||||
runner = CliRunner()
|
runner = CliRunner()
|
||||||
result = runner.invoke(snapshot_app, ["create", "any_table", "--as", "x", "--estimate"])
|
result = runner.invoke(snapshot_app, ["create", "any_table", "--as", "x"])
|
||||||
assert result.exit_code == 1
|
assert result.exit_code == 1
|
||||||
out = result.output + (result.stderr or "")
|
out = result.output + (result.stderr or "")
|
||||||
assert "Run: agnes pull" in out
|
assert "Run: agnes pull" in out
|
||||||
|
|
||||||
|
|
||||||
|
def test_snapshot_create_estimate_skips_duckdb_guard(tmp_path, monkeypatch):
|
||||||
|
"""--estimate is server-side dry-run only; doesn't need local DuckDB.
|
||||||
|
|
||||||
|
Analysts use it pre-bootstrap to scope a fetch before committing to
|
||||||
|
materialize, so the local-DB guard would block the use case it's most
|
||||||
|
useful for. Per Devin review finding ANALYSIS_0004.
|
||||||
|
"""
|
||||||
|
monkeypatch.setenv("AGNES_LOCAL_DIR", str(tmp_path))
|
||||||
|
# Stub api_post so we don't actually hit the network — what we care about
|
||||||
|
# is that the guard doesn't fire BEFORE the API call.
|
||||||
|
from unittest.mock import MagicMock
|
||||||
|
fake_resp = MagicMock()
|
||||||
|
fake_resp.status_code = 200
|
||||||
|
fake_resp.json.return_value = {"estimated_scan_bytes": 0, "estimated_rows": 0,
|
||||||
|
"estimated_local_bytes": 0, "table_id": "any_table"}
|
||||||
|
monkeypatch.setattr("cli.commands.snapshot.api_post", lambda *a, **kw: fake_resp,
|
||||||
|
raising=False)
|
||||||
|
|
||||||
|
runner = CliRunner()
|
||||||
|
result = runner.invoke(snapshot_app, ["create", "any_table", "--as", "x", "--estimate"])
|
||||||
|
# Should NOT exit 1 with "Run: agnes pull" — that hint is for the fetch path.
|
||||||
|
out = result.output + (result.stderr or "")
|
||||||
|
assert "Run: agnes pull" not in out, \
|
||||||
|
"--estimate must not be blocked by the local-DuckDB guard"
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue