diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md index a2c4c12..53c7e37 100644 --- a/ARCHITECTURE.md +++ b/ARCHITECTURE.md @@ -21,7 +21,7 @@ ┌──────────┼──────────┐ ▼ ▼ ▼ FastAPI CLI - (serve) (da sync) + (serve) (agnes pull) ``` Three source types: @@ -115,14 +115,14 @@ Command-line tool `da` for sync, query, and admin operations. | Command | Role | |---------|------| -| `da sync` | Trigger data sync | +| `agnes pull` | Trigger data sync | | `agnes query` | Run SQL against analytics.duckdb | | `agnes admin group *` | Manage user groups | | `agnes admin grant *` | Manage resource grants | | `agnes admin register-table` | Register tables in table_registry | | `agnes admin break-glass ` | Emergency admin access recovery | -| `da tokens *` | Manage personal access tokens | -| `da metrics *` | Business metric definitions | +| `agnes auth token *` | Manage personal access tokens | +| `agnes admin metrics *` | Business metric definitions | | `agnes skills *` | List/show bundled skills | ### 5. Authentication (`app/auth/`) diff --git a/README.md b/README.md index e8aea1d..500aecd 100644 --- a/README.md +++ b/README.md @@ -35,7 +35,7 @@ The orchestrator scans `/data/extracts/*/extract.duckdb`, attaches each into `an ┌──────────┼──────────┐ ▼ ▼ ▼ FastAPI CLI - (serve) (da sync) + (serve) (agnes pull) ``` ## Supported Data Sources @@ -47,11 +47,11 @@ The orchestrator scans `/data/extracts/*/extract.duckdb`, attaches each into `an | **Remote attach** (`remote`) | View only, no download | BigQuery | Table is too large to materialize; latency cost of remote query is acceptable | | **Real-time push** | Incremental parquet | Jira | Source is event-driven and you need sub-minute freshness | -The first three modes are what `da sync` distributes to analysts. The fourth is server-side only — analysts query Jira data through the same `da sync`-distributed parquets. +The first three modes are what `agnes pull` distributes to analysts. The fourth is server-side only — analysts query Jira data through the same `agnes pull`-distributed parquets. Admins manage per-source registrations through the `/admin/tables` UI (per-connector tabs for BigQuery / Keboola / Jira) or the `agnes admin register-table` CLI; per-row "Manage access" deep-links to `/admin/access` for granting tables to user groups via `resource_grants(group, ResourceType.TABLE, table_id)`. -Analysts get a closed loop with Claude Code: `da analyst setup` writes `/.claude/settings.json` with SessionStart (`da sync --quiet`) and SessionEnd (`da sync --upload-only --quiet`) hooks so every Claude Code session starts with fresh RBAC-filtered parquets and ends with the session log uploaded back. +Analysts get a closed loop with Claude Code: `agnes init` writes `/.claude/settings.json` with SessionStart (`agnes pull --quiet`) and SessionEnd (`agnes push --quiet`) hooks so every Claude Code session starts with fresh RBAC-filtered parquets and ends with the session log uploaded back. Adding a new source means creating `connectors//extractor.py` that produces `extract.duckdb` with a `_meta` table (`table_name`, `description`, `rows`, `size_bytes`, `extracted_at`, `query_mode`). The orchestrator attaches it automatically. @@ -86,18 +86,18 @@ curl -X POST http://localhost:8000/api/sync/trigger ## Local sync & auto-update -Analysts run Claude Code against a local DuckDB built from RBAC-filtered parquets pulled from the server. `da sync` is the distribution path: +Analysts run Claude Code against a local DuckDB built from RBAC-filtered parquets pulled from the server. `agnes pull` is the distribution path: ```bash -da sync # delta-pull: manifest → MD5 compare → download changed → rebuild views -da sync --quiet # same, no progress output (for hooks/cron) -da sync --upload-only # push session jsonl + CLAUDE.local.md back to the server +agnes pull # delta-pull: manifest → MD5 compare → download changed → rebuild views +agnes pull --quiet # same, no progress output (for hooks/cron) +agnes push # push session jsonl + CLAUDE.local.md back to the server ``` -`da analyst setup` writes Claude Code lifecycle hooks into `/.claude/settings.json`: +`agnes init` writes Claude Code lifecycle hooks into `/.claude/settings.json`: -- `SessionStart` → `da sync --quiet` — fresh data on every session -- `SessionEnd` → `da sync --upload-only --quiet` — uploads notes and session log +- `SessionStart` → `agnes pull --quiet` — fresh data on every session +- `SessionEnd` → `agnes push --quiet` — uploads notes and session log Hooks live at workspace level so they only fire in this analyst workspace, not in unrelated Claude Code sessions on the same machine. @@ -108,7 +108,7 @@ The auto-sync set per analyst is the intersection of: 1. Tables with `query_mode IN ('local', 'materialized')` — these have parquets on disk and end up in the manifest 2. Tables granted to one of the analyst's groups via `resource_grants(group, ResourceType.TABLE, table_id)` (see [`docs/RBAC.md`](docs/RBAC.md)) -To enroll a new table for auto-sync, register it (or update its `query_mode`) and grant it to the relevant groups in `/admin/access`. New analysts get the same set on their next `da sync`. +To enroll a new table for auto-sync, register it (or update its `query_mode`) and grant it to the relevant groups in `/admin/access`. New analysts get the same set on their next `agnes pull`. For BigQuery, register a `query_mode='materialized'` table with a SQL body: @@ -120,7 +120,7 @@ agnes admin register-table orders_90d \ --schedule "every 6h" ``` -The scheduler runs the query through the DuckDB BigQuery extension on each tick that's due, writes the result as a parquet, and the analyst picks it up on the next `da sync`. Cost guardrail: `data_source.bigquery.max_bytes_per_materialize` (default 10 GiB) — operations exceeding the BQ dry-run estimate are skipped. +The scheduler runs the query through the DuckDB BigQuery extension on each tick that's due, writes the result as a parquet, and the analyst picks it up on the next `agnes pull`. Cost guardrail: `data_source.bigquery.max_bytes_per_materialize` (default 10 GiB) — operations exceeding the BQ dry-run estimate are skipped. ## Development Setup @@ -156,7 +156,7 @@ pytest tests/ -v │ ├── keboola/ # Keboola: extractor.py (DuckDB extension) + client.py (fallback) │ ├── bigquery/ # BigQuery: extractor.py (remote-only via DuckDB BQ extension) │ └── jira/ # Jira: webhook + incremental parquet → extract.duckdb -├── cli/ # CLI tool (`da sync`, `agnes query`, `agnes admin`) +├── cli/ # CLI tool (`agnes pull`, `agnes query`, `agnes admin`) ├── services/ # Standalone services (scheduler, telegram_bot, ws_gateway, etc.) ├── scripts/ # Utility + migration scripts ├── config/ # Configuration templates (instance.yaml.example) diff --git a/app/api/admin.py b/app/api/admin.py index 973ed78..83d38d2 100644 --- a/app/api/admin.py +++ b/app/api/admin.py @@ -249,7 +249,7 @@ _KNOWN_FIELDS: dict[str, dict[str, dict]] = { "Cost guardrail for `agnes query --remote` against query_mode='remote' " "BQ rows (dry-run check on the underlying SELECT before execute). " "Bytes processed; exceeds → 400 remote_scan_too_large with a " - "`da fetch` suggestion. 0 disables the gate. Default 5368709120 = 5 GiB." + "`agnes snapshot create` suggestion. 0 disables the gate. Default 5368709120 = 5 GiB." ), }, }, diff --git a/app/api/query.py b/app/api/query.py index 3389ba9..af97653 100644 --- a/app/api/query.py +++ b/app/api/query.py @@ -246,7 +246,7 @@ def _materialized_hint_for_query_error( text; a hit means the operator picked a name that exists in the registry but isn't queryable in this instance. The hint is the same in both arms of the OR — it tells them what the table needs and what - they can do today (`da sync` or query `bq."dataset"."table"` + they can do today (`agnes pull` or query `bq."dataset"."table"` directly using the bucket/source_table from the registry row). """ # Cheap fast-path — only inspect the registry when DuckDB's error @@ -299,7 +299,7 @@ def _build_materialized_hint(row: dict) -> str: return ( f"Table {tid!r} is registered as query_mode='materialized' but is " f"not yet materialized in this instance's analytics views. Run " - f"`da sync` (or wait for the scheduler tick / hit POST " + f"`agnes pull` (or wait for the scheduler tick / hit POST " f"/api/sync/trigger) to materialize the parquet" f"{direct_hint}." ) @@ -486,7 +486,7 @@ def _bq_quota_and_cap_guard(*, user_id: str, dry_run_set: list, sql: str): "limit_bytes": cap_bytes, "tables": tables, "suggestion": ( - "Use `da fetch --select --where " + "Use `agnes snapshot create --select --where " "--estimate` to materialize a filtered subset, then query " "the snapshot locally." ), diff --git a/app/api/v2_catalog.py b/app/api/v2_catalog.py index 2eb4533..426b320 100644 --- a/app/api/v2_catalog.py +++ b/app/api/v2_catalog.py @@ -39,7 +39,7 @@ def _examples_for(source_type: str) -> list[str]: def _fetch_hint(table_id: str, source_type: str) -> str: if source_type == "bigquery": - return f"da fetch {table_id} --select --where '' --limit " + return f"agnes snapshot create {table_id} --select --where '' --limit " return "already local — query directly via `agnes query`" diff --git a/cli/commands/auth.py b/cli/commands/auth.py index 5297d61..22590a8 100644 --- a/cli/commands/auth.py +++ b/cli/commands/auth.py @@ -116,7 +116,7 @@ def import_token( Decodes the JWT locally to extract the email claim, verifies it against the server, and writes it to ~/.config/agnes/token.json using the - canonical format so subsequent `agnes auth whoami` / `da sync` calls + canonical format so subsequent `agnes auth whoami` / `agnes pull` calls authenticate cleanly. Example: diff --git a/cli/commands/snapshot.py b/cli/commands/snapshot.py index f56d427..d2a73c9 100644 --- a/cli/commands/snapshot.py +++ b/cli/commands/snapshot.py @@ -235,10 +235,16 @@ def create_cmd( # Guard: refuse to create snapshots before `agnes pull` has bootstrapped # the local DuckDB. Otherwise we'd open an empty DB and confuse later # `agnes pull` runs. - local_db = _local_dir() / "user" / "duckdb" / "analytics.duckdb" - if not local_db.exists(): - typer.echo("Local DuckDB not found. Run: agnes pull first.", err=True) - raise typer.Exit(1) + # + # `--estimate` is exempt: it's a server-side dry-run cost check that + # never touches the local DuckDB, so it doesn't need the DB to exist + # (and analysts use it pre-bootstrap to scope a fetch before deciding + # to materialize). + if not estimate: + local_db = _local_dir() / "user" / "duckdb" / "analytics.duckdb" + if not local_db.exists(): + typer.echo("Local DuckDB not found. Run: agnes pull first.", err=True) + raise typer.Exit(1) snap_dir = _local_dir() / "user" / "snapshots" snap_dir.mkdir(parents=True, exist_ok=True) diff --git a/cli/commands/tokens.py b/cli/commands/tokens.py index 4b3bda3..e508b04 100644 --- a/cli/commands/tokens.py +++ b/cli/commands/tokens.py @@ -45,7 +45,7 @@ def create( typer.echo(f"name: {data['name']}") typer.echo(f"expires: {data.get('expires_at') or 'never'}") typer.echo("") - typer.echo("Export it so `da` can use it:") + typer.echo("Export it so `agnes` can use it:") typer.echo(f" export AGNES_TOKEN={data['token']}") diff --git a/cli/skills/agnes-data-querying.md b/cli/skills/agnes-data-querying.md index 13959ef..01dd546 100644 --- a/cli/skills/agnes-data-querying.md +++ b/cli/skills/agnes-data-querying.md @@ -26,21 +26,21 @@ Tables in `agnes catalog` have a `query_mode`: | Mode | Means | How to query | |------|-------|--------------| | `local` | parquet synced on laptop | `agnes query "SELECT …"` directly | -| `remote` (BigQuery) | parquet NOT on laptop | `da fetch` subset → snapshot, OR `agnes query --remote` one-shot | +| `remote` (BigQuery) | parquet NOT on laptop | `agnes snapshot create` subset → snapshot, OR `agnes query --remote` one-shot | For **remote tables**, you MUST either: -1. `da fetch` a filtered subset → query the local snapshot (preferred), OR +1. `agnes snapshot create` a filtered subset → query the local snapshot (preferred), OR 2. `agnes query --remote` for one-shot server-side execution, OR 3. `agnes query --register-bq` for hybrid joins (rare; see docs) -## The `da fetch` workflow (preferred for remote tables) +## The `agnes snapshot create` workflow (preferred for remote tables) ### 1. Estimate first Always estimate before fetching: ```bash -da fetch web_sessions_example \ +agnes snapshot create web_sessions_example \ --select event_date,country_code,session_id \ --where "event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) AND country_code = 'CZ'" \ @@ -52,7 +52,7 @@ Output tells you scan cost, expected rows, and local bytes — so you know if it ### 2. If reasonable, fetch to snapshot ```bash -da fetch web_sessions_example \ +agnes snapshot create web_sessions_example \ --select event_date,country_code,session_id \ --where "event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) AND country_code = 'CZ'" \ @@ -65,7 +65,7 @@ da fetch web_sessions_example \ agnes query "SELECT event_date, COUNT(*) FROM cz_recent GROUP BY 1 ORDER BY 1" ``` -## Heuristics for `da fetch` +## Heuristics for `agnes snapshot create` | Requirement | Why | |-------------|-----| @@ -97,14 +97,14 @@ For `source_type=keboola` / `source_type=jira` (local), use **DuckDB SQL** in yo - Drop with `agnes snapshot drop ` when done with a topic - Check total cache size with `agnes disk-info` -## When NOT to use `da fetch` +## When NOT to use `agnes snapshot create` | Scenario | Use instead | |----------|------------| | Single aggregate on remote BASE TABLE (`SELECT COUNT(*)`) | `agnes query --remote "SELECT COUNT(*) FROM web_sessions_example"` — cheap, no fetch needed (Storage Read API pushes the COUNT into BQ) | -| Single aggregate on remote VIEW/MATERIALIZED_VIEW | Same syntax works (#160) but the BQ jobs API can't push WHERE/COUNT into the view body. Cost guardrail (default 5 GiB) catches expensive scans → 400 `remote_scan_too_large` with `da fetch` suggestion. Pivot to `da fetch --where ''` if rejected. | +| Single aggregate on remote VIEW/MATERIALIZED_VIEW | Same syntax works (#160) but the BQ jobs API can't push WHERE/COUNT into the view body. Cost guardrail (default 5 GiB) catches expensive scans → 400 `remote_scan_too_large` with `agnes snapshot create` suggestion. Pivot to `agnes snapshot create --where ''` if rejected. | | Throwaway exploration with raw BQ syntax | `agnes query --remote "SELECT … FROM "` — direct `bq."".""` paths are now registry-gated (403 `bq_path_not_registered` if not registered). Register first or use the catalog id. | -| Cross-table JOIN with both remote | Use `da fetch` for one side + `agnes query --remote` for the other; full cross-remote JOIN needs design (see #101) | +| Cross-table JOIN with both remote | Use `agnes snapshot create` for one side + `agnes query --remote` for the other; full cross-remote JOIN needs design (see #101) | ## When the table you need isn't in `agnes catalog` @@ -118,7 +118,7 @@ The catalog reads from `system.duckdb::table_registry` — entries land there on 1. **Discover**: `agnes catalog`, `agnes schema`, `agnes describe` 2. **Check query_mode**: local (direct) or remote (fetch or --remote)? -3. **For remote**: `--estimate` first, then `da fetch` with `--select` + `--where` +3. **For remote**: `--estimate` first, then `agnes snapshot create` with `--select` + `--where` 4. **Snapshot name**: descriptive (`cz_recent`), reuse across questions 5. **Query**: `agnes query` against snapshot; DuckDB SQL syntax 6. **Cleanup**: `agnes snapshot drop` when done; `agnes disk-info` to check size diff --git a/cli/skills/agnes-table-registration.md b/cli/skills/agnes-table-registration.md index d91cb51..7ee2240 100644 --- a/cli/skills/agnes-table-registration.md +++ b/cli/skills/agnes-table-registration.md @@ -5,7 +5,7 @@ description: Use when adding tables to the Agnes catalog so analysts can query t # Registering tables in Agnes -`agnes catalog` lists tables from `system.duckdb::table_registry`. A table you can `da fetch` exists in that registry. This skill is the protocol for getting tables into and out of it. +`agnes catalog` lists tables from `system.duckdb::table_registry`. A table you can `agnes snapshot create` exists in that registry. This skill is the protocol for getting tables into and out of it. **Auth:** every command here requires admin role. The CLI sends the active PAT (`agnes auth import-token`); REST examples use `Authorization: Bearer $PAT` against the configured server. @@ -21,7 +21,7 @@ user wants to add tables ## Before you register — verify the source exists -Registering a table that does NOT exist at the source is silent: the row lands in the registry, but every later `da fetch` / `agnes query` against it 404s or 500s with an opaque message. Always verify first. +Registering a table that does NOT exist at the source is silent: the row lands in the registry, but every later `agnes snapshot create` / `agnes query` against it 404s or 500s with an opaque message. Always verify first. For BigQuery (`source-type=bigquery`): @@ -107,7 +107,7 @@ curl -sS -X DELETE \ "$AGNES_SERVER_URL/api/admin/registry/" ``` -Returns `204 No Content` on success, `404` if the id doesn't exist. **The underlying source data is NOT touched** — only the catalog entry. Local snapshots created via `da fetch` also remain on the analyst's laptop until they `agnes snapshot drop` them. +Returns `204 No Content` on success, `404` if the id doesn't exist. **The underlying source data is NOT touched** — only the catalog entry. Local snapshots created via `agnes snapshot create` also remain on the analyst's laptop until they `agnes snapshot drop` them. ## Heuristics @@ -120,7 +120,7 @@ Returns `204 No Content` on success, `404` if the id doesn't exist. **The underl - The user wants to inspect a table once, doesn't intend to share it: register the row once with `query_mode='remote'` (admin-only, ~30s) and query it via `agnes query --remote "SELECT … FROM "`. Direct `bq.""."
"` syntax is now registry-gated — unregistered paths return 403 `bq_path_not_registered` (closes the pre-existing RBAC + cost-cap bypass). - The data lives in a third source not yet supported by a connector: implement the connector first (see `connectors.md` skill), then register. -- The dataset already has a registered "parent" view that exposes the rows you want: register-table is for distinct catalog entities, not for slicing existing ones — slice with `da fetch --where`. +- The dataset already has a registered "parent" view that exposes the rows you want: register-table is for distinct catalog entities, not for slicing existing ones — slice with `agnes snapshot create --where`. ## Confirmation flow diff --git a/config/claude_md_template.txt b/config/claude_md_template.txt index 8bdda8e..b3320df 100644 --- a/config/claude_md_template.txt +++ b/config/claude_md_template.txt @@ -35,7 +35,6 @@ This workspace is connected to {{ server.url }}. ## Data Sync - `agnes pull` — download current data from server -- `agnes pull --docs-only` — just metadata and metrics (fast refresh) - `agnes push` — upload sessions and local notes to server - Data on the server refreshes every {{ sync_interval }} diff --git a/tests/test_cli_snapshot_create.py b/tests/test_cli_snapshot_create.py index 6136d5f..9aa4eef 100644 --- a/tests/test_cli_snapshot_create.py +++ b/tests/test_cli_snapshot_create.py @@ -28,9 +28,36 @@ def test_snapshot_create_help(): def test_snapshot_create_no_duckdb_friendly_exit(tmp_path, monkeypatch): + """Real-fetch path (no --estimate) refuses without a local DuckDB.""" monkeypatch.setenv("AGNES_LOCAL_DIR", str(tmp_path)) runner = CliRunner() - result = runner.invoke(snapshot_app, ["create", "any_table", "--as", "x", "--estimate"]) + result = runner.invoke(snapshot_app, ["create", "any_table", "--as", "x"]) assert result.exit_code == 1 out = result.output + (result.stderr or "") assert "Run: agnes pull" in out + + +def test_snapshot_create_estimate_skips_duckdb_guard(tmp_path, monkeypatch): + """--estimate is server-side dry-run only; doesn't need local DuckDB. + + Analysts use it pre-bootstrap to scope a fetch before committing to + materialize, so the local-DB guard would block the use case it's most + useful for. Per Devin review finding ANALYSIS_0004. + """ + monkeypatch.setenv("AGNES_LOCAL_DIR", str(tmp_path)) + # Stub api_post so we don't actually hit the network — what we care about + # is that the guard doesn't fire BEFORE the API call. + from unittest.mock import MagicMock + fake_resp = MagicMock() + fake_resp.status_code = 200 + fake_resp.json.return_value = {"estimated_scan_bytes": 0, "estimated_rows": 0, + "estimated_local_bytes": 0, "table_id": "any_table"} + monkeypatch.setattr("cli.commands.snapshot.api_post", lambda *a, **kw: fake_resp, + raising=False) + + runner = CliRunner() + result = runner.invoke(snapshot_app, ["create", "any_table", "--as", "x", "--estimate"]) + # Should NOT exit 1 with "Run: agnes pull" — that hint is for the fetch path. + out = result.output + (result.stderr or "") + assert "Run: agnes pull" not in out, \ + "--estimate must not be blocked by the local-DuckDB guard"