agnes-the-ai-analyst/docs/testing/e2e_clean_analyst_bootstrap.md

12 KiB
Raw Blame History

E2E Verification: clean-analyst-bootstrap (PR #173)

End-to-end verification of the clean-analyst-bootstrap rewrite on a deployed VM. Designed for parallel sub-agent dispatch — Phase 0 prerequisites run sequentially, then 10 parallel slices in a single Claude Code message, then sequential hook test, then aggregation.

Estimated runtime: ~15-20 min total (5 min Phase 0 + 5-10 min Phase 1 parallel + 3 min Phase 2 + 1 min aggregation).

How to use: open Claude Code on the VM in any cwd; paste this entire document into the conversation. Claude Code will execute Phase 0, then dispatch the 10 slices in parallel via Agent tool calls in a single message, then guide you through Phase 2.

Prerequisites — fill these in before starting

  • Server URL: https://<your-agnes-host>
  • Active user: confirm via web login (Google OAuth or password works)
  • Test PAT: mint via web /setup?role=analyst → click Generate prompt → copy clipboard → extract the PAT. Save as $TEST_PAT.
  • Picked test tables:
    • LOCAL_TABLE: any query_mode='local' table from agnes catalog --json
    • REMOTE_TABLE: any query_mode='remote' BigQuery table (preferably small or partitioned)
    • REMOTE_WHERE: a simple WHERE clause that filters REMOTE_TABLE to a small subset (e.g., event_date = DATE '2026-01-01')
    • REMOTE_SELECT: 1-2 columns to select from REMOTE_TABLE

Phase 0 — Prerequisites (sequential, ~5 min)

# 1. Confirm server reachable + on the new build
curl -fsS "$SERVER_URL/api/health" | head -3
curl -fsS "$SERVER_URL/cli/version" 2>/dev/null || true   # should mention 0.32.0+ post-rename

# 2. Confirm /setup?role=analyst renders the role tiles
curl -fsS "$SERVER_URL/setup?role=analyst" | grep -E "role-tiles|Analyst workspace|const ROLE" | head -5
# Expect: at least 3 matches (CSS class, tile heading, JS const)

# 3. Mint analyst PAT (see Prerequisites above) and export it
export SERVER_URL="https://<your-agnes-host>"
export TEST_PAT="agnes_pat_..."

# 4. Bootstrap the BASE workspace that read-only slices share
mkdir -p /tmp/agnes-e2e-base
cd /tmp/agnes-e2e-base
# Paste the install prompt verbatim from /setup?role=analyst.
# After it finishes:
tree -aL 2 /tmp/agnes-e2e-base | head -30

Gate: if the base workspace doesn't have the expected shape (CLAUDE.md, AGNES_WORKSPACE.md, .claude/settings.json, user/duckdb/analytics.duckdb), STOP and report the failure mode — the rest of the plan assumes a working base workspace.


Phase 1 — Parallel slices (dispatch ALL 10 in ONE message, ~5-10 min)

For each slice, dispatch a general-purpose Agent. Send all 10 Agent tool calls in a single response message so they run concurrently. Pass SERVER_URL and TEST_PAT from Phase 0 into each prompt.

Slice 1 — Web UI + role tiles + paste-prompt content

Verify the /setup?role=analyst and /setup?role=admin pages on $SERVER_URL.

Required checks:
- /setup?role=analyst renders 2 role tiles (Analyst workspace + Admin CLI);
  Analyst tile has is-active class, Admin tile is inactive.
- Page contains `const ROLE = "analyst"` (JSON-escaped form, with quotes).
- The JS array `SETUP_INSTRUCTIONS_TEMPLATE` contains `agnes init` and
  `agnes catalog`, does NOT contain `claude plugin marketplace add`,
  `agnes auth import-token`, or `agnes diagnose`.
- /setup?role=admin: admin tile active, prompt contains `agnes auth import-token`.
- /install (no query) returns 302 to /setup.

Tools: curl + grep/python regex.
Report: PASS/FAIL per check + 5 lines of the rendered prompt for both roles.

Slice 2 — agnes init workspace inventory

In /tmp/agnes-e2e-init (NEW empty folder):
  agnes init --server-url $SERVER_URL --token $TEST_PAT --workspace .

Verify EXACTLY this file set is present:
- CLAUDE.md (non-empty, contains "agnes pull")
- AGNES_WORKSPACE.md (does NOT contain $TEST_PAT, no "{placeholder}" leaks,
  contains $SERVER_URL, contains the absolute workspace path,
  has 6 H2 sections matching ^## )
- .claude/settings.json (hooks.SessionStart with `agnes pull --quiet`,
  hooks.SessionEnd with `agnes push --quiet`, has `model` and `permissions.allow`)
- .claude/CLAUDE.local.md (stub content "# My Notes")
- user/duckdb/analytics.duckdb (file exists, non-zero size)

Verify NONE of these exist:
  data/parquet, data/duckdb, data/metadata, user/artifacts, .agnes

Conditional dirs (per lazy-mkdir contract):
- if server/parquet/ exists, must contain ≥1 .parquet
- if .claude/rules/ exists, must contain ≥1 km_*.md

Run: tree -aL 3 /tmp/agnes-e2e-init
Report: full file inventory + PASS/FAIL per assertion.

Slice 3 — Reader smoke matrix (read-only against base workspace)

cd /tmp/agnes-e2e-base
For each command, capture exit code + first stderr line.
Forbidden: any "Traceback" in stderr.

  agnes catalog
  agnes catalog --json
  agnes catalog --metrics
  agnes schema $LOCAL_TABLE
  agnes describe $LOCAL_TABLE -n 5
  agnes status
  agnes status --json
  agnes diagnose
  agnes diagnose system
  agnes disk-info
  agnes auth whoami
  agnes skills list
  agnes skills show agnes-data-querying

Bad-input no-crash:
  agnes schema __nonexistent__
  agnes describe __nonexistent__
  agnes explore __nonexistent__

PASS = all rc ∈ {0, 1}, no Traceback.
Report: command-by-command rc + traceback-Y/N + first stderr line.

Slice 4 — Local + remote query paths

cd /tmp/agnes-e2e-base

Local (parquet/DuckDB) path:
  agnes query "SELECT count(*) FROM $LOCAL_TABLE LIMIT 1"
  agnes query "SELECT * FROM $LOCAL_TABLE LIMIT 3"
  agnes query --json "SELECT count(*) FROM $LOCAL_TABLE"
  agnes explore $LOCAL_TABLE   # friendly even in non-TTY

Remote (BigQuery passthrough) path:
  agnes query --remote "SELECT count(*) FROM $REMOTE_TABLE LIMIT 1"
  agnes query --remote "SELECT count(*) FROM $REMOTE_TABLE" --limit 1

Report: per-query rc, first 5 lines of stdout, traceback-Y/N.
Flag any 502 / cost-guardrail / typed-error envelopes.

Slice 5 — Snapshot lifecycle

cd /tmp/agnes-e2e-base

  agnes snapshot list                                            # baseline
  agnes snapshot create $REMOTE_TABLE --select $REMOTE_SELECT --where '$REMOTE_WHERE' --as e2e_test_snap --estimate
  agnes snapshot create $REMOTE_TABLE --select $REMOTE_SELECT --where '$REMOTE_WHERE' --as e2e_test_snap
  agnes snapshot list                                            # should show e2e_test_snap
  agnes query "SELECT count(*) FROM e2e_test_snap"
  agnes snapshot refresh e2e_test_snap
  agnes snapshot drop e2e_test_snap
  agnes snapshot list                                            # back to baseline

PASS = clean lifecycle, snapshot file present after create + absent after drop.
Report: per-step rc + first stdout line.

Slice 6 — Force / protection scenarios

Setup:
  cp -r /tmp/agnes-e2e-base /tmp/agnes-e2e-force
  cd /tmp/agnes-e2e-force
  echo "# my private edit" > .claude/CLAUDE.local.md

Test 1: re-init without --force should refuse
  agnes init --server-url $SERVER_URL --token $TEST_PAT --workspace .
  Expect rc=1, stderr contains "already initialized" or "partial_state",
  no Traceback.

Test 2: --force regenerates CLAUDE.md but PRESERVES CLAUDE.local.md
  agnes init --server-url $SERVER_URL --token $TEST_PAT --workspace . --force
  Expect rc=0
  Then: cat .claude/CLAUDE.local.md  → must still contain "# my private edit"

Report: PASS/FAIL per test + grep result for the private edit marker.

Slice 7 — Pre-init reader smoke (no-traceback contract)

mkdir -p /tmp/agnes-e2e-pre && cd /tmp/agnes-e2e-pre

For each, capture rc + stderr:
  agnes query "SELECT 1"
  agnes snapshot create __nope__ --as x --estimate
  agnes explore foo
  agnes snapshot list
  agnes status
  agnes catalog
  agnes disk-info

Forbidden: any "Traceback" in stderr.
Expected: rc=1 with friendly hint mentioning "agnes init" or "agnes pull".

Report: per-command rc + traceback-Y/N + did-hint-mention-init-or-pull.

Slice 8 — Auth + token lifecycle

agnes auth whoami     # should print email + role

# Token CRUD
agnes auth token list
TID=$(agnes auth token create e2e-test --expires-in-days 1 | grep -oE 'tok_[a-zA-Z0-9_-]+' | head -1)
agnes auth token list   # new token visible
agnes auth token revoke "$TID"
agnes auth token list   # revoked_at populated for $TID

# Bad token → friendly 401
AGNES_TOKEN=fake-pat agnes catalog 2>&1 | tail -5
# Expect friendly hint, no Traceback

Report: PASS/FAIL per step + the captured TID.

Slice 9 — Admin metrics + catalog --metrics

# Read paths (any analyst)
agnes catalog --metrics
agnes catalog --metrics --show revenue/mrr   # adjust to a real metric path

# Write paths (admin only — skip if the test user isn't admin)
agnes admin metrics --help    # surface check (import/export/validate listed)

If admin:
  agnes admin metrics validate
  agnes admin metrics export /tmp/metrics-backup
  ls /tmp/metrics-backup/ | head

Report: per-command rc + first stdout line.

Slice 10 — AGNES_WORKSPACE.md content quality

cd /tmp/agnes-e2e-base
cat AGNES_WORKSPACE.md | head -100

Programmatic checks:
- Contains "Created:" header line with ISO timestamp
- Contains "Server:" line with $SERVER_URL
- Contains "Workspace:" line with absolute path
- $TEST_PAT does NOT appear anywhere in the file
- No literal "{created_at}", "{server_url}", "{workspace_path}" substrings
- Has exactly 6 H2 sections (^## )
- Cheat sheet section mentions: agnes catalog, agnes query, agnes pull,
  agnes snapshot create, agnes status
- Uninstall section mentions: uv tool uninstall, ~/.config/agnes, ~/.agnes
- For each path in the "Globally installed" table that's a real file path
  (~/.local/bin/agnes, ~/.config/agnes/{config.yaml,token.json}):
  test -e and report exists/missing.

Report: each assertion PASS/FAIL.

Phase 2 — Hook behavior (sequential, ~3 min)

Sub-agents can't open Claude Code sessions, so this part is manual.

cd /tmp/agnes-e2e-base
claude                    # opens Claude Code; SessionStart hook fires `agnes pull --quiet`

# Inside Claude Code, ask: "show me 5 rows of $LOCAL_TABLE"  → should work without errors

/exit                     # SessionEnd hook fires `agnes push --quiet`

ls /tmp/agnes-e2e-base/user/sessions/    # should be non-empty (transcript captured)

# Re-enter to verify SessionStart fires again
claude
/exit

Verify in server audit log: 2× agnes pull GETs (one per session start)

  • 2× agnes push POSTs (one per session end). Tail the server audit endpoint or DB table to confirm.

Phase 3 — Aggregation

Compile a single PASS/FAIL table:

Slice Status Notes
1 — Web UI role tiles
2 — agnes init inventory
3 — Reader smoke matrix
4 — Query paths
5 — Snapshot lifecycle
6 — Force / protection
7 — Pre-init no-traceback
8 — Auth + tokens
9 — Admin metrics
10 — AGNES_WORKSPACE.md
Hooks (Phase 2)

For any FAIL: preserve the failing folder/output, report the exact command

  • first traceback / first stderr line. Don't fix in place — flag for follow-up.

Cleanup after testing

rm -rf /tmp/agnes-e2e-base /tmp/agnes-e2e-init /tmp/agnes-e2e-force /tmp/agnes-e2e-pre /tmp/metrics-backup
agnes auth token revoke <e2e-test PAT id>

Slice priority

If time is constrained, run these load-bearing slices first:

  1. Slice 1 (Web UI role tiles) — proves the web entry point works.
  2. Slice 2 (agnes init inventory) — proves the bootstrap creates the exact expected file set with no dead dirs.
  3. Slice 7 (pre-init no-traceback) — proves the reader contract holds.
  4. Slice 10 (AGNES_WORKSPACE.md content) — proves the human-facing docs render correctly with no PAT leak.

Slices 3-6, 8-9 are breadth coverage — important but lower-priority.