fix(analyst): document BigQuery remote-query capability in bootstrap CLAUDE.md template (#154)
* fix(analyst): document BigQuery remote-query capability in bootstrap CLAUDE.md template Closes #153. The CLAUDE.md template generated by `da analyst bootstrap` (config/claude_md_template.txt) covered metrics, sync, corporate memory, and directory layout — but had ZERO mention of query_mode: "remote", da fetch, da query --remote, or --register-bq. Result: the AI analyst running in a freshly-bootstrapped workspace had no idea BigQuery-backed tables existed, no path to fetch unsynced data, and no fallback for tables not in the catalog. Validated against /Users/<user>/foundry-ai/foundryai-data-analyst/CLAUDE.md on 2026-05-01: section confirmed missing. Workspace-level (parent-dir) CLAUDE.md carried legacy SSH-heredoc instructions but the analyst-level file (which Claude reads as primary project context) had nothing. ## Changes ### config/claude_md_template.txt (+83) Added a `## Remote Queries (BigQuery)` section covering: - Discovery first — `da catalog --json | jq '...'` to see all tables with their query_mode, then `da schema` and `da describe` for shape. - Three query patterns: - `da fetch` (preferred) — materialize a filtered subset locally, query the snapshot, drop when done. - `da query --remote` — one-shot server-side execution (cheap probes). - `da query --register-bq` — hybrid joins between local + ad-hoc BQ. - `da fetch` estimate-first discipline — rules of thumb on --select / --where / --estimate / snapshot reuse. - BigQuery SQL flavor cheat sheet for `--where` (DATE literal, DATE_SUB, REGEXP_CONTAINS, CAST AS INT64). - Unknown-table fallback: when a table isn't in `da catalog` at all, use ad-hoc `--register-bq` if the agnes server SA has BQ access, or ask admin to register with `query_mode: "remote"` for ongoing use. - Pointer to `da skills show agnes-data-querying` for deeper guidance. ### docs/setup/claude_md_template.txt (deleted) Stale 359-line template that documented the deprecated SSH-heredoc remote_query.sh protocol. No code references it (verified via grep across .py / .sh / .yml / .md). Removing eliminates two failure modes: 1. A future refactor accidentally pulling it into a workspace and shipping deprecated guidance to analyst Claude sessions. 2. Reviewer confusion over which template is canonical. ### CHANGELOG.md `### Fixed` and `### Removed` entries under [Unreleased]. ## Tested - Manually walked the diff against `da skills show agnes-data-querying` output on a live VM (foundryai-development) — patterns + flags match the modern CLI exactly. - Re-bootstrap test deferred: requires network round-trip; pattern is identical to existing template substitution path so render is not at risk. ## Out of scope - The companion gap that data_description.md often only enumerates query_mode: "local" tables (no signal that other modes exist) — separate concern, fix likely belongs in the metadata generator on the server side, not in the analyst template. - Encouraging admins to register frequently-queried BQ tables as `query_mode: "remote"` in the registry — workflow improvement, not a code bug. * chore(release): cut 0.28.0 --------- Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
This commit is contained in:
parent
d4ac84dd46
commit
bd7b8c3233
4 changed files with 136 additions and 363 deletions
10
CHANGELOG.md
10
CHANGELOG.md
|
|
@ -10,6 +10,16 @@ CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every C
|
|||
|
||||
## [Unreleased]
|
||||
|
||||
## [0.28.0] — 2026-05-01
|
||||
|
||||
### Fixed
|
||||
|
||||
- **Analyst CLAUDE.md template now documents BigQuery remote-query capability.** `config/claude_md_template.txt` (used by `da analyst setup`) had **zero mention** of `query_mode: "remote"`, `da fetch`, `da query --remote`, or `--register-bq` — the AI analyst running in a freshly-bootstrapped workspace had no idea remote tables existed. Added a `## Remote Queries (BigQuery)` section covering: discovery via `da catalog` (now called out as canonical, with `data/metadata/schema.json` flagged as local-only); the three query patterns (`da fetch` preferred, `da query --remote` for one-shots, `da query --register-bq` for hybrid joins); permission boundary (BQ access via the agnes server's GCE service account, not personal creds — escalate permission errors to admin); cost awareness (every query bills the SA's project for bytes scanned, `--select`/`--where`/`--estimate` discipline); `da fetch` estimate-first rules; BigQuery SQL flavor reminder; snapshot freshness ritual (`da snapshot drop` + re-fetch when source data updates); concrete hybrid-query example with `--register-bq` joining local + ad-hoc BQ; the unknown-table case (ad-hoc `--register-bq` or ask admin to register); and a cross-reference to `da skills show agnes-data-querying` for deeper guidance. Also clarifies that **personal customizations belong in `.claude/CLAUDE.local.md`**, not CLAUDE.md (which is regenerated by `da analyst setup --force` and would lose edits). Closes #153.
|
||||
|
||||
### Removed
|
||||
|
||||
- **Legacy `docs/setup/claude_md_template.txt` deleted.** 359-line stale template that documented the deprecated SSH-heredoc remote-query protocol (`ssh data-analyst 'bash ~/server/scripts/remote_query.sh --stdin' < query.json`). The active template lives at `config/claude_md_template.txt`; the docs/ copy was confusing references and at risk of being pulled into a workspace by a future refactor. No code references the deleted file (verified).
|
||||
|
||||
## [0.27.0] — 2026-04-30
|
||||
|
||||
### Removed
|
||||
|
|
|
|||
|
|
@ -4,10 +4,11 @@ This workspace is connected to {server_url}.
|
|||
|
||||
## Rules
|
||||
- Before computing any business metric: run `da metrics show <category>/<name>`
|
||||
- For current schema: read `data/metadata/schema.json`
|
||||
- Do not use DESCRIBE/SHOW COLUMNS — read metadata files instead
|
||||
- **For canonical table list with query modes: `da catalog`.** `data/metadata/schema.json` covers `query_mode: "local"` tables only — for remote/hybrid tables it's incomplete. Treat `da catalog` as source of truth.
|
||||
- Do not use DESCRIBE/SHOW COLUMNS — use `da schema <table>` instead
|
||||
- Save work output to `user/artifacts/`
|
||||
- Sync data regularly with `da sync`
|
||||
- **Personal customizations go in `.claude/CLAUDE.local.md`, NOT here.** This file is regenerated by `da analyst setup --force`; edits here will be lost. CLAUDE.local.md is preserved across regeneration and uploaded on `da sync --upload-only`.
|
||||
|
||||
## Metrics Workflow
|
||||
1. `da metrics list` — find the relevant metric
|
||||
|
|
@ -21,6 +22,127 @@ This workspace is connected to {server_url}.
|
|||
- `da sync --upload-only` — upload sessions and local notes to server
|
||||
- Data on the server refreshes every {sync_interval}
|
||||
|
||||
## Remote Queries (BigQuery) — when data isn't on the laptop
|
||||
|
||||
Not every table is synced. Tables registered with `query_mode: "remote"` live in
|
||||
BigQuery, accessed server-side via DuckDB's BQ extension — no parquet on disk.
|
||||
Tables you don't see in `data/parquet/` may still be queryable.
|
||||
|
||||
### Discovery first
|
||||
|
||||
```
|
||||
da catalog --json | jq '.[] | {name, source_type, query_mode}' # see all tables + their modes
|
||||
da schema <table> # columns + types
|
||||
da describe <table> -n 5 # sample rows
|
||||
```
|
||||
|
||||
For local-mode tables, query directly with `da query "SELECT … FROM <table>"`.
|
||||
|
||||
### Three patterns for `query_mode: "remote"` tables
|
||||
|
||||
| Pattern | Tool | Use when |
|
||||
|---------|------|----------|
|
||||
| **`da fetch`** (preferred) | materializes a filtered subset locally → query the snapshot | repeated questions on same slice |
|
||||
| **`da query --remote`** | one-shot, server-side execution against BigQuery | single aggregate / cheap probe |
|
||||
| **`da query --register-bq`** | hybrid joins between local snapshots and ad-hoc BQ subqueries | crossing local + remote |
|
||||
|
||||
### Permission model + cost — important
|
||||
|
||||
- BQ access goes through the **agnes server's GCE service account**, not your personal Google credentials. If a query fails with a permission error, the table is in a project the server SA cannot read — escalate to admin, do NOT try to authenticate yourself.
|
||||
- Every BQ query bills the SA's GCP project for **bytes scanned**. A naive `SELECT * FROM <large_table>` can cost real money. ALWAYS:
|
||||
- filter via `--where` on the partition column (typically a date)
|
||||
- list specific columns in `--select` — column-store BQ skips the rest, cheaper
|
||||
- run `--estimate` first when unsure of the table size or partitioning
|
||||
|
||||
### `da fetch` discipline
|
||||
|
||||
```
|
||||
# 1. ESTIMATE first — refuses to fetch without knowing the cost
|
||||
da fetch <table> --select col1,col2 --where "date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)" --estimate
|
||||
|
||||
# 2. If reasonable, fetch as a named snapshot
|
||||
da fetch <table> --select col1,col2 --where "..." --as my_recent
|
||||
|
||||
# 3. Query the local snapshot
|
||||
da query "SELECT col1, COUNT(*) FROM my_recent GROUP BY 1"
|
||||
|
||||
# 4. List + drop snapshots when done
|
||||
da snapshot list
|
||||
da snapshot drop my_recent
|
||||
```
|
||||
|
||||
Rules of thumb:
|
||||
- ALWAYS list specific columns in `--select`. Avoid implicit SELECT *.
|
||||
- ALWAYS include a `--where` for remote tables; otherwise add `--limit`.
|
||||
- ALWAYS run `--estimate` first when the table is `partition_by` / `clustered_by`
|
||||
per `da schema`, or could plausibly exceed 1 GB local bytes.
|
||||
- Reuse snapshots across questions in the same conversation — `da snapshot list`
|
||||
before fetching.
|
||||
|
||||
### Snapshot freshness — when to refresh
|
||||
|
||||
Snapshots are point-in-time copies. They go stale as the source data updates (most BQ tables refresh daily; check `sync_schedule` per `da catalog`). For each new conversation:
|
||||
|
||||
```
|
||||
da snapshot list # see existing snapshots + their ages
|
||||
da snapshot drop my_recent # drop stale ones
|
||||
da fetch <table> --select ... --where ... --as my_recent # re-fetch
|
||||
```
|
||||
|
||||
If the question is time-sensitive (e.g. "today's orders"), assume any snapshot older than the table's `sync_schedule` is stale and refresh.
|
||||
|
||||
### Hybrid query example — local + remote in one query
|
||||
|
||||
`da query --register-bq` lets a single SQL statement join a local table with an ad-hoc BQ subquery. The BQ subquery runs first (server-side), result registered as a DuckDB view, then the joined query runs locally.
|
||||
|
||||
```
|
||||
da query \
|
||||
--register-bq "traffic=SELECT date, country, SUM(views) AS views \
|
||||
FROM \`prj.web_analytics.sessions\` \
|
||||
WHERE date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) \
|
||||
GROUP BY 1, 2" \
|
||||
--sql "SELECT o.date, o.country, o.revenue, t.views, o.revenue / NULLIF(t.views,0) AS rev_per_view \
|
||||
FROM orders o \
|
||||
JOIN traffic t ON o.date = t.date AND o.country = t.country \
|
||||
ORDER BY 1 DESC"
|
||||
```
|
||||
|
||||
The BQ subquery MUST contain `WHERE` and/or `GROUP BY` to keep the registered result manageable (target: under 500K rows, well under 100 MB). Multiple `--register-bq` flags can compose multiple BQ sources. For complex SQL, use `--stdin` mode (`echo '{"register_bq":{...},"sql":"..."}' | da query --stdin`).
|
||||
|
||||
### BigQuery SQL flavor for `--where`
|
||||
|
||||
Source-typed `bigquery` tables use BigQuery dialect, not DuckDB:
|
||||
|
||||
- Date literal: `DATE '2026-01-01'`
|
||||
- Timestamp literal: `TIMESTAMP '2026-01-01 00:00:00 UTC'`
|
||||
- Now: `CURRENT_DATE()`, `CURRENT_TIMESTAMP()`
|
||||
- Date arithmetic: `DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)`
|
||||
- Regex: `REGEXP_CONTAINS(col, r'pattern')` (raw string!)
|
||||
- Cast: `CAST(x AS INT64)` (NOT `INT`)
|
||||
|
||||
### When the table you want isn't in `da catalog`
|
||||
|
||||
The table may exist in BigQuery but not be registered with Agnes yet. Two options:
|
||||
|
||||
1. **Ad-hoc one-shot** — register a BQ subquery as a view inline, no admin needed
|
||||
if the agnes server SA has BQ access:
|
||||
```
|
||||
da query --register-bq "live=SELECT * FROM \`project.dataset.table\` WHERE date >= '...' LIMIT 1000" \
|
||||
--sql "SELECT * FROM live"
|
||||
```
|
||||
2. **Ask admin to register** the table with `query_mode: "remote"` so it shows up
|
||||
in `da catalog` and supports `da fetch` / `da query --remote`. This is the
|
||||
right path for any table you'll query repeatedly.
|
||||
|
||||
### Deeper guidance
|
||||
|
||||
For the full protocol, including hybrid-query examples, snapshot hygiene, and
|
||||
when NOT to use `da fetch`, run:
|
||||
|
||||
```
|
||||
da skills show agnes-data-querying
|
||||
```
|
||||
|
||||
## Corporate Memory
|
||||
|
||||
Rules injected by `da sync` from the server's corporate knowledge base live in `.claude/rules/km_*.md`. They are automatically loaded by Claude Code on every session start.
|
||||
|
|
@ -38,4 +160,4 @@ Run `da sync` to refresh. Rules are pruned automatically when items are revoked.
|
|||
- `user/` — your workspace (persistent across syncs)
|
||||
- `user/artifacts/` — analysis outputs, reports, charts
|
||||
- `user/sessions/` — Claude Code session logs
|
||||
- `.claude/CLAUDE.local.md` — your personal notes (never overwritten, uploaded on sync)
|
||||
- `.claude/CLAUDE.local.md` — your personal notes + workspace customizations. **Never overwritten by `da analyst setup --force`.** Uploaded to the server on `da sync --upload-only`. Put any local-only Claude instructions, project-specific reminders, or temporary notes here — NOT in CLAUDE.md (this file is regenerated from a template).
|
||||
|
|
|
|||
|
|
@ -1,359 +0,0 @@
|
|||
# CLAUDE.md
|
||||
|
||||
Project context file for **AI Data Analyst** - local analytics environment with access to your organization's internal data.
|
||||
|
||||
## Quick Status
|
||||
|
||||
| Property | Value |
|
||||
|----------|-------|
|
||||
| **Project Type** | AI Data Analyst |
|
||||
| **Database** | DuckDB at `user/duckdb/analytics.duckdb` |
|
||||
| **Data Source** | {ssh_alias} server ({server_host}) |
|
||||
| **Data Format** | Parquet files in `server/parquet/` |
|
||||
| **Analyst** | {username} |
|
||||
|
||||
---
|
||||
|
||||
## CRITICAL: Always Start Here
|
||||
|
||||
### 1. Sync Data When Starting
|
||||
|
||||
**MANDATORY: Automatically run sync in these situations:**
|
||||
- This is a new session (first interaction today)
|
||||
- The session is from a previous day or older
|
||||
- Data may be stale (updated multiple times daily on server)
|
||||
- The user explicitly requests fresh data
|
||||
|
||||
```bash
|
||||
bash server/scripts/sync_data.sh
|
||||
```
|
||||
|
||||
This updates data, scripts, documentation, and CLAUDE.md.
|
||||
|
||||
### 2. Read Schema Documentation Before Writing SQL
|
||||
|
||||
**MANDATORY: Before writing ANY SQL query, you MUST read the relevant documentation files:**
|
||||
|
||||
#### For table structure (columns, types, descriptions):
|
||||
|
||||
```bash
|
||||
# ALWAYS read this FIRST before querying tables
|
||||
cat server/docs/schema.yml
|
||||
```
|
||||
|
||||
- **NEVER use DESCRIBE, SHOW COLUMNS, or similar commands**
|
||||
- **NEVER guess column names**
|
||||
- schema.yml contains: all column names, types, descriptions, primary keys
|
||||
|
||||
#### For table relationships (joins, foreign keys):
|
||||
|
||||
```bash
|
||||
# Read this for understanding relationships between tables
|
||||
cat server/docs/data_description.md
|
||||
```
|
||||
|
||||
- Contains primary/foreign keys, sync strategies, and table descriptions
|
||||
- Essential for writing correct JOIN queries
|
||||
|
||||
#### For additional dataset schemas (if available):
|
||||
|
||||
```bash
|
||||
# Check for additional dataset schemas
|
||||
ls server/docs/datasets/ 2>/dev/null
|
||||
```
|
||||
|
||||
### 3. Read Metrics Definitions (if available)
|
||||
|
||||
**Before calculating ANY business metric, check for metric definitions:**
|
||||
|
||||
```bash
|
||||
# Check if metrics index exists
|
||||
cat server/docs/metrics/metrics.yml 2>/dev/null
|
||||
|
||||
# Or list available metric files
|
||||
ls server/docs/metrics/ 2>/dev/null
|
||||
```
|
||||
|
||||
If metric definitions exist, always read the specific metric file before calculating.
|
||||
Do not calculate metrics from memory - the formulas contain critical details.
|
||||
|
||||
---
|
||||
|
||||
## Directory Structure
|
||||
|
||||
```
|
||||
project_root/
|
||||
├── server/ # READ-ONLY - synced from server
|
||||
│ ├── docs/ # Documentation
|
||||
│ │ ├── data_description.md # Table relationships and descriptions
|
||||
│ │ ├── schema.yml # Table schemas and column definitions
|
||||
│ │ ├── metrics/ # Metric definitions (if available)
|
||||
│ │ └── datasets/ # Additional dataset docs (if available)
|
||||
│ ├── scripts/ # Helper scripts (sync_data.sh, setup_views.sh)
|
||||
│ ├── examples/ # Example scripts (if available)
|
||||
│ └── parquet/ # Synced parquet data files
|
||||
│
|
||||
├── user/ # YOUR WORKSPACE - never overwritten
|
||||
│ ├── duckdb/ # DuckDB database (analytics.duckdb)
|
||||
│ ├── artifacts/ # Analysis outputs, charts, exports
|
||||
│ └── scripts/ # Your custom scripts
|
||||
│
|
||||
├── .claude/ # Claude Code config
|
||||
├── .venv/ # Python virtual environment
|
||||
├── CLAUDE.md # This file (auto-updated from server)
|
||||
└── CLAUDE.local.md # Your personal notes (never overwritten)
|
||||
```
|
||||
|
||||
**Never modify files in `server/` - they are overwritten on every sync.**
|
||||
|
||||
---
|
||||
|
||||
## Essential Commands
|
||||
|
||||
```bash
|
||||
# Data freshness and sync
|
||||
bash server/scripts/sync_data.sh # Sync latest data from server
|
||||
|
||||
# DuckDB management
|
||||
bash server/scripts/setup_views.sh # Recreate DuckDB views
|
||||
|
||||
# Python environment
|
||||
source .venv/bin/activate # Activate venv (macOS/Linux)
|
||||
.venv/Scripts/activate # Activate venv (Windows)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Quick Start
|
||||
|
||||
### List all tables
|
||||
|
||||
```python
|
||||
import duckdb
|
||||
con = duckdb.connect('user/duckdb/analytics.duckdb')
|
||||
tables = con.execute("SHOW TABLES;").fetchall()
|
||||
for table in tables:
|
||||
print(table[0])
|
||||
con.close()
|
||||
```
|
||||
|
||||
### Query data
|
||||
|
||||
```bash
|
||||
# Read schema first, then query
|
||||
cat server/docs/schema.yml
|
||||
```
|
||||
|
||||
```python
|
||||
import duckdb
|
||||
con = duckdb.connect('user/duckdb/analytics.duckdb')
|
||||
# Write your query based on schema.yml column definitions
|
||||
result = con.execute("SELECT * FROM your_table LIMIT 10").fetchdf()
|
||||
print(result)
|
||||
con.close()
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Startup Checklist
|
||||
|
||||
When starting a new session:
|
||||
|
||||
1. **Sync latest data**
|
||||
```bash
|
||||
bash server/scripts/sync_data.sh
|
||||
```
|
||||
|
||||
2. **Verify database exists**
|
||||
```bash
|
||||
ls -lh user/duckdb/analytics.duckdb
|
||||
```
|
||||
|
||||
You're ready to analyze!
|
||||
|
||||
---
|
||||
|
||||
## Important Reminders
|
||||
|
||||
- Always read `server/docs/schema.yml` before writing SQL queries
|
||||
- Always read `server/docs/data_description.md` for table relationships and joins
|
||||
- Check `server/docs/metrics/` for metric definitions before calculating business metrics
|
||||
- Use DuckDB views, not direct parquet file reads
|
||||
- Never modify files in `server/` - they're read-only
|
||||
|
||||
---
|
||||
|
||||
## Remote Queries (BigQuery)
|
||||
|
||||
Some tables are too large for local Parquet sync and are queried remotely via BigQuery.
|
||||
These tables have `query_mode: "remote"` in `server/docs/data_description.md`.
|
||||
|
||||
**IMPORTANT: When remote tables exist, proactively offer hybrid analyses that combine
|
||||
local and remote data.** For example, if the user asks for a business overview, suggest
|
||||
joining local order data with remote traffic data to show a complete picture (conversion
|
||||
funnels, revenue per visitor, etc.). Don't wait for the user to ask -- hybrid insights
|
||||
are more valuable than single-source analysis.
|
||||
|
||||
### How to recognize remote tables
|
||||
|
||||
Before writing any query, read `server/docs/data_description.md`. Each table has:
|
||||
- `query_mode: "local"` -- available as a local DuckDB view (query normally)
|
||||
- `query_mode: "remote"` -- NOT in local DuckDB, must use remote query protocol below
|
||||
- `query_mode: "hybrid"` -- local view exists AND can query BQ for live data
|
||||
|
||||
### Remote table metadata in data_description.md
|
||||
|
||||
Remote tables include metadata to help you write safe queries:
|
||||
|
||||
- **`volume`** -- rows_per_day, unique entities per day (tells you table size)
|
||||
- **`columns`** -- column names, types, value distributions
|
||||
- **`dimension_profile`** -- cardinality per dimension with value distributions
|
||||
- **`query_result_estimates`** -- expected row counts after GROUP BY combinations
|
||||
- **`join_keys`** -- how to join with other tables
|
||||
|
||||
**ALWAYS read these sections before writing a remote query.** Use `query_result_estimates`
|
||||
to predict how many rows your query will return. The server has limited RAM -- keep BQ
|
||||
sub-query results under 500K rows.
|
||||
|
||||
### Two-phase query protocol
|
||||
|
||||
Remote queries run **on the server** via SSH (server has DuckDB + Parquet + BigQuery access).
|
||||
You write two SQL statements:
|
||||
|
||||
1. **BQ sub-query** (`--register-bq "alias=SQL"`) -- runs on BigQuery, result registered in DuckDB as a view.
|
||||
This MUST contain WHERE and/or GROUP BY to reduce the result set. Never SELECT * from a remote table.
|
||||
2. **DuckDB SQL** (`--sql "SQL"`) -- runs in DuckDB after all views (local + BQ) are ready.
|
||||
Can JOIN local tables with registered BQ results.
|
||||
|
||||
### Command format (JSON file via stdin)
|
||||
|
||||
**IMPORTANT:** Always use the `--stdin` JSON mode to avoid shell escaping issues with
|
||||
backticks, quotes, and parentheses in SQL. Use the Write tool to create a JSON query
|
||||
spec file, then pipe it to SSH via stdin redirect:
|
||||
|
||||
**Step 1:** Use the Write tool to create a JSON file (e.g., `user/scripts/rq_query.json`):
|
||||
```json
|
||||
{
|
||||
"register_bq": {
|
||||
"ALIAS": "SELECT ... FROM `project.dataset.table` WHERE ... GROUP BY ..."
|
||||
},
|
||||
"sql": "SELECT ... FROM ALIAS JOIN local_table ...",
|
||||
"format": "table"
|
||||
}
|
||||
```
|
||||
|
||||
**Step 2:** Run the query via SSH with stdin redirect:
|
||||
```bash
|
||||
ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin' < user/scripts/rq_query.json
|
||||
```
|
||||
|
||||
**NEVER use `cat <<HEREDOC | ssh ...`** -- the `cat` command is blocked by permissions.
|
||||
Always write the JSON to a file first using the Write tool, then use `< file` redirect.
|
||||
|
||||
**JSON fields:**
|
||||
- `"sql"` (required) -- DuckDB SQL query (can reference local views + registered BQ aliases)
|
||||
- `"register_bq"` (optional) -- Object mapping alias names to BigQuery SQL queries
|
||||
- `"format"` (optional) -- `"table"`, `"csv"`, `"json"`, or `"parquet"` (default: `"table"`)
|
||||
- `"output"` (optional) -- File path for parquet/csv/json output
|
||||
- `"max_rows"` (optional) -- Override max result rows
|
||||
|
||||
### Example 1: Remote-only query (aggregated data)
|
||||
|
||||
Write to `user/scripts/rq_query.json`:
|
||||
```json
|
||||
{
|
||||
"register_bq": {
|
||||
"agg_data": "SELECT date_col, dim_col, SUM(metric) as total FROM `project.dataset.table` WHERE date_col >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY) GROUP BY 1,2"
|
||||
},
|
||||
"sql": "SELECT * FROM agg_data ORDER BY date_col, dim_col",
|
||||
"format": "table"
|
||||
}
|
||||
```
|
||||
|
||||
Then run:
|
||||
```bash
|
||||
ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin' < user/scripts/rq_query.json
|
||||
```
|
||||
|
||||
### Example 2: JOIN local + remote
|
||||
|
||||
Write to `user/scripts/rq_query.json`:
|
||||
```json
|
||||
{
|
||||
"register_bq": {
|
||||
"remote_data": "SELECT date_col, dim_col, SUM(metric) as total FROM `project.dataset.table` WHERE date_col >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) GROUP BY 1,2"
|
||||
},
|
||||
"sql": "SELECT l.*, r.total FROM local_table l JOIN remote_data r ON l.date_col = r.date_col AND l.dim_col = r.dim_col ORDER BY 1,2",
|
||||
"format": "table"
|
||||
}
|
||||
```
|
||||
|
||||
Then run:
|
||||
```bash
|
||||
ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin' < user/scripts/rq_query.json
|
||||
```
|
||||
|
||||
### Example 3: Download result as Parquet for local analysis
|
||||
|
||||
Write to `user/scripts/rq_query.json`:
|
||||
```json
|
||||
{
|
||||
"register_bq": {
|
||||
"remote_data": "SELECT ... FROM `project.dataset.table` WHERE ... GROUP BY ..."
|
||||
},
|
||||
"sql": "SELECT ... FROM local_table JOIN remote_data ...",
|
||||
"format": "parquet",
|
||||
"output": "/tmp/remote_query/analysis.parquet"
|
||||
}
|
||||
```
|
||||
|
||||
Then run:
|
||||
```bash
|
||||
# 1. Run query on server
|
||||
ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin' < user/scripts/rq_query.json
|
||||
|
||||
# 2. Download to local machine
|
||||
scp {ssh_alias}:/tmp/remote_query/analysis.parquet ./user/parquet/
|
||||
|
||||
# 3. Register in local DuckDB for further analysis
|
||||
python3 -c "
|
||||
import duckdb
|
||||
conn = duckdb.connect('user/duckdb/analytics.duckdb')
|
||||
conn.execute(\"CREATE OR REPLACE VIEW analysis AS SELECT * FROM read_parquet('user/parquet/analysis.parquet')\")
|
||||
print('View created:', conn.execute('SELECT COUNT(*) FROM analysis').fetchone()[0], 'rows')
|
||||
conn.close()
|
||||
"
|
||||
```
|
||||
|
||||
### How to estimate result sizes
|
||||
|
||||
Before writing a BQ sub-query, check `dimension_profile` and `query_result_estimates`
|
||||
in `server/docs/data_description.md`.
|
||||
|
||||
**Rule of thumb:** rows = (estimate per day from query_result_estimates) * (number of days in WHERE clause).
|
||||
If that exceeds 100K rows, add more aggregation or tighter date filters.
|
||||
|
||||
### Safety rules
|
||||
|
||||
1. **NEVER** run `SELECT * FROM remote_table` without WHERE + GROUP BY
|
||||
2. **ALWAYS** check `dimension_profile` before writing BQ sub-queries
|
||||
3. **ALWAYS** include date range in WHERE clause
|
||||
4. **Limits**: 500K rows max per BQ sub-query, 100K rows max in final result
|
||||
5. If the query might take > 60 seconds, use nohup pattern:
|
||||
```bash
|
||||
# Write query spec to user/scripts/rq_query.json first, then:
|
||||
ssh {ssh_alias} 'cat > /tmp/rq_spec.json && nohup bash ~/server/scripts/remote_query.sh --stdin < /tmp/rq_spec.json > /tmp/rq.log 2>&1 &' < user/scripts/rq_query.json
|
||||
ssh {ssh_alias} 'tail -5 /tmp/rq.log' # check progress
|
||||
scp {ssh_alias}:/tmp/remote_query/result.parquet ./user/parquet/
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Reporting Issues
|
||||
|
||||
Report issues to your platform team or the project's issue tracker.
|
||||
|
||||
Include:
|
||||
- Error messages or unexpected behavior
|
||||
- Steps to reproduce
|
||||
- Output of `bash server/scripts/sync_data.sh`
|
||||
|
|
@ -1,6 +1,6 @@
|
|||
[project]
|
||||
name = "agnes-the-ai-analyst"
|
||||
version = "0.27.0"
|
||||
version = "0.28.0"
|
||||
description = "Agnes — AI Data Analyst platform for AI analytical systems"
|
||||
requires-python = ">=3.11,<3.14"
|
||||
license = "MIT"
|
||||
|
|
|
|||
Loading…
Reference in a new issue