Fix remote query UX: file-based stdin, ssh permissions, deprecation
Session testing revealed 3 issues with remote queries: 1. CLAUDE.md template recommended `cat <<HEREDOC | ssh ...` but claude_settings.json had `cat` in deny list, causing 2-3 failed attempts per query. Replaced with file-based approach: Write tool creates JSON file, then `ssh ... < file` avoids the cat deny. 2. ssh/scp commands were not in the allow list, requiring manual approval for every remote query. Added both to allow list. 3. DuckDB fetch_arrow_table() emitted DeprecationWarning on every parquet export. Replaced with .arrow().read_all(). Also added instruction for proactive hybrid analysis when remote tables are available (agent was only using local data until asked).
This commit is contained in:
parent
8c6c162417
commit
84d14da611
3 changed files with 45 additions and 22 deletions
|
|
@ -188,6 +188,12 @@ You're ready to analyze!
|
||||||
Some tables are too large for local Parquet sync and are queried remotely via BigQuery.
|
Some tables are too large for local Parquet sync and are queried remotely via BigQuery.
|
||||||
These tables have `query_mode: "remote"` in `server/docs/data_description.md`.
|
These tables have `query_mode: "remote"` in `server/docs/data_description.md`.
|
||||||
|
|
||||||
|
**IMPORTANT: When remote tables exist, proactively offer hybrid analyses that combine
|
||||||
|
local and remote data.** For example, if the user asks for a business overview, suggest
|
||||||
|
joining local order data with remote traffic data to show a complete picture (conversion
|
||||||
|
funnels, revenue per visitor, etc.). Don't wait for the user to ask -- hybrid insights
|
||||||
|
are more valuable than single-source analysis.
|
||||||
|
|
||||||
### How to recognize remote tables
|
### How to recognize remote tables
|
||||||
|
|
||||||
Before writing any query, read `server/docs/data_description.md`. Each table has:
|
Before writing any query, read `server/docs/data_description.md`. Each table has:
|
||||||
|
|
@ -219,13 +225,14 @@ You write two SQL statements:
|
||||||
2. **DuckDB SQL** (`--sql "SQL"`) -- runs in DuckDB after all views (local + BQ) are ready.
|
2. **DuckDB SQL** (`--sql "SQL"`) -- runs in DuckDB after all views (local + BQ) are ready.
|
||||||
Can JOIN local tables with registered BQ results.
|
Can JOIN local tables with registered BQ results.
|
||||||
|
|
||||||
### Command format (JSON via stdin -- ALWAYS use this)
|
### Command format (JSON file via stdin)
|
||||||
|
|
||||||
**IMPORTANT:** Always use the `--stdin` JSON mode to avoid shell escaping issues with
|
**IMPORTANT:** Always use the `--stdin` JSON mode to avoid shell escaping issues with
|
||||||
backticks, quotes, and parentheses in SQL. Write a heredoc with the JSON query spec:
|
backticks, quotes, and parentheses in SQL. Use the Write tool to create a JSON query
|
||||||
|
spec file, then pipe it to SSH via stdin redirect:
|
||||||
|
|
||||||
```bash
|
**Step 1:** Use the Write tool to create a JSON file (e.g., `user/scripts/rq_query.json`):
|
||||||
cat <<'QUERY' | ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin'
|
```json
|
||||||
{
|
{
|
||||||
"register_bq": {
|
"register_bq": {
|
||||||
"ALIAS": "SELECT ... FROM `project.dataset.table` WHERE ... GROUP BY ..."
|
"ALIAS": "SELECT ... FROM `project.dataset.table` WHERE ... GROUP BY ..."
|
||||||
|
|
@ -233,11 +240,15 @@ cat <<'QUERY' | ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin'
|
||||||
"sql": "SELECT ... FROM ALIAS JOIN local_table ...",
|
"sql": "SELECT ... FROM ALIAS JOIN local_table ...",
|
||||||
"format": "table"
|
"format": "table"
|
||||||
}
|
}
|
||||||
QUERY
|
|
||||||
```
|
```
|
||||||
|
|
||||||
The `<<'QUERY'` heredoc passes SQL **literally** -- no escaping needed for backticks,
|
**Step 2:** Run the query via SSH with stdin redirect:
|
||||||
single quotes, parentheses, or any other special characters.
|
```bash
|
||||||
|
ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin' < user/scripts/rq_query.json
|
||||||
|
```
|
||||||
|
|
||||||
|
**NEVER use `cat <<HEREDOC | ssh ...`** -- the `cat` command is blocked by permissions.
|
||||||
|
Always write the JSON to a file first using the Write tool, then use `< file` redirect.
|
||||||
|
|
||||||
**JSON fields:**
|
**JSON fields:**
|
||||||
- `"sql"` (required) -- DuckDB SQL query (can reference local views + registered BQ aliases)
|
- `"sql"` (required) -- DuckDB SQL query (can reference local views + registered BQ aliases)
|
||||||
|
|
@ -248,8 +259,8 @@ single quotes, parentheses, or any other special characters.
|
||||||
|
|
||||||
### Example 1: Remote-only query (aggregated data)
|
### Example 1: Remote-only query (aggregated data)
|
||||||
|
|
||||||
```bash
|
Write to `user/scripts/rq_query.json`:
|
||||||
cat <<'QUERY' | ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin'
|
```json
|
||||||
{
|
{
|
||||||
"register_bq": {
|
"register_bq": {
|
||||||
"agg_data": "SELECT date_col, dim_col, SUM(metric) as total FROM `project.dataset.table` WHERE date_col >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY) GROUP BY 1,2"
|
"agg_data": "SELECT date_col, dim_col, SUM(metric) as total FROM `project.dataset.table` WHERE date_col >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY) GROUP BY 1,2"
|
||||||
|
|
@ -257,13 +268,17 @@ cat <<'QUERY' | ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin'
|
||||||
"sql": "SELECT * FROM agg_data ORDER BY date_col, dim_col",
|
"sql": "SELECT * FROM agg_data ORDER BY date_col, dim_col",
|
||||||
"format": "table"
|
"format": "table"
|
||||||
}
|
}
|
||||||
QUERY
|
```
|
||||||
|
|
||||||
|
Then run:
|
||||||
|
```bash
|
||||||
|
ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin' < user/scripts/rq_query.json
|
||||||
```
|
```
|
||||||
|
|
||||||
### Example 2: JOIN local + remote
|
### Example 2: JOIN local + remote
|
||||||
|
|
||||||
```bash
|
Write to `user/scripts/rq_query.json`:
|
||||||
cat <<'QUERY' | ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin'
|
```json
|
||||||
{
|
{
|
||||||
"register_bq": {
|
"register_bq": {
|
||||||
"remote_data": "SELECT date_col, dim_col, SUM(metric) as total FROM `project.dataset.table` WHERE date_col >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) GROUP BY 1,2"
|
"remote_data": "SELECT date_col, dim_col, SUM(metric) as total FROM `project.dataset.table` WHERE date_col >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) GROUP BY 1,2"
|
||||||
|
|
@ -271,14 +286,17 @@ cat <<'QUERY' | ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin'
|
||||||
"sql": "SELECT l.*, r.total FROM local_table l JOIN remote_data r ON l.date_col = r.date_col AND l.dim_col = r.dim_col ORDER BY 1,2",
|
"sql": "SELECT l.*, r.total FROM local_table l JOIN remote_data r ON l.date_col = r.date_col AND l.dim_col = r.dim_col ORDER BY 1,2",
|
||||||
"format": "table"
|
"format": "table"
|
||||||
}
|
}
|
||||||
QUERY
|
```
|
||||||
|
|
||||||
|
Then run:
|
||||||
|
```bash
|
||||||
|
ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin' < user/scripts/rq_query.json
|
||||||
```
|
```
|
||||||
|
|
||||||
### Example 3: Download result as Parquet for local analysis
|
### Example 3: Download result as Parquet for local analysis
|
||||||
|
|
||||||
```bash
|
Write to `user/scripts/rq_query.json`:
|
||||||
# 1. Run query, save as Parquet on server
|
```json
|
||||||
cat <<'QUERY' | ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin'
|
|
||||||
{
|
{
|
||||||
"register_bq": {
|
"register_bq": {
|
||||||
"remote_data": "SELECT ... FROM `project.dataset.table` WHERE ... GROUP BY ..."
|
"remote_data": "SELECT ... FROM `project.dataset.table` WHERE ... GROUP BY ..."
|
||||||
|
|
@ -287,7 +305,12 @@ cat <<'QUERY' | ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin'
|
||||||
"format": "parquet",
|
"format": "parquet",
|
||||||
"output": "/tmp/remote_query/analysis.parquet"
|
"output": "/tmp/remote_query/analysis.parquet"
|
||||||
}
|
}
|
||||||
QUERY
|
```
|
||||||
|
|
||||||
|
Then run:
|
||||||
|
```bash
|
||||||
|
# 1. Run query on server
|
||||||
|
ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin' < user/scripts/rq_query.json
|
||||||
|
|
||||||
# 2. Download to local machine
|
# 2. Download to local machine
|
||||||
scp {ssh_alias}:/tmp/remote_query/analysis.parquet ./user/parquet/
|
scp {ssh_alias}:/tmp/remote_query/analysis.parquet ./user/parquet/
|
||||||
|
|
@ -318,10 +341,8 @@ If that exceeds 100K rows, add more aggregation or tighter date filters.
|
||||||
4. **Limits**: 500K rows max per BQ sub-query, 100K rows max in final result
|
4. **Limits**: 500K rows max per BQ sub-query, 100K rows max in final result
|
||||||
5. If the query might take > 60 seconds, use nohup pattern:
|
5. If the query might take > 60 seconds, use nohup pattern:
|
||||||
```bash
|
```bash
|
||||||
# Write query to temp file, then run via nohup
|
# Write query spec to user/scripts/rq_query.json first, then:
|
||||||
cat <<'QUERY' | ssh {ssh_alias} 'cat > /tmp/rq_spec.json && nohup bash ~/server/scripts/remote_query.sh --stdin < /tmp/rq_spec.json > /tmp/rq.log 2>&1 &'
|
ssh {ssh_alias} 'cat > /tmp/rq_spec.json && nohup bash ~/server/scripts/remote_query.sh --stdin < /tmp/rq_spec.json > /tmp/rq.log 2>&1 &' < user/scripts/rq_query.json
|
||||||
{"register_bq": {"data": "SELECT ..."}, "sql": "SELECT ...", "format": "parquet", "output": "/tmp/remote_query/result.parquet"}
|
|
||||||
QUERY
|
|
||||||
ssh {ssh_alias} 'tail -5 /tmp/rq.log' # check progress
|
ssh {ssh_alias} 'tail -5 /tmp/rq.log' # check progress
|
||||||
scp {ssh_alias}:/tmp/remote_query/result.parquet ./user/parquet/
|
scp {ssh_alias}:/tmp/remote_query/result.parquet ./user/parquet/
|
||||||
```
|
```
|
||||||
|
|
|
||||||
|
|
@ -41,6 +41,8 @@
|
||||||
"Bash(stat:*)",
|
"Bash(stat:*)",
|
||||||
"Bash(bash server/scripts/*)",
|
"Bash(bash server/scripts/*)",
|
||||||
"Bash(python server/scripts/*)",
|
"Bash(python server/scripts/*)",
|
||||||
|
"Bash(ssh:*)",
|
||||||
|
"Bash(scp:*)",
|
||||||
"WebFetch(domain:github.com)",
|
"WebFetch(domain:github.com)",
|
||||||
"WebSearch"
|
"WebSearch"
|
||||||
],
|
],
|
||||||
|
|
|
||||||
|
|
@ -329,7 +329,7 @@ def _format_output(
|
||||||
# Re-execute without limit wrapper for clean Arrow export
|
# Re-execute without limit wrapper for clean Arrow export
|
||||||
arrow_result = conn.execute(
|
arrow_result = conn.execute(
|
||||||
f"SELECT * FROM ({sql}) AS _rq LIMIT {max_rows}"
|
f"SELECT * FROM ({sql}) AS _rq LIMIT {max_rows}"
|
||||||
).fetch_arrow_table()
|
).arrow().read_all()
|
||||||
|
|
||||||
if not output_path:
|
if not output_path:
|
||||||
output_path = str(Path(_load_remote_query_config()["output_dir"]) / "result.parquet")
|
output_path = str(Path(_load_remote_query_config()["output_dir"]) / "result.parquet")
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue