Fix remote query UX: file-based stdin, ssh permissions, deprecation

Session testing revealed 3 issues with remote queries:

1. CLAUDE.md template recommended `cat <<HEREDOC | ssh ...` but
   claude_settings.json had `cat` in deny list, causing 2-3 failed
   attempts per query. Replaced with file-based approach: Write tool
   creates JSON file, then `ssh ... < file` avoids the cat deny.

2. ssh/scp commands were not in the allow list, requiring manual
   approval for every remote query. Added both to allow list.

3. DuckDB fetch_arrow_table() emitted DeprecationWarning on every
   parquet export. Replaced with .arrow().read_all().

Also added instruction for proactive hybrid analysis when remote
tables are available (agent was only using local data until asked).
This commit is contained in:
Petr 2026-03-21 18:41:43 +01:00
parent 8c6c162417
commit 84d14da611
3 changed files with 45 additions and 22 deletions

View file

@ -188,6 +188,12 @@ You're ready to analyze!
Some tables are too large for local Parquet sync and are queried remotely via BigQuery. Some tables are too large for local Parquet sync and are queried remotely via BigQuery.
These tables have `query_mode: "remote"` in `server/docs/data_description.md`. These tables have `query_mode: "remote"` in `server/docs/data_description.md`.
**IMPORTANT: When remote tables exist, proactively offer hybrid analyses that combine
local and remote data.** For example, if the user asks for a business overview, suggest
joining local order data with remote traffic data to show a complete picture (conversion
funnels, revenue per visitor, etc.). Don't wait for the user to ask -- hybrid insights
are more valuable than single-source analysis.
### How to recognize remote tables ### How to recognize remote tables
Before writing any query, read `server/docs/data_description.md`. Each table has: Before writing any query, read `server/docs/data_description.md`. Each table has:
@ -219,13 +225,14 @@ You write two SQL statements:
2. **DuckDB SQL** (`--sql "SQL"`) -- runs in DuckDB after all views (local + BQ) are ready. 2. **DuckDB SQL** (`--sql "SQL"`) -- runs in DuckDB after all views (local + BQ) are ready.
Can JOIN local tables with registered BQ results. Can JOIN local tables with registered BQ results.
### Command format (JSON via stdin -- ALWAYS use this) ### Command format (JSON file via stdin)
**IMPORTANT:** Always use the `--stdin` JSON mode to avoid shell escaping issues with **IMPORTANT:** Always use the `--stdin` JSON mode to avoid shell escaping issues with
backticks, quotes, and parentheses in SQL. Write a heredoc with the JSON query spec: backticks, quotes, and parentheses in SQL. Use the Write tool to create a JSON query
spec file, then pipe it to SSH via stdin redirect:
```bash **Step 1:** Use the Write tool to create a JSON file (e.g., `user/scripts/rq_query.json`):
cat <<'QUERY' | ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin' ```json
{ {
"register_bq": { "register_bq": {
"ALIAS": "SELECT ... FROM `project.dataset.table` WHERE ... GROUP BY ..." "ALIAS": "SELECT ... FROM `project.dataset.table` WHERE ... GROUP BY ..."
@ -233,11 +240,15 @@ cat <<'QUERY' | ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin'
"sql": "SELECT ... FROM ALIAS JOIN local_table ...", "sql": "SELECT ... FROM ALIAS JOIN local_table ...",
"format": "table" "format": "table"
} }
QUERY
``` ```
The `<<'QUERY'` heredoc passes SQL **literally** -- no escaping needed for backticks, **Step 2:** Run the query via SSH with stdin redirect:
single quotes, parentheses, or any other special characters. ```bash
ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin' < user/scripts/rq_query.json
```
**NEVER use `cat <<HEREDOC | ssh ...`** -- the `cat` command is blocked by permissions.
Always write the JSON to a file first using the Write tool, then use `< file` redirect.
**JSON fields:** **JSON fields:**
- `"sql"` (required) -- DuckDB SQL query (can reference local views + registered BQ aliases) - `"sql"` (required) -- DuckDB SQL query (can reference local views + registered BQ aliases)
@ -248,8 +259,8 @@ single quotes, parentheses, or any other special characters.
### Example 1: Remote-only query (aggregated data) ### Example 1: Remote-only query (aggregated data)
```bash Write to `user/scripts/rq_query.json`:
cat <<'QUERY' | ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin' ```json
{ {
"register_bq": { "register_bq": {
"agg_data": "SELECT date_col, dim_col, SUM(metric) as total FROM `project.dataset.table` WHERE date_col >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY) GROUP BY 1,2" "agg_data": "SELECT date_col, dim_col, SUM(metric) as total FROM `project.dataset.table` WHERE date_col >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY) GROUP BY 1,2"
@ -257,13 +268,17 @@ cat <<'QUERY' | ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin'
"sql": "SELECT * FROM agg_data ORDER BY date_col, dim_col", "sql": "SELECT * FROM agg_data ORDER BY date_col, dim_col",
"format": "table" "format": "table"
} }
QUERY ```
Then run:
```bash
ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin' < user/scripts/rq_query.json
``` ```
### Example 2: JOIN local + remote ### Example 2: JOIN local + remote
```bash Write to `user/scripts/rq_query.json`:
cat <<'QUERY' | ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin' ```json
{ {
"register_bq": { "register_bq": {
"remote_data": "SELECT date_col, dim_col, SUM(metric) as total FROM `project.dataset.table` WHERE date_col >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) GROUP BY 1,2" "remote_data": "SELECT date_col, dim_col, SUM(metric) as total FROM `project.dataset.table` WHERE date_col >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) GROUP BY 1,2"
@ -271,14 +286,17 @@ cat <<'QUERY' | ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin'
"sql": "SELECT l.*, r.total FROM local_table l JOIN remote_data r ON l.date_col = r.date_col AND l.dim_col = r.dim_col ORDER BY 1,2", "sql": "SELECT l.*, r.total FROM local_table l JOIN remote_data r ON l.date_col = r.date_col AND l.dim_col = r.dim_col ORDER BY 1,2",
"format": "table" "format": "table"
} }
QUERY ```
Then run:
```bash
ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin' < user/scripts/rq_query.json
``` ```
### Example 3: Download result as Parquet for local analysis ### Example 3: Download result as Parquet for local analysis
```bash Write to `user/scripts/rq_query.json`:
# 1. Run query, save as Parquet on server ```json
cat <<'QUERY' | ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin'
{ {
"register_bq": { "register_bq": {
"remote_data": "SELECT ... FROM `project.dataset.table` WHERE ... GROUP BY ..." "remote_data": "SELECT ... FROM `project.dataset.table` WHERE ... GROUP BY ..."
@ -287,7 +305,12 @@ cat <<'QUERY' | ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin'
"format": "parquet", "format": "parquet",
"output": "/tmp/remote_query/analysis.parquet" "output": "/tmp/remote_query/analysis.parquet"
} }
QUERY ```
Then run:
```bash
# 1. Run query on server
ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin' < user/scripts/rq_query.json
# 2. Download to local machine # 2. Download to local machine
scp {ssh_alias}:/tmp/remote_query/analysis.parquet ./user/parquet/ scp {ssh_alias}:/tmp/remote_query/analysis.parquet ./user/parquet/
@ -318,10 +341,8 @@ If that exceeds 100K rows, add more aggregation or tighter date filters.
4. **Limits**: 500K rows max per BQ sub-query, 100K rows max in final result 4. **Limits**: 500K rows max per BQ sub-query, 100K rows max in final result
5. If the query might take > 60 seconds, use nohup pattern: 5. If the query might take > 60 seconds, use nohup pattern:
```bash ```bash
# Write query to temp file, then run via nohup # Write query spec to user/scripts/rq_query.json first, then:
cat <<'QUERY' | ssh {ssh_alias} 'cat > /tmp/rq_spec.json && nohup bash ~/server/scripts/remote_query.sh --stdin < /tmp/rq_spec.json > /tmp/rq.log 2>&1 &' ssh {ssh_alias} 'cat > /tmp/rq_spec.json && nohup bash ~/server/scripts/remote_query.sh --stdin < /tmp/rq_spec.json > /tmp/rq.log 2>&1 &' < user/scripts/rq_query.json
{"register_bq": {"data": "SELECT ..."}, "sql": "SELECT ...", "format": "parquet", "output": "/tmp/remote_query/result.parquet"}
QUERY
ssh {ssh_alias} 'tail -5 /tmp/rq.log' # check progress ssh {ssh_alias} 'tail -5 /tmp/rq.log' # check progress
scp {ssh_alias}:/tmp/remote_query/result.parquet ./user/parquet/ scp {ssh_alias}:/tmp/remote_query/result.parquet ./user/parquet/
``` ```

View file

@ -41,6 +41,8 @@
"Bash(stat:*)", "Bash(stat:*)",
"Bash(bash server/scripts/*)", "Bash(bash server/scripts/*)",
"Bash(python server/scripts/*)", "Bash(python server/scripts/*)",
"Bash(ssh:*)",
"Bash(scp:*)",
"WebFetch(domain:github.com)", "WebFetch(domain:github.com)",
"WebSearch" "WebSearch"
], ],

View file

@ -329,7 +329,7 @@ def _format_output(
# Re-execute without limit wrapper for clean Arrow export # Re-execute without limit wrapper for clean Arrow export
arrow_result = conn.execute( arrow_result = conn.execute(
f"SELECT * FROM ({sql}) AS _rq LIMIT {max_rows}" f"SELECT * FROM ({sql}) AS _rq LIMIT {max_rows}"
).fetch_arrow_table() ).arrow().read_all()
if not output_path: if not output_path:
output_path = str(Path(_load_remote_query_config()["output_dir"]) / "result.parquet") output_path = str(Path(_load_remote_query_config()["output_dir"]) / "result.parquet")