Fix remote query UX: file-based stdin, ssh permissions, deprecation

Session testing revealed 3 issues with remote queries: 1. CLAUDE.md template recommended `cat <<HEREDOC | ssh ...` but claude_settings.json had `cat` in deny list, causing 2-3 failed attempts per query. Replaced with file-based approach: Write tool creates JSON file, then `ssh ... < file` avoids the cat deny. 2. ssh/scp commands were not in the allow list, requiring manual approval for every remote query. Added both to allow list. 3. DuckDB fetch_arrow_table() emitted DeprecationWarning on every parquet export. Replaced with .arrow().read_all(). Also added instruction for proactive hybrid analysis when remote tables are available (agent was only using local data until asked).
2026-03-21 18:41:43 +01:00 · 2026-03-21 18:41:43 +01:00 · 84d14da611
commit 84d14da611
parent 8c6c162417
3 changed files with 45 additions and 22 deletions
--- a/docs/setup/claude_md_template.txt
+++ b/docs/setup/claude_md_template.txt
@ -188,6 +188,12 @@ You're ready to analyze!
 Some tables are too large for local Parquet sync and are queried remotely via BigQuery.
 These tables have `query_mode: "remote"` in `server/docs/data_description.md`.

+**IMPORTANT: When remote tables exist, proactively offer hybrid analyses that combine
+local and remote data.** For example, if the user asks for a business overview, suggest
+joining local order data with remote traffic data to show a complete picture (conversion
+funnels, revenue per visitor, etc.). Don't wait for the user to ask -- hybrid insights
+are more valuable than single-source analysis.
+
 ### How to recognize remote tables

 Before writing any query, read `server/docs/data_description.md`. Each table has:
@ -219,13 +225,14 @@ You write two SQL statements:
 2. **DuckDB SQL** (`--sql "SQL"`) -- runs in DuckDB after all views (local + BQ) are ready.
   Can JOIN local tables with registered BQ results.

-### Command format (JSON via stdin -- ALWAYS use this)
+### Command format (JSON file via stdin)

 **IMPORTANT:** Always use the `--stdin` JSON mode to avoid shell escaping issues with
-backticks, quotes, and parentheses in SQL. Write a heredoc with the JSON query spec:
+backticks, quotes, and parentheses in SQL. Use the Write tool to create a JSON query
+spec file, then pipe it to SSH via stdin redirect:

-```bash
-cat <<'QUERY' | ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin'
+**Step 1:** Use the Write tool to create a JSON file (e.g., `user/scripts/rq_query.json`):
+```json
 {
  "register_bq": {
    "ALIAS": "SELECT ... FROM `project.dataset.table` WHERE ... GROUP BY ..."
@ -233,11 +240,15 @@ cat <<'QUERY' | ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin'
  "sql": "SELECT ... FROM ALIAS JOIN local_table ...",
  "format": "table"
 }
-QUERY
 ```

-The `<<'QUERY'` heredoc passes SQL **literally** -- no escaping needed for backticks,
-single quotes, parentheses, or any other special characters.
+**Step 2:** Run the query via SSH with stdin redirect:
+```bash
+ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin' < user/scripts/rq_query.json
+```
+
+**NEVER use `cat <<HEREDOC | ssh ...`** -- the `cat` command is blocked by permissions.
+Always write the JSON to a file first using the Write tool, then use `< file` redirect.

 **JSON fields:**
 - `"sql"` (required) -- DuckDB SQL query (can reference local views + registered BQ aliases)
@ -248,8 +259,8 @@ single quotes, parentheses, or any other special characters.

 ### Example 1: Remote-only query (aggregated data)

-```bash
-cat <<'QUERY' | ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin'
+Write to `user/scripts/rq_query.json`:
+```json
 {
  "register_bq": {
    "agg_data": "SELECT date_col, dim_col, SUM(metric) as total FROM `project.dataset.table` WHERE date_col >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY) GROUP BY 1,2"
@ -257,13 +268,17 @@ cat <<'QUERY' | ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin'
  "sql": "SELECT * FROM agg_data ORDER BY date_col, dim_col",
  "format": "table"
 }
-QUERY
+```
+
+Then run:
+```bash
+ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin' < user/scripts/rq_query.json
 ```

 ### Example 2: JOIN local + remote

-```bash
-cat <<'QUERY' | ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin'
+Write to `user/scripts/rq_query.json`:
+```json
 {
  "register_bq": {
    "remote_data": "SELECT date_col, dim_col, SUM(metric) as total FROM `project.dataset.table` WHERE date_col >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) GROUP BY 1,2"
@ -271,14 +286,17 @@ cat <<'QUERY' | ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin'
  "sql": "SELECT l.*, r.total FROM local_table l JOIN remote_data r ON l.date_col = r.date_col AND l.dim_col = r.dim_col ORDER BY 1,2",
  "format": "table"
 }
-QUERY
+```
+
+Then run:
+```bash
+ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin' < user/scripts/rq_query.json
 ```

 ### Example 3: Download result as Parquet for local analysis

-```bash
-# 1. Run query, save as Parquet on server
-cat <<'QUERY' | ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin'
+Write to `user/scripts/rq_query.json`:
+```json
 {
  "register_bq": {
    "remote_data": "SELECT ... FROM `project.dataset.table` WHERE ... GROUP BY ..."
@ -287,7 +305,12 @@ cat <<'QUERY' | ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin'
  "format": "parquet",
  "output": "/tmp/remote_query/analysis.parquet"
 }
-QUERY
+```
+
+Then run:
+```bash
+# 1. Run query on server
+ssh {ssh_alias} 'bash ~/server/scripts/remote_query.sh --stdin' < user/scripts/rq_query.json

 # 2. Download to local machine
 scp {ssh_alias}:/tmp/remote_query/analysis.parquet ./user/parquet/
@ -318,10 +341,8 @@ If that exceeds 100K rows, add more aggregation or tighter date filters.
 4. **Limits**: 500K rows max per BQ sub-query, 100K rows max in final result
 5. If the query might take > 60 seconds, use nohup pattern:
   ```bash
-   # Write query to temp file, then run via nohup
-   cat <<'QUERY' | ssh {ssh_alias} 'cat > /tmp/rq_spec.json && nohup bash ~/server/scripts/remote_query.sh --stdin < /tmp/rq_spec.json > /tmp/rq.log 2>&1 &'
-   {"register_bq": {"data": "SELECT ..."}, "sql": "SELECT ...", "format": "parquet", "output": "/tmp/remote_query/result.parquet"}
-   QUERY
+   # Write query spec to user/scripts/rq_query.json first, then:
+   ssh {ssh_alias} 'cat > /tmp/rq_spec.json && nohup bash ~/server/scripts/remote_query.sh --stdin < /tmp/rq_spec.json > /tmp/rq.log 2>&1 &' < user/scripts/rq_query.json
   ssh {ssh_alias} 'tail -5 /tmp/rq.log'  # check progress
   scp {ssh_alias}:/tmp/remote_query/result.parquet ./user/parquet/
   ```
--- a/docs/setup/claude_settings.json
+++ b/docs/setup/claude_settings.json
@ -41,6 +41,8 @@
        "Bash(stat:*)",
        "Bash(bash server/scripts/*)",
        "Bash(python server/scripts/*)",
+        "Bash(ssh:*)",
+        "Bash(scp:*)",
        "WebFetch(domain:github.com)",
        "WebSearch"
      ],
--- a/src/remote_query.py
+++ b/src/remote_query.py
@ -329,7 +329,7 @@ def _format_output(
        # Re-execute without limit wrapper for clean Arrow export
        arrow_result = conn.execute(
            f"SELECT * FROM ({sql}) AS _rq LIMIT {max_rows}"
-        ).fetch_arrow_table()
+        ).arrow().read_all()

        if not output_path:
            output_path = str(Path(_load_remote_query_config()["output_dir"]) / "result.parquet")