From 017cf076740d547968c5cf5e6518cc75559fb8c7 Mon Sep 17 00:00:00 2001 From: ZdenekSrotyr Date: Sat, 11 Apr 2026 10:52:39 +0200 Subject: [PATCH] docs: add design spec for remote query (extension re-attach + two-phase BQ) Co-Authored-By: Claude Opus 4.6 (1M context) --- .../specs/2026-04-11-remote-query-design.md | 180 ++++++++++++++++++ 1 file changed, 180 insertions(+) create mode 100644 docs/superpowers/specs/2026-04-11-remote-query-design.md diff --git a/docs/superpowers/specs/2026-04-11-remote-query-design.md b/docs/superpowers/specs/2026-04-11-remote-query-design.md new file mode 100644 index 0000000..29af622 --- /dev/null +++ b/docs/superpowers/specs/2026-04-11-remote-query-design.md @@ -0,0 +1,180 @@ +# Remote Query — Design Spec + +**Date:** 2026-04-11 +**Status:** Approved +**Scope:** Fix extension re-attach + two-phase remote query engine + +## Context + +BigQuery remote views created by the orchestrator don't work at query time because `get_analytics_db_readonly()` opens a fresh connection without re-loading the BigQuery extension. Additionally, the platform lacks the ability to run hybrid queries that JOIN local Parquet data with on-demand BigQuery subquery results. + +The `padak/tmp_oss` v1 repo has `src/remote_query.py` with a two-phase protocol. The existing `scripts/duckdb_manager.py` in this repo already has `register_bq_table()` and `_create_bq_client()` helper functions. The `table_registry` already supports `query_mode` values: `local`, `remote`, `hybrid`. + +**Primary user:** Claude Code agent running `da query` locally, or API consumers via `POST /api/query/hybrid`. + +--- + +## Part 1: Fix Extension Re-attach + +### Problem + +`get_analytics_db_readonly()` in `src/db.py` opens analytics.duckdb in read-only mode and ATTACHes extract.duckdb files, but does NOT re-load extensions referenced in `_remote_attach` tables. BigQuery remote views fail with "Catalog Error: bq not found". + +### Solution + +After ATTACHing extract.duckdb files in `get_analytics_db_readonly()`, scan each for a `_remote_attach` table. For each record: + +1. `LOAD {extension}` — loads pre-installed extension from disk (no INSTALL needed in read-only mode; orchestrator pre-installs during rebuild) +2. `ATTACH '{url}' AS {alias} (TYPE {extension}, READ_ONLY)` — re-attaches the remote source + +If LOAD fails (extension not installed), log a warning and continue — local views still work. + +### Changes + +**File:** `src/db.py` — `get_analytics_db_readonly()` function + +Add ~20 lines after the existing extract.duckdb ATTACH loop. Read `_remote_attach` table from each attached extract DB, collect unique (alias, extension, url, token_env) tuples, and re-attach. + +Pattern follows `src/orchestrator.py:_attach_remote_extensions()` but simplified for read-only context (no INSTALL, just LOAD + ATTACH). + +--- + +## Part 2: Two-Phase Remote Query Engine + +### Architecture + +New module `src/remote_query.py` with a `RemoteQueryEngine` class: + +```python +class RemoteQueryEngine: + def __init__(self, conn: duckdb.DuckDBPyConnection): + """Takes an existing DuckDB connection (analytics.duckdb with local views).""" + + def register_bq(self, alias: str, bq_sql: str) -> dict: + """Execute BQ subquery, register result as in-memory DuckDB view. + Returns {alias, rows, columns, memory_mb}. + Raises RemoteQueryError on safety limit violation.""" + + def execute(self, sql: str) -> dict: + """Execute final DuckDB query against local + registered BQ views. + Returns {columns: [...], rows: [...], row_count: int, truncated: bool}.""" +``` + +### Two-Phase Flow + +1. **Phase 1 — BQ Registration:** For each `register_bq(alias, bq_sql)` call: + - COUNT(*) pre-check via Python BQ client → reject if >max_bq_rows + - Memory estimate: ~50 bytes/cell × rows × cols → reject if >max_memory_mb + - Execute BQ query → `job.to_arrow()` → `conn.register(alias, arrow_table)` + - Uses `scripts/duckdb_manager.py:_create_bq_client()` for client creation and `register_bq_table()` logic (reuse, not reimplement) + +2. **Phase 2 — DuckDB Query:** Execute final SQL against all views (local Parquet + registered BQ Arrow tables). Apply max_result_rows limit. + +### Safety Limits + +Configurable in `config/instance.yaml` under `remote_query:`: + +```yaml +remote_query: + max_bq_rows: 500000 # max rows from a single BQ subquery + max_memory_mb: 2048 # max estimated memory for BQ result + max_result_rows: 100000 # max rows in final result + timeout_seconds: 300 # BQ query timeout +``` + +Defaults are hardcoded in `RemoteQueryEngine` and overridden by instance config. + +### Error Handling + +Custom `RemoteQueryError` exception with structured error: + +```python +class RemoteQueryError(Exception): + def __init__(self, message: str, error_type: str, details: dict = None): + # error_type: "row_limit", "memory_limit", "bq_error", "query_error", "timeout" +``` + +### CLI: `da query` Extension + +Extend existing `cli/commands/query.py`: + +``` +da query --sql "SELECT o.*, t.views FROM orders o JOIN traffic t ON o.date = t.date" \ + --register-bq "traffic=SELECT date, SUM(views) as views FROM dataset.web WHERE date > '2026-01-01' GROUP BY 1" +``` + +- Multiple `--register-bq` flags allowed (one per BQ alias) +- Format: `"alias=BQ_SQL"` (split on first `=`) +- `--stdin` mode: reads JSON from stdin for complex SQL: + ```json + {"register_bq": {"traffic": "SELECT ..."}, "sql": "SELECT ..."} + ``` +- Output formats: `table` (default), `csv`, `json` + +### API: `POST /api/query/hybrid` + +``` +POST /api/query/hybrid +Authorization: Bearer + +{ + "register_bq": { + "traffic": "SELECT date, SUM(views) FROM dataset.web WHERE date > '2026-01-01' GROUP BY 1" + }, + "sql": "SELECT o.*, t.views FROM orders o JOIN traffic t ON o.date = t.date", + "format": "json" +} +``` + +**Response:** +```json +{ + "columns": ["order_id", "date", "views"], + "rows": [...], + "row_count": 1234, + "truncated": false, + "bq_stats": { + "traffic": {"rows": 365, "columns": 2, "memory_mb": 0.03} + } +} +``` + +**Auth:** `require_admin` — BQ queries cost money, only admins can trigger them. + +**Validation:** `register_bq` SQL strings are validated as SELECT-only (no INSERT/UPDATE/DELETE/DROP). + +--- + +## Implementation Summary + +### New Files + +| File | Purpose | +|---|---| +| `src/remote_query.py` | `RemoteQueryEngine` class + `RemoteQueryError` | +| `app/api/query_hybrid.py` | `POST /api/query/hybrid` endpoint | +| `tests/test_remote_query.py` | Engine unit tests (mocked BQ client) | + +### Modified Files + +| File | Changes | +|---|---| +| `src/db.py` | `get_analytics_db_readonly()` — add extension re-attach from `_remote_attach` | +| `cli/commands/query.py` | Add `--register-bq` and `--stdin` flags | +| `app/main.py` | Register hybrid query router | +| `CLAUDE.md` | Document hybrid query usage | + +### Implementation Order + +1. Fix extension re-attach in `src/db.py` (unblocks remote views) +2. `RemoteQueryEngine` in `src/remote_query.py` (core logic) +3. CLI extension `--register-bq` +4. API endpoint `POST /api/query/hybrid` +5. CLAUDE.md update + integration tests + +### Test Coverage + +- `tests/test_remote_query.py` — engine tests with mocked BQ client (safety limits, registration, error handling) +- `tests/test_db.py` — extension re-attach test (mock _remote_attach table) +- `tests/test_api.py` — hybrid query endpoint (auth, validation) +- `tests/test_cli.py` — `--register-bq` flag parsing