docs: add design spec for remote query (extension re-attach + two-phase BQ)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
fbad3f5538
commit
017cf07674
1 changed files with 180 additions and 0 deletions
180
docs/superpowers/specs/2026-04-11-remote-query-design.md
Normal file
180
docs/superpowers/specs/2026-04-11-remote-query-design.md
Normal file
|
|
@ -0,0 +1,180 @@
|
||||||
|
# Remote Query — Design Spec
|
||||||
|
|
||||||
|
**Date:** 2026-04-11
|
||||||
|
**Status:** Approved
|
||||||
|
**Scope:** Fix extension re-attach + two-phase remote query engine
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
BigQuery remote views created by the orchestrator don't work at query time because `get_analytics_db_readonly()` opens a fresh connection without re-loading the BigQuery extension. Additionally, the platform lacks the ability to run hybrid queries that JOIN local Parquet data with on-demand BigQuery subquery results.
|
||||||
|
|
||||||
|
The `padak/tmp_oss` v1 repo has `src/remote_query.py` with a two-phase protocol. The existing `scripts/duckdb_manager.py` in this repo already has `register_bq_table()` and `_create_bq_client()` helper functions. The `table_registry` already supports `query_mode` values: `local`, `remote`, `hybrid`.
|
||||||
|
|
||||||
|
**Primary user:** Claude Code agent running `da query` locally, or API consumers via `POST /api/query/hybrid`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Part 1: Fix Extension Re-attach
|
||||||
|
|
||||||
|
### Problem
|
||||||
|
|
||||||
|
`get_analytics_db_readonly()` in `src/db.py` opens analytics.duckdb in read-only mode and ATTACHes extract.duckdb files, but does NOT re-load extensions referenced in `_remote_attach` tables. BigQuery remote views fail with "Catalog Error: bq not found".
|
||||||
|
|
||||||
|
### Solution
|
||||||
|
|
||||||
|
After ATTACHing extract.duckdb files in `get_analytics_db_readonly()`, scan each for a `_remote_attach` table. For each record:
|
||||||
|
|
||||||
|
1. `LOAD {extension}` — loads pre-installed extension from disk (no INSTALL needed in read-only mode; orchestrator pre-installs during rebuild)
|
||||||
|
2. `ATTACH '{url}' AS {alias} (TYPE {extension}, READ_ONLY)` — re-attaches the remote source
|
||||||
|
|
||||||
|
If LOAD fails (extension not installed), log a warning and continue — local views still work.
|
||||||
|
|
||||||
|
### Changes
|
||||||
|
|
||||||
|
**File:** `src/db.py` — `get_analytics_db_readonly()` function
|
||||||
|
|
||||||
|
Add ~20 lines after the existing extract.duckdb ATTACH loop. Read `_remote_attach` table from each attached extract DB, collect unique (alias, extension, url, token_env) tuples, and re-attach.
|
||||||
|
|
||||||
|
Pattern follows `src/orchestrator.py:_attach_remote_extensions()` but simplified for read-only context (no INSTALL, just LOAD + ATTACH).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Part 2: Two-Phase Remote Query Engine
|
||||||
|
|
||||||
|
### Architecture
|
||||||
|
|
||||||
|
New module `src/remote_query.py` with a `RemoteQueryEngine` class:
|
||||||
|
|
||||||
|
```python
|
||||||
|
class RemoteQueryEngine:
|
||||||
|
def __init__(self, conn: duckdb.DuckDBPyConnection):
|
||||||
|
"""Takes an existing DuckDB connection (analytics.duckdb with local views)."""
|
||||||
|
|
||||||
|
def register_bq(self, alias: str, bq_sql: str) -> dict:
|
||||||
|
"""Execute BQ subquery, register result as in-memory DuckDB view.
|
||||||
|
Returns {alias, rows, columns, memory_mb}.
|
||||||
|
Raises RemoteQueryError on safety limit violation."""
|
||||||
|
|
||||||
|
def execute(self, sql: str) -> dict:
|
||||||
|
"""Execute final DuckDB query against local + registered BQ views.
|
||||||
|
Returns {columns: [...], rows: [...], row_count: int, truncated: bool}."""
|
||||||
|
```
|
||||||
|
|
||||||
|
### Two-Phase Flow
|
||||||
|
|
||||||
|
1. **Phase 1 — BQ Registration:** For each `register_bq(alias, bq_sql)` call:
|
||||||
|
- COUNT(*) pre-check via Python BQ client → reject if >max_bq_rows
|
||||||
|
- Memory estimate: ~50 bytes/cell × rows × cols → reject if >max_memory_mb
|
||||||
|
- Execute BQ query → `job.to_arrow()` → `conn.register(alias, arrow_table)`
|
||||||
|
- Uses `scripts/duckdb_manager.py:_create_bq_client()` for client creation and `register_bq_table()` logic (reuse, not reimplement)
|
||||||
|
|
||||||
|
2. **Phase 2 — DuckDB Query:** Execute final SQL against all views (local Parquet + registered BQ Arrow tables). Apply max_result_rows limit.
|
||||||
|
|
||||||
|
### Safety Limits
|
||||||
|
|
||||||
|
Configurable in `config/instance.yaml` under `remote_query:`:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
remote_query:
|
||||||
|
max_bq_rows: 500000 # max rows from a single BQ subquery
|
||||||
|
max_memory_mb: 2048 # max estimated memory for BQ result
|
||||||
|
max_result_rows: 100000 # max rows in final result
|
||||||
|
timeout_seconds: 300 # BQ query timeout
|
||||||
|
```
|
||||||
|
|
||||||
|
Defaults are hardcoded in `RemoteQueryEngine` and overridden by instance config.
|
||||||
|
|
||||||
|
### Error Handling
|
||||||
|
|
||||||
|
Custom `RemoteQueryError` exception with structured error:
|
||||||
|
|
||||||
|
```python
|
||||||
|
class RemoteQueryError(Exception):
|
||||||
|
def __init__(self, message: str, error_type: str, details: dict = None):
|
||||||
|
# error_type: "row_limit", "memory_limit", "bq_error", "query_error", "timeout"
|
||||||
|
```
|
||||||
|
|
||||||
|
### CLI: `da query` Extension
|
||||||
|
|
||||||
|
Extend existing `cli/commands/query.py`:
|
||||||
|
|
||||||
|
```
|
||||||
|
da query --sql "SELECT o.*, t.views FROM orders o JOIN traffic t ON o.date = t.date" \
|
||||||
|
--register-bq "traffic=SELECT date, SUM(views) as views FROM dataset.web WHERE date > '2026-01-01' GROUP BY 1"
|
||||||
|
```
|
||||||
|
|
||||||
|
- Multiple `--register-bq` flags allowed (one per BQ alias)
|
||||||
|
- Format: `"alias=BQ_SQL"` (split on first `=`)
|
||||||
|
- `--stdin` mode: reads JSON from stdin for complex SQL:
|
||||||
|
```json
|
||||||
|
{"register_bq": {"traffic": "SELECT ..."}, "sql": "SELECT ..."}
|
||||||
|
```
|
||||||
|
- Output formats: `table` (default), `csv`, `json`
|
||||||
|
|
||||||
|
### API: `POST /api/query/hybrid`
|
||||||
|
|
||||||
|
```
|
||||||
|
POST /api/query/hybrid
|
||||||
|
Authorization: Bearer <admin_token>
|
||||||
|
|
||||||
|
{
|
||||||
|
"register_bq": {
|
||||||
|
"traffic": "SELECT date, SUM(views) FROM dataset.web WHERE date > '2026-01-01' GROUP BY 1"
|
||||||
|
},
|
||||||
|
"sql": "SELECT o.*, t.views FROM orders o JOIN traffic t ON o.date = t.date",
|
||||||
|
"format": "json"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Response:**
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"columns": ["order_id", "date", "views"],
|
||||||
|
"rows": [...],
|
||||||
|
"row_count": 1234,
|
||||||
|
"truncated": false,
|
||||||
|
"bq_stats": {
|
||||||
|
"traffic": {"rows": 365, "columns": 2, "memory_mb": 0.03}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Auth:** `require_admin` — BQ queries cost money, only admins can trigger them.
|
||||||
|
|
||||||
|
**Validation:** `register_bq` SQL strings are validated as SELECT-only (no INSERT/UPDATE/DELETE/DROP).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Implementation Summary
|
||||||
|
|
||||||
|
### New Files
|
||||||
|
|
||||||
|
| File | Purpose |
|
||||||
|
|---|---|
|
||||||
|
| `src/remote_query.py` | `RemoteQueryEngine` class + `RemoteQueryError` |
|
||||||
|
| `app/api/query_hybrid.py` | `POST /api/query/hybrid` endpoint |
|
||||||
|
| `tests/test_remote_query.py` | Engine unit tests (mocked BQ client) |
|
||||||
|
|
||||||
|
### Modified Files
|
||||||
|
|
||||||
|
| File | Changes |
|
||||||
|
|---|---|
|
||||||
|
| `src/db.py` | `get_analytics_db_readonly()` — add extension re-attach from `_remote_attach` |
|
||||||
|
| `cli/commands/query.py` | Add `--register-bq` and `--stdin` flags |
|
||||||
|
| `app/main.py` | Register hybrid query router |
|
||||||
|
| `CLAUDE.md` | Document hybrid query usage |
|
||||||
|
|
||||||
|
### Implementation Order
|
||||||
|
|
||||||
|
1. Fix extension re-attach in `src/db.py` (unblocks remote views)
|
||||||
|
2. `RemoteQueryEngine` in `src/remote_query.py` (core logic)
|
||||||
|
3. CLI extension `--register-bq`
|
||||||
|
4. API endpoint `POST /api/query/hybrid`
|
||||||
|
5. CLAUDE.md update + integration tests
|
||||||
|
|
||||||
|
### Test Coverage
|
||||||
|
|
||||||
|
- `tests/test_remote_query.py` — engine tests with mocked BQ client (safety limits, registration, error handling)
|
||||||
|
- `tests/test_db.py` — extension re-attach test (mock _remote_attach table)
|
||||||
|
- `tests/test_api.py` — hybrid query endpoint (auth, validation)
|
||||||
|
- `tests/test_cli.py` — `--register-bq` flag parsing
|
||||||
Loading…
Reference in a new issue