ZdenekSrotyr eb68e6292d docs: fix remote query spec after code review

- Address read-only LOAD uncertainty with verification step + workaround
- Clarify register_bq wraps BQ logic (not delegates to register_bq_table)
- Use existing max_bq_registration_rows config key name
- Apply SQL blocklist to both register_bq and final sql
- Define connection lifecycle (caller owns, try/finally)
- Fix CLI argument handling (optional positional + --sql flag)
- Document concurrency safety (Unix inode semantics)
- Handle missing google-cloud-bigquery gracefully

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-11 10:58:25 +02:00

9.2 KiB

Raw Blame History

Remote Query — Design Spec

Date: 2026-04-11 Status: Approved Scope: Fix extension re-attach + two-phase remote query engine

Context

BigQuery remote views created by the orchestrator don't work at query time because get_analytics_db_readonly() opens a fresh connection without re-loading the BigQuery extension. Additionally, the platform lacks the ability to run hybrid queries that JOIN local Parquet data with on-demand BigQuery subquery results.

The padak/tmp_oss v1 repo has src/remote_query.py with a two-phase protocol. The existing scripts/duckdb_manager.py in this repo already has register_bq_table() and _create_bq_client() helper functions. The table_registry already supports query_mode values: local, remote, hybrid.

Primary user: Claude Code agent running da query locally, or API consumers via POST /api/query/hybrid.

Part 1: Fix Extension Re-attach

Problem

get_analytics_db_readonly() in src/db.py opens analytics.duckdb in read-only mode and ATTACHes extract.duckdb files, but does NOT re-load extensions referenced in _remote_attach tables. BigQuery remote views fail with "Catalog Error: bq not found".

Solution

After ATTACHing extract.duckdb files in get_analytics_db_readonly(), scan each for a _remote_attach table. For each record, re-load the extension and re-attach the remote source.

Important: DuckDB read-only LOAD behavior. The read_only=True flag on duckdb.connect() blocks writes to the DB file, but LOAD writes to the extension cache in ~/.duckdb/extensions/ (separate from the DB file). This should work, but MUST be empirically verified as the first implementation step. If LOAD fails in read-only mode, the workaround is to open the analytics DB WITHOUT read_only=True but still use read-only SQL patterns (no INSERT/UPDATE/DELETE), or to call LOAD on a separate in-memory connection first (DuckDB extension cache is process-wide).

Steps for each _remote_attach record:

LOAD {extension} — loads pre-installed extension from disk
Read token from os.environ[token_env] if token_env is non-empty
ATTACH '{url}' AS {alias} (TYPE {extension}, READ_ONLY) — with TOKEN if needed

If LOAD or ATTACH fails, log a warning and continue — local views still work.

Changes

File: src/db.py — get_analytics_db_readonly() function

Add ~25 lines after the existing extract.duckdb ATTACH loop. Read _remote_attach table from each attached extract DB, collect unique (alias, extension, url, token_env) tuples, and re-attach.

Pattern follows src/orchestrator.py:_attach_remote_extensions() but simplified (no INSTALL — orchestrator pre-installs during rebuild).

Concurrency note: If the orchestrator runs _atomic_swap_db() while a read-only connection is open, the existing connection holds a file descriptor to the old inode (Unix semantics). This is safe — the old data remains accessible until the connection is closed.

Part 2: Two-Phase Remote Query Engine

Architecture

New module src/remote_query.py with a RemoteQueryEngine class:

class RemoteQueryEngine:
    def __init__(self, conn: duckdb.DuckDBPyConnection):
        """Takes an existing DuckDB connection (analytics.duckdb with local views)."""

    def register_bq(self, alias: str, bq_sql: str) -> dict:
        """Execute BQ subquery, register result as in-memory DuckDB view.
        Returns {alias, rows, columns, memory_mb}.
        Raises RemoteQueryError on safety limit violation."""

    def execute(self, sql: str) -> dict:
        """Execute final DuckDB query against local + registered BQ views.
        Returns {columns: [...], rows: [...], row_count: int, truncated: bool}."""

Two-Phase Flow

Phase 1 — BQ Registration: For each register_bq(alias, bq_sql) call:
- COUNT(*) pre-check via Python BQ client → reject if >max_bq_registration_rows
- Memory estimate: ~50 bytes/cell × rows × cols → reject if >max_memory_mb. Note: this is approximate. After query completes, use arrow_table.nbytes for accurate reporting in bq_stats.
- Execute BQ query → job.to_arrow() → conn.register(alias, arrow_table)
- Uses scripts/duckdb_manager.py:_create_bq_client() for BQ client creation (reuse)
- Does NOT delegate to register_bq_table() directly — RemoteQueryEngine.register_bq() wraps BQ query execution with its own pre-check logic (COUNT, memory estimate), then calls conn.register(alias, arrow_table). The existing register_bq_table() has no pre-check capability and would need signature changes to add one. Wrapping is cleaner than modifying shared code.
- Gracefully handle missing google-cloud-bigquery package: catch ImportError and raise RemoteQueryError(error_type="bq_error", message="google-cloud-bigquery not installed")
Phase 2 — DuckDB Query: Execute final SQL against all views (local Parquet + registered BQ Arrow tables). Apply max_result_rows limit.

Safety Limits

Configurable in config/instance.yaml under remote_query::

remote_query:
  max_bq_registration_rows: 500000   # max rows from a single BQ subquery (matches existing instance.yaml.example key)
  max_memory_mb: 2048                # max estimated memory for BQ result
  max_result_rows: 100000            # max rows in final result
  timeout_seconds: 300               # BQ query timeout

Note: max_bq_registration_rows matches the key already documented in config/instance.yaml.example.

Defaults are hardcoded in RemoteQueryEngine and overridden by instance config.

Error Handling

Custom RemoteQueryError exception with structured error:

class RemoteQueryError(Exception):
    def __init__(self, message: str, error_type: str, details: dict = None):
        # error_type: "row_limit", "memory_limit", "bq_error", "query_error", "timeout"

CLI: `da query` Extension

Extend existing cli/commands/query.py:

da query --sql "SELECT o.*, t.views FROM orders o JOIN traffic t ON o.date = t.date" \
         --register-bq "traffic=SELECT date, SUM(views) as views FROM dataset.web WHERE date > '2026-01-01' GROUP BY 1"

Multiple --register-bq flags allowed (one per BQ alias)
Format: "alias=BQ_SQL" (split on first =)

--stdin mode: reads JSON from stdin for complex SQL:

{"register_bq": {"traffic": "SELECT ..."}, "sql": "SELECT ..."}

Output formats: table (default), csv, json

CLI argument handling: The existing query_command has sql as a required positional argument. When --register-bq is used, sql should be provided via --sql flag instead (named option, not positional). When --stdin is used, both sql and register_bq come from stdin JSON. Make sql an optional positional (typer.Argument(None)) and validate that exactly one of (positional sql, --sql flag, --stdin) is provided.

API: `POST /api/query/hybrid`

POST /api/query/hybrid
Authorization: Bearer <admin_token>

{
  "register_bq": {
    "traffic": "SELECT date, SUM(views) FROM dataset.web WHERE date > '2026-01-01' GROUP BY 1"
  },
  "sql": "SELECT o.*, t.views FROM orders o JOIN traffic t ON o.date = t.date",
  "format": "json"
}

Response:

{
  "columns": ["order_id", "date", "views"],
  "rows": [...],
  "row_count": 1234,
  "truncated": false,
  "bq_stats": {
    "traffic": {"rows": 365, "columns": 2, "memory_mb": 0.03}
  }
}

Auth: require_admin — BQ queries cost money, only admins can trigger them.

Validation — both register_bq SQL and final sql:

Apply the same SQL blocklist from app/api/query.py (blocks LOAD, ATTACH, INSTALL, read_parquet with paths, path traversal patterns, etc.)
register_bq SQL additionally validated as SELECT-only (no INSERT/UPDATE/DELETE/DROP)
Reuse the existing _validate_sql() helper from app/api/query.py (extract to shared utility if needed)

Connection lifecycle: The API endpoint owns the connection. Pattern:

analytics = get_analytics_db_readonly()
try:
    engine = RemoteQueryEngine(analytics)
    # ... register_bq + execute
finally:
    analytics.close()

Implementation Summary

New Files

File	Purpose
`src/remote_query.py`	`RemoteQueryEngine` class + `RemoteQueryError`
`app/api/query_hybrid.py`	`POST /api/query/hybrid` endpoint
`tests/test_remote_query.py`	Engine unit tests (mocked BQ client)

Modified Files

File	Changes
`src/db.py`	`get_analytics_db_readonly()` — add extension re-attach from `_remote_attach`
`cli/commands/query.py`	Add `--register-bq` and `--stdin` flags
`app/main.py`	Register hybrid query router
`CLAUDE.md`	Document hybrid query usage

Implementation Order

Fix extension re-attach in src/db.py (unblocks remote views)
RemoteQueryEngine in src/remote_query.py (core logic)
CLI extension --register-bq
API endpoint POST /api/query/hybrid
CLAUDE.md update + integration tests

Test Coverage

tests/test_remote_query.py — engine tests with mocked BQ client (safety limits, registration, error handling)
tests/test_db.py — extension re-attach test (mock _remote_attach table)
tests/test_api.py — hybrid query endpoint (auth, validation)
tests/test_cli.py — --register-bq flag parsing

9.2 KiB Raw Blame History Unescape Escape