923 lines
31 KiB
Markdown
923 lines
31 KiB
Markdown
# Remote Query Implementation Plan
|
|
|
|
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
|
|
|
**Goal:** Fix BigQuery extension re-attach so remote views work, then add a two-phase query engine that JOINs local Parquet data with on-demand BigQuery subquery results.
|
|
|
|
**Architecture:** Part 1 patches `get_analytics_db_readonly()` to re-load extensions from `_remote_attach` tables. Part 2 adds `RemoteQueryEngine` that wraps BQ client with safety limits (COUNT pre-check, memory estimation), registers Arrow results in DuckDB, then executes the final SQL. Exposed via `da query --register-bq` CLI and `POST /api/query/hybrid` API.
|
|
|
|
**Tech Stack:** DuckDB, google-cloud-bigquery, PyArrow, FastAPI, Typer
|
|
|
|
**Spec:** `docs/superpowers/specs/2026-04-11-remote-query-design.md`
|
|
|
|
---
|
|
|
|
### Task 1: Fix Extension Re-attach in `get_analytics_db_readonly()`
|
|
|
|
**Files:**
|
|
- Modify: `src/db.py:253-282` (get_analytics_db_readonly)
|
|
- Test: `tests/test_db.py`
|
|
|
|
- [ ] **Step 1: Write failing test**
|
|
|
|
Add to `tests/test_db.py`:
|
|
|
|
```python
|
|
class TestExtensionReattach:
|
|
def test_reads_remote_attach_table(self, tmp_path, monkeypatch):
|
|
"""Verify get_analytics_db_readonly() attempts to load extensions from _remote_attach."""
|
|
monkeypatch.setenv("DATA_DIR", str(tmp_path))
|
|
import duckdb
|
|
|
|
# Create analytics DB
|
|
analytics_dir = tmp_path / "analytics"
|
|
analytics_dir.mkdir()
|
|
conn = duckdb.connect(str(analytics_dir / "server.duckdb"))
|
|
conn.close()
|
|
|
|
# Create an extract.duckdb with a _remote_attach table
|
|
ext_dir = tmp_path / "extracts" / "testbq"
|
|
ext_dir.mkdir(parents=True)
|
|
ext_conn = duckdb.connect(str(ext_dir / "extract.duckdb"))
|
|
ext_conn.execute("""
|
|
CREATE TABLE _remote_attach (
|
|
alias VARCHAR, extension VARCHAR, url VARCHAR, token_env VARCHAR
|
|
)
|
|
""")
|
|
ext_conn.execute(
|
|
"INSERT INTO _remote_attach VALUES ('bq', 'bigquery', 'project=test', '')"
|
|
)
|
|
ext_conn.close()
|
|
|
|
from src.db import get_analytics_db_readonly
|
|
# This won't actually load bigquery (not installed in test env),
|
|
# but should not crash — just log a warning
|
|
analytics = get_analytics_db_readonly()
|
|
try:
|
|
# Connection should be usable even if extension load failed
|
|
result = analytics.execute("SELECT 1").fetchone()
|
|
assert result[0] == 1
|
|
finally:
|
|
analytics.close()
|
|
|
|
def test_skips_missing_remote_attach(self, tmp_path, monkeypatch):
|
|
"""Extract without _remote_attach should not cause errors."""
|
|
monkeypatch.setenv("DATA_DIR", str(tmp_path))
|
|
import duckdb
|
|
|
|
analytics_dir = tmp_path / "analytics"
|
|
analytics_dir.mkdir()
|
|
conn = duckdb.connect(str(analytics_dir / "server.duckdb"))
|
|
conn.close()
|
|
|
|
ext_dir = tmp_path / "extracts" / "plain"
|
|
ext_dir.mkdir(parents=True)
|
|
ext_conn = duckdb.connect(str(ext_dir / "extract.duckdb"))
|
|
ext_conn.execute("CREATE TABLE _meta (name VARCHAR)")
|
|
ext_conn.close()
|
|
|
|
from src.db import get_analytics_db_readonly
|
|
analytics = get_analytics_db_readonly()
|
|
try:
|
|
result = analytics.execute("SELECT 1").fetchone()
|
|
assert result[0] == 1
|
|
finally:
|
|
analytics.close()
|
|
```
|
|
|
|
- [ ] **Step 2: Run test to verify it fails (or passes — these are resilience tests)**
|
|
|
|
Run: `pytest tests/test_db.py::TestExtensionReattach -v`
|
|
Expected: Both tests likely PASS already (graceful failures). That's fine — the real value is ensuring the re-attach code doesn't break anything.
|
|
|
|
- [ ] **Step 3: Implement extension re-attach**
|
|
|
|
In `src/db.py`, modify `get_analytics_db_readonly()`. After the existing ATTACH loop (line ~279), before the `return conn` (line ~282), add:
|
|
|
|
```python
|
|
# Re-attach remote extensions (BigQuery, Keboola, etc.)
|
|
if extracts_dir.exists():
|
|
_reattach_remote_extensions(conn, extracts_dir)
|
|
```
|
|
|
|
Add this helper function before `get_analytics_db_readonly()`:
|
|
|
|
```python
|
|
def _reattach_remote_extensions(
|
|
conn: duckdb.DuckDBPyConnection, extracts_dir: Path
|
|
) -> None:
|
|
"""Re-load extensions from _remote_attach tables in extract.duckdb files."""
|
|
already_attached = set()
|
|
try:
|
|
already_attached = {
|
|
r[0] for r in conn.execute(
|
|
"SELECT database_name FROM duckdb_databases()"
|
|
).fetchall()
|
|
}
|
|
except Exception:
|
|
pass
|
|
|
|
for ext_dir in sorted(extracts_dir.iterdir()):
|
|
if not ext_dir.is_dir() or not _SAFE_IDENTIFIER.match(ext_dir.name):
|
|
continue
|
|
# Check if this extract has a _remote_attach table
|
|
try:
|
|
has_table = conn.execute(
|
|
f"SELECT table_name FROM information_schema.tables "
|
|
f"WHERE table_schema='{ext_dir.name}' AND table_name='_remote_attach'"
|
|
).fetchall()
|
|
if not has_table:
|
|
continue
|
|
except Exception:
|
|
continue
|
|
|
|
try:
|
|
rows = conn.execute(
|
|
f"SELECT alias, extension, url, token_env FROM {ext_dir.name}._remote_attach"
|
|
).fetchall()
|
|
except Exception:
|
|
continue
|
|
|
|
for alias, extension, url, token_env in rows:
|
|
if alias in already_attached:
|
|
continue
|
|
if not _SAFE_IDENTIFIER.match(alias) or not _SAFE_IDENTIFIER.match(extension):
|
|
continue
|
|
|
|
token = os.environ.get(token_env, "") if token_env else ""
|
|
|
|
try:
|
|
conn.execute(f"LOAD {extension};")
|
|
if token:
|
|
escaped_token = token.replace("'", "''")
|
|
conn.execute(
|
|
f"ATTACH '{url}' AS {alias} (TYPE {extension}, TOKEN '{escaped_token}')"
|
|
)
|
|
else:
|
|
conn.execute(
|
|
f"ATTACH '{url}' AS {alias} (TYPE {extension}, READ_ONLY)"
|
|
)
|
|
already_attached.add(alias)
|
|
logger.info("Re-attached remote source %s via %s", alias, extension)
|
|
except Exception as e:
|
|
logger.debug("Could not re-attach %s: %s", alias, e)
|
|
```
|
|
|
|
- [ ] **Step 4: Run tests**
|
|
|
|
Run: `pytest tests/test_db.py -v`
|
|
Expected: ALL PASS
|
|
|
|
- [ ] **Step 5: Commit**
|
|
|
|
```bash
|
|
git add src/db.py tests/test_db.py
|
|
git commit -m "fix: re-attach remote extensions in get_analytics_db_readonly()"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 2: RemoteQueryEngine Core
|
|
|
|
**Files:**
|
|
- Create: `src/remote_query.py`
|
|
- Test: `tests/test_remote_query.py`
|
|
|
|
- [ ] **Step 1: Write failing tests**
|
|
|
|
Create `tests/test_remote_query.py`:
|
|
|
|
```python
|
|
"""Tests for RemoteQueryEngine."""
|
|
|
|
import json
|
|
import os
|
|
from pathlib import Path
|
|
from unittest.mock import patch, MagicMock
|
|
|
|
import duckdb
|
|
import pytest
|
|
|
|
|
|
@pytest.fixture
|
|
def analytics_conn(tmp_path):
|
|
"""DuckDB connection with a sample local view."""
|
|
conn = duckdb.connect()
|
|
conn.execute("CREATE TABLE orders (id INT, date DATE, amount DECIMAL(10,2))")
|
|
conn.execute("INSERT INTO orders VALUES (1, '2026-01-01', 100.0), (2, '2026-01-15', 200.0)")
|
|
yield conn
|
|
conn.close()
|
|
|
|
|
|
def _mock_bq_arrow_table():
|
|
"""Create a mock Arrow table for BQ results."""
|
|
import pyarrow as pa
|
|
return pa.table({
|
|
"date": ["2026-01-01", "2026-01-15"],
|
|
"pageviews": [1000, 2000],
|
|
})
|
|
|
|
|
|
class TestRemoteQueryEngineRegister:
|
|
def test_register_bq_success(self, analytics_conn):
|
|
from src.remote_query import RemoteQueryEngine
|
|
|
|
mock_arrow = _mock_bq_arrow_table()
|
|
mock_job = MagicMock()
|
|
mock_job.to_arrow.return_value = mock_arrow
|
|
mock_client = MagicMock()
|
|
mock_client.query.return_value = mock_job
|
|
# COUNT pre-check
|
|
mock_count_job = MagicMock()
|
|
mock_count_result = MagicMock()
|
|
mock_count_result.fetchone.return_value = (2,)
|
|
mock_count_job.result.return_value = mock_count_result
|
|
mock_client.query.side_effect = [mock_count_job, mock_job]
|
|
|
|
engine = RemoteQueryEngine(analytics_conn, _bq_client_factory=lambda: mock_client)
|
|
stats = engine.register_bq("traffic", "SELECT date, pageviews FROM dataset.web")
|
|
|
|
assert stats["alias"] == "traffic"
|
|
assert stats["rows"] == 2
|
|
# Verify the view is usable
|
|
result = analytics_conn.execute("SELECT * FROM traffic").fetchall()
|
|
assert len(result) == 2
|
|
|
|
def test_register_bq_row_limit_exceeded(self, analytics_conn):
|
|
from src.remote_query import RemoteQueryEngine, RemoteQueryError
|
|
|
|
mock_client = MagicMock()
|
|
mock_count_job = MagicMock()
|
|
mock_count_result = MagicMock()
|
|
mock_count_result.fetchone.return_value = (999999,)
|
|
mock_count_job.result.return_value = mock_count_result
|
|
mock_client.query.return_value = mock_count_job
|
|
|
|
engine = RemoteQueryEngine(
|
|
analytics_conn,
|
|
_bq_client_factory=lambda: mock_client,
|
|
max_bq_registration_rows=1000,
|
|
)
|
|
with pytest.raises(RemoteQueryError, match="row_limit"):
|
|
engine.register_bq("big", "SELECT * FROM huge_table")
|
|
|
|
def test_register_bq_missing_package(self, analytics_conn):
|
|
from src.remote_query import RemoteQueryEngine, RemoteQueryError
|
|
|
|
engine = RemoteQueryEngine(
|
|
analytics_conn,
|
|
_bq_client_factory=None, # Will try real import
|
|
)
|
|
with patch.dict("sys.modules", {"google.cloud.bigquery": None}):
|
|
with pytest.raises(RemoteQueryError, match="bq_error"):
|
|
engine.register_bq("x", "SELECT 1")
|
|
|
|
|
|
class TestRemoteQueryEngineExecute:
|
|
def test_execute_local_only(self, analytics_conn):
|
|
from src.remote_query import RemoteQueryEngine
|
|
engine = RemoteQueryEngine(analytics_conn)
|
|
result = engine.execute("SELECT id, amount FROM orders ORDER BY id")
|
|
assert result["columns"] == ["id", "amount"]
|
|
assert len(result["rows"]) == 2
|
|
assert result["row_count"] == 2
|
|
assert result["truncated"] is False
|
|
|
|
def test_execute_with_registered_bq(self, analytics_conn):
|
|
from src.remote_query import RemoteQueryEngine
|
|
import pyarrow as pa
|
|
|
|
# Manually register an Arrow table (simulating BQ result)
|
|
traffic = pa.table({"date": ["2026-01-01", "2026-01-15"], "views": [100, 200]})
|
|
analytics_conn.register("traffic", traffic)
|
|
|
|
engine = RemoteQueryEngine(analytics_conn)
|
|
result = engine.execute(
|
|
"SELECT o.id, t.views FROM orders o JOIN traffic t ON CAST(o.date AS VARCHAR) = t.date ORDER BY o.id"
|
|
)
|
|
assert len(result["rows"]) == 2
|
|
assert result["columns"] == ["id", "views"]
|
|
|
|
def test_execute_respects_max_result_rows(self, analytics_conn):
|
|
from src.remote_query import RemoteQueryEngine
|
|
engine = RemoteQueryEngine(analytics_conn, max_result_rows=1)
|
|
result = engine.execute("SELECT * FROM orders")
|
|
assert len(result["rows"]) == 1
|
|
assert result["truncated"] is True
|
|
|
|
def test_execute_invalid_sql(self, analytics_conn):
|
|
from src.remote_query import RemoteQueryEngine, RemoteQueryError
|
|
engine = RemoteQueryEngine(analytics_conn)
|
|
with pytest.raises(RemoteQueryError, match="query_error"):
|
|
engine.execute("DROP TABLE orders")
|
|
```
|
|
|
|
- [ ] **Step 2: Run tests to verify they fail**
|
|
|
|
Run: `pytest tests/test_remote_query.py -v`
|
|
Expected: FAIL — `ModuleNotFoundError: No module named 'src.remote_query'`
|
|
|
|
- [ ] **Step 3: Implement RemoteQueryEngine**
|
|
|
|
Create `src/remote_query.py`:
|
|
|
|
```python
|
|
"""Two-phase remote query engine.
|
|
|
|
Phase 1: Execute BigQuery subqueries, register results as in-memory Arrow tables.
|
|
Phase 2: Execute DuckDB query joining local Parquet views with BQ Arrow tables.
|
|
"""
|
|
|
|
import logging
|
|
import os
|
|
from typing import Any, Callable, Dict, List, Optional
|
|
|
|
import duckdb
|
|
|
|
logger = logging.getLogger(__name__)
|
|
|
|
# SQL blocklist — reused from app/api/query.py
|
|
_BLOCKED_KEYWORDS = [
|
|
"drop ", "delete ", "insert ", "update ", "alter ", "create ",
|
|
"copy ", "attach ", "detach ", "load ", "install ",
|
|
"export ", "import ", "pragma ", "call ",
|
|
"read_csv", "read_json", "read_parquet", "read_text",
|
|
"write_csv", "write_parquet", "read_blob", "read_ndjson",
|
|
"parquet_scan", "parquet_metadata", "parquet_schema",
|
|
"json_scan", "csv_scan",
|
|
"query_table", "iceberg_scan", "delta_scan",
|
|
"glob(", "list_files",
|
|
"'/", '"/', 'http://', 'https://', 's3://', 'gcs://',
|
|
"information_schema", "duckdb_tables", "duckdb_columns",
|
|
"duckdb_databases", "duckdb_settings", "duckdb_functions",
|
|
"duckdb_views", "duckdb_indexes", "duckdb_schemas",
|
|
"pragma_table_info", "pragma_storage_info",
|
|
"'../", '"../',
|
|
";",
|
|
]
|
|
|
|
|
|
class RemoteQueryError(Exception):
|
|
"""Structured error for remote query failures."""
|
|
|
|
def __init__(self, message: str, error_type: str, details: Optional[dict] = None):
|
|
super().__init__(message)
|
|
self.error_type = error_type
|
|
self.details = details or {}
|
|
|
|
|
|
class RemoteQueryEngine:
|
|
"""Two-phase query engine: BQ subqueries + DuckDB final query."""
|
|
|
|
def __init__(
|
|
self,
|
|
conn: duckdb.DuckDBPyConnection,
|
|
*,
|
|
_bq_client_factory: Optional[Callable] = None,
|
|
max_bq_registration_rows: int = 500_000,
|
|
max_memory_mb: float = 2048.0,
|
|
max_result_rows: int = 100_000,
|
|
timeout_seconds: int = 300,
|
|
):
|
|
self.conn = conn
|
|
self._bq_client_factory = _bq_client_factory
|
|
self.max_bq_registration_rows = max_bq_registration_rows
|
|
self.max_memory_mb = max_memory_mb
|
|
self.max_result_rows = max_result_rows
|
|
self.timeout_seconds = timeout_seconds
|
|
self._bq_stats: Dict[str, dict] = {}
|
|
|
|
def register_bq(self, alias: str, bq_sql: str) -> dict:
|
|
"""Execute BQ subquery, register result as in-memory DuckDB view.
|
|
|
|
Returns dict with {alias, rows, columns, memory_mb}.
|
|
Raises RemoteQueryError on failure.
|
|
"""
|
|
_validate_sql(bq_sql)
|
|
|
|
client = self._get_bq_client()
|
|
|
|
# Phase 1a: COUNT(*) pre-check
|
|
count_sql = f"SELECT COUNT(*) FROM ({bq_sql})"
|
|
try:
|
|
count_job = client.query(count_sql)
|
|
row_count = count_job.result().fetchone()[0]
|
|
except Exception as e:
|
|
raise RemoteQueryError(
|
|
f"BQ COUNT pre-check failed for '{alias}': {e}",
|
|
error_type="bq_error",
|
|
details={"alias": alias},
|
|
)
|
|
|
|
if row_count > self.max_bq_registration_rows:
|
|
raise RemoteQueryError(
|
|
f"BQ query '{alias}' returns {row_count:,} rows "
|
|
f"(limit: {self.max_bq_registration_rows:,})",
|
|
error_type="row_limit",
|
|
details={"alias": alias, "rows": row_count, "limit": self.max_bq_registration_rows},
|
|
)
|
|
|
|
# Phase 1b: Execute and register
|
|
try:
|
|
job = client.query(bq_sql)
|
|
try:
|
|
arrow_table = job.to_arrow()
|
|
except Exception:
|
|
arrow_table = job.to_arrow(create_bqstorage_client=False)
|
|
except Exception as e:
|
|
raise RemoteQueryError(
|
|
f"BQ query failed for '{alias}': {e}",
|
|
error_type="bq_error",
|
|
details={"alias": alias},
|
|
)
|
|
|
|
# Memory check (actual, not estimated)
|
|
memory_mb = arrow_table.nbytes / (1024 * 1024)
|
|
if memory_mb > self.max_memory_mb:
|
|
raise RemoteQueryError(
|
|
f"BQ result '{alias}' uses {memory_mb:.1f} MB "
|
|
f"(limit: {self.max_memory_mb:.0f} MB)",
|
|
error_type="memory_limit",
|
|
details={"alias": alias, "memory_mb": memory_mb, "limit": self.max_memory_mb},
|
|
)
|
|
|
|
self.conn.register(alias, arrow_table)
|
|
stats = {
|
|
"alias": alias,
|
|
"rows": arrow_table.num_rows,
|
|
"columns": arrow_table.num_columns,
|
|
"memory_mb": round(memory_mb, 3),
|
|
}
|
|
self._bq_stats[alias] = stats
|
|
logger.info("Registered BQ view '%s': %d rows, %.1f MB", alias, arrow_table.num_rows, memory_mb)
|
|
return stats
|
|
|
|
def execute(self, sql: str) -> dict:
|
|
"""Execute final DuckDB query. Returns {columns, rows, row_count, truncated, bq_stats}."""
|
|
_validate_sql(sql)
|
|
|
|
try:
|
|
result = self.conn.execute(sql).fetchmany(self.max_result_rows + 1)
|
|
columns = [desc[0] for desc in self.conn.description] if self.conn.description else []
|
|
except Exception as e:
|
|
raise RemoteQueryError(
|
|
f"Query execution failed: {e}",
|
|
error_type="query_error",
|
|
)
|
|
|
|
truncated = len(result) > self.max_result_rows
|
|
rows = result[:self.max_result_rows]
|
|
|
|
# Serialize non-standard types
|
|
serializable_rows = []
|
|
for row in rows:
|
|
serializable_rows.append([
|
|
str(v) if v is not None and not isinstance(v, (int, float, bool, str)) else v
|
|
for v in row
|
|
])
|
|
|
|
return {
|
|
"columns": columns,
|
|
"rows": serializable_rows,
|
|
"row_count": len(serializable_rows),
|
|
"truncated": truncated,
|
|
"bq_stats": dict(self._bq_stats),
|
|
}
|
|
|
|
def _get_bq_client(self):
|
|
"""Get BigQuery client, using factory or default."""
|
|
if self._bq_client_factory:
|
|
return self._bq_client_factory()
|
|
try:
|
|
from scripts.duckdb_manager import _create_bq_client
|
|
project = os.environ.get("BIGQUERY_PROJECT")
|
|
if not project:
|
|
raise RemoteQueryError(
|
|
"BIGQUERY_PROJECT env var not set",
|
|
error_type="bq_error",
|
|
)
|
|
return _create_bq_client(project)
|
|
except ImportError:
|
|
raise RemoteQueryError(
|
|
"google-cloud-bigquery is not installed. "
|
|
"Install with: pip install google-cloud-bigquery",
|
|
error_type="bq_error",
|
|
)
|
|
|
|
|
|
def _validate_sql(sql: str) -> None:
|
|
"""Validate SQL against blocklist. Raises RemoteQueryError."""
|
|
sql_lower = sql.strip().lower()
|
|
for keyword in _BLOCKED_KEYWORDS:
|
|
if keyword in sql_lower:
|
|
raise RemoteQueryError(
|
|
f"Blocked SQL keyword: {keyword.strip()}",
|
|
error_type="query_error",
|
|
)
|
|
if not sql_lower.startswith("select ") and not sql_lower.startswith("with "):
|
|
raise RemoteQueryError(
|
|
"Query must start with SELECT or WITH",
|
|
error_type="query_error",
|
|
)
|
|
|
|
|
|
def load_config() -> dict:
|
|
"""Load remote_query config from instance.yaml."""
|
|
try:
|
|
from app.instance_config import get_value
|
|
return get_value("remote_query") or {}
|
|
except Exception:
|
|
return {}
|
|
```
|
|
|
|
- [ ] **Step 4: Run tests**
|
|
|
|
Run: `pytest tests/test_remote_query.py -v`
|
|
Expected: ALL PASS
|
|
|
|
- [ ] **Step 5: Commit**
|
|
|
|
```bash
|
|
git add src/remote_query.py tests/test_remote_query.py
|
|
git commit -m "feat: add RemoteQueryEngine with BQ registration and safety limits"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 3: CLI `da query --register-bq`
|
|
|
|
**Files:**
|
|
- Modify: `cli/commands/query.py`
|
|
- Test: `tests/test_cli.py`
|
|
|
|
- [ ] **Step 1: Write failing test**
|
|
|
|
Add to `tests/test_cli.py`:
|
|
|
|
```python
|
|
class TestQueryHybrid:
|
|
def test_register_bq_flag_help(self):
|
|
result = runner.invoke(app, ["query", "--help"])
|
|
assert result.exit_code == 0
|
|
assert "register-bq" in result.output
|
|
```
|
|
|
|
- [ ] **Step 2: Run test to verify it fails**
|
|
|
|
Run: `pytest tests/test_cli.py::TestQueryHybrid -v`
|
|
Expected: FAIL — `register-bq` not in help output
|
|
|
|
- [ ] **Step 3: Implement CLI changes**
|
|
|
|
Replace `cli/commands/query.py` with:
|
|
|
|
```python
|
|
"""Query commands — da query."""
|
|
|
|
import json
|
|
import os
|
|
import sys
|
|
from pathlib import Path
|
|
from typing import List, Optional
|
|
|
|
import typer
|
|
|
|
|
|
def query_command(
|
|
sql: Optional[str] = typer.Argument(None, help="SQL query to execute"),
|
|
sql_opt: Optional[str] = typer.Option(None, "--sql", help="SQL query (alternative to positional)"),
|
|
remote: bool = typer.Option(False, "--remote", help="Execute on server instead of locally"),
|
|
register_bq: Optional[List[str]] = typer.Option(None, "--register-bq", help="Register BQ subquery: alias=SQL"),
|
|
stdin: bool = typer.Option(False, "--stdin", help="Read query spec from stdin (JSON)"),
|
|
fmt: str = typer.Option("table", "--format", "-f", help="Output format: table, json, csv"),
|
|
limit: int = typer.Option(1000, "--limit", help="Max rows to return"),
|
|
):
|
|
"""Execute SQL query against DuckDB. Supports hybrid BQ+local queries."""
|
|
# Resolve SQL from positional, --sql, or --stdin
|
|
if stdin:
|
|
spec = json.loads(sys.stdin.read())
|
|
final_sql = spec.get("sql", "")
|
|
register_bq = [f"{k}={v}" for k, v in spec.get("register_bq", {}).items()]
|
|
else:
|
|
final_sql = sql or sql_opt
|
|
if not final_sql:
|
|
typer.echo("Error: provide SQL as argument, --sql, or --stdin", err=True)
|
|
raise typer.Exit(1)
|
|
|
|
if register_bq:
|
|
_query_hybrid(final_sql, register_bq, fmt, limit)
|
|
elif remote:
|
|
_query_remote(final_sql, fmt, limit)
|
|
else:
|
|
_query_local(final_sql, fmt, limit)
|
|
|
|
|
|
def _query_hybrid(sql: str, register_bq_specs: List[str], fmt: str, limit: int):
|
|
"""Run two-phase hybrid query: BQ subqueries + local DuckDB."""
|
|
import duckdb
|
|
from src.remote_query import RemoteQueryEngine, RemoteQueryError, load_config
|
|
|
|
local_dir = Path(os.environ.get("DA_LOCAL_DIR", "."))
|
|
db_path = local_dir / "user" / "duckdb" / "analytics.duckdb"
|
|
if not db_path.exists():
|
|
typer.echo("Local DuckDB not found. Run: da sync", err=True)
|
|
raise typer.Exit(1)
|
|
|
|
config = load_config()
|
|
conn = duckdb.connect(str(db_path), read_only=True)
|
|
try:
|
|
engine = RemoteQueryEngine(
|
|
conn,
|
|
max_bq_registration_rows=config.get("max_bq_registration_rows", 500_000),
|
|
max_memory_mb=config.get("max_memory_mb", 2048),
|
|
max_result_rows=limit,
|
|
timeout_seconds=config.get("timeout_seconds", 300),
|
|
)
|
|
|
|
# Phase 1: Register BQ subqueries
|
|
for spec in register_bq_specs:
|
|
eq_idx = spec.index("=")
|
|
alias = spec[:eq_idx].strip()
|
|
bq_sql = spec[eq_idx + 1:].strip()
|
|
try:
|
|
stats = engine.register_bq(alias, bq_sql)
|
|
typer.echo(f" BQ '{alias}': {stats['rows']} rows, {stats['memory_mb']} MB", err=True)
|
|
except RemoteQueryError as e:
|
|
typer.echo(f"Error registering '{alias}': {e}", err=True)
|
|
raise typer.Exit(1)
|
|
|
|
# Phase 2: Execute final query
|
|
try:
|
|
result = engine.execute(sql)
|
|
except RemoteQueryError as e:
|
|
typer.echo(f"Query error: {e}", err=True)
|
|
raise typer.Exit(1)
|
|
|
|
_output(result["columns"], result["rows"], fmt)
|
|
if result["truncated"]:
|
|
typer.echo(f"(truncated at {limit} rows)", err=True)
|
|
finally:
|
|
conn.close()
|
|
|
|
|
|
def _query_local(sql: str, fmt: str, limit: int):
|
|
"""Run query against local DuckDB."""
|
|
import duckdb
|
|
|
|
local_dir = Path(os.environ.get("DA_LOCAL_DIR", "."))
|
|
db_path = local_dir / "user" / "duckdb" / "analytics.duckdb"
|
|
if not db_path.exists():
|
|
typer.echo("Local DuckDB not found. Run: da sync", err=True)
|
|
raise typer.Exit(1)
|
|
|
|
conn = duckdb.connect(str(db_path), read_only=True)
|
|
try:
|
|
result = conn.execute(sql).fetchmany(limit)
|
|
columns = [desc[0] for desc in conn.description] if conn.description else []
|
|
_output(columns, result, fmt)
|
|
except Exception as e:
|
|
typer.echo(f"Query error: {e}", err=True)
|
|
raise typer.Exit(1)
|
|
finally:
|
|
conn.close()
|
|
|
|
|
|
def _query_remote(sql: str, fmt: str, limit: int):
|
|
"""Run query against server DuckDB via API."""
|
|
from cli.client import api_post
|
|
|
|
resp = api_post("/api/query", json={"sql": sql, "limit": limit})
|
|
if resp.status_code != 200:
|
|
typer.echo(f"Query failed: {resp.json().get('detail', resp.text)}", err=True)
|
|
raise typer.Exit(1)
|
|
|
|
data = resp.json()
|
|
_output(data["columns"], data["rows"], fmt)
|
|
if data.get("truncated"):
|
|
typer.echo(f"(truncated at {limit} rows)", err=True)
|
|
|
|
|
|
def _output(columns: list, rows: list, fmt: str):
|
|
if fmt == "json":
|
|
output = [dict(zip(columns, row)) for row in rows]
|
|
typer.echo(json.dumps(output, indent=2, default=str))
|
|
elif fmt == "csv":
|
|
typer.echo(",".join(columns))
|
|
for row in rows:
|
|
typer.echo(",".join(str(v) if v is not None else "" for v in row))
|
|
else:
|
|
from rich.console import Console
|
|
from rich.table import Table
|
|
console = Console()
|
|
table = Table()
|
|
for col in columns:
|
|
table.add_column(col)
|
|
for row in rows:
|
|
table.add_row(*(str(v) if v is not None else "" for v in row))
|
|
console.print(table)
|
|
```
|
|
|
|
- [ ] **Step 4: Run tests**
|
|
|
|
Run: `pytest tests/test_cli.py -v`
|
|
Expected: ALL PASS
|
|
|
|
- [ ] **Step 5: Commit**
|
|
|
|
```bash
|
|
git add cli/commands/query.py tests/test_cli.py
|
|
git commit -m "feat: add --register-bq and --stdin to da query for hybrid BQ+local queries"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 4: API Endpoint `POST /api/query/hybrid`
|
|
|
|
**Files:**
|
|
- Create: `app/api/query_hybrid.py`
|
|
- Modify: `app/main.py` (register router)
|
|
- Test: `tests/test_api.py`
|
|
|
|
- [ ] **Step 1: Write failing tests**
|
|
|
|
Add to `tests/test_api.py`:
|
|
|
|
```python
|
|
class TestHybridQueryAPI:
|
|
def test_hybrid_query_requires_admin(self, seeded_client):
|
|
client, _, analyst_token = seeded_client
|
|
resp = client.post(
|
|
"/api/query/hybrid",
|
|
json={"sql": "SELECT 1", "register_bq": {}},
|
|
headers={"Authorization": f"Bearer {analyst_token}"},
|
|
)
|
|
assert resp.status_code == 403
|
|
|
|
def test_hybrid_query_local_only(self, seeded_client):
|
|
"""Hybrid endpoint works without BQ registrations (just local query)."""
|
|
client, admin_token, _ = seeded_client
|
|
resp = client.post(
|
|
"/api/query/hybrid",
|
|
json={"sql": "SELECT 1 AS val", "register_bq": {}},
|
|
headers={"Authorization": f"Bearer {admin_token}"},
|
|
)
|
|
assert resp.status_code == 200
|
|
data = resp.json()
|
|
assert data["columns"] == ["val"]
|
|
assert data["rows"] == [[1]]
|
|
|
|
def test_hybrid_query_blocked_sql(self, seeded_client):
|
|
client, admin_token, _ = seeded_client
|
|
resp = client.post(
|
|
"/api/query/hybrid",
|
|
json={"sql": "DROP TABLE users", "register_bq": {}},
|
|
headers={"Authorization": f"Bearer {admin_token}"},
|
|
)
|
|
assert resp.status_code == 400
|
|
|
|
def test_hybrid_query_blocked_bq_sql(self, seeded_client):
|
|
client, admin_token, _ = seeded_client
|
|
resp = client.post(
|
|
"/api/query/hybrid",
|
|
json={
|
|
"sql": "SELECT 1",
|
|
"register_bq": {"x": "DROP TABLE something"},
|
|
},
|
|
headers={"Authorization": f"Bearer {admin_token}"},
|
|
)
|
|
assert resp.status_code == 400
|
|
```
|
|
|
|
- [ ] **Step 2: Run tests to verify they fail**
|
|
|
|
Run: `pytest tests/test_api.py::TestHybridQueryAPI -v`
|
|
Expected: FAIL — 404 on `/api/query/hybrid`
|
|
|
|
- [ ] **Step 3: Implement API endpoint**
|
|
|
|
Create `app/api/query_hybrid.py`:
|
|
|
|
```python
|
|
"""Hybrid query endpoint — two-phase BQ + DuckDB queries."""
|
|
|
|
from typing import Dict, Optional
|
|
|
|
from fastapi import APIRouter, Depends, HTTPException
|
|
from pydantic import BaseModel
|
|
import duckdb
|
|
|
|
from app.auth.dependencies import require_admin, _get_db
|
|
from src.db import get_analytics_db_readonly
|
|
from src.remote_query import RemoteQueryEngine, RemoteQueryError, load_config
|
|
|
|
router = APIRouter(prefix="/api/query", tags=["query"])
|
|
|
|
|
|
class HybridQueryRequest(BaseModel):
|
|
sql: str
|
|
register_bq: Dict[str, str] = {}
|
|
format: str = "json"
|
|
|
|
|
|
@router.post("/hybrid")
|
|
async def hybrid_query(
|
|
request: HybridQueryRequest,
|
|
user: dict = Depends(require_admin),
|
|
):
|
|
"""Execute a two-phase hybrid query: BQ subqueries + DuckDB final query."""
|
|
config = load_config()
|
|
analytics = get_analytics_db_readonly()
|
|
try:
|
|
engine = RemoteQueryEngine(
|
|
analytics,
|
|
max_bq_registration_rows=config.get("max_bq_registration_rows", 500_000),
|
|
max_memory_mb=config.get("max_memory_mb", 2048),
|
|
max_result_rows=config.get("max_result_rows", 100_000),
|
|
timeout_seconds=config.get("timeout_seconds", 300),
|
|
)
|
|
|
|
# Phase 1: Register BQ subqueries
|
|
for alias, bq_sql in request.register_bq.items():
|
|
try:
|
|
engine.register_bq(alias, bq_sql)
|
|
except RemoteQueryError as e:
|
|
raise HTTPException(
|
|
status_code=400,
|
|
detail=f"BQ registration '{alias}' failed: {e.error_type}: {str(e)}",
|
|
)
|
|
|
|
# Phase 2: Execute final query
|
|
try:
|
|
result = engine.execute(request.sql)
|
|
except RemoteQueryError as e:
|
|
raise HTTPException(
|
|
status_code=400,
|
|
detail=f"Query failed: {e.error_type}: {str(e)}",
|
|
)
|
|
|
|
return result
|
|
finally:
|
|
analytics.close()
|
|
```
|
|
|
|
Register in `app/main.py`:
|
|
|
|
```python
|
|
from app.api.query_hybrid import router as query_hybrid_router
|
|
# ...
|
|
app.include_router(query_hybrid_router) # before web_router
|
|
```
|
|
|
|
- [ ] **Step 4: Run tests**
|
|
|
|
Run: `pytest tests/test_api.py::TestHybridQueryAPI -v`
|
|
Expected: ALL PASS
|
|
|
|
- [ ] **Step 5: Commit**
|
|
|
|
```bash
|
|
git add app/api/query_hybrid.py app/main.py tests/test_api.py
|
|
git commit -m "feat: add POST /api/query/hybrid endpoint for two-phase BQ+DuckDB queries"
|
|
```
|
|
|
|
---
|
|
|
|
### Task 5: CLAUDE.md + Integration Test
|
|
|
|
**Files:**
|
|
- Modify: `CLAUDE.md`
|
|
- Test: run full suite
|
|
|
|
- [ ] **Step 1: Add hybrid query docs to CLAUDE.md**
|
|
|
|
After the "## Business Metrics" section, add:
|
|
|
|
```markdown
|
|
## Hybrid Queries (BigQuery + Local)
|
|
|
|
For tables too large to sync locally, use hybrid queries that JOIN local data with on-demand BigQuery results:
|
|
|
|
```bash
|
|
da query --sql "SELECT o.*, t.views FROM orders o JOIN traffic t ON o.date = t.date" \
|
|
--register-bq "traffic=SELECT date, SUM(views) as views FROM dataset.web WHERE date > '2026-01-01' GROUP BY 1"
|
|
```
|
|
|
|
The `--register-bq` flag executes a BigQuery subquery, loads the result into memory, and makes it available as a DuckDB view for the final SQL. Multiple `--register-bq` flags can be used for multiple BQ sources.
|
|
|
|
For complex SQL, use stdin mode:
|
|
```bash
|
|
echo '{"register_bq": {"traffic": "SELECT ..."}, "sql": "SELECT ..."}' | da query --stdin
|
|
```
|
|
```
|
|
|
|
- [ ] **Step 2: Run full test suite**
|
|
|
|
Run: `pytest tests/ -v --timeout=60`
|
|
Expected: ALL PASS
|
|
|
|
- [ ] **Step 3: Commit**
|
|
|
|
```bash
|
|
git add CLAUDE.md
|
|
git commit -m "docs: add hybrid query usage instructions to CLAUDE.md"
|
|
```
|