perf: Tier 1 event-loop unblocking — async def → def on BQ-bound handlers
Five hottest BQ-touching endpoints were `async def` but invoked synchronous DuckDB / BQ-extension calls inside the body. Under uvicorn's single event loop that meant a single heavy `agnes query --remote` (waiting up to ~200 s for BQ's jobs.query) froze EVERY other request — /api/health, dashboard, auth, even another query — for the full BQ wait. Operators saw "VM idle, app frozen" during PR #188's testing. Convert to plain `def` so FastAPI auto-offloads the body to the anyio thread pool. Event loop stays free for non-BQ requests. - app/api/query.py:execute_query - app/api/v2_scan.py:scan_estimate_endpoint, scan_endpoint - app/api/v2_sample.py:sample - app/api/v2_schema.py:schema Audit: 0 `await` statements in any converted handler (verified file-by- file), so the rename is safe. Tests in tests/test_v2_*.py called the handlers via `asyncio.run(...)` which now fails on a non-coroutine return; swapped for direct calls (asyncio.run( -> ( ) — keeps paren balance). Plus AGNES_THREADPOOL_SIZE env var (default 200, was anyio's stock 40) in app/main.py:lifespan. Set via anyio.to_thread.current_default_thread_limiter().total_tokens. 200 is comfortable headroom for <50 concurrent analysts; bump for more. 480/480 impacted tests pass (the 2 remaining errors are a pre-existing fixture setup issue in test_reader_smoke_matrix.py unrelated to this change).
This commit is contained in:
parent
7a72ea9c37
commit
e5fb913cec
10 changed files with 62 additions and 21 deletions
|
|
@ -10,6 +10,10 @@ CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every C
|
|||
|
||||
## [Unreleased]
|
||||
|
||||
### Changed
|
||||
- **Tier 1 event-loop unblocking** — the five hottest BQ-touching endpoints (`POST /api/query`, `POST /api/v2/scan`, `POST /api/v2/scan/estimate`, `GET /api/v2/sample/{id}`, `GET /api/v2/schema/{id}`) were declared `async def` but invoked synchronous DuckDB / BQ-extension calls inside the body. Under uvicorn's single event loop that meant a single heavy `agnes query --remote` (waiting up to ~200 s for BQ's `jobs.query` to return) **froze every other request** — `/api/health`, the dashboard, auth, even another query — for the full duration of the BQ wait. Operators saw "VM idle, app frozen" symptoms during this work. Converted all five to plain `def` so FastAPI auto-offloads the blocking body to the anyio thread pool; the event loop stays free for non-BQ requests. Verified via 0-await audit (no `await` statements in the converted handlers, so the rename is safe). Tests: `tests/test_v2_*.py` were rewritten to call the handlers directly instead of `asyncio.run(...)` (which now fails on a non-coroutine return). Pairs with the thread-pool capacity bump below.
|
||||
- **`AGNES_THREADPOOL_SIZE` env var** (default 200, was anyio's stock 40) controls the FastAPI / Starlette thread pool capacity used by every plain-`def` route handler. Set in `app/main.py:lifespan` via `anyio.to_thread.current_default_thread_limiter().total_tokens`. 200 leaves comfortable headroom over the BQ extension's connection budget while keeping the per-process thread cost bounded — for the workload of <50 concurrent analysts this is well over what's needed; bump for higher concurrency.
|
||||
|
||||
### Added
|
||||
- **`data_source.bigquery.query_timeout_ms` config knob** (default 600 000 ms = 10 min). The DuckDB BigQuery extension's built-in default of 90 s was too tight for analyst-scale queries against view-backed BQ datasets — `agnes query --remote` would HTTP 400 with `Binder Error: Query execution exceeded the timeout. Job ID: …` whenever the underlying BQ job took longer than 90 s, even though the BQ job itself was healthy. The new knob is applied via `SET bq_query_timeout_ms` after every `LOAD bigquery` on every BQ-touching DuckDB session — the orchestrator's `_remote_attach` ATTACH path (`src/orchestrator.py`), the analytics-DB read-only reattach path (`src/db.py:_reattach_remote_extensions` — the primary `agnes query --remote` request path), the `BqAccess` session factory (`connectors/bigquery/access.py`), and the standalone extractor (`connectors/bigquery/extractor.py`). Sentinel `0` (or non-numeric / unparseable values) leaves the extension default in place so operators on legacy extension versions that don't recognise the setting aren't broken. Configurable via `/admin/server-config` UI. Note: BigQuery's `jobs.query` RPC caps the wait at ~200 s per call regardless of this setting; the extension polls on top so the effective ceiling is the value here but each poll is ~200 s. DuckDB emits an informational warning when the value is set above the BQ RPC cap — operators can safely ignore it.
|
||||
- **Per-user parallel parquet downloads in `agnes pull`** — the download loop in `cli/lib/pull.py` now uses a `ThreadPoolExecutor` with concurrency capped by the new `AGNES_PULL_PARALLELISM` env var (default 4, set 1 to restore pre-PR serial behavior). On a registry of N tables the wall-clock time drops from `Σ stream_download_seconds(table_i)` to roughly `max × ceil(N/4)`. Works hand-in-hand with the Caddy `file_server` change below: without it parallel client-side downloads would still queue on the single uvicorn worker; with it each request is its own caddy goroutine + sendfile, so 4-way parallelism actually delivers throughput. Per-table error semantics preserved — a failure on one table no longer aborts the rest of the batch.
|
||||
|
|
|
|||
|
|
@ -71,12 +71,23 @@ class QueryResponse(BaseModel):
|
|||
|
||||
|
||||
@router.post("", response_model=QueryResponse)
|
||||
async def execute_query(
|
||||
def execute_query(
|
||||
request: QueryRequest,
|
||||
user: dict = Depends(get_current_user),
|
||||
conn: duckdb.DuckDBPyConnection = Depends(_get_db),
|
||||
):
|
||||
"""Execute SQL against the server analytics DuckDB."""
|
||||
"""Execute SQL against the server analytics DuckDB.
|
||||
|
||||
Plain ``def`` (not ``async def``) so FastAPI auto-offloads the call
|
||||
to the anyio thread pool. The body invokes ``analytics.execute(sql)``
|
||||
synchronously, which blocks for the full BQ jobs.query wait when a
|
||||
referenced view resolves through the BQ extension. Under ``async def``
|
||||
that block holds the single uvicorn event loop, freezing every other
|
||||
request (UI, /api/health, auth) until the query returns. Plain ``def``
|
||||
runs each invocation on its own thread, so heavy queries no longer
|
||||
starve unrelated endpoints. See PR #188's CHANGELOG entry for the
|
||||
Tier 1 event-loop unblocking rollout.
|
||||
"""
|
||||
sql_lower = request.sql.strip().lower()
|
||||
|
||||
# Block everything except SELECT
|
||||
|
|
|
|||
|
|
@ -104,13 +104,15 @@ def build_sample(
|
|||
|
||||
|
||||
@router.get("/sample/{table_id}")
|
||||
async def sample(
|
||||
def sample(
|
||||
table_id: str,
|
||||
n: int = Query(default=5, ge=1, le=_MAX_N),
|
||||
user: dict = Depends(get_current_user),
|
||||
conn: duckdb.DuckDBPyConnection = Depends(_get_db),
|
||||
bq: BqAccess = Depends(get_bq_access),
|
||||
):
|
||||
# Plain ``def`` — opens a `bq.duckdb_session()` and runs sync queries
|
||||
# through the BQ extension. See PR #188 Tier 1 entry.
|
||||
try:
|
||||
return build_sample(conn, user, table_id, n=n, bq=bq)
|
||||
except FileNotFoundError:
|
||||
|
|
|
|||
|
|
@ -218,12 +218,17 @@ def _avg_bytes_for_type(t: str) -> int:
|
|||
|
||||
|
||||
@router.post("/scan/estimate")
|
||||
async def scan_estimate_endpoint(
|
||||
def scan_estimate_endpoint(
|
||||
raw: dict,
|
||||
user: dict = Depends(get_current_user),
|
||||
conn: duckdb.DuckDBPyConnection = Depends(_get_db),
|
||||
bq: BqAccess = Depends(get_bq_access),
|
||||
):
|
||||
# Plain ``def`` so FastAPI auto-offloads to the anyio thread pool — the
|
||||
# estimate path calls into google-cloud-bigquery's `client.query(...,
|
||||
# dry_run=True)` which blocks until BQ returns the dry-run cost. Under
|
||||
# ``async def`` that wait holds the event loop. See PR #188's Tier 1
|
||||
# entry for the wider rollout.
|
||||
try:
|
||||
return estimate(conn, user, raw, bq=bq)
|
||||
except WhereValidationError as e:
|
||||
|
|
@ -374,7 +379,7 @@ def run_scan(
|
|||
|
||||
|
||||
@router.post("/scan")
|
||||
async def scan_endpoint(
|
||||
def scan_endpoint(
|
||||
raw: dict,
|
||||
user: dict = Depends(get_current_user),
|
||||
conn: duckdb.DuckDBPyConnection = Depends(_get_db),
|
||||
|
|
|
|||
|
|
@ -209,12 +209,14 @@ def build_schema(
|
|||
|
||||
|
||||
@router.get("/schema/{table_id}")
|
||||
async def schema(
|
||||
def schema(
|
||||
table_id: str,
|
||||
user: dict = Depends(get_current_user),
|
||||
conn: duckdb.DuckDBPyConnection = Depends(_get_db),
|
||||
bq: BqAccess = Depends(get_bq_access),
|
||||
):
|
||||
# Plain ``def`` — opens a `bq.duckdb_session()` and runs sync metadata
|
||||
# queries through the BQ extension. See PR #188 Tier 1 entry.
|
||||
try:
|
||||
return build_schema(conn, user, table_id, bq=bq)
|
||||
except NotFound:
|
||||
|
|
|
|||
17
app/main.py
17
app/main.py
|
|
@ -141,6 +141,23 @@ async def lifespan(app):
|
|||
log_effective_policy()
|
||||
except Exception:
|
||||
pass # never block startup on a logging convenience
|
||||
|
||||
# Bump anyio's default thread pool size from 40 → AGNES_THREADPOOL_SIZE
|
||||
# (default 200). FastAPI auto-runs every plain `def` route handler in
|
||||
# this pool — the Tier 1 endpoints converted in PR #188 (`/api/query`,
|
||||
# `/api/v2/scan`, `/api/v2/sample`, `/api/v2/schema`) all block on
|
||||
# synchronous DuckDB / BQ-extension calls inside the handler body and
|
||||
# would otherwise serialise once 40 are in flight. 200 keeps the per-
|
||||
# process working set well under the BQ extension's connection cap
|
||||
# while leaving headroom for concurrent UI / health probes.
|
||||
try:
|
||||
import anyio.to_thread
|
||||
size = int(os.environ.get("AGNES_THREADPOOL_SIZE", "200"))
|
||||
anyio.to_thread.current_default_thread_limiter().total_tokens = size
|
||||
logger.info("anyio thread pool capacity set to %d", size)
|
||||
except Exception as e:
|
||||
logger.warning("failed to bump anyio thread pool capacity: %s", e)
|
||||
|
||||
yield
|
||||
from src.db import close_system_db
|
||||
close_system_db()
|
||||
|
|
|
|||
|
|
@ -152,7 +152,7 @@ class TestBqAccessErrors:
|
|||
# Endpoint is async — drive it directly. dependency_overrides only
|
||||
# fires through TestClient/HTTP, so pass `bq=bq` explicitly.
|
||||
with pytest.raises(HTTPException) as exc_info:
|
||||
asyncio.run(v2_sample.sample(
|
||||
(v2_sample.sample(
|
||||
table_id="bq_view", n=5, user=user, conn=conn, bq=bq,
|
||||
))
|
||||
finally:
|
||||
|
|
@ -182,7 +182,7 @@ class TestBqAccessErrors:
|
|||
user = {"id": "admin1", "email": "a@x.com"}
|
||||
|
||||
with pytest.raises(HTTPException) as exc_info:
|
||||
asyncio.run(v2_sample.sample(
|
||||
(v2_sample.sample(
|
||||
table_id="bq_view", n=5, user=user, conn=conn, bq=bq,
|
||||
))
|
||||
finally:
|
||||
|
|
@ -211,7 +211,7 @@ class TestBqAccessErrors:
|
|||
user = {"id": "admin1", "email": "a@x.com"}
|
||||
|
||||
with pytest.raises(HTTPException) as exc_info:
|
||||
asyncio.run(v2_sample.sample(
|
||||
(v2_sample.sample(
|
||||
table_id="bq_view", n=5, user=user, conn=conn, bq=bq,
|
||||
))
|
||||
finally:
|
||||
|
|
@ -248,7 +248,7 @@ class TestBqAccessErrors:
|
|||
try:
|
||||
_seed(conn)
|
||||
user = {"id": "admin1", "email": "a@x.com"}
|
||||
asyncio.run(v2_sample.sample(
|
||||
(v2_sample.sample(
|
||||
table_id="bq_view", n=5, user=user, conn=conn, bq=bq,
|
||||
))
|
||||
finally:
|
||||
|
|
|
|||
|
|
@ -298,7 +298,7 @@ class TestBqAccessErrors:
|
|||
lambda *a, **kw: {"event_date": "DATE", "country_code": "STRING"},
|
||||
):
|
||||
with pytest.raises(HTTPException) as exc_info:
|
||||
asyncio.run(
|
||||
(
|
||||
v2_scan.scan_endpoint(raw=req, user=user, conn=conn, bq=bq)
|
||||
)
|
||||
finally:
|
||||
|
|
@ -337,7 +337,7 @@ class TestBqAccessErrors:
|
|||
lambda *a, **kw: {"event_date": "DATE", "country_code": "STRING"},
|
||||
):
|
||||
with pytest.raises(HTTPException) as exc_info:
|
||||
asyncio.run(
|
||||
(
|
||||
v2_scan.scan_endpoint(raw=req, user=user, conn=conn, bq=bq)
|
||||
)
|
||||
finally:
|
||||
|
|
@ -372,7 +372,7 @@ class TestBqAccessErrors:
|
|||
lambda *a, **kw: {"event_date": "DATE", "country_code": "STRING"},
|
||||
):
|
||||
with pytest.raises(HTTPException) as exc_info:
|
||||
asyncio.run(
|
||||
(
|
||||
v2_scan.scan_endpoint(raw=req, user=user, conn=conn, bq=bq)
|
||||
)
|
||||
finally:
|
||||
|
|
|
|||
|
|
@ -117,7 +117,7 @@ class TestBqAccessErrors:
|
|||
lambda *a, **kw: {"event_date": "DATE", "country_code": "STRING"},
|
||||
):
|
||||
with pytest.raises(HTTPException) as exc_info:
|
||||
asyncio.run(
|
||||
(
|
||||
v2_scan.scan_estimate_endpoint(raw=req, user=user, conn=conn, bq=bq)
|
||||
)
|
||||
finally:
|
||||
|
|
@ -156,7 +156,7 @@ class TestBqAccessErrors:
|
|||
lambda *a, **kw: {"event_date": "DATE", "country_code": "STRING"},
|
||||
):
|
||||
with pytest.raises(HTTPException) as exc_info:
|
||||
asyncio.run(
|
||||
(
|
||||
v2_scan.scan_estimate_endpoint(raw=req, user=user, conn=conn, bq=bq)
|
||||
)
|
||||
finally:
|
||||
|
|
@ -191,7 +191,7 @@ class TestBqAccessErrors:
|
|||
lambda *a, **kw: {"event_date": "DATE", "country_code": "STRING"},
|
||||
):
|
||||
with pytest.raises(HTTPException) as exc_info:
|
||||
asyncio.run(
|
||||
(
|
||||
v2_scan.scan_estimate_endpoint(raw=req, user=user, conn=conn, bq=bq)
|
||||
)
|
||||
finally:
|
||||
|
|
|
|||
|
|
@ -161,7 +161,7 @@ class TestBqAccessErrors:
|
|||
# Endpoint is async — drive it directly. dependency_overrides only
|
||||
# fires through TestClient/HTTP, so pass `bq=bq` explicitly.
|
||||
with pytest.raises(HTTPException) as exc_info:
|
||||
asyncio.run(v2_schema.schema(
|
||||
(v2_schema.schema(
|
||||
table_id="bq_view", user=user, conn=conn, bq=bq,
|
||||
))
|
||||
finally:
|
||||
|
|
@ -191,7 +191,7 @@ class TestBqAccessErrors:
|
|||
_seed_bq_table(conn)
|
||||
user = {"id": "admin1", "email": "a@x.com"}
|
||||
with pytest.raises(HTTPException) as exc_info:
|
||||
asyncio.run(v2_schema.schema(
|
||||
(v2_schema.schema(
|
||||
table_id="bq_view", user=user, conn=conn, bq=bq,
|
||||
))
|
||||
finally:
|
||||
|
|
@ -217,7 +217,7 @@ class TestBqAccessErrors:
|
|||
_seed_bq_table(conn)
|
||||
user = {"id": "admin1", "email": "a@x.com"}
|
||||
with pytest.raises(HTTPException) as exc_info:
|
||||
asyncio.run(v2_schema.schema(
|
||||
(v2_schema.schema(
|
||||
table_id="bq_view", user=user, conn=conn, bq=bq,
|
||||
))
|
||||
finally:
|
||||
|
|
@ -259,7 +259,7 @@ class TestBqAccessErrors:
|
|||
try:
|
||||
_seed_bq_table(conn)
|
||||
user = {"id": "admin1", "email": "a@x.com"}
|
||||
data = asyncio.run(v2_schema.schema(
|
||||
data = (v2_schema.schema(
|
||||
table_id="bq_view", user=user, conn=conn, bq=_bq(),
|
||||
))
|
||||
finally:
|
||||
|
|
@ -322,7 +322,7 @@ class TestBqAccessErrors:
|
|||
try:
|
||||
_seed_bq_table(conn)
|
||||
user = {"id": "admin1", "email": "a@x.com"}
|
||||
asyncio.run(v2_schema.schema(
|
||||
(v2_schema.schema(
|
||||
table_id="bq_view", user=user, conn=conn, bq=bq,
|
||||
))
|
||||
finally:
|
||||
|
|
|
|||
Loading…
Reference in a new issue