SQL using only a full backtick path (`<proj>.<dataset>.<table>`) as the
table reference had neither bare name_lookups nor direct bq.ds.tbl matches,
so _rewrite_user_sql_for_bigquery_query's Skip 1 returned the original SQL
unchanged. DuckDB then rejected the backtick syntax locally with
"syntax error at or near `"" before the query ever reached BigQuery.
Detect _BACKTICK_FULL_PATH matches in the rewriter and include them in the
Skip 1 guard so the SQL gets wrapped in bigquery_query(). No identifier
rewrite is needed — backtick paths are already BQ-native and
_rewrite_bq_table_refs_to_native preserves them verbatim via its
backtick-split pass.
Closes#363
`_rewrite_user_sql_for_bigquery_query` does its own bare-name detection
(mirroring the non-RBAC parts of `_bq_guardrail_inputs`). The backtick
masking from #201 was applied to `_bq_guardrail_inputs` and the
forbidden-table loop, but missed this third site — so a registered
local-mode table name appearing as the table segment of a
user-supplied full backtick path (e.g. ``\`prj.ds.orders\`` matching
registered local ``orders``) tripped the cross-source guard and
forced every backtick-path query into the 50-100× slower
ATTACH-catalog fallback.
Mask once at the top of the function, route both the BQ-name
detection (line ~830) and the cross-source check (line ~867) through
the masked copy. New regression test
`test_local_name_inside_backtick_path_does_not_trip_cross_source`
proves the wrapper now wraps when it should.
In cross-project BQ setups (where billing != data), the SA typically has
serviceusage.services.use on the billing project but not on the data
project. The rewriter passed bq.projects.data as the first arg to
bigquery_query(), which BQ uses as the execution + billing project →
403 USER_PROJECT_DENIED.
Match the convention used everywhere else in the codebase
(app/api/v2_scan.py, app/api/v2_sample.py, app/api/v2_schema.py,
connectors/bigquery/extractor.py): backtick paths inside the inner SQL
use the **data** project (resolves the actual table location), the
bigquery_query() first arg uses the **billing** project (decides who
pays + which project the job runs under). For single-project deploys
the two are identical so the fix is a no-op there.
Test pins the cross-project case: data-prj for backticks, billing-prj
for the bigquery_query() first arg.
R1 adversarial review surfaced 5 issues, all addressed:
#1 chunked download silently disabled in non-Caddy deployments (HEAD on
GET-only FastAPI route returns 405). _probe_range_support now falls back
to GET with Range: bytes=0-0 when HEAD fails — works against both
Caddy file_server (HEAD-friendly) and dev FastAPI direct (GET-only).
#2 parse-error fallback heuristic too broad — matched on Unrecognized
name / Function not found / No matching signature / Invalid cast,
which BQ surfaces for ordinary user-column typos. That triggered slow
ATTACH-catalog retry on every typo (2× latency tax). Narrowed to just
'Syntax error' / 'syntax error' which are the genuine DuckDB-vs-BQ
dialect mismatch markers.
#3 apply_bq_session_settings was only run on fresh-built pool entries,
not on reuse. An operator's /admin/server-config change to bq_query
_timeout_ms wouldn't propagate to long-lived pooled sessions until
restart. Fixed: re-apply on every pool acquire (idempotent + fail-soft).
#4 content-length sanity bound — a misconfigured proxy returning a
wildly inflated Content-Length would cause overlapping chunked Range
requests against the actual file → corrupt assembled output (caught
by manifest hash check, but only after wasted bandwidth). Cap at 100
GiB; above that, drop to single-stream.
#5 rewriter assumed every BQ row resolves under the single
bq.projects.data project. Bucket containing '.' suggests a project-
qualified bucket (multi-project deployment); rewriter would silently
target the wrong project. Conservative skip with regression test.
Address code-reviewer findings on the bigquery_query() rewrite path:
1. Outer LIMIT wrap — bigquery_query() materialises BQ result into DuckDB
before fetchmany sees it (vs ATTACH-catalog Storage Read API streaming).
A user 'SELECT *' against a billion-row remote table would buffer the
entire result before request.limit applied. Wrap rewritten SQL in an
outer 'LIMIT N+1' so the cap pushes into the BQ job itself.
2. Dollar-quoted inner SQL — naive replace("'", "''") doubling missed
DuckDB backslash-escape sequences (\\, \\n, \\t, …). A predicate
like 'WHERE name = ''O\\'Brien''' was unsafe under the doubling
path. DuckDB $bqq_inner$ … $bqq_inner$ form takes the inner SQL
verbatim with no escapes whatsoever. Falls back to legacy doubling
if user SQL improbably contains the literal tag.
3. Parse-error fallback — when the rewritten path fails with a BQ-side
parse / validation error (DuckDB-only syntax like ::INT cast that
survives identifier rewrite but BQ refuses), retry the user's
original SQL via the legacy ATTACH-catalog path so the request still
succeeds. Mirrors the existing dry-run fallback contract.
4. CHANGELOG — delete duplicate CLI bullets that landed under
already-released [0.38.1] (file corruption from merge — entries are
correctly under [0.39.0]).
User SQL hitting query_mode='remote' BigQuery rows was 50-100x slower
than the equivalent direct bigquery_query() call because DuckDB's master
view (CREATE VIEW … AS SELECT * FROM bigquery.<ds>.<tbl>) does not push
WHERE/SELECT/LIMIT into BQ in ATTACH-catalog mode. The BQ extension opens
a Storage Read API session over the entire upstream table; on >100M-row
sources this was 70-150s and frequently failed with 'Response too large
to return'.
Extract the existing dry-run rewriter's core (table-name → BQ-native
backtick path) into a shared helper. Add an execution-path rewriter
that wraps the whole user SQL in bigquery_query('<project>', '<inner>')
so the BQ planner sees the full query and engages partition pruning +
projection pushdown server-side.
Conservative fall-through: cross-source JOINs (BQ ↔ Keboola/Jira local),
queries already containing bigquery_query(, and unconfigured BQ project
all skip the rewrite and run the original SQL via ATTACH-catalog so
behavior degrades gracefully.