Closes the 'admin pre-stages a curated table/view for analysts' use case end-to-end across both supported source connectors. Backend (BigQuery + Keboola, schema v20): - schema v20 adds source_query TEXT to table_registry (renumbered from v19 after main's #150 RBAC migration also bumped to v19) - connectors/bigquery/extractor.py adds materialize_query(table_id, sql, *, bq, output_dir, max_bytes=...) — BqAccess session, dry-run cost guardrail (default 10 GiB, configurable via data_source.bigquery.max_bytes_per_materialize), idempotent ATTACH, rows/bytes/md5 metadata for sync_state - connectors/keboola/access.py — new KeboolaAccess facade (parallel of BqAccess) wrapping ATTACH 'keboola://...' AS kbc - connectors/keboola/extractor.py adds materialize_query — same shape, no dry-run analog (Keboola Storage API has different cost model); legacy bucket-download path skips query_mode='materialized' rows - app/api/sync.py:_run_materialized_pass dispatches by source_type to the right materialize_query - app/api/admin.py: RegisterTableRequest accepts source_query; model_validator coheres mode↔source_query↔bucket; PUT preserves omitted fields; deprecation marks (Field(deprecated=True)) on sync_strategy + profile_after_sync (no extractor reads them; profile_after_sync becomes inert — bug from earlier work where /api/sync/trigger never honored the flag); _BQ_OPTIONAL_FIELD_DEFAULTS injects defaults into GET /server-config payload Operator + CLI surface: - da admin register-table --query / --query-mode materialized - scripts/smoke-test-materialized-bq.sh — end-to-end smoke for operators Tests (incl. spike + integration + regression): - test_db_migration_v20, test_table_registry_source_query - test_bq_materialize, test_bq_cost_guardrail, test_bq_init_extract_skips - test_keboola_access, test_keboola_extension_query_passthrough (lock-in for the DuckDB extension capability), test_keboola_materialize, test_keboola_init_extract_skips, test_keboola_materialized_e2e (skipped without KBC_TEST_* creds) - test_sync_trigger_materialized, test_sync_trigger_keboola_materialized - test_api_admin_materialized, test_cli_admin_materialized - test_admin_bq_register, test_admin_discover_bigquery, test_admin_keboola_materialized, test_admin_phase_c_deprecation, test_admin_put_preservation, test_materialized_e2e Cost: BQ uses bigquery_query() (jobs API, view-aware) — works on tables, views, materialized views uniformly. Keboola uses ATTACH+COPY parquet through the DuckDB extension.
43 lines
1.4 KiB
Python
43 lines
1.4 KiB
Python
"""DuckDB session facade for the Keboola Storage API extension.
|
|
|
|
Parallel of `connectors/bigquery/access.py:BqAccess`. The materialized
|
|
Keboola SQL path needs a one-shot DuckDB connection with the Keboola
|
|
extension installed, loaded, and ATTACHed; this facade encapsulates
|
|
that wiring so `_run_materialized_pass` doesn't need to know the
|
|
extension name, the ATTACH alias, or how the token gets quoted into
|
|
the URL literal.
|
|
"""
|
|
from __future__ import annotations
|
|
from contextlib import contextmanager
|
|
from typing import Iterator
|
|
|
|
import duckdb
|
|
|
|
|
|
class KeboolaAccess:
|
|
"""Lazy DuckDB session manager for the Keboola Storage API extension.
|
|
|
|
Single-use — call `.duckdb_session()` as a context manager once per
|
|
materialized job.
|
|
"""
|
|
|
|
def __init__(self, *, url: str, token: str) -> None:
|
|
if not url or not token:
|
|
raise ValueError("KeboolaAccess requires url and token")
|
|
self._url = url
|
|
self._token = token
|
|
|
|
@contextmanager
|
|
def duckdb_session(self) -> Iterator[duckdb.DuckDBPyConnection]:
|
|
conn = duckdb.connect(":memory:")
|
|
try:
|
|
conn.execute("INSTALL keboola FROM community")
|
|
conn.execute("LOAD keboola")
|
|
escaped_token = self._token.replace("'", "''")
|
|
conn.execute(
|
|
f"ATTACH '{self._url}' AS kbc "
|
|
f"(TYPE keboola, TOKEN '{escaped_token}')"
|
|
)
|
|
yield conn
|
|
finally:
|
|
conn.close()
|