agnes-the-ai-analyst/tests/test_keboola_extension_query_passthrough.py
ZdenekSrotyr 85d3810535 feat(materialized): query_mode='materialized' for BigQuery + Keboola — admin SELECT → parquet → analyst
Closes the 'admin pre-stages a curated table/view for analysts' use case end-to-end across both supported source connectors.

Backend (BigQuery + Keboola, schema v20):
  - schema v20 adds source_query TEXT to table_registry (renumbered from v19 after main's #150 RBAC migration also bumped to v19)
  - connectors/bigquery/extractor.py adds materialize_query(table_id, sql, *, bq, output_dir, max_bytes=...) — BqAccess session, dry-run cost guardrail (default 10 GiB, configurable via data_source.bigquery.max_bytes_per_materialize), idempotent ATTACH, rows/bytes/md5 metadata for sync_state
  - connectors/keboola/access.py — new KeboolaAccess facade (parallel of BqAccess) wrapping ATTACH 'keboola://...' AS kbc
  - connectors/keboola/extractor.py adds materialize_query — same shape, no dry-run analog (Keboola Storage API has different cost model); legacy bucket-download path skips query_mode='materialized' rows
  - app/api/sync.py:_run_materialized_pass dispatches by source_type to the right materialize_query
  - app/api/admin.py: RegisterTableRequest accepts source_query; model_validator coheres mode↔source_query↔bucket; PUT preserves omitted fields; deprecation marks (Field(deprecated=True)) on sync_strategy + profile_after_sync (no extractor reads them; profile_after_sync becomes inert — bug from earlier work where /api/sync/trigger never honored the flag); _BQ_OPTIONAL_FIELD_DEFAULTS injects defaults into GET /server-config payload

Operator + CLI surface:
  - da admin register-table --query / --query-mode materialized
  - scripts/smoke-test-materialized-bq.sh — end-to-end smoke for operators

Tests (incl. spike + integration + regression):
  - test_db_migration_v20, test_table_registry_source_query
  - test_bq_materialize, test_bq_cost_guardrail, test_bq_init_extract_skips
  - test_keboola_access, test_keboola_extension_query_passthrough (lock-in for the DuckDB extension capability), test_keboola_materialize, test_keboola_init_extract_skips, test_keboola_materialized_e2e (skipped without KBC_TEST_* creds)
  - test_sync_trigger_materialized, test_sync_trigger_keboola_materialized
  - test_api_admin_materialized, test_cli_admin_materialized
  - test_admin_bq_register, test_admin_discover_bigquery, test_admin_keboola_materialized, test_admin_phase_c_deprecation, test_admin_put_preservation, test_materialized_e2e

Cost: BQ uses bigquery_query() (jobs API, view-aware) — works on tables, views, materialized views uniformly. Keboola uses ATTACH+COPY parquet through the DuckDB extension.
2026-05-01 20:25:56 +02:00

75 lines
3 KiB
Python

"""Lock-in test for the DuckDB Keboola extension's query-passthrough
capability that the Keboola materialized path depends on.
Run only when KBC_TEST_URL + KBC_TEST_TOKEN env vars are set (CI without
real Keboola credentials skips). Local dev with a real Storage API
token exercises the path.
"""
import os
import pytest
import duckdb
KBC_URL = os.environ.get("KBC_TEST_URL")
KBC_TOKEN = os.environ.get("KBC_TEST_TOKEN")
KBC_BUCKET = os.environ.get("KBC_TEST_BUCKET")
KBC_TABLE = os.environ.get("KBC_TEST_TABLE")
pytestmark = pytest.mark.skipif(
not all([KBC_URL, KBC_TOKEN, KBC_BUCKET, KBC_TABLE]),
reason="Keboola integration creds not provided",
)
def test_extension_supports_attach_and_select(tmp_path):
"""Keboola extension must support: ATTACH 'keboola://...' AS kbc, then
SELECT * FROM kbc.bucket.table. The Keboola materialized path uses this
primitive at runtime (just like connectors/keboola/extractor.py:133)."""
conn = duckdb.connect(str(tmp_path / "spike.duckdb"))
conn.execute("INSTALL keboola FROM community")
conn.execute("LOAD keboola")
escaped_token = KBC_TOKEN.replace("'", "''")
conn.execute(f"ATTACH '{KBC_URL}' AS kbc (TYPE keboola, TOKEN '{escaped_token}')")
rows = conn.execute(
f'SELECT COUNT(*) FROM kbc."{KBC_BUCKET}"."{KBC_TABLE}"'
).fetchone()
assert rows[0] >= 0 # any non-negative count is fine; we're testing the path works
def test_extension_supports_copy_to_parquet(tmp_path):
"""Keboola materialized writes the SELECT result via
`COPY (...) TO '...' (FORMAT PARQUET)`. Lock that primitive."""
conn = duckdb.connect(str(tmp_path / "spike.duckdb"))
conn.execute("INSTALL keboola FROM community")
conn.execute("LOAD keboola")
escaped_token = KBC_TOKEN.replace("'", "''")
conn.execute(f"ATTACH '{KBC_URL}' AS kbc (TYPE keboola, TOKEN '{escaped_token}')")
parquet_path = tmp_path / "out.parquet"
safe_lit = str(parquet_path).replace("'", "''")
conn.execute(
f'COPY (SELECT * FROM kbc."{KBC_BUCKET}"."{KBC_TABLE}" LIMIT 5) '
f"TO '{safe_lit}' (FORMAT PARQUET)"
)
assert parquet_path.exists() and parquet_path.stat().st_size > 0
def test_extension_supports_filtered_query(tmp_path):
"""Most important capability: a non-trivial WHERE/projection survives.
This is what 'Custom SQL' mode actually relies on."""
conn = duckdb.connect(str(tmp_path / "spike.duckdb"))
conn.execute("INSTALL keboola FROM community")
conn.execute("LOAD keboola")
escaped_token = KBC_TOKEN.replace("'", "''")
conn.execute(f"ATTACH '{KBC_URL}' AS kbc (TYPE keboola, TOKEN '{escaped_token}')")
parquet_path = tmp_path / "filtered.parquet"
safe_lit = str(parquet_path).replace("'", "''")
# Trivially filterable SELECT — extension must push the WHERE down or
# at minimum execute it client-side. Either is acceptable for our
# materialized path.
conn.execute(
f'COPY (SELECT 1 AS marker FROM kbc."{KBC_BUCKET}"."{KBC_TABLE}" LIMIT 3) '
f"TO '{safe_lit}' (FORMAT PARQUET)"
)
assert parquet_path.exists()