agnes-the-ai-analyst/tests/test_keboola_init_extract_skips.py
ZdenekSrotyr 85d3810535 feat(materialized): query_mode='materialized' for BigQuery + Keboola — admin SELECT → parquet → analyst
Closes the 'admin pre-stages a curated table/view for analysts' use case end-to-end across both supported source connectors.

Backend (BigQuery + Keboola, schema v20):
  - schema v20 adds source_query TEXT to table_registry (renumbered from v19 after main's #150 RBAC migration also bumped to v19)
  - connectors/bigquery/extractor.py adds materialize_query(table_id, sql, *, bq, output_dir, max_bytes=...) — BqAccess session, dry-run cost guardrail (default 10 GiB, configurable via data_source.bigquery.max_bytes_per_materialize), idempotent ATTACH, rows/bytes/md5 metadata for sync_state
  - connectors/keboola/access.py — new KeboolaAccess facade (parallel of BqAccess) wrapping ATTACH 'keboola://...' AS kbc
  - connectors/keboola/extractor.py adds materialize_query — same shape, no dry-run analog (Keboola Storage API has different cost model); legacy bucket-download path skips query_mode='materialized' rows
  - app/api/sync.py:_run_materialized_pass dispatches by source_type to the right materialize_query
  - app/api/admin.py: RegisterTableRequest accepts source_query; model_validator coheres mode↔source_query↔bucket; PUT preserves omitted fields; deprecation marks (Field(deprecated=True)) on sync_strategy + profile_after_sync (no extractor reads them; profile_after_sync becomes inert — bug from earlier work where /api/sync/trigger never honored the flag); _BQ_OPTIONAL_FIELD_DEFAULTS injects defaults into GET /server-config payload

Operator + CLI surface:
  - da admin register-table --query / --query-mode materialized
  - scripts/smoke-test-materialized-bq.sh — end-to-end smoke for operators

Tests (incl. spike + integration + regression):
  - test_db_migration_v20, test_table_registry_source_query
  - test_bq_materialize, test_bq_cost_guardrail, test_bq_init_extract_skips
  - test_keboola_access, test_keboola_extension_query_passthrough (lock-in for the DuckDB extension capability), test_keboola_materialize, test_keboola_init_extract_skips, test_keboola_materialized_e2e (skipped without KBC_TEST_* creds)
  - test_sync_trigger_materialized, test_sync_trigger_keboola_materialized
  - test_api_admin_materialized, test_cli_admin_materialized
  - test_admin_bq_register, test_admin_discover_bigquery, test_admin_keboola_materialized, test_admin_phase_c_deprecation, test_admin_put_preservation, test_materialized_e2e

Cost: BQ uses bigquery_query() (jobs API, view-aware) — works on tables, views, materialized views uniformly. Keboola uses ATTACH+COPY parquet through the DuckDB extension.
2026-05-01 20:25:56 +02:00

89 lines
3.3 KiB
Python

"""Verify the legacy Keboola download path skips materialized rows.
Materialized rows are handled by `_run_materialized_pass` in
`app/api/sync.py`, not by the legacy extractor. Mirror of the BQ
extractor's existing skip behavior at line 188.
The Keboola extractor entrypoint is `run()` (not `init_extract` like
the BQ extractor). Tests below observe the skip via caplog and the
stats dict (no parquet written, table not in tables_extracted/failed).
"""
from connectors.keboola import extractor as kbe
def test_run_skips_materialized_rows(tmp_path, caplog):
"""Given a registry with one materialized row, run() must NOT write a
parquet for it and must NOT count it in tables_extracted/failed."""
output_dir = tmp_path / "extracts" / "keboola"
output_dir.mkdir(parents=True)
table_configs = [
{
"id": "orders_recent",
"name": "orders_recent",
"source_query": "SELECT * FROM kbc.\"in.c-sales\".\"orders\" WHERE date > '2026-01-01'",
"query_mode": "materialized",
},
]
with caplog.at_level("INFO"):
# Use bogus URL/token — the extractor will fail to ATTACH the
# extension (legacy client fallback also unavailable for the
# materialized row, but materialized rows must be skipped before
# any of that code runs).
stats = kbe.run(
str(output_dir), table_configs,
keboola_url="https://invalid.example/",
keboola_token="bogus",
)
# No parquet files for the materialized row.
parquet = output_dir / "data" / "orders_recent.parquet"
assert not parquet.exists(), "materialized row was written by legacy extractor"
# The materialized row must NOT count as extracted or failed.
assert stats["tables_extracted"] == 0
assert stats["tables_failed"] == 0
# Skip log line for ops visibility.
assert "Skipping" in caplog.text and "materialized" in caplog.text
def test_run_processes_local_alongside_skipped_materialized(tmp_path, caplog):
"""Mixed registry: one local + one materialized. Materialized is
skipped, local row attempts extraction (and fails because there's
no real Keboola in this test, but that's a separate failure path —
the materialized row must not appear in tables_failed)."""
output_dir = tmp_path / "extracts" / "keboola"
output_dir.mkdir(parents=True)
table_configs = [
{
"id": "orders",
"name": "orders",
"bucket": "in.c-sales",
"source_table": "orders",
"query_mode": "local",
},
{
"id": "orders_recent",
"name": "orders_recent",
"source_query": "SELECT 1",
"query_mode": "materialized",
},
]
with caplog.at_level("INFO"):
stats = kbe.run(
str(output_dir), table_configs,
keboola_url="https://invalid.example/",
keboola_token="bogus",
)
# The materialized row must not be in failures (it was skipped, not failed).
failed_names = {e["table"] for e in stats.get("errors", [])}
assert "orders_recent" not in failed_names, (
"materialized row appeared in failures — should have been skipped"
)
# Skip log line for ops visibility.
assert "Skipping" in caplog.text and "materialized" in caplog.text