* fix: cutover regressions + parallel Keboola legacy fallback
Bundled fixes from a fresh-deploy run on a Keboola Storage backend with
the block-shared-snowflake-access feature flag — DuckDB Keboola
extension's per-table scan can't access bucket schemas, so the legacy
kbcstorage Storage-API client is the only working path.
CUTOVER REGRESSIONS
- agnes pull hash mismatch on every Keboola local-mode table —
src/orchestrator.py:_update_sync_state stored md5(mtime+size)[:12]
while the CLI compares against full 32-char content MD5. Now stores
the same content MD5 the materialized SQL path already used.
- Trailing-slash sanitization in connectors/keboola/access.py and
extractor.py — DuckDB Keboola extension's ATTACH fails when the URL
ends in / (canonical form).
- src/profiler.py:TableInfo.description becomes optional — two call
sites instantiated without it, crashing the profiler pass.
- scripts/ops/agnes-auto-upgrade.sh: chown on UID change — older images
ran as root, current runs as agnes (uid 999). Reads target uid:gid
from /etc/passwd inside the new image and chowns ${STATE_DIR},
/data/extracts, /data/analytics when the digest moves.
- POST /api/sync/trigger is now singleton per process — two
near-simultaneous trigger calls each forked an extractor subprocess,
fought for extract.duckdb's file lock, starved uvicorn, flipped the
container to unhealthy. Trigger now returns 409
(sync_already_in_progress) when held; _run_sync acquires non-blocking.
PARALLEL LEGACY FALLBACK
- Process pool fan-out for the _extract_via_legacy queue (default 8
workers, override via AGNES_KEBOOLA_PARALLELISM). Process pool, not
thread pool, because connectors/keboola/client.py:export_table does
os.chdir(temp_dir) — process-global, so threads raced and slice files
landed in the wrong directory ("[Errno 2] No such file or directory:
'<job_id>.csv_X_Y_Z.csv'").
- Extractor subprocess timeout 1800s -> 3600s (configurable via
AGNES_EXTRACTOR_TIMEOUT_SEC). 28+ tables × multi-minute Keboola export
jobs need the headroom on telemetry-class projects.
- Process group cleanup on timeout — Popen(start_new_session=True) puts
the extractor in its own group. On timeout the parent SIGTERMs the
group (10s grace) then SIGKILLs stragglers. Without this, the pool
workers were reparented to PID 1 and continued holding open Keboola
Storage export jobs. Inline extractor script also installs a SIGTERM
-> sys.exit(143) handler so the with ProcessPoolExecutor(...) block
__exit__ runs cleanly.
Tests: existing tests that patched subprocess.run updated to patch
subprocess.Popen with a _FakePopen stand-in (same exit-code-injection
contract). Two tests that exercised the parallel path forced
AGNES_KEBOOLA_PARALLELISM=1 to keep mocks alive (mocks don't ride into
ProcessPoolExecutor subprocesses).
Squashed onto current main (was 7 commits + multi-commit CHANGELOG +
agnes-auto-upgrade.sh conflicts; squash avoids per-commit conflict
resolution against main's flat-mount STATE_DIR refactor and 0.38.0
release cut).
* feat(keboola): Storage API direct extract path; drop extension data path
The DuckDB Keboola extension's COPY routes through Keboola QueryService,
which is unreliable on linked-bucket projects (extension v0.1.6 fixes
that case but isn't yet in the community CDN, and pre-fix any project
with the block-shared-snowflake-access feature flag couldn't see bucket
schemas at all). Move the extract path off the extension entirely and
talk to the Storage API directly via signed-URL download — works on any
project, regardless of extension state.
connectors/keboola/storage_api.py (NEW)
Lightweight client built on requests.Session. Three endpoints:
- POST /v2/storage/tables/{id}/export-async (kicks off job)
- GET /v2/storage/jobs/{id} (poll until done)
- GET /v2/storage/files/{id}?federationToken=1 (signed URL detail)
- GET <signed_url> (download bytes)
Supports sliced exports (manifest + per-slice signed URLs) and gzipped
payloads. ExportFilter dataclass mirrors the Keboola filter spec
(whereFilters / columns / changedSince / limit) and handles JSON
round-trip with the registry's source_query column. Token redaction
in error messages. Bounded exponential backoff on job polling.
No cloud-SDK dependency on the data path; thread-safe.
connectors/keboola/extractor.py
- materialize_query() rewritten: takes bucket/source_table/source_query
(JSON filter spec), exports via KeboolaStorageClient, converts CSV
to parquet via DuckDB, atomic os.replace. Same return shape so
sync.py downstream code stays uniform with the BQ branch.
- _extract_via_legacy() also moved to Storage API direct (kept the
name for caller compatibility with _legacy_worker / the parallel
batch extractor). Per-call temp directories — no os.chdir, threads
don't race.
app/api/sync.py
_run_materialized_pass for source_type='keboola' rows now constructs a
KeboolaStorageClient (replaces KeboolaAccess) and passes
bucket/source_table/source_query to materialize_query. Reuses one
client across rows for HTTP keep-alive. Sources keboola URL from env
too (KEBOOLA_STACK_URL) when instance.yaml doesn't have stack_url
configured.
cli/commands/admin.py
discover-and-register defaults Keboola rows to query_mode='materialized'
(NULL source_query = full table), matching the v26 migration's
unification of the local/materialized split for Keboola. BigQuery and
Jira keep their per-source defaults.
src/db.py
Schema bump 25 → 26. Migration: UPDATE table_registry SET
query_mode='materialized' WHERE source_type='keboola' AND
query_mode='local'. NULL source_query on those rows means "full table
export" — same effective behavior the local mode provided, but now
via Storage API instead of the extension.
pyproject.toml
kbcstorage dep stays (admin-side bucket/table list still uses the
SDK in app/api/admin.py / connectors/keboola/client.py); only the
data path is migrated off the SDK. Comment updated to reflect the
new boundary.
tests
- test_keboola_storage_api.py (NEW, 19 tests): ExportFilter parsing,
HTTP client (token redaction, retry logic, polling), download_file
(single, gzipped, sliced), end-to-end export_table_to_csv.
- test_keboola_materialize.py rewritten: mocks KeboolaStorageClient
instead of FakeAccess; same atomic-write + zero-rows + unsafe-id
contracts.
- test_sync_trigger_keboola_materialized.py: registry rows now carry
bucket+source_table+JSON-shape source_query.
114+ Keboola-impacted tests green locally.
* test: schema version assertion bumped to 26 alongside the keboola query_mode migration
* fix(keboola): cutover hot-patches surfaced on agnes-dev
Five small fixes that were applied as in-container hot-patches during
agnes-dev cutover and need to be on the source-of-truth image so a fresh
upgrade does not undo them.
- app/api/sync.py: auto-discover gate considers the WHOLE registry (any
source, any mode), not just rows where source matches and query_mode
is local. After the v25→v26 keboola materialized migration an
instance can have 30 materialized rows and zero local rows; the
previous gate kept re-firing _discover_and_register_tables every
scheduler tick, creating duplicate auto-discovered rows with the
wrong bucket prefix every time.
- app/api/admin.py: _discover_and_register_tables reassembles the
bucket as <stage>.<bucket-id> (e.g. in.c-finance) instead of
dropping the stage prefix; default query_mode for keboola is now
materialized (the v26 contract); validator allows NULL source_query
for keboola materialized rows (full-table export via Storage API
export-async, no SQL needed).
- cli/commands/admin.py: register-table mirrors the server validator
(NULL source_query allowed for source_type=keboola); --bucket help
text generalized to cover both BQ dataset and Keboola bucket id.
- connectors/keboola/extractor.py: max_line_size=64 MiB on
read_csv_auto so embedded JSON / SQL cells (kbc_component_configuration
in particular) do not trip the default 2 MiB ceiling.
- connectors/keboola/storage_api.py: GCP backend support — when the
Storage API returns a manifest whose slice URLs are gs://
references with a gcsCredentials block, rewrite to the JSON REST
download endpoint and authenticate with the issued OAuth bearer
token; redact tokens in any surfaced error string.
* test: align with new keboola materialized + auto-discover-gate contracts
- test_admin_keboola_materialized: rename
test_register_keboola_materialized_rejects_missing_source_query →
test_register_keboola_materialized_accepts_missing_source_query.
v25→v26 introduced 'keboola materialized with NULL source_query
means full-table export via Storage API export-async' as the
default registration shape; the rejection case is no longer the
contract.
- test_sync_filter: add list_all() to _StubRegistry. The auto-discover
gate in _run_sync now keys off the WHOLE registry (not just local
rows) so materialized-only Keboola instances do not re-trigger
discovery on every tick.
* feat(keboola): native parquet export — skip CSV roundtrip
Storage API export-async accepts fileType={csv,parquet}. Switching the
materialized sync to parquet eliminates the CSV → DuckDB COPY → parquet
roundtrip that pinned a single uvicorn worker over 4 GiB on multi-GB
tables (read_csv with all_varchar + max_line_size=64MB has to
materialize the whole CSV in memory before COPY can stream out a
parquet). Snowflake UNLOAD on Keboola's side already produces typed,
self-contained parquet files; the extractor downloads them and renames
into place.
Two cases:
- **Single-file** export (small table): file_info.url points at one
signed URL; download_file streams chunks straight to .parquet.tmp
and we're done. No DuckDB.
- **Sliced** export (Snowflake UNLOAD respects MAX_FILE_SIZE — 16 MiB
default — so anything larger arrives as N parquet slices): each
slice is a complete parquet file with its own footer; naive concat
would corrupt them. download_file_slices keeps the slices as
separate files in a tempdir, then DuckDB COPY (SELECT * FROM
read_parquet([slice0, slice1, ...])) merges them into one
consolidated parquet. DuckDB streams row groups during this — peak
memory bounded to one row group (~1 MiB) regardless of source size.
The legacy CSV path stays as the explicit opt-in via source_query=
'{"file_type":"csv"}' for projects whose backend can't UNLOAD
parquet (none known today; cheap escape hatch). Backward-compat alias
KeboolaStorageClient.export_table_to_csv kept.
Also fixes a latent bug in download_file's gzip detection: previous
heuristic flagged any unencrypted file as gzipped, which would have
corrupted parquet downloads at gunzip time. Name-suffix-only now.
* fix: tempdir leak cleanup, every 0m schedule, /sync/trigger body shapes
Three small self-contained fixes uncovered during agnes-dev cutover.
- connectors/keboola/extractor.py: tempfile.TemporaryDirectory now uses
ignore_cleanup_errors=True so a worker death mid-write doesn't leave
multi-GiB stale slice trees on the boot disk. (12 GiB seen after a
disk-full crash where TemporaryDirectory's own cleanup also raised
and got swallowed.)
- src/scheduler.py: is_valid_schedule accepts 'every 0m' (interval=0
= always due). Force-resync of an errored row no longer requires
waiting out the default 'every 1h' interval — admin can flip the
schedule, trigger, then flip back.
- app/api/sync.py: POST /api/sync/trigger accepts both ['table_id']
(legacy bare-array body) and {'tables': ['table_id']} (matches the
response payload shape, more discoverable for clients building
requests by hand). Malformed bodies return 422 with a structured
detail; null/missing means 'sync everything' as before.
Tests cover: tempdir cleanup on raise (sliced parquet path),
is_valid_schedule + is_table_due 'every 0m' acceptance, and trigger
body parametrized matrix (8 valid shapes + 6 rejection cases).
* fix: targeted-trigger filter in materialized pass + auto-upgrade defer
Two operational gaps observed during agnes-dev cutover, in the same
sync-routing area.
- _run_materialized_pass now takes a 'tables' arg and skips rows not in
the target set with reason='not_in_target'. POST /api/sync/trigger
with a body of tables previously only scoped the legacy extractor
subprocess — the materialized pass kept iterating every due
materialized row, so an admin asking to re-sync kbc_job re-ran
every other due materialized row alongside it. Match on registry id
OR name (admins commonly pass either form). tables=None preserves
the no-filter behavior.
- New GET /api/sync/status (public, no auth) returns {locked: bool}
off _sync_lock.locked(). agnes-auto-upgrade.sh probes this before
docker compose up -d and exits 0 with a 'deferred recreate' log
line if a sync is in flight — the next 5-min cron tick retries.
Pre-fix, an auto-upgrade triggered mid-sync would recreate the
uvicorn worker and kill the in-flight extractor / Snowflake-UNLOAD
download (observed when kbc_job's first 7-day retry got SIGKILLed).
Connection failures in the probe fall through to the upgrade —
being stuck on a wedged image is worse than interrupting a
hypothetical sync.
* fix: auto-discover protects admin overrides + surfaces drift
Two real-world incidents on agnes-dev drove this:
1. kbc_job was registered manually with the correct
(in.c-kbc_telemetry, kbc_job) coordinates. A naive auto-discover
re-run would have inserted a SECOND kbc_job row at the slugified
id 'in_c-keboola-storage_kbc_job' (where Keboola's discovery
places it) — and that row's Storage API export-async 404s.
2. An earlier auto-discover bug stripped the stage prefix from
bucket ids ('c-finance' instead of 'in.c-finance'), inserting
137 rows whose syncs all failed.
Fix:
- _discover_and_register_tables now builds a plan first
(_build_keboola_discovery_plan) classifying each discovered table
into one of new / existing_match / existing_drift / invalid, then
executes only the 'new' bucket. Drift rows are reported with both
sides of the disagreement plus drift_kind:
- same_id_diff_coords: registry has the same id but different
bucket / source_table (admin migrated coords inline).
- name_collision: discovery's slugified id differs from any
registry id, but the discovered .name matches an existing row's
.name (case-insensitive). Catches the kbc_job case.
- Bucket detection now prefers the API's authoritative bucket_id
field (separate field on the Keboola tables.list response,
normalised by KeboolaClient.discover_all_tables). Falls back to
id-string parsing only when bucket_id is missing (older fallback
path inside discover_all_tables).
- Endpoint POST /api/admin/discover-and-register?dry_run=true
returns the plan without writing — would_register, drift,
invalid lists. Lets an operator audit before merging discovery
with a registry that has admin overrides.
Removed 'every 0m' from test_register_request_rejects_malformed_sync_schedule
— the runtime started accepting it in the previous commit (force-resync
override) and the validator follows suit.
* feat(keboola): AGNES_TEMP_DIR routes tempfiles off overlayfs /tmp
The container's /tmp lives on the boot disk's overlayfs (29 GiB on
agnes-dev, shared with /var). Snowflake UNLOAD of a wide table writes
slices into per-call /tmp tempdirs that fill multi-GiB / many-slice
exports long before the dedicated data disk fills. agnes-dev hit
100% boot-disk while the 20 GiB data disk had 15 GiB free.
connectors.keboola.storage_api.get_temp_root() reads AGNES_TEMP_DIR;
mkdirs the target on first use; unset / empty / unwritable falls
back to None (system tempdir, OSS-pre-fix behaviour). Both
materialize_query (parquet path) and _extract_via_legacy (CSV
fallback) and the sliced-CSV concat path in storage_api use the
helper now.
docker-compose.yml defaults AGNES_TEMP_DIR=/data/tmp on app, scheduler,
and extract services. The data volume is the dedicated disk in
production layouts and a plain docker volume in single-disk
dev/laptop setups — same blast radius as the previous /tmp default
on the latter, no regression.
580 lines
25 KiB
Python
580 lines
25 KiB
Python
"""Tests for Keboola extractor."""
|
|
|
|
import os
|
|
from pathlib import Path
|
|
from unittest.mock import patch, MagicMock
|
|
|
|
import duckdb
|
|
import pytest
|
|
|
|
from tests.helpers.contract import validate_extract_contract
|
|
|
|
|
|
@pytest.fixture
|
|
def output_dir(tmp_path):
|
|
d = tmp_path / "extracts" / "keboola"
|
|
d.mkdir(parents=True)
|
|
return str(d)
|
|
|
|
|
|
@pytest.fixture
|
|
def sample_configs():
|
|
return [
|
|
{
|
|
"id": "in.c-crm.orders",
|
|
"name": "orders",
|
|
"source_type": "keboola",
|
|
"bucket": "in.c-crm",
|
|
"source_table": "orders",
|
|
"query_mode": "local",
|
|
"description": "Order data",
|
|
},
|
|
{
|
|
"id": "in.c-crm.customers",
|
|
"name": "customers",
|
|
"source_type": "keboola",
|
|
"bucket": "in.c-crm",
|
|
"source_table": "customers",
|
|
"query_mode": "local",
|
|
"description": "Customer data",
|
|
},
|
|
]
|
|
|
|
|
|
def _mock_attach(conn, url, token):
|
|
"""Mock that says extension is available and ATTACHes a fake kbc catalog."""
|
|
# Create in-memory DB as kbc so views referencing kbc."bucket"."table" can be created
|
|
conn.execute("ATTACH ':memory:' AS kbc")
|
|
return True
|
|
|
|
|
|
def _write_parquet(pq_path, data_sql="SELECT 1 AS id, 'test' AS name"):
|
|
"""Helper to write a parquet file with given SQL."""
|
|
local_conn = duckdb.connect()
|
|
local_conn.execute(f"COPY ({data_sql}) TO '{pq_path}' (FORMAT PARQUET)")
|
|
local_conn.close()
|
|
|
|
|
|
class TestKeboolaExtractor:
|
|
def test_creates_extract_duckdb(self, output_dir, sample_configs):
|
|
"""Test that run() creates extract.duckdb with correct structure."""
|
|
from connectors.keboola.extractor import run
|
|
|
|
def write_parquet(conn, tc, pq_path):
|
|
_write_parquet(pq_path)
|
|
|
|
with patch("connectors.keboola.extractor._try_attach_extension", side_effect=_mock_attach), \
|
|
patch("connectors.keboola.extractor._extract_via_extension", side_effect=write_parquet):
|
|
result = run(output_dir, sample_configs, "https://example.com", "test-token")
|
|
|
|
assert result["tables_extracted"] == 2
|
|
assert result["tables_failed"] == 0
|
|
|
|
# Verify extract.duckdb exists and has correct structure
|
|
db_path = Path(output_dir) / "extract.duckdb"
|
|
assert db_path.exists()
|
|
|
|
conn = duckdb.connect(str(db_path))
|
|
try:
|
|
# Check _meta table
|
|
meta = conn.execute("SELECT * FROM _meta ORDER BY table_name").fetchall()
|
|
assert len(meta) == 2
|
|
names = {row[0] for row in meta}
|
|
assert names == {"orders", "customers"}
|
|
|
|
# Check all are 'local' query_mode
|
|
modes = {row[5] for row in meta}
|
|
assert modes == {"local"}
|
|
finally:
|
|
conn.close()
|
|
|
|
validate_extract_contract(str(db_path))
|
|
|
|
def test_remote_tables_not_downloaded(self, output_dir):
|
|
"""Test that tables with query_mode='remote' are registered but not downloaded."""
|
|
from connectors.keboola.extractor import run
|
|
|
|
configs = [{
|
|
"name": "big_table",
|
|
"bucket": "in.c-events",
|
|
"source_table": "big_table",
|
|
"query_mode": "remote",
|
|
"description": "Too large to sync",
|
|
}]
|
|
|
|
def mock_attach_with_schema(conn, url, token):
|
|
"""Mock kbc with the expected bucket schema so remote views can be created."""
|
|
conn.execute("ATTACH ':memory:' AS kbc")
|
|
conn.execute('CREATE SCHEMA kbc."in.c-events"')
|
|
conn.execute('CREATE TABLE kbc."in.c-events"."big_table" (id VARCHAR)')
|
|
return True
|
|
|
|
with patch("connectors.keboola.extractor._try_attach_extension", side_effect=mock_attach_with_schema):
|
|
result = run(output_dir, configs, "https://example.com", "test-token")
|
|
|
|
assert result["tables_extracted"] == 1
|
|
|
|
conn = duckdb.connect(str(Path(output_dir) / "extract.duckdb"))
|
|
try:
|
|
meta = conn.execute("SELECT query_mode FROM _meta WHERE table_name='big_table'").fetchone()
|
|
assert meta[0] == "remote"
|
|
|
|
# _remote_attach table should exist with Keboola connection info
|
|
ra = conn.execute("SELECT alias, extension, url, token_env FROM _remote_attach").fetchone()
|
|
assert ra[0] == "kbc"
|
|
assert ra[1] == "keboola"
|
|
assert ra[2] == "https://example.com"
|
|
assert ra[3] == "KEBOOLA_STORAGE_TOKEN"
|
|
finally:
|
|
conn.close()
|
|
|
|
# No parquet file should exist
|
|
assert not (Path(output_dir) / "data" / "big_table.parquet").exists()
|
|
|
|
def test_handles_extraction_failure(self, output_dir, sample_configs, monkeypatch):
|
|
"""Test that a failed table doesn't stop other tables from extracting."""
|
|
from connectors.keboola.extractor import run
|
|
|
|
call_count = 0
|
|
def side_effect(conn, tc, pq_path):
|
|
nonlocal call_count
|
|
call_count += 1
|
|
if call_count == 1:
|
|
raise Exception("Network error")
|
|
# Second call succeeds
|
|
_write_parquet(pq_path, "SELECT 1 AS id")
|
|
|
|
# Mock the legacy fallback too — without it the real client
|
|
# attempts an HTTPS round-trip to the test URL and hangs ~minute.
|
|
# Force inline (PARALLELISM=1) so the mock survives — the parallel
|
|
# path would spawn a subprocess that doesn't see the patch.
|
|
monkeypatch.setenv("AGNES_KEBOOLA_PARALLELISM", "1")
|
|
|
|
def legacy_reraise(tc, pq_path, url, token):
|
|
raise Exception("Network error")
|
|
|
|
with patch("connectors.keboola.extractor._try_attach_extension", side_effect=_mock_attach), \
|
|
patch("connectors.keboola.extractor._extract_via_extension", side_effect=side_effect), \
|
|
patch("connectors.keboola.extractor._extract_via_legacy", side_effect=legacy_reraise):
|
|
result = run(output_dir, sample_configs, "https://example.com", "test-token")
|
|
|
|
assert result["tables_extracted"] == 1
|
|
assert result["tables_failed"] == 1
|
|
assert len(result["errors"]) == 1
|
|
|
|
def test_creates_data_directory(self, output_dir, sample_configs):
|
|
"""Test that data/ subdirectory is created."""
|
|
from connectors.keboola.extractor import run
|
|
|
|
def write_pq(conn, tc, pq_path):
|
|
_write_parquet(pq_path, "SELECT 1 AS id")
|
|
|
|
with patch("connectors.keboola.extractor._try_attach_extension", side_effect=_mock_attach), \
|
|
patch("connectors.keboola.extractor._extract_via_extension", side_effect=write_pq):
|
|
run(output_dir, sample_configs, "https://example.com", "test-token")
|
|
|
|
assert (Path(output_dir) / "data").is_dir()
|
|
assert (Path(output_dir) / "data" / "orders.parquet").exists()
|
|
|
|
def test_views_queryable(self, output_dir):
|
|
"""Test that views in extract.duckdb can be queried."""
|
|
from connectors.keboola.extractor import run
|
|
|
|
configs = [{"name": "test_table", "query_mode": "local", "description": "Test"}]
|
|
|
|
def write_pq(conn, tc, pq_path):
|
|
_write_parquet(pq_path, "SELECT 42 AS value, 'hello' AS msg")
|
|
|
|
with patch("connectors.keboola.extractor._try_attach_extension", side_effect=_mock_attach), \
|
|
patch("connectors.keboola.extractor._extract_via_extension", side_effect=write_pq):
|
|
run(output_dir, configs, "https://example.com", "test-token")
|
|
|
|
conn = duckdb.connect(str(Path(output_dir) / "extract.duckdb"))
|
|
try:
|
|
result = conn.execute("SELECT value, msg FROM test_table").fetchone()
|
|
assert result[0] == 42
|
|
assert result[1] == "hello"
|
|
finally:
|
|
conn.close()
|
|
|
|
def test_meta_table_schema(self, output_dir):
|
|
"""Test that _meta table has all required columns."""
|
|
from connectors.keboola.extractor import run
|
|
|
|
configs = [{"name": "t", "query_mode": "local", "description": "desc"}]
|
|
|
|
def write_pq(conn, tc, pq_path):
|
|
_write_parquet(pq_path, "SELECT 1 AS x")
|
|
|
|
with patch("connectors.keboola.extractor._try_attach_extension", side_effect=_mock_attach), \
|
|
patch("connectors.keboola.extractor._extract_via_extension", side_effect=write_pq):
|
|
run(output_dir, configs, "https://example.com", "test-token")
|
|
|
|
conn = duckdb.connect(str(Path(output_dir) / "extract.duckdb"))
|
|
try:
|
|
cols = conn.execute("SELECT column_name FROM information_schema.columns WHERE table_name='_meta' ORDER BY ordinal_position").fetchall()
|
|
col_names = [c[0] for c in cols]
|
|
assert col_names == ["table_name", "description", "rows", "size_bytes", "extracted_at", "query_mode"]
|
|
finally:
|
|
conn.close()
|
|
|
|
def test_legacy_fallback_when_extension_unavailable(self, output_dir):
|
|
"""Test that legacy client is used when extension attach fails."""
|
|
from connectors.keboola.extractor import run
|
|
|
|
configs = [{"name": "t", "id": "in.c-test.t", "query_mode": "local", "description": ""}]
|
|
|
|
def mock_legacy(tc, pq_path, url, token):
|
|
_write_parquet(pq_path, "SELECT 1 AS id")
|
|
|
|
# Extension not available
|
|
with patch("connectors.keboola.extractor._try_attach_extension", return_value=False), \
|
|
patch("connectors.keboola.extractor._extract_via_legacy", side_effect=mock_legacy):
|
|
result = run(output_dir, configs, "https://example.com", "test-token")
|
|
|
|
assert result["tables_extracted"] == 1
|
|
assert result["tables_failed"] == 0
|
|
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# Connector failure mode tests
|
|
# ---------------------------------------------------------------------------
|
|
|
|
|
|
class TestKeboolaExtractorFailureModes:
|
|
"""Tests for Keboola extractor failure handling and resilience."""
|
|
|
|
def test_extractor_crash_does_not_corrupt_extract_duckdb(self, output_dir, sample_configs):
|
|
"""If the extractor crashes mid-extraction, the temp DB is not moved
|
|
into place, so the existing extract.duckdb (if any) is not corrupted.
|
|
The atomic write pattern (tmp + rename) protects against this."""
|
|
from connectors.keboola.extractor import run
|
|
|
|
# First, create a valid extract.duckdb
|
|
def write_pq(conn, tc, pq_path):
|
|
_write_parquet(pq_path, "SELECT 1 AS id")
|
|
|
|
with patch("connectors.keboola.extractor._try_attach_extension", side_effect=_mock_attach), patch("connectors.keboola.extractor._extract_via_extension", side_effect=write_pq):
|
|
run(output_dir, sample_configs[:1], "https://example.com", "test-token")
|
|
|
|
db_path = Path(output_dir) / "extract.duckdb"
|
|
assert db_path.exists()
|
|
# Verify it's valid
|
|
conn = duckdb.connect(str(db_path))
|
|
conn.execute("SELECT * FROM _meta").fetchall()
|
|
conn.close()
|
|
|
|
# Now simulate a crash during a second extraction — the extension
|
|
# attach raises an exception after the tmp file is created.
|
|
with patch("connectors.keboola.extractor._try_attach_extension", side_effect=RuntimeError("crash")):
|
|
try:
|
|
run(output_dir, sample_configs, "https://example.com", "test-token")
|
|
except Exception:
|
|
pass # The extractor catches internally and returns stats
|
|
|
|
# The extract.duckdb should still exist and be valid (atomic swap
|
|
# means the old file is untouched if the new one didn't complete)
|
|
assert db_path.exists()
|
|
|
|
def test_partial_data_write_incomplete_parquet(self, output_dir):
|
|
"""When a parquet file write fails mid-stream, the extractor records
|
|
the table as failed in stats but continues with other tables."""
|
|
from connectors.keboola.extractor import run
|
|
|
|
configs = [
|
|
{"name": "good_table", "query_mode": "local", "description": "OK"},
|
|
{"name": "bad_table", "query_mode": "local", "description": "Will fail"},
|
|
]
|
|
|
|
call_count = 0
|
|
def side_effect(conn, tc, pq_path):
|
|
nonlocal call_count
|
|
call_count += 1
|
|
if tc["name"] == "bad_table":
|
|
raise IOError("Disk full — partial write")
|
|
_write_parquet(pq_path, "SELECT 1 AS id")
|
|
|
|
with patch("connectors.keboola.extractor._try_attach_extension", side_effect=_mock_attach), patch("connectors.keboola.extractor._extract_via_extension", side_effect=side_effect):
|
|
result = run(output_dir, configs, "https://example.com", "test-token")
|
|
|
|
# One table succeeded, one failed
|
|
assert result["tables_extracted"] == 1
|
|
assert result["tables_failed"] == 1
|
|
assert len(result["errors"]) == 1
|
|
assert "bad_table" in result["errors"][0]["table"]
|
|
|
|
# The good table's parquet file exists
|
|
assert (Path(output_dir) / "data" / "good_table.parquet").exists()
|
|
# The bad table's parquet file should NOT exist (failed before write)
|
|
assert not (Path(output_dir) / "data" / "bad_table.parquet").exists()
|
|
|
|
def test_network_timeout_during_extraction(self, output_dir, monkeypatch):
|
|
"""Network timeout during extraction should return a meaningful error
|
|
in the stats, not crash the whole process."""
|
|
from connectors.keboola.extractor import run
|
|
import socket
|
|
|
|
configs = [
|
|
{"name": "timeout_table", "query_mode": "local", "description": "Will timeout"},
|
|
{"name": "ok_table", "query_mode": "local", "description": "OK"},
|
|
]
|
|
|
|
call_count = 0
|
|
def side_effect(conn, tc, pq_path):
|
|
nonlocal call_count
|
|
call_count += 1
|
|
if tc["name"] == "timeout_table":
|
|
raise socket.timeout("Connection timed out")
|
|
_write_parquet(pq_path, "SELECT 1 AS id")
|
|
|
|
# When extension scan fails, the per-table flow now retries via
|
|
# _extract_via_legacy. Mock it to re-raise the same socket.timeout
|
|
# so we observe the final error surface; the contract under test is
|
|
# "extension failure doesn't crash, error makes it into stats, other
|
|
# tables continue", not which path produced the message.
|
|
# Force PARALLELISM=1 so the mock survives — the parallel path uses
|
|
# ProcessPoolExecutor which spawns subprocesses that don't see the
|
|
# mock.
|
|
monkeypatch.setenv("AGNES_KEBOOLA_PARALLELISM", "1")
|
|
|
|
def legacy_reraise(tc, pq_path, url, token):
|
|
raise socket.timeout("Connection timed out")
|
|
|
|
with patch("connectors.keboola.extractor._try_attach_extension", side_effect=_mock_attach), \
|
|
patch("connectors.keboola.extractor._extract_via_extension", side_effect=side_effect), \
|
|
patch("connectors.keboola.extractor._extract_via_legacy", side_effect=legacy_reraise):
|
|
result = run(output_dir, configs, "https://example.com", "test-token")
|
|
|
|
assert result["tables_extracted"] == 1
|
|
assert result["tables_failed"] == 1
|
|
assert "timed out" in result["errors"][0]["error"].lower()
|
|
|
|
def test_extension_unavailable_fallback_to_client(self, output_dir):
|
|
"""When DuckDB Keboola extension fails to load, the extractor falls
|
|
back to the legacy HTTP client."""
|
|
from connectors.keboola.extractor import run
|
|
|
|
configs = [{"name": "t", "id": "in.c-test.t", "query_mode": "local",
|
|
"bucket": "in.c-test", "source_table": "t", "description": ""}]
|
|
|
|
def mock_legacy(tc, pq_path, url, token):
|
|
_write_parquet(pq_path, "SELECT 42 AS value")
|
|
|
|
with patch("connectors.keboola.extractor._try_attach_extension", return_value=False), patch("connectors.keboola.extractor._extract_via_legacy", side_effect=mock_legacy):
|
|
result = run(output_dir, configs, "https://example.com", "test-token")
|
|
|
|
assert result["tables_extracted"] == 1
|
|
assert result["tables_failed"] == 0
|
|
|
|
# Verify the data is queryable
|
|
conn = duckdb.connect(str(Path(output_dir) / "extract.duckdb"))
|
|
val = conn.execute("SELECT value FROM t").fetchone()
|
|
assert val[0] == 42
|
|
conn.close()
|
|
|
|
def test_extension_per_table_failure_falls_back_to_legacy(self, output_dir):
|
|
"""When ATTACH succeeds but the per-table extension scan fails (e.g.
|
|
Keboola QueryService schema/role mismatch — keboola/duckdb-extension#17),
|
|
the extractor retries that table via the legacy Storage-API client."""
|
|
from connectors.keboola.extractor import run
|
|
|
|
configs = [{"name": "t", "id": "in.c-test.t", "query_mode": "local",
|
|
"bucket": "in.c-test", "source_table": "t", "description": ""}]
|
|
|
|
def extension_scan_fails(conn, tc, pq_path):
|
|
raise RuntimeError(
|
|
"Keboola scan failed: Schema 'KBC_USE4_NNNN.\"in.c-test\"' "
|
|
"does not exist or not authorized."
|
|
)
|
|
|
|
def legacy_succeeds(tc, pq_path, url, token):
|
|
_write_parquet(pq_path, "SELECT 7 AS value")
|
|
|
|
with patch("connectors.keboola.extractor._try_attach_extension", side_effect=_mock_attach), \
|
|
patch("connectors.keboola.extractor._extract_via_extension", side_effect=extension_scan_fails), \
|
|
patch("connectors.keboola.extractor._extract_via_legacy", side_effect=legacy_succeeds):
|
|
result = run(output_dir, configs, "https://example.com", "test-token")
|
|
|
|
assert result["tables_extracted"] == 1
|
|
assert result["tables_failed"] == 0
|
|
# Verify the legacy-produced data is queryable
|
|
conn = duckdb.connect(str(Path(output_dir) / "extract.duckdb"))
|
|
val = conn.execute("SELECT value FROM t").fetchone()
|
|
assert val[0] == 7
|
|
conn.close()
|
|
|
|
def test_all_tables_fail_returns_full_failure_stats(self, output_dir, monkeypatch):
|
|
"""When every table fails, the extractor returns all failures in stats
|
|
without crashing."""
|
|
from connectors.keboola.extractor import run
|
|
|
|
configs = [
|
|
{"name": "t1", "query_mode": "local", "description": ""},
|
|
{"name": "t2", "query_mode": "local", "description": ""},
|
|
]
|
|
|
|
def always_fail(conn, tc, pq_path):
|
|
raise RuntimeError("Extraction failed")
|
|
|
|
# Mock legacy too — otherwise it would attempt a real HTTP call to
|
|
# the fake URL on each per-table fallback retry. Force inline mode
|
|
# (AGNES_KEBOOLA_PARALLELISM=1) so the mock survives — the parallel
|
|
# path uses ProcessPoolExecutor which spawns subprocesses that
|
|
# don't see the mock.
|
|
monkeypatch.setenv("AGNES_KEBOOLA_PARALLELISM", "1")
|
|
|
|
def legacy_also_fails(tc, pq_path, url, token):
|
|
raise RuntimeError("Extraction failed")
|
|
|
|
with patch("connectors.keboola.extractor._try_attach_extension", side_effect=_mock_attach), \
|
|
patch("connectors.keboola.extractor._extract_via_extension", side_effect=always_fail), \
|
|
patch("connectors.keboola.extractor._extract_via_legacy", side_effect=legacy_also_fails):
|
|
result = run(output_dir, configs, "https://example.com", "test-token")
|
|
|
|
assert result["tables_extracted"] == 0
|
|
assert result["tables_failed"] == 2
|
|
assert len(result["errors"]) == 2
|
|
|
|
def test_legacy_parallelism_one_runs_inline(self, output_dir, monkeypatch):
|
|
"""AGNES_KEBOOLA_PARALLELISM=1 keeps the legacy fallback inline —
|
|
no ProcessPoolExecutor, so unittest.mock patches survive. Useful
|
|
as a debugging escape hatch and the path used by tests below.
|
|
|
|
Why processes (not threads) for the parallel path: the legacy
|
|
client's `export_table` does `os.chdir(temp_dir)` to direct
|
|
kbcstorage's slice-file downloads into a per-call temp directory.
|
|
`os.chdir` is process-global, so two threads racing on it land
|
|
slice files in the wrong directory and the merge step fails with
|
|
`[Errno 2] No such file or directory`. Process workers each have
|
|
their own CWD and don't interfere."""
|
|
from connectors.keboola.extractor import run
|
|
|
|
configs = [
|
|
{"name": f"u{i}", "query_mode": "local", "description": "",
|
|
"bucket": "in.c-test", "source_table": f"u{i}"}
|
|
for i in range(3)
|
|
]
|
|
|
|
call_count = 0
|
|
|
|
def mock_legacy(tc, pq_path, url, token):
|
|
nonlocal call_count
|
|
call_count += 1
|
|
_write_parquet(pq_path, "SELECT 1 AS x")
|
|
|
|
def extension_always_fails(conn, tc, pq_path):
|
|
raise RuntimeError("Schema not authorized")
|
|
|
|
monkeypatch.setenv("AGNES_KEBOOLA_PARALLELISM", "1")
|
|
|
|
with patch("connectors.keboola.extractor._try_attach_extension", side_effect=_mock_attach), \
|
|
patch("connectors.keboola.extractor._extract_via_extension", side_effect=extension_always_fails), \
|
|
patch("connectors.keboola.extractor._extract_via_legacy", side_effect=mock_legacy):
|
|
result = run(output_dir, configs, "https://example.com", "test-token")
|
|
|
|
assert result["tables_extracted"] == 3
|
|
assert result["tables_failed"] == 0
|
|
assert call_count == 3, "all 3 tables should have called the patched legacy fn"
|
|
|
|
def test_legacy_uses_process_pool_when_parallel(self, output_dir, monkeypatch):
|
|
"""Structural check: when len(legacy_queue) > 1 and
|
|
AGNES_KEBOOLA_PARALLELISM > 1, the parallel path imports
|
|
`concurrent.futures.ProcessPoolExecutor` to drain the queue.
|
|
|
|
The mock can't ride into subprocesses (mocks aren't picklable),
|
|
so we patch ProcessPoolExecutor itself and verify it's invoked
|
|
with the expected worker count."""
|
|
from connectors.keboola.extractor import run
|
|
|
|
configs = [
|
|
{"name": f"t{i}", "query_mode": "local", "description": "",
|
|
"bucket": "in.c-test", "source_table": f"t{i}"}
|
|
for i in range(5)
|
|
]
|
|
|
|
def extension_always_fails(conn, tc, pq_path):
|
|
raise RuntimeError("Schema not authorized")
|
|
|
|
# Stand in for ProcessPoolExecutor — runs everything in-process
|
|
# so we can verify the call shape without dealing with pickling.
|
|
seen_max_workers = []
|
|
|
|
class _FakePool:
|
|
def __init__(self, max_workers):
|
|
seen_max_workers.append(max_workers)
|
|
|
|
def __enter__(self):
|
|
return self
|
|
|
|
def __exit__(self, *exc):
|
|
return False
|
|
|
|
def submit(self, fn, *args, **kwargs):
|
|
from concurrent.futures import Future
|
|
f: Future = Future()
|
|
try:
|
|
# Inline the legacy call so the parquet ends up on disk
|
|
# for the orchestrator's downstream stat + _meta logic.
|
|
f.set_result(fn(*args, **kwargs))
|
|
except Exception as e:
|
|
f.set_exception(e)
|
|
return f
|
|
|
|
def mock_legacy(tc, pq_path, url, token):
|
|
_write_parquet(pq_path, "SELECT 1 AS x")
|
|
|
|
monkeypatch.setenv("AGNES_KEBOOLA_PARALLELISM", "4")
|
|
|
|
with patch("connectors.keboola.extractor._try_attach_extension", side_effect=_mock_attach), \
|
|
patch("connectors.keboola.extractor._extract_via_extension", side_effect=extension_always_fails), \
|
|
patch("connectors.keboola.extractor._extract_via_legacy", side_effect=mock_legacy), \
|
|
patch("concurrent.futures.ProcessPoolExecutor", _FakePool):
|
|
result = run(output_dir, configs, "https://example.com", "test-token")
|
|
|
|
assert result["tables_extracted"] == 5
|
|
assert result["tables_failed"] == 0
|
|
assert seen_max_workers == [4], (
|
|
f"Expected ProcessPoolExecutor(max_workers=4); got {seen_max_workers}"
|
|
)
|
|
|
|
def test_unsafe_identifier_skipped_not_crashed(self, output_dir):
|
|
"""Tables with unsafe identifiers are skipped with an error in stats,
|
|
not causing a crash."""
|
|
from connectors.keboola.extractor import run
|
|
|
|
configs = [
|
|
{"name": "bad-name", "query_mode": "local", "description": "hyphen not allowed"},
|
|
{"name": "good_name", "query_mode": "local", "description": "OK"},
|
|
]
|
|
|
|
def write_pq(conn, tc, pq_path):
|
|
_write_parquet(pq_path, "SELECT 1 AS id")
|
|
|
|
with patch("connectors.keboola.extractor._try_attach_extension", side_effect=_mock_attach), \
|
|
patch("connectors.keboola.extractor._extract_via_extension", side_effect=write_pq):
|
|
result = run(output_dir, configs, "https://example.com", "test-token")
|
|
|
|
assert result["tables_extracted"] == 1
|
|
assert result["tables_failed"] == 1
|
|
assert result["errors"][0]["error"] == "unsafe identifier"
|
|
|
|
def test_compute_exit_code_full_success(self):
|
|
from connectors.keboola.extractor import compute_exit_code
|
|
stats = {"tables_failed": 0, "errors": []}
|
|
assert compute_exit_code(stats, 5) == 0
|
|
|
|
def test_compute_exit_code_partial_failure(self):
|
|
from connectors.keboola.extractor import compute_exit_code
|
|
stats = {"tables_failed": 2, "errors": [{}, {}]}
|
|
assert compute_exit_code(stats, 5) == 2
|
|
|
|
def test_compute_exit_code_full_failure(self):
|
|
from connectors.keboola.extractor import compute_exit_code
|
|
stats = {"tables_failed": 5, "errors": [{}] * 5}
|
|
assert compute_exit_code(stats, 5) == 1
|
|
|
|
def test_compute_exit_code_no_tables(self):
|
|
from connectors.keboola.extractor import compute_exit_code
|
|
stats = {"tables_failed": 0, "errors": []}
|
|
assert compute_exit_code(stats, 0) == 0
|