* fix: cutover regressions + parallel Keboola legacy fallback
Bundled fixes from a fresh-deploy run on a Keboola Storage backend with
the block-shared-snowflake-access feature flag — DuckDB Keboola
extension's per-table scan can't access bucket schemas, so the legacy
kbcstorage Storage-API client is the only working path.
CUTOVER REGRESSIONS
- agnes pull hash mismatch on every Keboola local-mode table —
src/orchestrator.py:_update_sync_state stored md5(mtime+size)[:12]
while the CLI compares against full 32-char content MD5. Now stores
the same content MD5 the materialized SQL path already used.
- Trailing-slash sanitization in connectors/keboola/access.py and
extractor.py — DuckDB Keboola extension's ATTACH fails when the URL
ends in / (canonical form).
- src/profiler.py:TableInfo.description becomes optional — two call
sites instantiated without it, crashing the profiler pass.
- scripts/ops/agnes-auto-upgrade.sh: chown on UID change — older images
ran as root, current runs as agnes (uid 999). Reads target uid:gid
from /etc/passwd inside the new image and chowns ${STATE_DIR},
/data/extracts, /data/analytics when the digest moves.
- POST /api/sync/trigger is now singleton per process — two
near-simultaneous trigger calls each forked an extractor subprocess,
fought for extract.duckdb's file lock, starved uvicorn, flipped the
container to unhealthy. Trigger now returns 409
(sync_already_in_progress) when held; _run_sync acquires non-blocking.
PARALLEL LEGACY FALLBACK
- Process pool fan-out for the _extract_via_legacy queue (default 8
workers, override via AGNES_KEBOOLA_PARALLELISM). Process pool, not
thread pool, because connectors/keboola/client.py:export_table does
os.chdir(temp_dir) — process-global, so threads raced and slice files
landed in the wrong directory ("[Errno 2] No such file or directory:
'<job_id>.csv_X_Y_Z.csv'").
- Extractor subprocess timeout 1800s -> 3600s (configurable via
AGNES_EXTRACTOR_TIMEOUT_SEC). 28+ tables × multi-minute Keboola export
jobs need the headroom on telemetry-class projects.
- Process group cleanup on timeout — Popen(start_new_session=True) puts
the extractor in its own group. On timeout the parent SIGTERMs the
group (10s grace) then SIGKILLs stragglers. Without this, the pool
workers were reparented to PID 1 and continued holding open Keboola
Storage export jobs. Inline extractor script also installs a SIGTERM
-> sys.exit(143) handler so the with ProcessPoolExecutor(...) block
__exit__ runs cleanly.
Tests: existing tests that patched subprocess.run updated to patch
subprocess.Popen with a _FakePopen stand-in (same exit-code-injection
contract). Two tests that exercised the parallel path forced
AGNES_KEBOOLA_PARALLELISM=1 to keep mocks alive (mocks don't ride into
ProcessPoolExecutor subprocesses).
Squashed onto current main (was 7 commits + multi-commit CHANGELOG +
agnes-auto-upgrade.sh conflicts; squash avoids per-commit conflict
resolution against main's flat-mount STATE_DIR refactor and 0.38.0
release cut).
* feat(keboola): Storage API direct extract path; drop extension data path
The DuckDB Keboola extension's COPY routes through Keboola QueryService,
which is unreliable on linked-bucket projects (extension v0.1.6 fixes
that case but isn't yet in the community CDN, and pre-fix any project
with the block-shared-snowflake-access feature flag couldn't see bucket
schemas at all). Move the extract path off the extension entirely and
talk to the Storage API directly via signed-URL download — works on any
project, regardless of extension state.
connectors/keboola/storage_api.py (NEW)
Lightweight client built on requests.Session. Three endpoints:
- POST /v2/storage/tables/{id}/export-async (kicks off job)
- GET /v2/storage/jobs/{id} (poll until done)
- GET /v2/storage/files/{id}?federationToken=1 (signed URL detail)
- GET <signed_url> (download bytes)
Supports sliced exports (manifest + per-slice signed URLs) and gzipped
payloads. ExportFilter dataclass mirrors the Keboola filter spec
(whereFilters / columns / changedSince / limit) and handles JSON
round-trip with the registry's source_query column. Token redaction
in error messages. Bounded exponential backoff on job polling.
No cloud-SDK dependency on the data path; thread-safe.
connectors/keboola/extractor.py
- materialize_query() rewritten: takes bucket/source_table/source_query
(JSON filter spec), exports via KeboolaStorageClient, converts CSV
to parquet via DuckDB, atomic os.replace. Same return shape so
sync.py downstream code stays uniform with the BQ branch.
- _extract_via_legacy() also moved to Storage API direct (kept the
name for caller compatibility with _legacy_worker / the parallel
batch extractor). Per-call temp directories — no os.chdir, threads
don't race.
app/api/sync.py
_run_materialized_pass for source_type='keboola' rows now constructs a
KeboolaStorageClient (replaces KeboolaAccess) and passes
bucket/source_table/source_query to materialize_query. Reuses one
client across rows for HTTP keep-alive. Sources keboola URL from env
too (KEBOOLA_STACK_URL) when instance.yaml doesn't have stack_url
configured.
cli/commands/admin.py
discover-and-register defaults Keboola rows to query_mode='materialized'
(NULL source_query = full table), matching the v26 migration's
unification of the local/materialized split for Keboola. BigQuery and
Jira keep their per-source defaults.
src/db.py
Schema bump 25 → 26. Migration: UPDATE table_registry SET
query_mode='materialized' WHERE source_type='keboola' AND
query_mode='local'. NULL source_query on those rows means "full table
export" — same effective behavior the local mode provided, but now
via Storage API instead of the extension.
pyproject.toml
kbcstorage dep stays (admin-side bucket/table list still uses the
SDK in app/api/admin.py / connectors/keboola/client.py); only the
data path is migrated off the SDK. Comment updated to reflect the
new boundary.
tests
- test_keboola_storage_api.py (NEW, 19 tests): ExportFilter parsing,
HTTP client (token redaction, retry logic, polling), download_file
(single, gzipped, sliced), end-to-end export_table_to_csv.
- test_keboola_materialize.py rewritten: mocks KeboolaStorageClient
instead of FakeAccess; same atomic-write + zero-rows + unsafe-id
contracts.
- test_sync_trigger_keboola_materialized.py: registry rows now carry
bucket+source_table+JSON-shape source_query.
114+ Keboola-impacted tests green locally.
* test: schema version assertion bumped to 26 alongside the keboola query_mode migration
* fix(keboola): cutover hot-patches surfaced on agnes-dev
Five small fixes that were applied as in-container hot-patches during
agnes-dev cutover and need to be on the source-of-truth image so a fresh
upgrade does not undo them.
- app/api/sync.py: auto-discover gate considers the WHOLE registry (any
source, any mode), not just rows where source matches and query_mode
is local. After the v25→v26 keboola materialized migration an
instance can have 30 materialized rows and zero local rows; the
previous gate kept re-firing _discover_and_register_tables every
scheduler tick, creating duplicate auto-discovered rows with the
wrong bucket prefix every time.
- app/api/admin.py: _discover_and_register_tables reassembles the
bucket as <stage>.<bucket-id> (e.g. in.c-finance) instead of
dropping the stage prefix; default query_mode for keboola is now
materialized (the v26 contract); validator allows NULL source_query
for keboola materialized rows (full-table export via Storage API
export-async, no SQL needed).
- cli/commands/admin.py: register-table mirrors the server validator
(NULL source_query allowed for source_type=keboola); --bucket help
text generalized to cover both BQ dataset and Keboola bucket id.
- connectors/keboola/extractor.py: max_line_size=64 MiB on
read_csv_auto so embedded JSON / SQL cells (kbc_component_configuration
in particular) do not trip the default 2 MiB ceiling.
- connectors/keboola/storage_api.py: GCP backend support — when the
Storage API returns a manifest whose slice URLs are gs://
references with a gcsCredentials block, rewrite to the JSON REST
download endpoint and authenticate with the issued OAuth bearer
token; redact tokens in any surfaced error string.
* test: align with new keboola materialized + auto-discover-gate contracts
- test_admin_keboola_materialized: rename
test_register_keboola_materialized_rejects_missing_source_query →
test_register_keboola_materialized_accepts_missing_source_query.
v25→v26 introduced 'keboola materialized with NULL source_query
means full-table export via Storage API export-async' as the
default registration shape; the rejection case is no longer the
contract.
- test_sync_filter: add list_all() to _StubRegistry. The auto-discover
gate in _run_sync now keys off the WHOLE registry (not just local
rows) so materialized-only Keboola instances do not re-trigger
discovery on every tick.
* feat(keboola): native parquet export — skip CSV roundtrip
Storage API export-async accepts fileType={csv,parquet}. Switching the
materialized sync to parquet eliminates the CSV → DuckDB COPY → parquet
roundtrip that pinned a single uvicorn worker over 4 GiB on multi-GB
tables (read_csv with all_varchar + max_line_size=64MB has to
materialize the whole CSV in memory before COPY can stream out a
parquet). Snowflake UNLOAD on Keboola's side already produces typed,
self-contained parquet files; the extractor downloads them and renames
into place.
Two cases:
- **Single-file** export (small table): file_info.url points at one
signed URL; download_file streams chunks straight to .parquet.tmp
and we're done. No DuckDB.
- **Sliced** export (Snowflake UNLOAD respects MAX_FILE_SIZE — 16 MiB
default — so anything larger arrives as N parquet slices): each
slice is a complete parquet file with its own footer; naive concat
would corrupt them. download_file_slices keeps the slices as
separate files in a tempdir, then DuckDB COPY (SELECT * FROM
read_parquet([slice0, slice1, ...])) merges them into one
consolidated parquet. DuckDB streams row groups during this — peak
memory bounded to one row group (~1 MiB) regardless of source size.
The legacy CSV path stays as the explicit opt-in via source_query=
'{"file_type":"csv"}' for projects whose backend can't UNLOAD
parquet (none known today; cheap escape hatch). Backward-compat alias
KeboolaStorageClient.export_table_to_csv kept.
Also fixes a latent bug in download_file's gzip detection: previous
heuristic flagged any unencrypted file as gzipped, which would have
corrupted parquet downloads at gunzip time. Name-suffix-only now.
* fix: tempdir leak cleanup, every 0m schedule, /sync/trigger body shapes
Three small self-contained fixes uncovered during agnes-dev cutover.
- connectors/keboola/extractor.py: tempfile.TemporaryDirectory now uses
ignore_cleanup_errors=True so a worker death mid-write doesn't leave
multi-GiB stale slice trees on the boot disk. (12 GiB seen after a
disk-full crash where TemporaryDirectory's own cleanup also raised
and got swallowed.)
- src/scheduler.py: is_valid_schedule accepts 'every 0m' (interval=0
= always due). Force-resync of an errored row no longer requires
waiting out the default 'every 1h' interval — admin can flip the
schedule, trigger, then flip back.
- app/api/sync.py: POST /api/sync/trigger accepts both ['table_id']
(legacy bare-array body) and {'tables': ['table_id']} (matches the
response payload shape, more discoverable for clients building
requests by hand). Malformed bodies return 422 with a structured
detail; null/missing means 'sync everything' as before.
Tests cover: tempdir cleanup on raise (sliced parquet path),
is_valid_schedule + is_table_due 'every 0m' acceptance, and trigger
body parametrized matrix (8 valid shapes + 6 rejection cases).
* fix: targeted-trigger filter in materialized pass + auto-upgrade defer
Two operational gaps observed during agnes-dev cutover, in the same
sync-routing area.
- _run_materialized_pass now takes a 'tables' arg and skips rows not in
the target set with reason='not_in_target'. POST /api/sync/trigger
with a body of tables previously only scoped the legacy extractor
subprocess — the materialized pass kept iterating every due
materialized row, so an admin asking to re-sync kbc_job re-ran
every other due materialized row alongside it. Match on registry id
OR name (admins commonly pass either form). tables=None preserves
the no-filter behavior.
- New GET /api/sync/status (public, no auth) returns {locked: bool}
off _sync_lock.locked(). agnes-auto-upgrade.sh probes this before
docker compose up -d and exits 0 with a 'deferred recreate' log
line if a sync is in flight — the next 5-min cron tick retries.
Pre-fix, an auto-upgrade triggered mid-sync would recreate the
uvicorn worker and kill the in-flight extractor / Snowflake-UNLOAD
download (observed when kbc_job's first 7-day retry got SIGKILLed).
Connection failures in the probe fall through to the upgrade —
being stuck on a wedged image is worse than interrupting a
hypothetical sync.
* fix: auto-discover protects admin overrides + surfaces drift
Two real-world incidents on agnes-dev drove this:
1. kbc_job was registered manually with the correct
(in.c-kbc_telemetry, kbc_job) coordinates. A naive auto-discover
re-run would have inserted a SECOND kbc_job row at the slugified
id 'in_c-keboola-storage_kbc_job' (where Keboola's discovery
places it) — and that row's Storage API export-async 404s.
2. An earlier auto-discover bug stripped the stage prefix from
bucket ids ('c-finance' instead of 'in.c-finance'), inserting
137 rows whose syncs all failed.
Fix:
- _discover_and_register_tables now builds a plan first
(_build_keboola_discovery_plan) classifying each discovered table
into one of new / existing_match / existing_drift / invalid, then
executes only the 'new' bucket. Drift rows are reported with both
sides of the disagreement plus drift_kind:
- same_id_diff_coords: registry has the same id but different
bucket / source_table (admin migrated coords inline).
- name_collision: discovery's slugified id differs from any
registry id, but the discovered .name matches an existing row's
.name (case-insensitive). Catches the kbc_job case.
- Bucket detection now prefers the API's authoritative bucket_id
field (separate field on the Keboola tables.list response,
normalised by KeboolaClient.discover_all_tables). Falls back to
id-string parsing only when bucket_id is missing (older fallback
path inside discover_all_tables).
- Endpoint POST /api/admin/discover-and-register?dry_run=true
returns the plan without writing — would_register, drift,
invalid lists. Lets an operator audit before merging discovery
with a registry that has admin overrides.
Removed 'every 0m' from test_register_request_rejects_malformed_sync_schedule
— the runtime started accepting it in the previous commit (force-resync
override) and the validator follows suit.
* feat(keboola): AGNES_TEMP_DIR routes tempfiles off overlayfs /tmp
The container's /tmp lives on the boot disk's overlayfs (29 GiB on
agnes-dev, shared with /var). Snowflake UNLOAD of a wide table writes
slices into per-call /tmp tempdirs that fill multi-GiB / many-slice
exports long before the dedicated data disk fills. agnes-dev hit
100% boot-disk while the 20 GiB data disk had 15 GiB free.
connectors.keboola.storage_api.get_temp_root() reads AGNES_TEMP_DIR;
mkdirs the target on first use; unset / empty / unwritable falls
back to None (system tempdir, OSS-pre-fix behaviour). Both
materialize_query (parquet path) and _extract_via_legacy (CSV
fallback) and the sliced-CSV concat path in storage_api use the
helper now.
docker-compose.yml defaults AGNES_TEMP_DIR=/data/tmp on app, scheduler,
and extract services. The data volume is the dedicated disk in
production layouts and a plain docker volume in single-disk
dev/laptop setups — same blast radius as the previous /tmp default
on the latter, no regression.
520 lines
21 KiB
Python
520 lines
21 KiB
Python
"""KeboolaStorageClient — direct Storage API export-async path.
|
|
|
|
Replaces the previous DuckDB-extension materialize path (extension scan
|
|
broken on linked-bucket projects, see keboola/duckdb-extension#17). Tests
|
|
mock the requests.Session at the adapter level so we exercise the real
|
|
HTTP shapes (status codes, JSON bodies) without touching the network.
|
|
"""
|
|
from __future__ import annotations
|
|
|
|
import gzip
|
|
import json
|
|
from io import BytesIO
|
|
from pathlib import Path
|
|
from unittest.mock import MagicMock, patch
|
|
|
|
import pytest
|
|
import requests
|
|
|
|
from connectors.keboola.storage_api import (
|
|
FILE_TYPE_CSV,
|
|
FILE_TYPE_PARQUET,
|
|
ExportFilter,
|
|
KeboolaStorageClient,
|
|
StorageApiError,
|
|
get_temp_root,
|
|
)
|
|
|
|
|
|
# ---- ExportFilter ----------------------------------------------------------
|
|
|
|
class TestExportFilter:
|
|
def test_empty_dict_means_full_table(self):
|
|
f = ExportFilter.from_dict({})
|
|
assert f.to_export_params() == {}
|
|
|
|
def test_none_means_full_table(self):
|
|
f = ExportFilter.from_dict(None)
|
|
assert f.to_export_params() == {}
|
|
|
|
def test_where_filters_columns_changed_since(self):
|
|
f = ExportFilter.from_dict({
|
|
"where_filters": [
|
|
{"column": "status", "operator": "eq", "values": ["open"]},
|
|
],
|
|
"columns": ["id", "status"],
|
|
"changed_since": "2026-04-01",
|
|
})
|
|
params = f.to_export_params()
|
|
assert params["whereFilters"] == [
|
|
{"column": "status", "operator": "eq", "values": ["open"]}
|
|
]
|
|
# Storage API takes columns as comma-joined string, not array — the
|
|
# `kbcstorage` SDK does the same join, so match its wire format.
|
|
assert params["columns"] == "id,status"
|
|
assert params["changedSince"] == "2026-04-01"
|
|
|
|
def test_where_filter_missing_keys_raises_with_context(self):
|
|
f = ExportFilter.from_dict({
|
|
"where_filters": [{"column": "x", "operator": "eq"}], # no values
|
|
})
|
|
with pytest.raises(ValueError, match=r"missing fields.*\['values'\]"):
|
|
f.to_export_params()
|
|
|
|
def test_where_filter_values_must_be_list(self):
|
|
f = ExportFilter.from_dict({
|
|
"where_filters": [{"column": "x", "operator": "eq", "values": "open"}],
|
|
})
|
|
with pytest.raises(ValueError, match="values must be a list"):
|
|
f.to_export_params()
|
|
|
|
def test_default_file_type_is_csv_and_omits_param(self):
|
|
# Wire-side default is csv — preserve old behavior for callers
|
|
# that never set file_type.
|
|
assert ExportFilter().file_type == FILE_TYPE_CSV
|
|
assert "fileType" not in ExportFilter().to_export_params()
|
|
|
|
def test_file_type_parquet_emits_fileType_param(self):
|
|
f = ExportFilter(file_type=FILE_TYPE_PARQUET)
|
|
assert f.to_export_params()["fileType"] == "parquet"
|
|
|
|
def test_from_dict_reads_file_type_snake_case(self):
|
|
f = ExportFilter.from_dict({"file_type": "parquet"})
|
|
assert f.file_type == "parquet"
|
|
assert f.to_export_params()["fileType"] == "parquet"
|
|
|
|
def test_from_dict_reads_fileType_camel_case_alias(self):
|
|
# Operators copying examples from Apiary docs ship the wire name.
|
|
f = ExportFilter.from_dict({"fileType": "parquet"})
|
|
assert f.file_type == "parquet"
|
|
|
|
def test_from_dict_invalid_file_type_raises(self):
|
|
with pytest.raises(ValueError, match="file_type"):
|
|
ExportFilter.from_dict({"file_type": "orc"})
|
|
|
|
|
|
# ---- HTTP client low-level -------------------------------------------------
|
|
|
|
def _mock_response(status, body):
|
|
"""Build a fake `requests.Response`-like object."""
|
|
resp = MagicMock(spec=requests.Response)
|
|
resp.status_code = status
|
|
resp.json.return_value = body
|
|
resp.text = json.dumps(body)
|
|
return resp
|
|
|
|
|
|
class TestStorageClient:
|
|
def test_init_normalises_trailing_slash(self):
|
|
c = KeboolaStorageClient(url="https://kbc/", token="t")
|
|
assert c.base.endswith("/v2/storage")
|
|
assert "/" * 2 not in c.base.replace("https://", "")
|
|
|
|
def test_init_rejects_missing_url_or_token(self):
|
|
with pytest.raises(ValueError):
|
|
KeboolaStorageClient(url="", token="t")
|
|
with pytest.raises(ValueError):
|
|
KeboolaStorageClient(url="https://kbc", token="")
|
|
|
|
def test_post_sends_storage_api_token_header(self):
|
|
sess = MagicMock()
|
|
sess.post.return_value = _mock_response(200, {"id": 42})
|
|
c = KeboolaStorageClient(url="https://kbc", token="abc", session=sess)
|
|
|
|
c.export_table_async("in.c-x.t", {"columns": "a"})
|
|
|
|
sess.post.assert_called_once()
|
|
kwargs = sess.post.call_args.kwargs
|
|
assert kwargs["headers"]["X-StorageApi-Token"] == "abc"
|
|
|
|
def test_post_4xx_redacts_token_in_error_message(self):
|
|
# If the API echoes the token (or a proxy injects it), we must not
|
|
# leak it into raised exceptions.
|
|
sess = MagicMock()
|
|
sess.post.return_value = _mock_response(
|
|
403, {"detail": "rejected token=secrettoken123"}
|
|
)
|
|
c = KeboolaStorageClient(url="https://kbc", token="secrettoken123", session=sess)
|
|
|
|
with pytest.raises(StorageApiError) as e:
|
|
c.export_table_async("in.c-x.t", {})
|
|
|
|
assert "secrettoken123" not in str(e.value)
|
|
assert "<redacted-storage-token>" in str(e.value)
|
|
|
|
|
|
# ---- wait_for_job ----------------------------------------------------------
|
|
|
|
class TestWaitForJob:
|
|
def test_returns_on_success(self):
|
|
sess = MagicMock()
|
|
sess.get.return_value = _mock_response(200, {
|
|
"id": 1, "status": "success", "results": {"file": {"id": 99}},
|
|
})
|
|
c = KeboolaStorageClient(url="https://kbc", token="t", session=sess)
|
|
|
|
job = c.wait_for_job(1, timeout=5, poll_interval=0.01)
|
|
assert job["status"] == "success"
|
|
|
|
def test_raises_on_error_status(self):
|
|
sess = MagicMock()
|
|
sess.get.return_value = _mock_response(200, {
|
|
"id": 1, "status": "error", "error": {"message": "bad table"},
|
|
})
|
|
c = KeboolaStorageClient(url="https://kbc", token="t", session=sess)
|
|
|
|
with pytest.raises(StorageApiError, match="reported error"):
|
|
c.wait_for_job(1, timeout=5, poll_interval=0.01)
|
|
|
|
def test_polls_until_terminal(self):
|
|
# First two responses 'waiting', third 'success'. The client must
|
|
# keep polling instead of giving up.
|
|
sess = MagicMock()
|
|
sess.get.side_effect = [
|
|
_mock_response(200, {"id": 1, "status": "waiting"}),
|
|
_mock_response(200, {"id": 1, "status": "processing"}),
|
|
_mock_response(200, {"id": 1, "status": "success", "results": {"file": {"id": 7}}}),
|
|
]
|
|
c = KeboolaStorageClient(url="https://kbc", token="t", session=sess)
|
|
|
|
job = c.wait_for_job(1, timeout=5, poll_interval=0.01)
|
|
assert job["status"] == "success"
|
|
assert sess.get.call_count == 3
|
|
|
|
def test_timeout_raises_with_job_id(self):
|
|
sess = MagicMock()
|
|
sess.get.return_value = _mock_response(200, {"id": 1, "status": "waiting"})
|
|
c = KeboolaStorageClient(url="https://kbc", token="t", session=sess)
|
|
|
|
with pytest.raises(StorageApiError, match="did not finish"):
|
|
c.wait_for_job(1, timeout=0.1, poll_interval=0.05)
|
|
|
|
|
|
# ---- download_file ---------------------------------------------------------
|
|
|
|
class TestDownloadFile:
|
|
def test_single_file_csv_passthrough(self, tmp_path):
|
|
sess = MagicMock()
|
|
# File detail returns a signed URL for a non-sliced .csv; download
|
|
# streams it directly.
|
|
single_resp = MagicMock()
|
|
single_resp.__enter__ = MagicMock(return_value=single_resp)
|
|
single_resp.__exit__ = MagicMock(return_value=False)
|
|
single_resp.iter_content.return_value = [b"col1,col2\n", b"a,1\n", b"b,2\n"]
|
|
single_resp.raise_for_status = MagicMock()
|
|
sess.get.return_value = single_resp
|
|
|
|
c = KeboolaStorageClient(url="https://kbc", token="t", session=sess)
|
|
dest = tmp_path / "out.csv"
|
|
c.download_file({
|
|
"url": "https://signed/single.csv",
|
|
"name": "single.csv",
|
|
"isSliced": False,
|
|
}, dest)
|
|
|
|
assert dest.exists()
|
|
assert dest.read_bytes() == b"col1,col2\na,1\nb,2\n"
|
|
|
|
def test_single_file_gz_is_gunzipped(self, tmp_path):
|
|
gzipped = BytesIO()
|
|
with gzip.GzipFile(fileobj=gzipped, mode="wb") as gz:
|
|
gz.write(b"col1,col2\nx,42\n")
|
|
payload = gzipped.getvalue()
|
|
|
|
sess = MagicMock()
|
|
single_resp = MagicMock()
|
|
single_resp.__enter__ = MagicMock(return_value=single_resp)
|
|
single_resp.__exit__ = MagicMock(return_value=False)
|
|
single_resp.iter_content.return_value = [payload]
|
|
single_resp.raise_for_status = MagicMock()
|
|
sess.get.return_value = single_resp
|
|
|
|
c = KeboolaStorageClient(url="https://kbc", token="t", session=sess)
|
|
dest = tmp_path / "out.csv"
|
|
c.download_file({
|
|
"url": "https://signed/single.csv.gz",
|
|
"name": "single.csv.gz",
|
|
"isSliced": False,
|
|
}, dest)
|
|
|
|
assert dest.read_bytes() == b"col1,col2\nx,42\n"
|
|
|
|
def test_sliced_concat_in_order(self, tmp_path):
|
|
# isSliced=True: detail.url points at a JSON manifest of slice URLs.
|
|
# Simulate two slices: slice 0 (header + rows), slice 1 (more rows,
|
|
# NO header per Storage API contract). We just concatenate bytes —
|
|
# the contract test is "every slice's bytes appear in dest, in order".
|
|
sess = MagicMock()
|
|
|
|
manifest_resp = MagicMock()
|
|
manifest_resp.json.return_value = {
|
|
"entries": [
|
|
{"url": "https://signed/slice-0"},
|
|
{"url": "https://signed/slice-1"},
|
|
]
|
|
}
|
|
manifest_resp.raise_for_status = MagicMock()
|
|
|
|
slice0 = MagicMock()
|
|
slice0.__enter__ = MagicMock(return_value=slice0)
|
|
slice0.__exit__ = MagicMock(return_value=False)
|
|
slice0.iter_content.return_value = [b"col\n", b"a\n"]
|
|
slice0.raise_for_status = MagicMock()
|
|
|
|
slice1 = MagicMock()
|
|
slice1.__enter__ = MagicMock(return_value=slice1)
|
|
slice1.__exit__ = MagicMock(return_value=False)
|
|
slice1.iter_content.return_value = [b"b\n"]
|
|
slice1.raise_for_status = MagicMock()
|
|
|
|
sess.get.side_effect = [manifest_resp, slice0, slice1]
|
|
|
|
c = KeboolaStorageClient(url="https://kbc", token="t", session=sess)
|
|
dest = tmp_path / "out.csv"
|
|
c.download_file({
|
|
"url": "https://signed/manifest.json",
|
|
"name": "sliced",
|
|
"isSliced": True,
|
|
}, dest)
|
|
|
|
assert dest.read_bytes() == b"col\na\nb\n"
|
|
|
|
|
|
# ---- end-to-end export_table_to_csv ---------------------------------------
|
|
|
|
class TestExportTableToCsv:
|
|
def test_full_pipeline_calls_post_poll_detail_download(self, tmp_path):
|
|
"""Smoke: export-async → wait_for_job → file_detail → download.
|
|
Mock the session at the boundary; assert the URL composition and
|
|
order of operations match the contract. The actual bytes-written
|
|
path is covered by TestDownloadFile."""
|
|
sess = MagicMock()
|
|
|
|
# 1) POST /tables/X/export-async → {id: 100}
|
|
export_resp = _mock_response(200, {"id": 100})
|
|
|
|
# 2) GET /jobs/100 → success with file id 200
|
|
job_resp = _mock_response(200, {
|
|
"id": 100,
|
|
"status": "success",
|
|
"results": {"file": {"id": 200}, "totalRowsCount": 5},
|
|
})
|
|
|
|
# 3) GET /files/200?federationToken=1 → single non-sliced URL
|
|
file_resp = _mock_response(200, {
|
|
"url": "https://signed/file.csv",
|
|
"name": "file.csv",
|
|
"isSliced": False,
|
|
})
|
|
|
|
# 4) GET https://signed/file.csv (download)
|
|
download_resp = MagicMock()
|
|
download_resp.__enter__ = MagicMock(return_value=download_resp)
|
|
download_resp.__exit__ = MagicMock(return_value=False)
|
|
download_resp.iter_content.return_value = [b"col\n1\n"]
|
|
download_resp.raise_for_status = MagicMock()
|
|
|
|
# session.get is called for: jobs (poll), file detail, download.
|
|
# session.post for the export-async kickoff.
|
|
sess.post.return_value = export_resp
|
|
sess.get.side_effect = [job_resp, file_resp, download_resp]
|
|
|
|
c = KeboolaStorageClient(url="https://kbc", token="t", session=sess)
|
|
dest = tmp_path / "out.csv"
|
|
stats = c.export_table_to_csv(
|
|
"in.c-x.t", dest,
|
|
export_filter=ExportFilter(columns=["col"]),
|
|
)
|
|
|
|
assert dest.read_bytes() == b"col\n1\n"
|
|
assert stats["job_id"] == 100
|
|
assert stats["file_id"] == 200
|
|
assert stats["rows"] == 5
|
|
assert stats["bytes"] == len(b"col\n1\n")
|
|
|
|
# Assert export-async POST URL composition + body shape
|
|
post_url = sess.post.call_args.args[0]
|
|
assert post_url == "https://kbc/v2/storage/tables/in.c-x.t/export-async"
|
|
post_body = sess.post.call_args.kwargs["data"]
|
|
assert post_body["columns"] == "col"
|
|
|
|
def test_missing_job_id_in_response_is_typed_error(self):
|
|
sess = MagicMock()
|
|
sess.post.return_value = _mock_response(200, {}) # no `id`
|
|
c = KeboolaStorageClient(url="https://kbc", token="t", session=sess)
|
|
|
|
with pytest.raises(StorageApiError, match="missing job id"):
|
|
c.export_table_to_csv("in.c-x.t", Path("/tmp/x"))
|
|
|
|
def test_missing_file_in_job_results_is_typed_error(self, tmp_path):
|
|
sess = MagicMock()
|
|
sess.post.return_value = _mock_response(200, {"id": 1})
|
|
sess.get.return_value = _mock_response(200, {
|
|
"id": 1, "status": "success", "results": {}, # no `file`
|
|
})
|
|
c = KeboolaStorageClient(url="https://kbc", token="t", session=sess)
|
|
|
|
with pytest.raises(StorageApiError, match="no result file"):
|
|
c.export_table_to_csv("in.c-x.t", tmp_path / "x")
|
|
|
|
|
|
# ---- prepare_export + download_file_slices (parquet path) ------------------
|
|
|
|
class TestParquetPath:
|
|
def test_parquet_request_emits_fileType_in_post_body(self, tmp_path):
|
|
sess = MagicMock()
|
|
sess.post.return_value = _mock_response(200, {"id": 100})
|
|
sess.get.side_effect = [
|
|
_mock_response(200, {
|
|
"id": 100, "status": "success",
|
|
"results": {"file": {"id": 200}, "totalRowsCount": 3},
|
|
}),
|
|
_mock_response(200, {
|
|
"id": 200, "url": "https://signed/x.parquet",
|
|
"name": "x.parquet", "isSliced": False,
|
|
}),
|
|
]
|
|
c = KeboolaStorageClient(url="https://kbc", token="t", session=sess)
|
|
|
|
prep = c.prepare_export(
|
|
"in.c-x.t",
|
|
export_filter=ExportFilter(file_type=FILE_TYPE_PARQUET),
|
|
)
|
|
|
|
assert prep["file_type"] == "parquet"
|
|
assert prep["file_info"]["isSliced"] is False
|
|
assert sess.post.call_args.kwargs["data"]["fileType"] == "parquet"
|
|
|
|
def test_export_table_rejects_sliced_parquet(self, tmp_path):
|
|
"""Concatenating sliced parquet would corrupt per-slice footers.
|
|
``export_table`` must fail loud and direct callers at
|
|
``download_file_slices``."""
|
|
sess = MagicMock()
|
|
sess.post.return_value = _mock_response(200, {"id": 1})
|
|
sess.get.side_effect = [
|
|
_mock_response(200, {
|
|
"id": 1, "status": "success",
|
|
"results": {"file": {"id": 2}},
|
|
}),
|
|
_mock_response(200, {
|
|
"id": 2, "url": "https://signed/manifest.json",
|
|
"name": "x.parquet", "isSliced": True,
|
|
}),
|
|
]
|
|
c = KeboolaStorageClient(url="https://kbc", token="t", session=sess)
|
|
|
|
with pytest.raises(StorageApiError, match="sliced parquet"):
|
|
c.export_table(
|
|
"in.c-x.t", tmp_path / "x.parquet",
|
|
export_filter=ExportFilter(file_type=FILE_TYPE_PARQUET),
|
|
)
|
|
|
|
def test_download_file_slices_returns_per_slice_paths(self, tmp_path):
|
|
sess = MagicMock()
|
|
|
|
manifest_resp = MagicMock()
|
|
manifest_resp.json.return_value = {
|
|
"entries": [
|
|
{"url": "https://signed/slice-0"},
|
|
{"url": "https://signed/slice-1"},
|
|
],
|
|
}
|
|
manifest_resp.raise_for_status = MagicMock()
|
|
|
|
def mk_chunk_resp(payload: bytes):
|
|
r = MagicMock()
|
|
r.__enter__ = MagicMock(return_value=r)
|
|
r.__exit__ = MagicMock(return_value=False)
|
|
r.iter_content.return_value = [payload]
|
|
r.raise_for_status = MagicMock()
|
|
return r
|
|
|
|
slice0 = mk_chunk_resp(b"PAR1...slice0...")
|
|
slice1 = mk_chunk_resp(b"PAR1...slice1...")
|
|
sess.get.side_effect = [manifest_resp, slice0, slice1]
|
|
|
|
c = KeboolaStorageClient(url="https://kbc", token="t", session=sess)
|
|
paths = c.download_file_slices(
|
|
{"url": "https://signed/manifest.json", "isSliced": True,
|
|
"name": "x.parquet"},
|
|
tmp_path / "slices",
|
|
)
|
|
|
|
assert len(paths) == 2
|
|
assert paths[0].read_bytes() == b"PAR1...slice0..."
|
|
assert paths[1].read_bytes() == b"PAR1...slice1..."
|
|
# Naming preserves manifest order — required for deterministic
|
|
# downstream merge.
|
|
assert paths[0].name < paths[1].name
|
|
|
|
def test_download_file_slices_refuses_non_sliced(self):
|
|
c = KeboolaStorageClient(url="https://kbc", token="t",
|
|
session=MagicMock())
|
|
with pytest.raises(StorageApiError, match="non-sliced"):
|
|
c.download_file_slices(
|
|
{"url": "https://x", "isSliced": False}, Path("/tmp/x"),
|
|
)
|
|
|
|
def test_get_temp_root_unset_returns_none(self, monkeypatch):
|
|
"""No env var → None → tempfile falls back to system default
|
|
(typically /tmp). Preserves OSS-pre-fix behaviour for users
|
|
who haven't set AGNES_TEMP_DIR."""
|
|
monkeypatch.delenv("AGNES_TEMP_DIR", raising=False)
|
|
assert get_temp_root() is None
|
|
|
|
def test_get_temp_root_creates_dir_when_missing(self, monkeypatch, tmp_path):
|
|
"""First-time use: target dir doesn't yet exist; helper mkdirs
|
|
it (non-recursive parents handled by exist_ok). Returns the
|
|
absolute path so tempfile uses it as the parent for staging."""
|
|
target = tmp_path / "agnes-tmp-fresh"
|
|
assert not target.exists()
|
|
monkeypatch.setenv("AGNES_TEMP_DIR", str(target))
|
|
assert get_temp_root() == str(target)
|
|
assert target.is_dir()
|
|
|
|
def test_get_temp_root_existing_dir_reused(self, monkeypatch, tmp_path):
|
|
target = tmp_path / "agnes-tmp-existing"
|
|
target.mkdir()
|
|
monkeypatch.setenv("AGNES_TEMP_DIR", str(target))
|
|
assert get_temp_root() == str(target)
|
|
|
|
def test_get_temp_root_unwritable_falls_back(self, monkeypatch, tmp_path, caplog):
|
|
"""Sandboxes / read-only mounts make the target uncreatable; the
|
|
helper logs a warning and returns None so tempfile falls back
|
|
to the system default rather than blowing up the sync run."""
|
|
# Point at a path under a read-only parent that doesn't exist.
|
|
unwritable = "/nonexistent/forbidden/agnes-tmp"
|
|
monkeypatch.setenv("AGNES_TEMP_DIR", unwritable)
|
|
with caplog.at_level("WARNING"):
|
|
assert get_temp_root() is None
|
|
assert any("AGNES_TEMP_DIR" in r.message for r in caplog.records)
|
|
|
|
def test_get_temp_root_empty_string_treated_as_unset(self, monkeypatch):
|
|
# Operator who left ``AGNES_TEMP_DIR=`` (empty) in .env doesn't
|
|
# get an mkdir of "" — same as unset.
|
|
monkeypatch.setenv("AGNES_TEMP_DIR", "")
|
|
assert get_temp_root() is None
|
|
|
|
def test_parquet_download_does_not_gunzip_plain_parquet(self, tmp_path):
|
|
"""Regression: previous heuristic flagged any unencrypted file as
|
|
gzipped, which would corrupt parquet downloads at gunzip time.
|
|
Verify a `.parquet` file is written through unmodified."""
|
|
sess = MagicMock()
|
|
single_resp = MagicMock()
|
|
single_resp.__enter__ = MagicMock(return_value=single_resp)
|
|
single_resp.__exit__ = MagicMock(return_value=False)
|
|
# Real parquet magic bytes — not valid gzip, would crash gunzip.
|
|
single_resp.iter_content.return_value = [b"PAR1\x00\x00\x00binary"]
|
|
single_resp.raise_for_status = MagicMock()
|
|
sess.get.return_value = single_resp
|
|
|
|
c = KeboolaStorageClient(url="https://kbc", token="t", session=sess)
|
|
dest = tmp_path / "out.parquet"
|
|
c.download_file({
|
|
"url": "https://signed/x.parquet",
|
|
"name": "x.parquet",
|
|
"isSliced": False,
|
|
"isEncrypted": False,
|
|
}, dest)
|
|
|
|
assert dest.read_bytes() == b"PAR1\x00\x00\x00binary"
|