* fix: cutover regressions + parallel Keboola legacy fallback
Bundled fixes from a fresh-deploy run on a Keboola Storage backend with
the block-shared-snowflake-access feature flag — DuckDB Keboola
extension's per-table scan can't access bucket schemas, so the legacy
kbcstorage Storage-API client is the only working path.
CUTOVER REGRESSIONS
- agnes pull hash mismatch on every Keboola local-mode table —
src/orchestrator.py:_update_sync_state stored md5(mtime+size)[:12]
while the CLI compares against full 32-char content MD5. Now stores
the same content MD5 the materialized SQL path already used.
- Trailing-slash sanitization in connectors/keboola/access.py and
extractor.py — DuckDB Keboola extension's ATTACH fails when the URL
ends in / (canonical form).
- src/profiler.py:TableInfo.description becomes optional — two call
sites instantiated without it, crashing the profiler pass.
- scripts/ops/agnes-auto-upgrade.sh: chown on UID change — older images
ran as root, current runs as agnes (uid 999). Reads target uid:gid
from /etc/passwd inside the new image and chowns ${STATE_DIR},
/data/extracts, /data/analytics when the digest moves.
- POST /api/sync/trigger is now singleton per process — two
near-simultaneous trigger calls each forked an extractor subprocess,
fought for extract.duckdb's file lock, starved uvicorn, flipped the
container to unhealthy. Trigger now returns 409
(sync_already_in_progress) when held; _run_sync acquires non-blocking.
PARALLEL LEGACY FALLBACK
- Process pool fan-out for the _extract_via_legacy queue (default 8
workers, override via AGNES_KEBOOLA_PARALLELISM). Process pool, not
thread pool, because connectors/keboola/client.py:export_table does
os.chdir(temp_dir) — process-global, so threads raced and slice files
landed in the wrong directory ("[Errno 2] No such file or directory:
'<job_id>.csv_X_Y_Z.csv'").
- Extractor subprocess timeout 1800s -> 3600s (configurable via
AGNES_EXTRACTOR_TIMEOUT_SEC). 28+ tables × multi-minute Keboola export
jobs need the headroom on telemetry-class projects.
- Process group cleanup on timeout — Popen(start_new_session=True) puts
the extractor in its own group. On timeout the parent SIGTERMs the
group (10s grace) then SIGKILLs stragglers. Without this, the pool
workers were reparented to PID 1 and continued holding open Keboola
Storage export jobs. Inline extractor script also installs a SIGTERM
-> sys.exit(143) handler so the with ProcessPoolExecutor(...) block
__exit__ runs cleanly.
Tests: existing tests that patched subprocess.run updated to patch
subprocess.Popen with a _FakePopen stand-in (same exit-code-injection
contract). Two tests that exercised the parallel path forced
AGNES_KEBOOLA_PARALLELISM=1 to keep mocks alive (mocks don't ride into
ProcessPoolExecutor subprocesses).
Squashed onto current main (was 7 commits + multi-commit CHANGELOG +
agnes-auto-upgrade.sh conflicts; squash avoids per-commit conflict
resolution against main's flat-mount STATE_DIR refactor and 0.38.0
release cut).
* feat(keboola): Storage API direct extract path; drop extension data path
The DuckDB Keboola extension's COPY routes through Keboola QueryService,
which is unreliable on linked-bucket projects (extension v0.1.6 fixes
that case but isn't yet in the community CDN, and pre-fix any project
with the block-shared-snowflake-access feature flag couldn't see bucket
schemas at all). Move the extract path off the extension entirely and
talk to the Storage API directly via signed-URL download — works on any
project, regardless of extension state.
connectors/keboola/storage_api.py (NEW)
Lightweight client built on requests.Session. Three endpoints:
- POST /v2/storage/tables/{id}/export-async (kicks off job)
- GET /v2/storage/jobs/{id} (poll until done)
- GET /v2/storage/files/{id}?federationToken=1 (signed URL detail)
- GET <signed_url> (download bytes)
Supports sliced exports (manifest + per-slice signed URLs) and gzipped
payloads. ExportFilter dataclass mirrors the Keboola filter spec
(whereFilters / columns / changedSince / limit) and handles JSON
round-trip with the registry's source_query column. Token redaction
in error messages. Bounded exponential backoff on job polling.
No cloud-SDK dependency on the data path; thread-safe.
connectors/keboola/extractor.py
- materialize_query() rewritten: takes bucket/source_table/source_query
(JSON filter spec), exports via KeboolaStorageClient, converts CSV
to parquet via DuckDB, atomic os.replace. Same return shape so
sync.py downstream code stays uniform with the BQ branch.
- _extract_via_legacy() also moved to Storage API direct (kept the
name for caller compatibility with _legacy_worker / the parallel
batch extractor). Per-call temp directories — no os.chdir, threads
don't race.
app/api/sync.py
_run_materialized_pass for source_type='keboola' rows now constructs a
KeboolaStorageClient (replaces KeboolaAccess) and passes
bucket/source_table/source_query to materialize_query. Reuses one
client across rows for HTTP keep-alive. Sources keboola URL from env
too (KEBOOLA_STACK_URL) when instance.yaml doesn't have stack_url
configured.
cli/commands/admin.py
discover-and-register defaults Keboola rows to query_mode='materialized'
(NULL source_query = full table), matching the v26 migration's
unification of the local/materialized split for Keboola. BigQuery and
Jira keep their per-source defaults.
src/db.py
Schema bump 25 → 26. Migration: UPDATE table_registry SET
query_mode='materialized' WHERE source_type='keboola' AND
query_mode='local'. NULL source_query on those rows means "full table
export" — same effective behavior the local mode provided, but now
via Storage API instead of the extension.
pyproject.toml
kbcstorage dep stays (admin-side bucket/table list still uses the
SDK in app/api/admin.py / connectors/keboola/client.py); only the
data path is migrated off the SDK. Comment updated to reflect the
new boundary.
tests
- test_keboola_storage_api.py (NEW, 19 tests): ExportFilter parsing,
HTTP client (token redaction, retry logic, polling), download_file
(single, gzipped, sliced), end-to-end export_table_to_csv.
- test_keboola_materialize.py rewritten: mocks KeboolaStorageClient
instead of FakeAccess; same atomic-write + zero-rows + unsafe-id
contracts.
- test_sync_trigger_keboola_materialized.py: registry rows now carry
bucket+source_table+JSON-shape source_query.
114+ Keboola-impacted tests green locally.
* test: schema version assertion bumped to 26 alongside the keboola query_mode migration
* fix(keboola): cutover hot-patches surfaced on agnes-dev
Five small fixes that were applied as in-container hot-patches during
agnes-dev cutover and need to be on the source-of-truth image so a fresh
upgrade does not undo them.
- app/api/sync.py: auto-discover gate considers the WHOLE registry (any
source, any mode), not just rows where source matches and query_mode
is local. After the v25→v26 keboola materialized migration an
instance can have 30 materialized rows and zero local rows; the
previous gate kept re-firing _discover_and_register_tables every
scheduler tick, creating duplicate auto-discovered rows with the
wrong bucket prefix every time.
- app/api/admin.py: _discover_and_register_tables reassembles the
bucket as <stage>.<bucket-id> (e.g. in.c-finance) instead of
dropping the stage prefix; default query_mode for keboola is now
materialized (the v26 contract); validator allows NULL source_query
for keboola materialized rows (full-table export via Storage API
export-async, no SQL needed).
- cli/commands/admin.py: register-table mirrors the server validator
(NULL source_query allowed for source_type=keboola); --bucket help
text generalized to cover both BQ dataset and Keboola bucket id.
- connectors/keboola/extractor.py: max_line_size=64 MiB on
read_csv_auto so embedded JSON / SQL cells (kbc_component_configuration
in particular) do not trip the default 2 MiB ceiling.
- connectors/keboola/storage_api.py: GCP backend support — when the
Storage API returns a manifest whose slice URLs are gs://
references with a gcsCredentials block, rewrite to the JSON REST
download endpoint and authenticate with the issued OAuth bearer
token; redact tokens in any surfaced error string.
* test: align with new keboola materialized + auto-discover-gate contracts
- test_admin_keboola_materialized: rename
test_register_keboola_materialized_rejects_missing_source_query →
test_register_keboola_materialized_accepts_missing_source_query.
v25→v26 introduced 'keboola materialized with NULL source_query
means full-table export via Storage API export-async' as the
default registration shape; the rejection case is no longer the
contract.
- test_sync_filter: add list_all() to _StubRegistry. The auto-discover
gate in _run_sync now keys off the WHOLE registry (not just local
rows) so materialized-only Keboola instances do not re-trigger
discovery on every tick.
* feat(keboola): native parquet export — skip CSV roundtrip
Storage API export-async accepts fileType={csv,parquet}. Switching the
materialized sync to parquet eliminates the CSV → DuckDB COPY → parquet
roundtrip that pinned a single uvicorn worker over 4 GiB on multi-GB
tables (read_csv with all_varchar + max_line_size=64MB has to
materialize the whole CSV in memory before COPY can stream out a
parquet). Snowflake UNLOAD on Keboola's side already produces typed,
self-contained parquet files; the extractor downloads them and renames
into place.
Two cases:
- **Single-file** export (small table): file_info.url points at one
signed URL; download_file streams chunks straight to .parquet.tmp
and we're done. No DuckDB.
- **Sliced** export (Snowflake UNLOAD respects MAX_FILE_SIZE — 16 MiB
default — so anything larger arrives as N parquet slices): each
slice is a complete parquet file with its own footer; naive concat
would corrupt them. download_file_slices keeps the slices as
separate files in a tempdir, then DuckDB COPY (SELECT * FROM
read_parquet([slice0, slice1, ...])) merges them into one
consolidated parquet. DuckDB streams row groups during this — peak
memory bounded to one row group (~1 MiB) regardless of source size.
The legacy CSV path stays as the explicit opt-in via source_query=
'{"file_type":"csv"}' for projects whose backend can't UNLOAD
parquet (none known today; cheap escape hatch). Backward-compat alias
KeboolaStorageClient.export_table_to_csv kept.
Also fixes a latent bug in download_file's gzip detection: previous
heuristic flagged any unencrypted file as gzipped, which would have
corrupted parquet downloads at gunzip time. Name-suffix-only now.
* fix: tempdir leak cleanup, every 0m schedule, /sync/trigger body shapes
Three small self-contained fixes uncovered during agnes-dev cutover.
- connectors/keboola/extractor.py: tempfile.TemporaryDirectory now uses
ignore_cleanup_errors=True so a worker death mid-write doesn't leave
multi-GiB stale slice trees on the boot disk. (12 GiB seen after a
disk-full crash where TemporaryDirectory's own cleanup also raised
and got swallowed.)
- src/scheduler.py: is_valid_schedule accepts 'every 0m' (interval=0
= always due). Force-resync of an errored row no longer requires
waiting out the default 'every 1h' interval — admin can flip the
schedule, trigger, then flip back.
- app/api/sync.py: POST /api/sync/trigger accepts both ['table_id']
(legacy bare-array body) and {'tables': ['table_id']} (matches the
response payload shape, more discoverable for clients building
requests by hand). Malformed bodies return 422 with a structured
detail; null/missing means 'sync everything' as before.
Tests cover: tempdir cleanup on raise (sliced parquet path),
is_valid_schedule + is_table_due 'every 0m' acceptance, and trigger
body parametrized matrix (8 valid shapes + 6 rejection cases).
* fix: targeted-trigger filter in materialized pass + auto-upgrade defer
Two operational gaps observed during agnes-dev cutover, in the same
sync-routing area.
- _run_materialized_pass now takes a 'tables' arg and skips rows not in
the target set with reason='not_in_target'. POST /api/sync/trigger
with a body of tables previously only scoped the legacy extractor
subprocess — the materialized pass kept iterating every due
materialized row, so an admin asking to re-sync kbc_job re-ran
every other due materialized row alongside it. Match on registry id
OR name (admins commonly pass either form). tables=None preserves
the no-filter behavior.
- New GET /api/sync/status (public, no auth) returns {locked: bool}
off _sync_lock.locked(). agnes-auto-upgrade.sh probes this before
docker compose up -d and exits 0 with a 'deferred recreate' log
line if a sync is in flight — the next 5-min cron tick retries.
Pre-fix, an auto-upgrade triggered mid-sync would recreate the
uvicorn worker and kill the in-flight extractor / Snowflake-UNLOAD
download (observed when kbc_job's first 7-day retry got SIGKILLed).
Connection failures in the probe fall through to the upgrade —
being stuck on a wedged image is worse than interrupting a
hypothetical sync.
* fix: auto-discover protects admin overrides + surfaces drift
Two real-world incidents on agnes-dev drove this:
1. kbc_job was registered manually with the correct
(in.c-kbc_telemetry, kbc_job) coordinates. A naive auto-discover
re-run would have inserted a SECOND kbc_job row at the slugified
id 'in_c-keboola-storage_kbc_job' (where Keboola's discovery
places it) — and that row's Storage API export-async 404s.
2. An earlier auto-discover bug stripped the stage prefix from
bucket ids ('c-finance' instead of 'in.c-finance'), inserting
137 rows whose syncs all failed.
Fix:
- _discover_and_register_tables now builds a plan first
(_build_keboola_discovery_plan) classifying each discovered table
into one of new / existing_match / existing_drift / invalid, then
executes only the 'new' bucket. Drift rows are reported with both
sides of the disagreement plus drift_kind:
- same_id_diff_coords: registry has the same id but different
bucket / source_table (admin migrated coords inline).
- name_collision: discovery's slugified id differs from any
registry id, but the discovered .name matches an existing row's
.name (case-insensitive). Catches the kbc_job case.
- Bucket detection now prefers the API's authoritative bucket_id
field (separate field on the Keboola tables.list response,
normalised by KeboolaClient.discover_all_tables). Falls back to
id-string parsing only when bucket_id is missing (older fallback
path inside discover_all_tables).
- Endpoint POST /api/admin/discover-and-register?dry_run=true
returns the plan without writing — would_register, drift,
invalid lists. Lets an operator audit before merging discovery
with a registry that has admin overrides.
Removed 'every 0m' from test_register_request_rejects_malformed_sync_schedule
— the runtime started accepting it in the previous commit (force-resync
override) and the validator follows suit.
* feat(keboola): AGNES_TEMP_DIR routes tempfiles off overlayfs /tmp
The container's /tmp lives on the boot disk's overlayfs (29 GiB on
agnes-dev, shared with /var). Snowflake UNLOAD of a wide table writes
slices into per-call /tmp tempdirs that fill multi-GiB / many-slice
exports long before the dedicated data disk fills. agnes-dev hit
100% boot-disk while the 20 GiB data disk had 15 GiB free.
connectors.keboola.storage_api.get_temp_root() reads AGNES_TEMP_DIR;
mkdirs the target on first use; unset / empty / unwritable falls
back to None (system tempdir, OSS-pre-fix behaviour). Both
materialize_query (parquet path) and _extract_via_legacy (CSV
fallback) and the sliced-CSV concat path in storage_api use the
helper now.
docker-compose.yml defaults AGNES_TEMP_DIR=/data/tmp on app, scheduler,
and extract services. The data volume is the dedicated disk in
production layouts and a plain docker volume in single-disk
dev/laptop setups — same blast radius as the previous /tmp default
on the latter, no regression.
696 lines
30 KiB
Python
696 lines
30 KiB
Python
"""Lightweight Keboola Storage API client for table export.
|
|
|
|
The DuckDB Keboola extension was the originally-intended fast path, but on
|
|
projects with the `block-shared-snowflake-access` feature flag and on linked
|
|
buckets the per-session workspace can't see the bucket schemas at all
|
|
(keboola/duckdb-extension#17, fixed upstream in v0.1.6 but not yet in the
|
|
community CDN as of 2026-05-06). The `kbcstorage` SDK works but uses
|
|
`os.chdir(temp_dir)` to redirect slice downloads, which is process-global —
|
|
threaded fan-out races on CWD and slice files land in the wrong directory.
|
|
|
|
This module talks to Storage API directly and downloads via signed URLs:
|
|
- POST /v2/storage/tables/{id}/export-async
|
|
- GET /v2/storage/jobs/{id} (poll until success/error)
|
|
- GET /v2/storage/files/{id}?federationToken=1
|
|
- GET <signed_url> (single file or manifest + per-slice URLs for sliced)
|
|
|
|
No `os.chdir`, no boto3/azure-blob/google-cloud-storage SDK dependencies —
|
|
the federation-token detail response includes a signed URL that works for
|
|
all three cloud backends. Thread-safe: each call uses an independent
|
|
download path under a per-call temp directory.
|
|
|
|
Storage API reference:
|
|
- https://keboola.docs.apiary.io/#reference/tables/asynchronous-table-export
|
|
- https://keboola.docs.apiary.io/#reference/jobs
|
|
- https://keboola.docs.apiary.io/#reference/files/manage-files/file-detail
|
|
"""
|
|
from __future__ import annotations
|
|
|
|
import gzip
|
|
import logging
|
|
import os
|
|
import shutil
|
|
import tempfile
|
|
import time
|
|
from dataclasses import dataclass, field
|
|
from pathlib import Path
|
|
from typing import Any, Iterable, List, Optional
|
|
|
|
import requests
|
|
|
|
logger = logging.getLogger(__name__)
|
|
|
|
# Storage API guarantees export jobs are created small and finish in seconds
|
|
# to a few minutes for typical bucket-table sizes; the absolute upper bound
|
|
# (very large tables, peak Snowflake load) is the operator's
|
|
# storage.jobsParallelism + scan duration. 30 min is a generous ceiling that
|
|
# matches what the dashboard's data-preview UI would also wait for.
|
|
_DEFAULT_EXPORT_TIMEOUT_SEC = int(os.environ.get("AGNES_KEBOOLA_EXPORT_TIMEOUT_SEC", "1800"))
|
|
_DEFAULT_POLL_INTERVAL_SEC = float(os.environ.get("AGNES_KEBOOLA_POLL_INTERVAL_SEC", "2"))
|
|
|
|
# Per-slice HTTP download timeout — separate from the export-job timeout.
|
|
# Sliced exports return a manifest of signed URLs; an individual slice is
|
|
# bounded in size by Storage API's slicer (typically ~100 MiB), so a few
|
|
# minutes is plenty for one HTTP GET.
|
|
_DEFAULT_SLICE_DOWNLOAD_TIMEOUT_SEC = int(
|
|
os.environ.get("AGNES_KEBOOLA_SLICE_TIMEOUT_SEC", "300")
|
|
)
|
|
|
|
|
|
def get_temp_root() -> Optional[str]:
|
|
"""Return the parent dir for per-call tempdirs, or None to use the
|
|
system default.
|
|
|
|
Reads ``AGNES_TEMP_DIR`` (compose env, single source of truth) and
|
|
creates the dir if it does not yet exist. Default behaviour
|
|
(``AGNES_TEMP_DIR`` unset) preserves the OSS pre-fix path —
|
|
``tempfile.TemporaryDirectory(...)`` falls back to the platform's
|
|
`tmpdir` (typically ``/tmp``).
|
|
|
|
The agnes-dev cutover surfaced why this knob matters: the
|
|
container's ``/tmp`` lives on the boot disk's overlayfs (29 GiB
|
|
on agnes-dev, shared with /var), so a multi-slice Snowflake
|
|
UNLOAD of a wide table fills it long before the dedicated 20 GiB
|
|
data disk at ``/data`` would. Setting ``AGNES_TEMP_DIR=/data/tmp``
|
|
routes the staging dir to the data disk where the parquets are
|
|
going anyway, no extra mount required (the data disk is already
|
|
bind-mounted).
|
|
"""
|
|
root = os.environ.get("AGNES_TEMP_DIR", "").strip()
|
|
if not root:
|
|
return None
|
|
# Best-effort mkdir — if the parent isn't writable we let
|
|
# tempfile.TemporaryDirectory raise the real OSError later with
|
|
# the underlying detail. Avoids a silent fall-through to /tmp.
|
|
try:
|
|
os.makedirs(root, exist_ok=True)
|
|
except OSError as e:
|
|
logger.warning(
|
|
"AGNES_TEMP_DIR=%r not creatable (%s); tempfiles fall back "
|
|
"to system default. Set the env to a writable path or unset "
|
|
"to silence this warning.", root, e,
|
|
)
|
|
return None
|
|
return root
|
|
|
|
|
|
FILE_TYPE_CSV = "csv"
|
|
FILE_TYPE_PARQUET = "parquet"
|
|
_VALID_FILE_TYPES = {FILE_TYPE_CSV, FILE_TYPE_PARQUET}
|
|
|
|
|
|
@dataclass
|
|
class ExportFilter:
|
|
"""Structured Keboola Storage API filter spec.
|
|
|
|
Mirrors the BQ materialized path's `source_query` SQL string conceptually
|
|
— both let the admin scope an extracted table — but Storage API takes a
|
|
structured filter object rather than free-form SQL. Empty fields all
|
|
map to "no filter" so a default-constructed ExportFilter exports the
|
|
full table.
|
|
|
|
Operators per Apiary docs: eq, ne, in, notIn, ge, gt, le, lt.
|
|
|
|
`file_type` controls the format Storage API materializes into File
|
|
Storage. `parquet` is the recommended path for the materialized sync:
|
|
Keboola serves the parquet directly (UNLOADed from Snowflake), the
|
|
extractor renames it into place — no CSV intermediate, no DuckDB
|
|
COPY, no peak-memory load. Falls back to CSV when an admin pins
|
|
`{"file_type":"csv"}` in source_query (e.g. for projects whose
|
|
backend can't UNLOAD parquet, or legacy debugging).
|
|
"""
|
|
where_filters: List[dict] = field(default_factory=list)
|
|
columns: List[str] = field(default_factory=list)
|
|
changed_since: Optional[str] = None
|
|
changed_until: Optional[str] = None
|
|
limit: Optional[int] = None
|
|
file_type: str = FILE_TYPE_CSV
|
|
|
|
def __post_init__(self):
|
|
if self.file_type not in _VALID_FILE_TYPES:
|
|
raise ValueError(
|
|
f"file_type must be one of {sorted(_VALID_FILE_TYPES)}, "
|
|
f"got {self.file_type!r}"
|
|
)
|
|
|
|
@classmethod
|
|
def from_dict(cls, data: Optional[dict]) -> "ExportFilter":
|
|
"""Parse from `table_registry.source_query` JSON. Tolerates None /
|
|
empty / unknown keys (registry stores admin input that may be sparse)."""
|
|
if not data:
|
|
return cls()
|
|
if not isinstance(data, dict):
|
|
raise ValueError(
|
|
f"ExportFilter.from_dict expects a dict, got {type(data).__name__}"
|
|
)
|
|
# Accept both `file_type` (preferred, matches the rest of the
|
|
# snake_case API) and `fileType` (matches Storage API wire name)
|
|
# so an admin who copies an example from Apiary docs doesn't trip.
|
|
ft = data.get("file_type") or data.get("fileType") or FILE_TYPE_CSV
|
|
return cls(
|
|
where_filters=list(data.get("where_filters") or []),
|
|
columns=list(data.get("columns") or []),
|
|
changed_since=data.get("changed_since"),
|
|
changed_until=data.get("changed_until"),
|
|
limit=data.get("limit"),
|
|
file_type=ft,
|
|
)
|
|
|
|
def to_export_params(self) -> dict:
|
|
"""Serialize for POST body of `/tables/{id}/export-async`.
|
|
|
|
whereFilters arrives as a list of `{column, operator, values}` dicts;
|
|
Storage API also accepts a single `whereColumn`/`whereOperator`/
|
|
`whereValues` triple but the multi-filter form is more general.
|
|
"""
|
|
params: dict = {}
|
|
if self.where_filters:
|
|
# Validate shape lightly — surface admin typos as ValueError
|
|
# rather than letting them turn into a 400 from Keboola's API
|
|
# without context.
|
|
for i, f in enumerate(self.where_filters):
|
|
if not isinstance(f, dict):
|
|
raise ValueError(f"where_filters[{i}] must be a dict")
|
|
missing = {"column", "operator", "values"} - set(f.keys())
|
|
if missing:
|
|
raise ValueError(
|
|
f"where_filters[{i}] missing fields: {sorted(missing)}"
|
|
)
|
|
if not isinstance(f["values"], list):
|
|
raise ValueError(f"where_filters[{i}].values must be a list")
|
|
params["whereFilters"] = self.where_filters
|
|
if self.columns:
|
|
params["columns"] = ",".join(self.columns)
|
|
if self.changed_since:
|
|
params["changedSince"] = self.changed_since
|
|
if self.changed_until:
|
|
params["changedUntil"] = self.changed_until
|
|
if self.limit is not None:
|
|
params["limit"] = int(self.limit)
|
|
# Only emit fileType when non-default — keeps the request body
|
|
# quiet for legacy callers that never knew about parquet, and
|
|
# matches the wire-side default behaviour.
|
|
if self.file_type and self.file_type != FILE_TYPE_CSV:
|
|
params["fileType"] = self.file_type
|
|
return params
|
|
|
|
|
|
class StorageApiError(RuntimeError):
|
|
"""Wraps a non-2xx Storage API response with the parsed body for context."""
|
|
|
|
def __init__(self, message: str, status: Optional[int] = None, body: Any = None):
|
|
super().__init__(message)
|
|
self.status = status
|
|
self.body = body
|
|
|
|
|
|
class KeboolaStorageClient:
|
|
"""Thread-safe Storage API client for table export.
|
|
|
|
One instance can be reused across threads — `requests.Session` is
|
|
thread-safe when the underlying `HTTPAdapter`'s pool size is sized for
|
|
concurrent calls. Default `pool_connections=20, pool_maxsize=20`
|
|
accommodates the typical AGNES_KEBOOLA_PARALLELISM=8 plus headroom.
|
|
"""
|
|
|
|
def __init__(self, *, url: str, token: str, session: Optional[requests.Session] = None):
|
|
if not url or not token:
|
|
raise ValueError("KeboolaStorageClient requires url and token")
|
|
# The DuckDB Keboola extension's ATTACH chokes on a trailing slash
|
|
# (`https://connection.<region>.keboola.com/`); the Storage API
|
|
# tolerates either form, but normalising here keeps URL composition
|
|
# below predictable.
|
|
self.base = url.rstrip("/") + "/v2/storage"
|
|
self.token = token
|
|
if session is None:
|
|
session = requests.Session()
|
|
adapter = requests.adapters.HTTPAdapter(
|
|
pool_connections=20, pool_maxsize=20
|
|
)
|
|
session.mount("http://", adapter)
|
|
session.mount("https://", adapter)
|
|
self.session = session
|
|
|
|
# ---- low-level HTTP helpers -------------------------------------------
|
|
|
|
def _headers(self) -> dict:
|
|
return {"X-StorageApi-Token": self.token, "Accept": "application/json"}
|
|
|
|
def _get(self, path: str, **kwargs) -> dict:
|
|
url = f"{self.base}{path}"
|
|
resp = self.session.get(url, headers=self._headers(), timeout=30, **kwargs)
|
|
return self._parse(resp, "GET", url)
|
|
|
|
def _post(self, path: str, *, data: Optional[dict] = None) -> dict:
|
|
url = f"{self.base}{path}"
|
|
resp = self.session.post(
|
|
url, headers=self._headers(), data=data, timeout=30
|
|
)
|
|
return self._parse(resp, "POST", url)
|
|
|
|
def _parse(self, resp: requests.Response, method: str, url: str) -> dict:
|
|
try:
|
|
body = resp.json()
|
|
except Exception:
|
|
body = resp.text
|
|
if resp.status_code >= 400:
|
|
# Redact the token if it accidentally surfaces in an error body.
|
|
# The Storage API doesn't echo it, but third-party proxies in
|
|
# front of customer instances sometimes do.
|
|
redacted = self._redact(body)
|
|
raise StorageApiError(
|
|
f"{method} {url} -> HTTP {resp.status_code}: {redacted}",
|
|
status=resp.status_code,
|
|
body=body,
|
|
)
|
|
if not isinstance(body, dict):
|
|
raise StorageApiError(
|
|
f"{method} {url} -> unexpected non-JSON response: {str(body)[:200]}",
|
|
status=resp.status_code,
|
|
body=body,
|
|
)
|
|
return body
|
|
|
|
def _redact(self, body: Any) -> str:
|
|
s = str(body)
|
|
if self.token and self.token in s:
|
|
s = s.replace(self.token, "<redacted-storage-token>")
|
|
return s[:500]
|
|
|
|
# ---- export-async + job polling ---------------------------------------
|
|
|
|
def export_table_async(self, table_id: str, params: dict) -> dict:
|
|
"""POST /v2/storage/tables/{table_id}/export-async — kicks off the
|
|
async export and returns the job resource. Caller polls `job.id`
|
|
via `wait_for_job` to find the file id when status='success'."""
|
|
return self._post(f"/tables/{table_id}/export-async", data=params)
|
|
|
|
def wait_for_job(
|
|
self,
|
|
job_id: int,
|
|
*,
|
|
timeout: float = _DEFAULT_EXPORT_TIMEOUT_SEC,
|
|
poll_interval: float = _DEFAULT_POLL_INTERVAL_SEC,
|
|
) -> dict:
|
|
"""Block until the async job reaches a terminal state. Returns the
|
|
job dict on success; raises `StorageApiError` on failure or timeout.
|
|
|
|
The poll interval starts small and backs off slightly so a chain of
|
|
~10 fast polls covers a sub-30 s job without flogging the API, while
|
|
a 30-min job ends up at a steady cadence after a few minutes.
|
|
"""
|
|
deadline = time.monotonic() + timeout
|
|
interval = poll_interval
|
|
while time.monotonic() < deadline:
|
|
job = self._get(f"/jobs/{job_id}")
|
|
status = job.get("status")
|
|
if status == "success":
|
|
return job
|
|
if status == "error":
|
|
raise StorageApiError(
|
|
f"Storage API job {job_id} reported error: "
|
|
f"{job.get('error') or job}",
|
|
body=job,
|
|
)
|
|
time.sleep(interval)
|
|
# Exponential backoff bounded at 10 s — a multi-minute Snowflake
|
|
# scan does not benefit from sub-second polls. 1.5 multiplier
|
|
# reaches 10 s after ~9 polls (~30 s wall-clock) and stays there.
|
|
interval = min(interval * 1.5, 10.0)
|
|
raise StorageApiError(
|
|
f"Storage API job {job_id} did not finish within {timeout}s"
|
|
)
|
|
|
|
# ---- file detail + signed-URL download --------------------------------
|
|
|
|
def file_detail(self, file_id: int) -> dict:
|
|
"""GET /v2/storage/files/{file_id}?federationToken=1 — returns the
|
|
file metadata plus a presigned URL (`url`) usable directly via HTTP
|
|
without any cloud SDK. For sliced exports the `url` resolves to a
|
|
manifest JSON listing the per-slice signed URLs."""
|
|
return self._get(f"/files/{file_id}", params={"federationToken": 1})
|
|
|
|
def download_file(self, file_info: dict, dest_path: Path) -> Path:
|
|
"""Download a Storage API file (single or sliced) to `dest_path`.
|
|
|
|
Backend variants:
|
|
- **AWS / Azure**: signed HTTPS URL in `file_info["url"]` (S3
|
|
presigned / SAS). Sliced manifest entries are signed HTTPS too.
|
|
Plain HTTP GET works.
|
|
- **GCP**: `file_info["url"]` is a signed HTTPS URL for the
|
|
single-file case. For sliced exports, the manifest at `url`
|
|
lists per-slice paths as `gs://<bucket>/<key>` (NOT signed) —
|
|
requires GCS authentication. We use the OAuth access token from
|
|
`file_info["gcsCredentials"]["access_token"]` and hit the REST
|
|
endpoint
|
|
`https://storage.googleapis.com/storage/v1/b/<bucket>/o/<urlencoded_key>?alt=media`
|
|
with `Authorization: Bearer <token>`. No google-cloud-storage
|
|
SDK dependency.
|
|
|
|
Single-file: stream the signed URL directly, gunzipping if the
|
|
URL/name ends in `.gz`. Sliced: stream each slice into
|
|
`dest_path` in order (slice 0 has the CSV header per Storage
|
|
API contract, subsequent slices are header-less data).
|
|
"""
|
|
url = file_info.get("url")
|
|
if not url:
|
|
raise StorageApiError(
|
|
f"file detail missing 'url': {self._redact(file_info)}",
|
|
body=file_info,
|
|
)
|
|
|
|
is_sliced = bool(file_info.get("isSliced"))
|
|
# Gzip detection is name-based only. Snowflake UNLOAD adds the
|
|
# `.gz` suffix when compression is requested (CSV exports), and
|
|
# leaves it off otherwise (parquet has its own internal
|
|
# compression and is served as plain `.parquet`). The previous
|
|
# `isEncrypted is False` fallback gated on a property that's
|
|
# orthogonal to compression — it would have flagged parquet
|
|
# downloads as gzipped and corrupted them at gunzip time.
|
|
is_gzipped = file_info.get("name", "").endswith(".gz")
|
|
|
|
dest_path.parent.mkdir(parents=True, exist_ok=True)
|
|
|
|
if is_sliced:
|
|
# GCP sliced manifests carry `gs://` URIs that need an OAuth
|
|
# bearer; AWS / Azure carry signed HTTPS URLs that work
|
|
# without auth. The presence of `gcsCredentials` in the file
|
|
# detail signals a GCP backend.
|
|
gcs_token = (file_info.get("gcsCredentials") or {}).get("access_token")
|
|
self._download_sliced(url, dest_path, gcs_token=gcs_token)
|
|
else:
|
|
self._download_single(url, dest_path, gunzip_on_read=is_gzipped)
|
|
return dest_path
|
|
|
|
def _download_single(
|
|
self,
|
|
url: str,
|
|
dest_path: Path,
|
|
*,
|
|
gunzip_on_read: bool,
|
|
extra_headers: Optional[dict] = None,
|
|
) -> None:
|
|
"""Stream a single signed URL (or GCS REST URL with bearer token
|
|
in `extra_headers`) into `dest_path`. Transparently gunzips if
|
|
the file name suggests it's a `.gz` — Storage API serves through
|
|
proxies that may rewrite Content-Encoding, so name-based
|
|
detection is more reliable than the header in practice."""
|
|
with self.session.get(
|
|
url,
|
|
stream=True,
|
|
timeout=_DEFAULT_SLICE_DOWNLOAD_TIMEOUT_SEC,
|
|
headers=extra_headers,
|
|
) as r:
|
|
r.raise_for_status()
|
|
tmp = dest_path.with_suffix(dest_path.suffix + ".part")
|
|
try:
|
|
with open(tmp, "wb") as fh:
|
|
for chunk in r.iter_content(chunk_size=64 * 1024):
|
|
if chunk:
|
|
fh.write(chunk)
|
|
if gunzip_on_read:
|
|
self._gunzip_in_place(tmp, dest_path)
|
|
tmp.unlink(missing_ok=True)
|
|
else:
|
|
tmp.replace(dest_path)
|
|
finally:
|
|
if tmp.exists():
|
|
tmp.unlink(missing_ok=True)
|
|
|
|
@staticmethod
|
|
def _gs_to_https(gs_url: str) -> str:
|
|
"""Rewrite `gs://<bucket>/<key>` to GCS JSON API media-download URL.
|
|
|
|
The JSON API requires the object name URL-encoded as a single
|
|
path segment (slashes inside the key are escaped). `alt=media`
|
|
switches the response from object metadata JSON to the actual
|
|
bytes — matches what `bucket.blob(key).download_as_bytes()` does
|
|
in the google-cloud-storage SDK.
|
|
"""
|
|
from urllib.parse import quote
|
|
if not gs_url.startswith("gs://"):
|
|
raise ValueError(f"_gs_to_https expects gs://; got {gs_url!r}")
|
|
path = gs_url[5:] # strip "gs://"
|
|
bucket, _, key = path.partition("/")
|
|
if not bucket or not key:
|
|
raise ValueError(f"malformed gs:// URL: {gs_url!r}")
|
|
return (
|
|
f"https://storage.googleapis.com/storage/v1/b/{bucket}"
|
|
f"/o/{quote(key, safe='')}?alt=media"
|
|
)
|
|
|
|
def _download_sliced(
|
|
self, manifest_url: str, dest_path: Path, *, gcs_token: Optional[str] = None
|
|
) -> None:
|
|
"""Sliced exports: the file detail's `url` points at a JSON manifest
|
|
whose `entries[].url` lists per-slice locations. Download each slice
|
|
and concatenate into `dest_path`. The first slice contains the CSV
|
|
header (Storage API guarantees stable header positioning).
|
|
|
|
Per-slice URL forms:
|
|
- signed HTTPS (S3 presigned, Azure SAS) — plain GET works.
|
|
- `gs://<bucket>/<key>` (GCP) — requires `gcs_token` (OAuth bearer
|
|
shipped in the file_detail's `gcsCredentials.access_token`).
|
|
Mapped to `https://storage.googleapis.com/storage/v1/b/<bucket>/o/<encoded_key>?alt=media`.
|
|
"""
|
|
m = self.session.get(
|
|
manifest_url, timeout=_DEFAULT_SLICE_DOWNLOAD_TIMEOUT_SEC
|
|
)
|
|
m.raise_for_status()
|
|
manifest = m.json()
|
|
entries = manifest.get("entries") or []
|
|
if not entries:
|
|
raise StorageApiError(
|
|
f"sliced manifest had no entries: {str(manifest)[:200]}",
|
|
body=manifest,
|
|
)
|
|
|
|
with tempfile.TemporaryDirectory(
|
|
prefix="kbc-slice-", dir=get_temp_root(), ignore_cleanup_errors=True,
|
|
) as tmpdir:
|
|
slice_paths: List[Path] = []
|
|
for i, entry in enumerate(entries):
|
|
surl = entry.get("url")
|
|
if not surl:
|
|
raise StorageApiError(
|
|
f"slice {i} missing 'url': {str(entry)[:200]}",
|
|
body=entry,
|
|
)
|
|
sp = Path(tmpdir) / f"slice-{i:05d}"
|
|
# GCP backend: rewrite gs:// to GCS REST + bearer auth.
|
|
# The OAuth token comes from the file_detail's
|
|
# `gcsCredentials.access_token` (passed as `gcs_token`
|
|
# arg).
|
|
if surl.startswith("gs://"):
|
|
if not gcs_token:
|
|
raise StorageApiError(
|
|
f"slice {i} URL is gs:// but no gcs_token "
|
|
f"provided in file_detail.gcsCredentials"
|
|
)
|
|
surl = self._gs_to_https(surl)
|
|
extra_headers = {"Authorization": f"Bearer {gcs_token}"}
|
|
else:
|
|
extra_headers = None
|
|
# Slices may individually be gzipped — same heuristic as
|
|
# single-file: if the slice URL's path ends in `.gz`, gunzip
|
|
# after download.
|
|
gz = ".gz" in surl.split("?")[0].rsplit("/", 1)[-1]
|
|
self._download_single(
|
|
surl, sp, gunzip_on_read=gz, extra_headers=extra_headers,
|
|
)
|
|
slice_paths.append(sp)
|
|
|
|
# Concat. Sliced CSV exports include the header in slice 0 only
|
|
# (Storage API contract); subsequent slices are header-less.
|
|
with open(dest_path, "wb") as out:
|
|
for sp in slice_paths:
|
|
with open(sp, "rb") as fh:
|
|
shutil.copyfileobj(fh, out, length=64 * 1024)
|
|
|
|
@staticmethod
|
|
def _gunzip_in_place(src: Path, dest: Path) -> None:
|
|
with gzip.open(src, "rb") as gz, open(dest, "wb") as out:
|
|
shutil.copyfileobj(gz, out, length=64 * 1024)
|
|
|
|
# ---- high-level: export-async + poll, returning file metadata ---------
|
|
|
|
def prepare_export(
|
|
self,
|
|
table_id: str,
|
|
*,
|
|
export_filter: Optional[ExportFilter] = None,
|
|
export_timeout: float = _DEFAULT_EXPORT_TIMEOUT_SEC,
|
|
) -> dict:
|
|
"""Run export-async + wait_for_job + file_detail and return the
|
|
file metadata. Caller decides how to download (single vs
|
|
sliced) — needed for the parquet path where sliced output must
|
|
be downloaded slice-by-slice and then DuckDB-merged (cat-style
|
|
concat would corrupt the per-slice parquet footers).
|
|
|
|
Returns:
|
|
{"job_id": int, "file_id": int, "rows": int|None,
|
|
"file_info": dict, "file_type": str}
|
|
"""
|
|
f = export_filter or ExportFilter()
|
|
params = f.to_export_params()
|
|
job_resp = self.export_table_async(table_id, params)
|
|
job_id = job_resp.get("id")
|
|
if not job_id:
|
|
raise StorageApiError(
|
|
f"export-async response missing job id: {self._redact(job_resp)}",
|
|
body=job_resp,
|
|
)
|
|
job = self.wait_for_job(job_id, timeout=export_timeout)
|
|
results = job.get("results") or {}
|
|
file_id = (results.get("file") or {}).get("id") or results.get("fileId")
|
|
if not file_id:
|
|
raise StorageApiError(
|
|
f"job {job_id} succeeded but had no result file: "
|
|
f"{self._redact(job)}",
|
|
body=job,
|
|
)
|
|
file_info = self.file_detail(file_id)
|
|
return {
|
|
"job_id": int(job_id),
|
|
"file_id": int(file_id),
|
|
"rows": (results.get("totalRowsCount")
|
|
or results.get("rowsCount")
|
|
or job.get("totalRowsCount")),
|
|
"file_info": file_info,
|
|
"file_type": f.file_type,
|
|
}
|
|
|
|
def download_file_slices(
|
|
self, file_info: dict, dest_dir: Path
|
|
) -> List[Path]:
|
|
"""Download a sliced Storage API export as separate per-slice
|
|
files into ``dest_dir``. Returns the slice paths in manifest
|
|
order. Use when the slices must be processed individually
|
|
(e.g. parquet — each slice is a complete parquet file with its
|
|
own footer; concatenation would invalidate it). For CSV where
|
|
concat-with-header-only-on-first-slice is the right thing,
|
|
``download_file`` is the correct entry point.
|
|
"""
|
|
url = file_info.get("url")
|
|
if not url:
|
|
raise StorageApiError(
|
|
f"file detail missing 'url': {self._redact(file_info)}",
|
|
body=file_info,
|
|
)
|
|
if not file_info.get("isSliced"):
|
|
raise StorageApiError(
|
|
"download_file_slices called on a non-sliced file_info; "
|
|
"use download_file for the single-file case"
|
|
)
|
|
gcs_token = (file_info.get("gcsCredentials") or {}).get("access_token")
|
|
m = self.session.get(url, timeout=_DEFAULT_SLICE_DOWNLOAD_TIMEOUT_SEC)
|
|
m.raise_for_status()
|
|
manifest = m.json()
|
|
entries = manifest.get("entries") or []
|
|
if not entries:
|
|
raise StorageApiError(
|
|
f"sliced manifest had no entries: {str(manifest)[:200]}",
|
|
body=manifest,
|
|
)
|
|
dest_dir.mkdir(parents=True, exist_ok=True)
|
|
slice_paths: List[Path] = []
|
|
for i, entry in enumerate(entries):
|
|
surl = entry.get("url")
|
|
if not surl:
|
|
raise StorageApiError(
|
|
f"slice {i} missing 'url': {str(entry)[:200]}",
|
|
body=entry,
|
|
)
|
|
# Reuse the same gs:// rewrite + bearer + per-slice gz
|
|
# heuristics used by the concat path.
|
|
if surl.startswith("gs://"):
|
|
if not gcs_token:
|
|
raise StorageApiError(
|
|
f"slice {i} URL is gs:// but no gcs_token "
|
|
f"provided in file_detail.gcsCredentials"
|
|
)
|
|
surl = self._gs_to_https(surl)
|
|
extra_headers = {"Authorization": f"Bearer {gcs_token}"}
|
|
else:
|
|
extra_headers = None
|
|
gz = ".gz" in surl.split("?")[0].rsplit("/", 1)[-1]
|
|
sp = dest_dir / f"slice-{i:05d}"
|
|
self._download_single(
|
|
surl, sp, gunzip_on_read=gz, extra_headers=extra_headers,
|
|
)
|
|
slice_paths.append(sp)
|
|
return slice_paths
|
|
|
|
# ---- high-level: export to local file (csv or parquet) ----------------
|
|
|
|
def export_table(
|
|
self,
|
|
table_id: str,
|
|
dest_path: Path,
|
|
*,
|
|
export_filter: Optional[ExportFilter] = None,
|
|
export_timeout: float = _DEFAULT_EXPORT_TIMEOUT_SEC,
|
|
) -> dict:
|
|
"""End-to-end: export-async → poll → download to ``dest_path``.
|
|
|
|
``export_filter.file_type`` controls the format Storage API
|
|
materializes (``csv`` default, ``parquet`` when explicitly set).
|
|
``dest_path`` is the local file we write the bytes to; the caller
|
|
decides the extension. The downloader streams chunks to disk so
|
|
memory stays bounded regardless of file size.
|
|
|
|
For CSV the sliced case is handled transparently — slices are
|
|
concatenated into ``dest_path`` (header in slice 0 only). For
|
|
**sliced parquet**, callers must use ``prepare_export`` +
|
|
``download_file_slices`` instead — concatenating parquet slices
|
|
invalidates the per-slice footer. ``export_table`` will raise
|
|
StorageApiError if it sees a sliced parquet, to fail loud.
|
|
|
|
Returns a small stats dict so callers can log / record provenance:
|
|
{"job_id": int, "file_id": int, "rows": int|None, "bytes": int,
|
|
"file_type": str}
|
|
"""
|
|
prep = self.prepare_export(
|
|
table_id, export_filter=export_filter, export_timeout=export_timeout,
|
|
)
|
|
file_info = prep["file_info"]
|
|
if (
|
|
prep["file_type"] == FILE_TYPE_PARQUET
|
|
and file_info.get("isSliced")
|
|
):
|
|
raise StorageApiError(
|
|
f"sliced parquet export for {table_id}: use "
|
|
f"prepare_export + download_file_slices and merge with "
|
|
f"DuckDB COPY (concat would corrupt parquet footers)",
|
|
body=file_info,
|
|
)
|
|
self.download_file(file_info, dest_path)
|
|
size = dest_path.stat().st_size if dest_path.exists() else 0
|
|
return {
|
|
"job_id": prep["job_id"],
|
|
"file_id": prep["file_id"],
|
|
"rows": prep["rows"],
|
|
"bytes": size,
|
|
"file_type": prep["file_type"],
|
|
}
|
|
|
|
# Backwards-compat alias retained for external callers (e.g. ad-hoc
|
|
# scripts) that imported the old name. The behavior matches calling
|
|
# `export_table` with whatever `file_type` the export_filter carries
|
|
# — the *_to_csv suffix is now imprecise (Storage API can also serve
|
|
# parquet here), but renaming the import would force unrelated repos
|
|
# to coordinate. Prefer `export_table` in new code.
|
|
def export_table_to_csv(
|
|
self,
|
|
table_id: str,
|
|
dest_csv: Path,
|
|
*,
|
|
export_filter: Optional[ExportFilter] = None,
|
|
export_timeout: float = _DEFAULT_EXPORT_TIMEOUT_SEC,
|
|
) -> dict:
|
|
return self.export_table(
|
|
table_id,
|
|
dest_csv,
|
|
export_filter=export_filter,
|
|
export_timeout=export_timeout,
|
|
)
|