agnes-the-ai-analyst/connectors/keboola/storage_api.py
ZdenekSrotyr 28430ced09
Keboola cutover: native parquet path + sync correctness + auto-discover protection (#190)
* fix: cutover regressions + parallel Keboola legacy fallback

Bundled fixes from a fresh-deploy run on a Keboola Storage backend with
the block-shared-snowflake-access feature flag — DuckDB Keboola
extension's per-table scan can't access bucket schemas, so the legacy
kbcstorage Storage-API client is the only working path.

CUTOVER REGRESSIONS

- agnes pull hash mismatch on every Keboola local-mode table —
  src/orchestrator.py:_update_sync_state stored md5(mtime+size)[:12]
  while the CLI compares against full 32-char content MD5. Now stores
  the same content MD5 the materialized SQL path already used.

- Trailing-slash sanitization in connectors/keboola/access.py and
  extractor.py — DuckDB Keboola extension's ATTACH fails when the URL
  ends in / (canonical form).

- src/profiler.py:TableInfo.description becomes optional — two call
  sites instantiated without it, crashing the profiler pass.

- scripts/ops/agnes-auto-upgrade.sh: chown on UID change — older images
  ran as root, current runs as agnes (uid 999). Reads target uid:gid
  from /etc/passwd inside the new image and chowns ${STATE_DIR},
  /data/extracts, /data/analytics when the digest moves.

- POST /api/sync/trigger is now singleton per process — two
  near-simultaneous trigger calls each forked an extractor subprocess,
  fought for extract.duckdb's file lock, starved uvicorn, flipped the
  container to unhealthy. Trigger now returns 409
  (sync_already_in_progress) when held; _run_sync acquires non-blocking.

PARALLEL LEGACY FALLBACK

- Process pool fan-out for the _extract_via_legacy queue (default 8
  workers, override via AGNES_KEBOOLA_PARALLELISM). Process pool, not
  thread pool, because connectors/keboola/client.py:export_table does
  os.chdir(temp_dir) — process-global, so threads raced and slice files
  landed in the wrong directory ("[Errno 2] No such file or directory:
  '<job_id>.csv_X_Y_Z.csv'").

- Extractor subprocess timeout 1800s -> 3600s (configurable via
  AGNES_EXTRACTOR_TIMEOUT_SEC). 28+ tables × multi-minute Keboola export
  jobs need the headroom on telemetry-class projects.

- Process group cleanup on timeout — Popen(start_new_session=True) puts
  the extractor in its own group. On timeout the parent SIGTERMs the
  group (10s grace) then SIGKILLs stragglers. Without this, the pool
  workers were reparented to PID 1 and continued holding open Keboola
  Storage export jobs. Inline extractor script also installs a SIGTERM
  -> sys.exit(143) handler so the with ProcessPoolExecutor(...) block
  __exit__ runs cleanly.

Tests: existing tests that patched subprocess.run updated to patch
subprocess.Popen with a _FakePopen stand-in (same exit-code-injection
contract). Two tests that exercised the parallel path forced
AGNES_KEBOOLA_PARALLELISM=1 to keep mocks alive (mocks don't ride into
ProcessPoolExecutor subprocesses).

Squashed onto current main (was 7 commits + multi-commit CHANGELOG +
agnes-auto-upgrade.sh conflicts; squash avoids per-commit conflict
resolution against main's flat-mount STATE_DIR refactor and 0.38.0
release cut).

* feat(keboola): Storage API direct extract path; drop extension data path

The DuckDB Keboola extension's COPY routes through Keboola QueryService,
which is unreliable on linked-bucket projects (extension v0.1.6 fixes
that case but isn't yet in the community CDN, and pre-fix any project
with the block-shared-snowflake-access feature flag couldn't see bucket
schemas at all). Move the extract path off the extension entirely and
talk to the Storage API directly via signed-URL download — works on any
project, regardless of extension state.

connectors/keboola/storage_api.py (NEW)
  Lightweight client built on requests.Session. Three endpoints:
  - POST /v2/storage/tables/{id}/export-async        (kicks off job)
  - GET  /v2/storage/jobs/{id}                        (poll until done)
  - GET  /v2/storage/files/{id}?federationToken=1     (signed URL detail)
  - GET  <signed_url>                                 (download bytes)
  Supports sliced exports (manifest + per-slice signed URLs) and gzipped
  payloads. ExportFilter dataclass mirrors the Keboola filter spec
  (whereFilters / columns / changedSince / limit) and handles JSON
  round-trip with the registry's source_query column. Token redaction
  in error messages. Bounded exponential backoff on job polling.
  No cloud-SDK dependency on the data path; thread-safe.

connectors/keboola/extractor.py
  - materialize_query() rewritten: takes bucket/source_table/source_query
    (JSON filter spec), exports via KeboolaStorageClient, converts CSV
    to parquet via DuckDB, atomic os.replace. Same return shape so
    sync.py downstream code stays uniform with the BQ branch.
  - _extract_via_legacy() also moved to Storage API direct (kept the
    name for caller compatibility with _legacy_worker / the parallel
    batch extractor). Per-call temp directories — no os.chdir, threads
    don't race.

app/api/sync.py
  _run_materialized_pass for source_type='keboola' rows now constructs a
  KeboolaStorageClient (replaces KeboolaAccess) and passes
  bucket/source_table/source_query to materialize_query. Reuses one
  client across rows for HTTP keep-alive. Sources keboola URL from env
  too (KEBOOLA_STACK_URL) when instance.yaml doesn't have stack_url
  configured.

cli/commands/admin.py
  discover-and-register defaults Keboola rows to query_mode='materialized'
  (NULL source_query = full table), matching the v26 migration's
  unification of the local/materialized split for Keboola. BigQuery and
  Jira keep their per-source defaults.

src/db.py
  Schema bump 25 → 26. Migration: UPDATE table_registry SET
  query_mode='materialized' WHERE source_type='keboola' AND
  query_mode='local'. NULL source_query on those rows means "full table
  export" — same effective behavior the local mode provided, but now
  via Storage API instead of the extension.

pyproject.toml
  kbcstorage dep stays (admin-side bucket/table list still uses the
  SDK in app/api/admin.py / connectors/keboola/client.py); only the
  data path is migrated off the SDK. Comment updated to reflect the
  new boundary.

tests
  - test_keboola_storage_api.py (NEW, 19 tests): ExportFilter parsing,
    HTTP client (token redaction, retry logic, polling), download_file
    (single, gzipped, sliced), end-to-end export_table_to_csv.
  - test_keboola_materialize.py rewritten: mocks KeboolaStorageClient
    instead of FakeAccess; same atomic-write + zero-rows + unsafe-id
    contracts.
  - test_sync_trigger_keboola_materialized.py: registry rows now carry
    bucket+source_table+JSON-shape source_query.

114+ Keboola-impacted tests green locally.

* test: schema version assertion bumped to 26 alongside the keboola query_mode migration

* fix(keboola): cutover hot-patches surfaced on agnes-dev

Five small fixes that were applied as in-container hot-patches during
agnes-dev cutover and need to be on the source-of-truth image so a fresh
upgrade does not undo them.

- app/api/sync.py: auto-discover gate considers the WHOLE registry (any
  source, any mode), not just rows where source matches and query_mode
  is local. After the v25→v26 keboola materialized migration an
  instance can have 30 materialized rows and zero local rows; the
  previous gate kept re-firing _discover_and_register_tables every
  scheduler tick, creating duplicate auto-discovered rows with the
  wrong bucket prefix every time.

- app/api/admin.py: _discover_and_register_tables reassembles the
  bucket as <stage>.<bucket-id> (e.g. in.c-finance) instead of
  dropping the stage prefix; default query_mode for keboola is now
  materialized (the v26 contract); validator allows NULL source_query
  for keboola materialized rows (full-table export via Storage API
  export-async, no SQL needed).

- cli/commands/admin.py: register-table mirrors the server validator
  (NULL source_query allowed for source_type=keboola); --bucket help
  text generalized to cover both BQ dataset and Keboola bucket id.

- connectors/keboola/extractor.py: max_line_size=64 MiB on
  read_csv_auto so embedded JSON / SQL cells (kbc_component_configuration
  in particular) do not trip the default 2 MiB ceiling.

- connectors/keboola/storage_api.py: GCP backend support — when the
  Storage API returns a manifest whose slice URLs are gs://
  references with a gcsCredentials block, rewrite to the JSON REST
  download endpoint and authenticate with the issued OAuth bearer
  token; redact tokens in any surfaced error string.

* test: align with new keboola materialized + auto-discover-gate contracts

- test_admin_keboola_materialized: rename
  test_register_keboola_materialized_rejects_missing_source_query →
  test_register_keboola_materialized_accepts_missing_source_query.
  v25→v26 introduced 'keboola materialized with NULL source_query
  means full-table export via Storage API export-async' as the
  default registration shape; the rejection case is no longer the
  contract.

- test_sync_filter: add list_all() to _StubRegistry. The auto-discover
  gate in _run_sync now keys off the WHOLE registry (not just local
  rows) so materialized-only Keboola instances do not re-trigger
  discovery on every tick.

* feat(keboola): native parquet export — skip CSV roundtrip

Storage API export-async accepts fileType={csv,parquet}. Switching the
materialized sync to parquet eliminates the CSV → DuckDB COPY → parquet
roundtrip that pinned a single uvicorn worker over 4 GiB on multi-GB
tables (read_csv with all_varchar + max_line_size=64MB has to
materialize the whole CSV in memory before COPY can stream out a
parquet). Snowflake UNLOAD on Keboola's side already produces typed,
self-contained parquet files; the extractor downloads them and renames
into place.

Two cases:

- **Single-file** export (small table): file_info.url points at one
  signed URL; download_file streams chunks straight to .parquet.tmp
  and we're done. No DuckDB.

- **Sliced** export (Snowflake UNLOAD respects MAX_FILE_SIZE — 16 MiB
  default — so anything larger arrives as N parquet slices): each
  slice is a complete parquet file with its own footer; naive concat
  would corrupt them. download_file_slices keeps the slices as
  separate files in a tempdir, then DuckDB COPY (SELECT * FROM
  read_parquet([slice0, slice1, ...])) merges them into one
  consolidated parquet. DuckDB streams row groups during this — peak
  memory bounded to one row group (~1 MiB) regardless of source size.

The legacy CSV path stays as the explicit opt-in via source_query=
'{"file_type":"csv"}' for projects whose backend can't UNLOAD
parquet (none known today; cheap escape hatch). Backward-compat alias
KeboolaStorageClient.export_table_to_csv kept.

Also fixes a latent bug in download_file's gzip detection: previous
heuristic flagged any unencrypted file as gzipped, which would have
corrupted parquet downloads at gunzip time. Name-suffix-only now.

* fix: tempdir leak cleanup, every 0m schedule, /sync/trigger body shapes

Three small self-contained fixes uncovered during agnes-dev cutover.

- connectors/keboola/extractor.py: tempfile.TemporaryDirectory now uses
  ignore_cleanup_errors=True so a worker death mid-write doesn't leave
  multi-GiB stale slice trees on the boot disk. (12 GiB seen after a
  disk-full crash where TemporaryDirectory's own cleanup also raised
  and got swallowed.)

- src/scheduler.py: is_valid_schedule accepts 'every 0m' (interval=0
  = always due). Force-resync of an errored row no longer requires
  waiting out the default 'every 1h' interval — admin can flip the
  schedule, trigger, then flip back.

- app/api/sync.py: POST /api/sync/trigger accepts both ['table_id']
  (legacy bare-array body) and {'tables': ['table_id']} (matches the
  response payload shape, more discoverable for clients building
  requests by hand). Malformed bodies return 422 with a structured
  detail; null/missing means 'sync everything' as before.

Tests cover: tempdir cleanup on raise (sliced parquet path),
is_valid_schedule + is_table_due 'every 0m' acceptance, and trigger
body parametrized matrix (8 valid shapes + 6 rejection cases).

* fix: targeted-trigger filter in materialized pass + auto-upgrade defer

Two operational gaps observed during agnes-dev cutover, in the same
sync-routing area.

- _run_materialized_pass now takes a 'tables' arg and skips rows not in
  the target set with reason='not_in_target'. POST /api/sync/trigger
  with a body of tables previously only scoped the legacy extractor
  subprocess — the materialized pass kept iterating every due
  materialized row, so an admin asking to re-sync kbc_job re-ran
  every other due materialized row alongside it. Match on registry id
  OR name (admins commonly pass either form). tables=None preserves
  the no-filter behavior.

- New GET /api/sync/status (public, no auth) returns {locked: bool}
  off _sync_lock.locked(). agnes-auto-upgrade.sh probes this before
  docker compose up -d and exits 0 with a 'deferred recreate' log
  line if a sync is in flight — the next 5-min cron tick retries.
  Pre-fix, an auto-upgrade triggered mid-sync would recreate the
  uvicorn worker and kill the in-flight extractor / Snowflake-UNLOAD
  download (observed when kbc_job's first 7-day retry got SIGKILLed).
  Connection failures in the probe fall through to the upgrade —
  being stuck on a wedged image is worse than interrupting a
  hypothetical sync.

* fix: auto-discover protects admin overrides + surfaces drift

Two real-world incidents on agnes-dev drove this:

1. kbc_job was registered manually with the correct
   (in.c-kbc_telemetry, kbc_job) coordinates. A naive auto-discover
   re-run would have inserted a SECOND kbc_job row at the slugified
   id 'in_c-keboola-storage_kbc_job' (where Keboola's discovery
   places it) — and that row's Storage API export-async 404s.

2. An earlier auto-discover bug stripped the stage prefix from
   bucket ids ('c-finance' instead of 'in.c-finance'), inserting
   137 rows whose syncs all failed.

Fix:

- _discover_and_register_tables now builds a plan first
  (_build_keboola_discovery_plan) classifying each discovered table
  into one of new / existing_match / existing_drift / invalid, then
  executes only the 'new' bucket. Drift rows are reported with both
  sides of the disagreement plus drift_kind:
  - same_id_diff_coords: registry has the same id but different
    bucket / source_table (admin migrated coords inline).
  - name_collision: discovery's slugified id differs from any
    registry id, but the discovered .name matches an existing row's
    .name (case-insensitive). Catches the kbc_job case.

- Bucket detection now prefers the API's authoritative bucket_id
  field (separate field on the Keboola tables.list response,
  normalised by KeboolaClient.discover_all_tables). Falls back to
  id-string parsing only when bucket_id is missing (older fallback
  path inside discover_all_tables).

- Endpoint POST /api/admin/discover-and-register?dry_run=true
  returns the plan without writing — would_register, drift,
  invalid lists. Lets an operator audit before merging discovery
  with a registry that has admin overrides.

Removed 'every 0m' from test_register_request_rejects_malformed_sync_schedule
— the runtime started accepting it in the previous commit (force-resync
override) and the validator follows suit.

* feat(keboola): AGNES_TEMP_DIR routes tempfiles off overlayfs /tmp

The container's /tmp lives on the boot disk's overlayfs (29 GiB on
agnes-dev, shared with /var). Snowflake UNLOAD of a wide table writes
slices into per-call /tmp tempdirs that fill multi-GiB / many-slice
exports long before the dedicated data disk fills. agnes-dev hit
100% boot-disk while the 20 GiB data disk had 15 GiB free.

connectors.keboola.storage_api.get_temp_root() reads AGNES_TEMP_DIR;
mkdirs the target on first use; unset / empty / unwritable falls
back to None (system tempdir, OSS-pre-fix behaviour). Both
materialize_query (parquet path) and _extract_via_legacy (CSV
fallback) and the sliced-CSV concat path in storage_api use the
helper now.

docker-compose.yml defaults AGNES_TEMP_DIR=/data/tmp on app, scheduler,
and extract services. The data volume is the dedicated disk in
production layouts and a plain docker volume in single-disk
dev/laptop setups — same blast radius as the previous /tmp default
on the latter, no regression.
2026-05-07 12:12:14 +02:00

696 lines
30 KiB
Python

"""Lightweight Keboola Storage API client for table export.
The DuckDB Keboola extension was the originally-intended fast path, but on
projects with the `block-shared-snowflake-access` feature flag and on linked
buckets the per-session workspace can't see the bucket schemas at all
(keboola/duckdb-extension#17, fixed upstream in v0.1.6 but not yet in the
community CDN as of 2026-05-06). The `kbcstorage` SDK works but uses
`os.chdir(temp_dir)` to redirect slice downloads, which is process-global —
threaded fan-out races on CWD and slice files land in the wrong directory.
This module talks to Storage API directly and downloads via signed URLs:
- POST /v2/storage/tables/{id}/export-async
- GET /v2/storage/jobs/{id} (poll until success/error)
- GET /v2/storage/files/{id}?federationToken=1
- GET <signed_url> (single file or manifest + per-slice URLs for sliced)
No `os.chdir`, no boto3/azure-blob/google-cloud-storage SDK dependencies —
the federation-token detail response includes a signed URL that works for
all three cloud backends. Thread-safe: each call uses an independent
download path under a per-call temp directory.
Storage API reference:
- https://keboola.docs.apiary.io/#reference/tables/asynchronous-table-export
- https://keboola.docs.apiary.io/#reference/jobs
- https://keboola.docs.apiary.io/#reference/files/manage-files/file-detail
"""
from __future__ import annotations
import gzip
import logging
import os
import shutil
import tempfile
import time
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any, Iterable, List, Optional
import requests
logger = logging.getLogger(__name__)
# Storage API guarantees export jobs are created small and finish in seconds
# to a few minutes for typical bucket-table sizes; the absolute upper bound
# (very large tables, peak Snowflake load) is the operator's
# storage.jobsParallelism + scan duration. 30 min is a generous ceiling that
# matches what the dashboard's data-preview UI would also wait for.
_DEFAULT_EXPORT_TIMEOUT_SEC = int(os.environ.get("AGNES_KEBOOLA_EXPORT_TIMEOUT_SEC", "1800"))
_DEFAULT_POLL_INTERVAL_SEC = float(os.environ.get("AGNES_KEBOOLA_POLL_INTERVAL_SEC", "2"))
# Per-slice HTTP download timeout — separate from the export-job timeout.
# Sliced exports return a manifest of signed URLs; an individual slice is
# bounded in size by Storage API's slicer (typically ~100 MiB), so a few
# minutes is plenty for one HTTP GET.
_DEFAULT_SLICE_DOWNLOAD_TIMEOUT_SEC = int(
os.environ.get("AGNES_KEBOOLA_SLICE_TIMEOUT_SEC", "300")
)
def get_temp_root() -> Optional[str]:
"""Return the parent dir for per-call tempdirs, or None to use the
system default.
Reads ``AGNES_TEMP_DIR`` (compose env, single source of truth) and
creates the dir if it does not yet exist. Default behaviour
(``AGNES_TEMP_DIR`` unset) preserves the OSS pre-fix path —
``tempfile.TemporaryDirectory(...)`` falls back to the platform's
`tmpdir` (typically ``/tmp``).
The agnes-dev cutover surfaced why this knob matters: the
container's ``/tmp`` lives on the boot disk's overlayfs (29 GiB
on agnes-dev, shared with /var), so a multi-slice Snowflake
UNLOAD of a wide table fills it long before the dedicated 20 GiB
data disk at ``/data`` would. Setting ``AGNES_TEMP_DIR=/data/tmp``
routes the staging dir to the data disk where the parquets are
going anyway, no extra mount required (the data disk is already
bind-mounted).
"""
root = os.environ.get("AGNES_TEMP_DIR", "").strip()
if not root:
return None
# Best-effort mkdir — if the parent isn't writable we let
# tempfile.TemporaryDirectory raise the real OSError later with
# the underlying detail. Avoids a silent fall-through to /tmp.
try:
os.makedirs(root, exist_ok=True)
except OSError as e:
logger.warning(
"AGNES_TEMP_DIR=%r not creatable (%s); tempfiles fall back "
"to system default. Set the env to a writable path or unset "
"to silence this warning.", root, e,
)
return None
return root
FILE_TYPE_CSV = "csv"
FILE_TYPE_PARQUET = "parquet"
_VALID_FILE_TYPES = {FILE_TYPE_CSV, FILE_TYPE_PARQUET}
@dataclass
class ExportFilter:
"""Structured Keboola Storage API filter spec.
Mirrors the BQ materialized path's `source_query` SQL string conceptually
— both let the admin scope an extracted table — but Storage API takes a
structured filter object rather than free-form SQL. Empty fields all
map to "no filter" so a default-constructed ExportFilter exports the
full table.
Operators per Apiary docs: eq, ne, in, notIn, ge, gt, le, lt.
`file_type` controls the format Storage API materializes into File
Storage. `parquet` is the recommended path for the materialized sync:
Keboola serves the parquet directly (UNLOADed from Snowflake), the
extractor renames it into place — no CSV intermediate, no DuckDB
COPY, no peak-memory load. Falls back to CSV when an admin pins
`{"file_type":"csv"}` in source_query (e.g. for projects whose
backend can't UNLOAD parquet, or legacy debugging).
"""
where_filters: List[dict] = field(default_factory=list)
columns: List[str] = field(default_factory=list)
changed_since: Optional[str] = None
changed_until: Optional[str] = None
limit: Optional[int] = None
file_type: str = FILE_TYPE_CSV
def __post_init__(self):
if self.file_type not in _VALID_FILE_TYPES:
raise ValueError(
f"file_type must be one of {sorted(_VALID_FILE_TYPES)}, "
f"got {self.file_type!r}"
)
@classmethod
def from_dict(cls, data: Optional[dict]) -> "ExportFilter":
"""Parse from `table_registry.source_query` JSON. Tolerates None /
empty / unknown keys (registry stores admin input that may be sparse)."""
if not data:
return cls()
if not isinstance(data, dict):
raise ValueError(
f"ExportFilter.from_dict expects a dict, got {type(data).__name__}"
)
# Accept both `file_type` (preferred, matches the rest of the
# snake_case API) and `fileType` (matches Storage API wire name)
# so an admin who copies an example from Apiary docs doesn't trip.
ft = data.get("file_type") or data.get("fileType") or FILE_TYPE_CSV
return cls(
where_filters=list(data.get("where_filters") or []),
columns=list(data.get("columns") or []),
changed_since=data.get("changed_since"),
changed_until=data.get("changed_until"),
limit=data.get("limit"),
file_type=ft,
)
def to_export_params(self) -> dict:
"""Serialize for POST body of `/tables/{id}/export-async`.
whereFilters arrives as a list of `{column, operator, values}` dicts;
Storage API also accepts a single `whereColumn`/`whereOperator`/
`whereValues` triple but the multi-filter form is more general.
"""
params: dict = {}
if self.where_filters:
# Validate shape lightly — surface admin typos as ValueError
# rather than letting them turn into a 400 from Keboola's API
# without context.
for i, f in enumerate(self.where_filters):
if not isinstance(f, dict):
raise ValueError(f"where_filters[{i}] must be a dict")
missing = {"column", "operator", "values"} - set(f.keys())
if missing:
raise ValueError(
f"where_filters[{i}] missing fields: {sorted(missing)}"
)
if not isinstance(f["values"], list):
raise ValueError(f"where_filters[{i}].values must be a list")
params["whereFilters"] = self.where_filters
if self.columns:
params["columns"] = ",".join(self.columns)
if self.changed_since:
params["changedSince"] = self.changed_since
if self.changed_until:
params["changedUntil"] = self.changed_until
if self.limit is not None:
params["limit"] = int(self.limit)
# Only emit fileType when non-default — keeps the request body
# quiet for legacy callers that never knew about parquet, and
# matches the wire-side default behaviour.
if self.file_type and self.file_type != FILE_TYPE_CSV:
params["fileType"] = self.file_type
return params
class StorageApiError(RuntimeError):
"""Wraps a non-2xx Storage API response with the parsed body for context."""
def __init__(self, message: str, status: Optional[int] = None, body: Any = None):
super().__init__(message)
self.status = status
self.body = body
class KeboolaStorageClient:
"""Thread-safe Storage API client for table export.
One instance can be reused across threads — `requests.Session` is
thread-safe when the underlying `HTTPAdapter`'s pool size is sized for
concurrent calls. Default `pool_connections=20, pool_maxsize=20`
accommodates the typical AGNES_KEBOOLA_PARALLELISM=8 plus headroom.
"""
def __init__(self, *, url: str, token: str, session: Optional[requests.Session] = None):
if not url or not token:
raise ValueError("KeboolaStorageClient requires url and token")
# The DuckDB Keboola extension's ATTACH chokes on a trailing slash
# (`https://connection.<region>.keboola.com/`); the Storage API
# tolerates either form, but normalising here keeps URL composition
# below predictable.
self.base = url.rstrip("/") + "/v2/storage"
self.token = token
if session is None:
session = requests.Session()
adapter = requests.adapters.HTTPAdapter(
pool_connections=20, pool_maxsize=20
)
session.mount("http://", adapter)
session.mount("https://", adapter)
self.session = session
# ---- low-level HTTP helpers -------------------------------------------
def _headers(self) -> dict:
return {"X-StorageApi-Token": self.token, "Accept": "application/json"}
def _get(self, path: str, **kwargs) -> dict:
url = f"{self.base}{path}"
resp = self.session.get(url, headers=self._headers(), timeout=30, **kwargs)
return self._parse(resp, "GET", url)
def _post(self, path: str, *, data: Optional[dict] = None) -> dict:
url = f"{self.base}{path}"
resp = self.session.post(
url, headers=self._headers(), data=data, timeout=30
)
return self._parse(resp, "POST", url)
def _parse(self, resp: requests.Response, method: str, url: str) -> dict:
try:
body = resp.json()
except Exception:
body = resp.text
if resp.status_code >= 400:
# Redact the token if it accidentally surfaces in an error body.
# The Storage API doesn't echo it, but third-party proxies in
# front of customer instances sometimes do.
redacted = self._redact(body)
raise StorageApiError(
f"{method} {url} -> HTTP {resp.status_code}: {redacted}",
status=resp.status_code,
body=body,
)
if not isinstance(body, dict):
raise StorageApiError(
f"{method} {url} -> unexpected non-JSON response: {str(body)[:200]}",
status=resp.status_code,
body=body,
)
return body
def _redact(self, body: Any) -> str:
s = str(body)
if self.token and self.token in s:
s = s.replace(self.token, "<redacted-storage-token>")
return s[:500]
# ---- export-async + job polling ---------------------------------------
def export_table_async(self, table_id: str, params: dict) -> dict:
"""POST /v2/storage/tables/{table_id}/export-async — kicks off the
async export and returns the job resource. Caller polls `job.id`
via `wait_for_job` to find the file id when status='success'."""
return self._post(f"/tables/{table_id}/export-async", data=params)
def wait_for_job(
self,
job_id: int,
*,
timeout: float = _DEFAULT_EXPORT_TIMEOUT_SEC,
poll_interval: float = _DEFAULT_POLL_INTERVAL_SEC,
) -> dict:
"""Block until the async job reaches a terminal state. Returns the
job dict on success; raises `StorageApiError` on failure or timeout.
The poll interval starts small and backs off slightly so a chain of
~10 fast polls covers a sub-30 s job without flogging the API, while
a 30-min job ends up at a steady cadence after a few minutes.
"""
deadline = time.monotonic() + timeout
interval = poll_interval
while time.monotonic() < deadline:
job = self._get(f"/jobs/{job_id}")
status = job.get("status")
if status == "success":
return job
if status == "error":
raise StorageApiError(
f"Storage API job {job_id} reported error: "
f"{job.get('error') or job}",
body=job,
)
time.sleep(interval)
# Exponential backoff bounded at 10 s — a multi-minute Snowflake
# scan does not benefit from sub-second polls. 1.5 multiplier
# reaches 10 s after ~9 polls (~30 s wall-clock) and stays there.
interval = min(interval * 1.5, 10.0)
raise StorageApiError(
f"Storage API job {job_id} did not finish within {timeout}s"
)
# ---- file detail + signed-URL download --------------------------------
def file_detail(self, file_id: int) -> dict:
"""GET /v2/storage/files/{file_id}?federationToken=1 — returns the
file metadata plus a presigned URL (`url`) usable directly via HTTP
without any cloud SDK. For sliced exports the `url` resolves to a
manifest JSON listing the per-slice signed URLs."""
return self._get(f"/files/{file_id}", params={"federationToken": 1})
def download_file(self, file_info: dict, dest_path: Path) -> Path:
"""Download a Storage API file (single or sliced) to `dest_path`.
Backend variants:
- **AWS / Azure**: signed HTTPS URL in `file_info["url"]` (S3
presigned / SAS). Sliced manifest entries are signed HTTPS too.
Plain HTTP GET works.
- **GCP**: `file_info["url"]` is a signed HTTPS URL for the
single-file case. For sliced exports, the manifest at `url`
lists per-slice paths as `gs://<bucket>/<key>` (NOT signed) —
requires GCS authentication. We use the OAuth access token from
`file_info["gcsCredentials"]["access_token"]` and hit the REST
endpoint
`https://storage.googleapis.com/storage/v1/b/<bucket>/o/<urlencoded_key>?alt=media`
with `Authorization: Bearer <token>`. No google-cloud-storage
SDK dependency.
Single-file: stream the signed URL directly, gunzipping if the
URL/name ends in `.gz`. Sliced: stream each slice into
`dest_path` in order (slice 0 has the CSV header per Storage
API contract, subsequent slices are header-less data).
"""
url = file_info.get("url")
if not url:
raise StorageApiError(
f"file detail missing 'url': {self._redact(file_info)}",
body=file_info,
)
is_sliced = bool(file_info.get("isSliced"))
# Gzip detection is name-based only. Snowflake UNLOAD adds the
# `.gz` suffix when compression is requested (CSV exports), and
# leaves it off otherwise (parquet has its own internal
# compression and is served as plain `.parquet`). The previous
# `isEncrypted is False` fallback gated on a property that's
# orthogonal to compression — it would have flagged parquet
# downloads as gzipped and corrupted them at gunzip time.
is_gzipped = file_info.get("name", "").endswith(".gz")
dest_path.parent.mkdir(parents=True, exist_ok=True)
if is_sliced:
# GCP sliced manifests carry `gs://` URIs that need an OAuth
# bearer; AWS / Azure carry signed HTTPS URLs that work
# without auth. The presence of `gcsCredentials` in the file
# detail signals a GCP backend.
gcs_token = (file_info.get("gcsCredentials") or {}).get("access_token")
self._download_sliced(url, dest_path, gcs_token=gcs_token)
else:
self._download_single(url, dest_path, gunzip_on_read=is_gzipped)
return dest_path
def _download_single(
self,
url: str,
dest_path: Path,
*,
gunzip_on_read: bool,
extra_headers: Optional[dict] = None,
) -> None:
"""Stream a single signed URL (or GCS REST URL with bearer token
in `extra_headers`) into `dest_path`. Transparently gunzips if
the file name suggests it's a `.gz` — Storage API serves through
proxies that may rewrite Content-Encoding, so name-based
detection is more reliable than the header in practice."""
with self.session.get(
url,
stream=True,
timeout=_DEFAULT_SLICE_DOWNLOAD_TIMEOUT_SEC,
headers=extra_headers,
) as r:
r.raise_for_status()
tmp = dest_path.with_suffix(dest_path.suffix + ".part")
try:
with open(tmp, "wb") as fh:
for chunk in r.iter_content(chunk_size=64 * 1024):
if chunk:
fh.write(chunk)
if gunzip_on_read:
self._gunzip_in_place(tmp, dest_path)
tmp.unlink(missing_ok=True)
else:
tmp.replace(dest_path)
finally:
if tmp.exists():
tmp.unlink(missing_ok=True)
@staticmethod
def _gs_to_https(gs_url: str) -> str:
"""Rewrite `gs://<bucket>/<key>` to GCS JSON API media-download URL.
The JSON API requires the object name URL-encoded as a single
path segment (slashes inside the key are escaped). `alt=media`
switches the response from object metadata JSON to the actual
bytes — matches what `bucket.blob(key).download_as_bytes()` does
in the google-cloud-storage SDK.
"""
from urllib.parse import quote
if not gs_url.startswith("gs://"):
raise ValueError(f"_gs_to_https expects gs://; got {gs_url!r}")
path = gs_url[5:] # strip "gs://"
bucket, _, key = path.partition("/")
if not bucket or not key:
raise ValueError(f"malformed gs:// URL: {gs_url!r}")
return (
f"https://storage.googleapis.com/storage/v1/b/{bucket}"
f"/o/{quote(key, safe='')}?alt=media"
)
def _download_sliced(
self, manifest_url: str, dest_path: Path, *, gcs_token: Optional[str] = None
) -> None:
"""Sliced exports: the file detail's `url` points at a JSON manifest
whose `entries[].url` lists per-slice locations. Download each slice
and concatenate into `dest_path`. The first slice contains the CSV
header (Storage API guarantees stable header positioning).
Per-slice URL forms:
- signed HTTPS (S3 presigned, Azure SAS) — plain GET works.
- `gs://<bucket>/<key>` (GCP) — requires `gcs_token` (OAuth bearer
shipped in the file_detail's `gcsCredentials.access_token`).
Mapped to `https://storage.googleapis.com/storage/v1/b/<bucket>/o/<encoded_key>?alt=media`.
"""
m = self.session.get(
manifest_url, timeout=_DEFAULT_SLICE_DOWNLOAD_TIMEOUT_SEC
)
m.raise_for_status()
manifest = m.json()
entries = manifest.get("entries") or []
if not entries:
raise StorageApiError(
f"sliced manifest had no entries: {str(manifest)[:200]}",
body=manifest,
)
with tempfile.TemporaryDirectory(
prefix="kbc-slice-", dir=get_temp_root(), ignore_cleanup_errors=True,
) as tmpdir:
slice_paths: List[Path] = []
for i, entry in enumerate(entries):
surl = entry.get("url")
if not surl:
raise StorageApiError(
f"slice {i} missing 'url': {str(entry)[:200]}",
body=entry,
)
sp = Path(tmpdir) / f"slice-{i:05d}"
# GCP backend: rewrite gs:// to GCS REST + bearer auth.
# The OAuth token comes from the file_detail's
# `gcsCredentials.access_token` (passed as `gcs_token`
# arg).
if surl.startswith("gs://"):
if not gcs_token:
raise StorageApiError(
f"slice {i} URL is gs:// but no gcs_token "
f"provided in file_detail.gcsCredentials"
)
surl = self._gs_to_https(surl)
extra_headers = {"Authorization": f"Bearer {gcs_token}"}
else:
extra_headers = None
# Slices may individually be gzipped — same heuristic as
# single-file: if the slice URL's path ends in `.gz`, gunzip
# after download.
gz = ".gz" in surl.split("?")[0].rsplit("/", 1)[-1]
self._download_single(
surl, sp, gunzip_on_read=gz, extra_headers=extra_headers,
)
slice_paths.append(sp)
# Concat. Sliced CSV exports include the header in slice 0 only
# (Storage API contract); subsequent slices are header-less.
with open(dest_path, "wb") as out:
for sp in slice_paths:
with open(sp, "rb") as fh:
shutil.copyfileobj(fh, out, length=64 * 1024)
@staticmethod
def _gunzip_in_place(src: Path, dest: Path) -> None:
with gzip.open(src, "rb") as gz, open(dest, "wb") as out:
shutil.copyfileobj(gz, out, length=64 * 1024)
# ---- high-level: export-async + poll, returning file metadata ---------
def prepare_export(
self,
table_id: str,
*,
export_filter: Optional[ExportFilter] = None,
export_timeout: float = _DEFAULT_EXPORT_TIMEOUT_SEC,
) -> dict:
"""Run export-async + wait_for_job + file_detail and return the
file metadata. Caller decides how to download (single vs
sliced) — needed for the parquet path where sliced output must
be downloaded slice-by-slice and then DuckDB-merged (cat-style
concat would corrupt the per-slice parquet footers).
Returns:
{"job_id": int, "file_id": int, "rows": int|None,
"file_info": dict, "file_type": str}
"""
f = export_filter or ExportFilter()
params = f.to_export_params()
job_resp = self.export_table_async(table_id, params)
job_id = job_resp.get("id")
if not job_id:
raise StorageApiError(
f"export-async response missing job id: {self._redact(job_resp)}",
body=job_resp,
)
job = self.wait_for_job(job_id, timeout=export_timeout)
results = job.get("results") or {}
file_id = (results.get("file") or {}).get("id") or results.get("fileId")
if not file_id:
raise StorageApiError(
f"job {job_id} succeeded but had no result file: "
f"{self._redact(job)}",
body=job,
)
file_info = self.file_detail(file_id)
return {
"job_id": int(job_id),
"file_id": int(file_id),
"rows": (results.get("totalRowsCount")
or results.get("rowsCount")
or job.get("totalRowsCount")),
"file_info": file_info,
"file_type": f.file_type,
}
def download_file_slices(
self, file_info: dict, dest_dir: Path
) -> List[Path]:
"""Download a sliced Storage API export as separate per-slice
files into ``dest_dir``. Returns the slice paths in manifest
order. Use when the slices must be processed individually
(e.g. parquet — each slice is a complete parquet file with its
own footer; concatenation would invalidate it). For CSV where
concat-with-header-only-on-first-slice is the right thing,
``download_file`` is the correct entry point.
"""
url = file_info.get("url")
if not url:
raise StorageApiError(
f"file detail missing 'url': {self._redact(file_info)}",
body=file_info,
)
if not file_info.get("isSliced"):
raise StorageApiError(
"download_file_slices called on a non-sliced file_info; "
"use download_file for the single-file case"
)
gcs_token = (file_info.get("gcsCredentials") or {}).get("access_token")
m = self.session.get(url, timeout=_DEFAULT_SLICE_DOWNLOAD_TIMEOUT_SEC)
m.raise_for_status()
manifest = m.json()
entries = manifest.get("entries") or []
if not entries:
raise StorageApiError(
f"sliced manifest had no entries: {str(manifest)[:200]}",
body=manifest,
)
dest_dir.mkdir(parents=True, exist_ok=True)
slice_paths: List[Path] = []
for i, entry in enumerate(entries):
surl = entry.get("url")
if not surl:
raise StorageApiError(
f"slice {i} missing 'url': {str(entry)[:200]}",
body=entry,
)
# Reuse the same gs:// rewrite + bearer + per-slice gz
# heuristics used by the concat path.
if surl.startswith("gs://"):
if not gcs_token:
raise StorageApiError(
f"slice {i} URL is gs:// but no gcs_token "
f"provided in file_detail.gcsCredentials"
)
surl = self._gs_to_https(surl)
extra_headers = {"Authorization": f"Bearer {gcs_token}"}
else:
extra_headers = None
gz = ".gz" in surl.split("?")[0].rsplit("/", 1)[-1]
sp = dest_dir / f"slice-{i:05d}"
self._download_single(
surl, sp, gunzip_on_read=gz, extra_headers=extra_headers,
)
slice_paths.append(sp)
return slice_paths
# ---- high-level: export to local file (csv or parquet) ----------------
def export_table(
self,
table_id: str,
dest_path: Path,
*,
export_filter: Optional[ExportFilter] = None,
export_timeout: float = _DEFAULT_EXPORT_TIMEOUT_SEC,
) -> dict:
"""End-to-end: export-async → poll → download to ``dest_path``.
``export_filter.file_type`` controls the format Storage API
materializes (``csv`` default, ``parquet`` when explicitly set).
``dest_path`` is the local file we write the bytes to; the caller
decides the extension. The downloader streams chunks to disk so
memory stays bounded regardless of file size.
For CSV the sliced case is handled transparently — slices are
concatenated into ``dest_path`` (header in slice 0 only). For
**sliced parquet**, callers must use ``prepare_export`` +
``download_file_slices`` instead — concatenating parquet slices
invalidates the per-slice footer. ``export_table`` will raise
StorageApiError if it sees a sliced parquet, to fail loud.
Returns a small stats dict so callers can log / record provenance:
{"job_id": int, "file_id": int, "rows": int|None, "bytes": int,
"file_type": str}
"""
prep = self.prepare_export(
table_id, export_filter=export_filter, export_timeout=export_timeout,
)
file_info = prep["file_info"]
if (
prep["file_type"] == FILE_TYPE_PARQUET
and file_info.get("isSliced")
):
raise StorageApiError(
f"sliced parquet export for {table_id}: use "
f"prepare_export + download_file_slices and merge with "
f"DuckDB COPY (concat would corrupt parquet footers)",
body=file_info,
)
self.download_file(file_info, dest_path)
size = dest_path.stat().st_size if dest_path.exists() else 0
return {
"job_id": prep["job_id"],
"file_id": prep["file_id"],
"rows": prep["rows"],
"bytes": size,
"file_type": prep["file_type"],
}
# Backwards-compat alias retained for external callers (e.g. ad-hoc
# scripts) that imported the old name. The behavior matches calling
# `export_table` with whatever `file_type` the export_filter carries
# — the *_to_csv suffix is now imprecise (Storage API can also serve
# parquet here), but renaming the import would force unrelated repos
# to coordinate. Prefer `export_table` in new code.
def export_table_to_csv(
self,
table_id: str,
dest_csv: Path,
*,
export_filter: Optional[ExportFilter] = None,
export_timeout: float = _DEFAULT_EXPORT_TIMEOUT_SEC,
) -> dict:
return self.export_table(
table_id,
dest_csv,
export_filter=export_filter,
export_timeout=export_timeout,
)