Keboola cutover: native parquet path + sync correctness + auto-discover protection (#190)

* fix: cutover regressions + parallel Keboola legacy fallback

Bundled fixes from a fresh-deploy run on a Keboola Storage backend with
the block-shared-snowflake-access feature flag — DuckDB Keboola
extension's per-table scan can't access bucket schemas, so the legacy
kbcstorage Storage-API client is the only working path.

CUTOVER REGRESSIONS

- agnes pull hash mismatch on every Keboola local-mode table —
  src/orchestrator.py:_update_sync_state stored md5(mtime+size)[:12]
  while the CLI compares against full 32-char content MD5. Now stores
  the same content MD5 the materialized SQL path already used.

- Trailing-slash sanitization in connectors/keboola/access.py and
  extractor.py — DuckDB Keboola extension's ATTACH fails when the URL
  ends in / (canonical form).

- src/profiler.py:TableInfo.description becomes optional — two call
  sites instantiated without it, crashing the profiler pass.

- scripts/ops/agnes-auto-upgrade.sh: chown on UID change — older images
  ran as root, current runs as agnes (uid 999). Reads target uid:gid
  from /etc/passwd inside the new image and chowns ${STATE_DIR},
  /data/extracts, /data/analytics when the digest moves.

- POST /api/sync/trigger is now singleton per process — two
  near-simultaneous trigger calls each forked an extractor subprocess,
  fought for extract.duckdb's file lock, starved uvicorn, flipped the
  container to unhealthy. Trigger now returns 409
  (sync_already_in_progress) when held; _run_sync acquires non-blocking.

PARALLEL LEGACY FALLBACK

- Process pool fan-out for the _extract_via_legacy queue (default 8
  workers, override via AGNES_KEBOOLA_PARALLELISM). Process pool, not
  thread pool, because connectors/keboola/client.py:export_table does
  os.chdir(temp_dir) — process-global, so threads raced and slice files
  landed in the wrong directory ("[Errno 2] No such file or directory:
  '<job_id>.csv_X_Y_Z.csv'").

- Extractor subprocess timeout 1800s -> 3600s (configurable via
  AGNES_EXTRACTOR_TIMEOUT_SEC). 28+ tables × multi-minute Keboola export
  jobs need the headroom on telemetry-class projects.

- Process group cleanup on timeout — Popen(start_new_session=True) puts
  the extractor in its own group. On timeout the parent SIGTERMs the
  group (10s grace) then SIGKILLs stragglers. Without this, the pool
  workers were reparented to PID 1 and continued holding open Keboola
  Storage export jobs. Inline extractor script also installs a SIGTERM
  -> sys.exit(143) handler so the with ProcessPoolExecutor(...) block
  __exit__ runs cleanly.

Tests: existing tests that patched subprocess.run updated to patch
subprocess.Popen with a _FakePopen stand-in (same exit-code-injection
contract). Two tests that exercised the parallel path forced
AGNES_KEBOOLA_PARALLELISM=1 to keep mocks alive (mocks don't ride into
ProcessPoolExecutor subprocesses).

Squashed onto current main (was 7 commits + multi-commit CHANGELOG +
agnes-auto-upgrade.sh conflicts; squash avoids per-commit conflict
resolution against main's flat-mount STATE_DIR refactor and 0.38.0
release cut).

* feat(keboola): Storage API direct extract path; drop extension data path

The DuckDB Keboola extension's COPY routes through Keboola QueryService,
which is unreliable on linked-bucket projects (extension v0.1.6 fixes
that case but isn't yet in the community CDN, and pre-fix any project
with the block-shared-snowflake-access feature flag couldn't see bucket
schemas at all). Move the extract path off the extension entirely and
talk to the Storage API directly via signed-URL download — works on any
project, regardless of extension state.

connectors/keboola/storage_api.py (NEW)
  Lightweight client built on requests.Session. Three endpoints:
  - POST /v2/storage/tables/{id}/export-async        (kicks off job)
  - GET  /v2/storage/jobs/{id}                        (poll until done)
  - GET  /v2/storage/files/{id}?federationToken=1     (signed URL detail)
  - GET  <signed_url>                                 (download bytes)
  Supports sliced exports (manifest + per-slice signed URLs) and gzipped
  payloads. ExportFilter dataclass mirrors the Keboola filter spec
  (whereFilters / columns / changedSince / limit) and handles JSON
  round-trip with the registry's source_query column. Token redaction
  in error messages. Bounded exponential backoff on job polling.
  No cloud-SDK dependency on the data path; thread-safe.

connectors/keboola/extractor.py
  - materialize_query() rewritten: takes bucket/source_table/source_query
    (JSON filter spec), exports via KeboolaStorageClient, converts CSV
    to parquet via DuckDB, atomic os.replace. Same return shape so
    sync.py downstream code stays uniform with the BQ branch.
  - _extract_via_legacy() also moved to Storage API direct (kept the
    name for caller compatibility with _legacy_worker / the parallel
    batch extractor). Per-call temp directories — no os.chdir, threads
    don't race.

app/api/sync.py
  _run_materialized_pass for source_type='keboola' rows now constructs a
  KeboolaStorageClient (replaces KeboolaAccess) and passes
  bucket/source_table/source_query to materialize_query. Reuses one
  client across rows for HTTP keep-alive. Sources keboola URL from env
  too (KEBOOLA_STACK_URL) when instance.yaml doesn't have stack_url
  configured.

cli/commands/admin.py
  discover-and-register defaults Keboola rows to query_mode='materialized'
  (NULL source_query = full table), matching the v26 migration's
  unification of the local/materialized split for Keboola. BigQuery and
  Jira keep their per-source defaults.

src/db.py
  Schema bump 25 → 26. Migration: UPDATE table_registry SET
  query_mode='materialized' WHERE source_type='keboola' AND
  query_mode='local'. NULL source_query on those rows means "full table
  export" — same effective behavior the local mode provided, but now
  via Storage API instead of the extension.

pyproject.toml
  kbcstorage dep stays (admin-side bucket/table list still uses the
  SDK in app/api/admin.py / connectors/keboola/client.py); only the
  data path is migrated off the SDK. Comment updated to reflect the
  new boundary.

tests
  - test_keboola_storage_api.py (NEW, 19 tests): ExportFilter parsing,
    HTTP client (token redaction, retry logic, polling), download_file
    (single, gzipped, sliced), end-to-end export_table_to_csv.
  - test_keboola_materialize.py rewritten: mocks KeboolaStorageClient
    instead of FakeAccess; same atomic-write + zero-rows + unsafe-id
    contracts.
  - test_sync_trigger_keboola_materialized.py: registry rows now carry
    bucket+source_table+JSON-shape source_query.

114+ Keboola-impacted tests green locally.

* test: schema version assertion bumped to 26 alongside the keboola query_mode migration

* fix(keboola): cutover hot-patches surfaced on agnes-dev

Five small fixes that were applied as in-container hot-patches during
agnes-dev cutover and need to be on the source-of-truth image so a fresh
upgrade does not undo them.

- app/api/sync.py: auto-discover gate considers the WHOLE registry (any
  source, any mode), not just rows where source matches and query_mode
  is local. After the v25→v26 keboola materialized migration an
  instance can have 30 materialized rows and zero local rows; the
  previous gate kept re-firing _discover_and_register_tables every
  scheduler tick, creating duplicate auto-discovered rows with the
  wrong bucket prefix every time.

- app/api/admin.py: _discover_and_register_tables reassembles the
  bucket as <stage>.<bucket-id> (e.g. in.c-finance) instead of
  dropping the stage prefix; default query_mode for keboola is now
  materialized (the v26 contract); validator allows NULL source_query
  for keboola materialized rows (full-table export via Storage API
  export-async, no SQL needed).

- cli/commands/admin.py: register-table mirrors the server validator
  (NULL source_query allowed for source_type=keboola); --bucket help
  text generalized to cover both BQ dataset and Keboola bucket id.

- connectors/keboola/extractor.py: max_line_size=64 MiB on
  read_csv_auto so embedded JSON / SQL cells (kbc_component_configuration
  in particular) do not trip the default 2 MiB ceiling.

- connectors/keboola/storage_api.py: GCP backend support — when the
  Storage API returns a manifest whose slice URLs are gs://
  references with a gcsCredentials block, rewrite to the JSON REST
  download endpoint and authenticate with the issued OAuth bearer
  token; redact tokens in any surfaced error string.

* test: align with new keboola materialized + auto-discover-gate contracts

- test_admin_keboola_materialized: rename
  test_register_keboola_materialized_rejects_missing_source_query →
  test_register_keboola_materialized_accepts_missing_source_query.
  v25→v26 introduced 'keboola materialized with NULL source_query
  means full-table export via Storage API export-async' as the
  default registration shape; the rejection case is no longer the
  contract.

- test_sync_filter: add list_all() to _StubRegistry. The auto-discover
  gate in _run_sync now keys off the WHOLE registry (not just local
  rows) so materialized-only Keboola instances do not re-trigger
  discovery on every tick.

* feat(keboola): native parquet export — skip CSV roundtrip

Storage API export-async accepts fileType={csv,parquet}. Switching the
materialized sync to parquet eliminates the CSV → DuckDB COPY → parquet
roundtrip that pinned a single uvicorn worker over 4 GiB on multi-GB
tables (read_csv with all_varchar + max_line_size=64MB has to
materialize the whole CSV in memory before COPY can stream out a
parquet). Snowflake UNLOAD on Keboola's side already produces typed,
self-contained parquet files; the extractor downloads them and renames
into place.

Two cases:

- **Single-file** export (small table): file_info.url points at one
  signed URL; download_file streams chunks straight to .parquet.tmp
  and we're done. No DuckDB.

- **Sliced** export (Snowflake UNLOAD respects MAX_FILE_SIZE — 16 MiB
  default — so anything larger arrives as N parquet slices): each
  slice is a complete parquet file with its own footer; naive concat
  would corrupt them. download_file_slices keeps the slices as
  separate files in a tempdir, then DuckDB COPY (SELECT * FROM
  read_parquet([slice0, slice1, ...])) merges them into one
  consolidated parquet. DuckDB streams row groups during this — peak
  memory bounded to one row group (~1 MiB) regardless of source size.

The legacy CSV path stays as the explicit opt-in via source_query=
'{"file_type":"csv"}' for projects whose backend can't UNLOAD
parquet (none known today; cheap escape hatch). Backward-compat alias
KeboolaStorageClient.export_table_to_csv kept.

Also fixes a latent bug in download_file's gzip detection: previous
heuristic flagged any unencrypted file as gzipped, which would have
corrupted parquet downloads at gunzip time. Name-suffix-only now.

* fix: tempdir leak cleanup, every 0m schedule, /sync/trigger body shapes

Three small self-contained fixes uncovered during agnes-dev cutover.

- connectors/keboola/extractor.py: tempfile.TemporaryDirectory now uses
  ignore_cleanup_errors=True so a worker death mid-write doesn't leave
  multi-GiB stale slice trees on the boot disk. (12 GiB seen after a
  disk-full crash where TemporaryDirectory's own cleanup also raised
  and got swallowed.)

- src/scheduler.py: is_valid_schedule accepts 'every 0m' (interval=0
  = always due). Force-resync of an errored row no longer requires
  waiting out the default 'every 1h' interval — admin can flip the
  schedule, trigger, then flip back.

- app/api/sync.py: POST /api/sync/trigger accepts both ['table_id']
  (legacy bare-array body) and {'tables': ['table_id']} (matches the
  response payload shape, more discoverable for clients building
  requests by hand). Malformed bodies return 422 with a structured
  detail; null/missing means 'sync everything' as before.

Tests cover: tempdir cleanup on raise (sliced parquet path),
is_valid_schedule + is_table_due 'every 0m' acceptance, and trigger
body parametrized matrix (8 valid shapes + 6 rejection cases).

* fix: targeted-trigger filter in materialized pass + auto-upgrade defer

Two operational gaps observed during agnes-dev cutover, in the same
sync-routing area.

- _run_materialized_pass now takes a 'tables' arg and skips rows not in
  the target set with reason='not_in_target'. POST /api/sync/trigger
  with a body of tables previously only scoped the legacy extractor
  subprocess — the materialized pass kept iterating every due
  materialized row, so an admin asking to re-sync kbc_job re-ran
  every other due materialized row alongside it. Match on registry id
  OR name (admins commonly pass either form). tables=None preserves
  the no-filter behavior.

- New GET /api/sync/status (public, no auth) returns {locked: bool}
  off _sync_lock.locked(). agnes-auto-upgrade.sh probes this before
  docker compose up -d and exits 0 with a 'deferred recreate' log
  line if a sync is in flight — the next 5-min cron tick retries.
  Pre-fix, an auto-upgrade triggered mid-sync would recreate the
  uvicorn worker and kill the in-flight extractor / Snowflake-UNLOAD
  download (observed when kbc_job's first 7-day retry got SIGKILLed).
  Connection failures in the probe fall through to the upgrade —
  being stuck on a wedged image is worse than interrupting a
  hypothetical sync.

* fix: auto-discover protects admin overrides + surfaces drift

Two real-world incidents on agnes-dev drove this:

1. kbc_job was registered manually with the correct
   (in.c-kbc_telemetry, kbc_job) coordinates. A naive auto-discover
   re-run would have inserted a SECOND kbc_job row at the slugified
   id 'in_c-keboola-storage_kbc_job' (where Keboola's discovery
   places it) — and that row's Storage API export-async 404s.

2. An earlier auto-discover bug stripped the stage prefix from
   bucket ids ('c-finance' instead of 'in.c-finance'), inserting
   137 rows whose syncs all failed.

Fix:

- _discover_and_register_tables now builds a plan first
  (_build_keboola_discovery_plan) classifying each discovered table
  into one of new / existing_match / existing_drift / invalid, then
  executes only the 'new' bucket. Drift rows are reported with both
  sides of the disagreement plus drift_kind:
  - same_id_diff_coords: registry has the same id but different
    bucket / source_table (admin migrated coords inline).
  - name_collision: discovery's slugified id differs from any
    registry id, but the discovered .name matches an existing row's
    .name (case-insensitive). Catches the kbc_job case.

- Bucket detection now prefers the API's authoritative bucket_id
  field (separate field on the Keboola tables.list response,
  normalised by KeboolaClient.discover_all_tables). Falls back to
  id-string parsing only when bucket_id is missing (older fallback
  path inside discover_all_tables).

- Endpoint POST /api/admin/discover-and-register?dry_run=true
  returns the plan without writing — would_register, drift,
  invalid lists. Lets an operator audit before merging discovery
  with a registry that has admin overrides.

Removed 'every 0m' from test_register_request_rejects_malformed_sync_schedule
— the runtime started accepting it in the previous commit (force-resync
override) and the validator follows suit.

* feat(keboola): AGNES_TEMP_DIR routes tempfiles off overlayfs /tmp

The container's /tmp lives on the boot disk's overlayfs (29 GiB on
agnes-dev, shared with /var). Snowflake UNLOAD of a wide table writes
slices into per-call /tmp tempdirs that fill multi-GiB / many-slice
exports long before the dedicated data disk fills. agnes-dev hit
100% boot-disk while the 20 GiB data disk had 15 GiB free.

connectors.keboola.storage_api.get_temp_root() reads AGNES_TEMP_DIR;
mkdirs the target on first use; unset / empty / unwritable falls
back to None (system tempdir, OSS-pre-fix behaviour). Both
materialize_query (parquet path) and _extract_via_legacy (CSV
fallback) and the sliced-CSV concat path in storage_api use the
helper now.

docker-compose.yml defaults AGNES_TEMP_DIR=/data/tmp on app, scheduler,
and extract services. The data volume is the dedicated disk in
production layouts and a plain docker volume in single-disk
dev/laptop setups — same blast radius as the previous /tmp default
on the latter, no regression.
This commit is contained in:
ZdenekSrotyr 2026-05-07 12:12:14 +02:00 committed by GitHub
parent cbd91838e2
commit 28430ced09
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
30 changed files with 3583 additions and 362 deletions

View file

@ -10,6 +10,66 @@ CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every C
## [Unreleased]
### Added
- `AGNES_TEMP_DIR` env var (default in `docker-compose.yml`: `/data/tmp`) routes per-call extractor tempdirs (Snowflake-UNLOAD slice staging, CSV→parquet intermediates) off the container's overlayfs `/tmp` onto the data volume. Boot-disk overlayfs filled to 100% on agnes-dev during a multi-GiB sliced parquet export; the dedicated data disk had 15 GiB free at the time. Helper `connectors/keboola/storage_api.py:get_temp_root` mkdirs the target on first use; unset / empty / unwritable falls back to system `/tmp` for compat with OSS users on a single-disk host.
- `POST /api/admin/discover-and-register?dry_run=true` returns the planned mutations without writing — lists `would_register`, `drift` (existing rows whose registry coordinates differ from what discovery would write), and `invalid` ids. Useful for auditing before re-running auto-discovery on a registry that's already had admin overrides applied.
- `GET /api/sync/status` returns `{"locked": bool}` — public, no auth. Consumed by the host-side `agnes-auto-upgrade.sh` cron to decide whether to defer `docker compose up -d` until the running sync finishes. Cheap (single Lock check), no sensitive data.
### Fixed
- `app/api/admin.py`: `_discover_and_register_tables` no longer overwrites admin-corrected registry rows. Two drift flavours surfaced (and skipped):
- **same_id_diff_coords** — registry has a row at the same id but different `(bucket, source_table)`; admin migrated coordinates.
- **name_collision** — discovery's slugified id differs from any registry id, but the discovered `name` matches an existing row's `name` (case-insensitive). Real-world cause: the `kbc_job` row was registered manually with the right bucket; Keboola's discovery exposes it under a different stage prefix that slugs to a different id. Pre-fix, auto-discovery would have inserted a duplicate whose Storage API export-async 404s. Now classified as drift, surfaced with `registry_id` so an operator can reconcile.
- `app/api/admin.py`: bucket detection in auto-discovery now uses the Keboola API's authoritative `bucket_id` field directly (with id-string parsing only as a fallback). Pre-fix, parsing the id string was the primary path and a stripped stage prefix inserted 137 broken rows.
- `app/api/sync.py`: `POST /api/sync/trigger` with a `tables` payload now actually scopes the materialized pass too. Previously the targeted trigger only filtered the legacy extractor subprocess; `_run_materialized_pass` still iterated every materialized row in the registry, so an admin asking to re-sync `kbc_job` re-ran every other due materialized row alongside it. The pass now takes a `tables` arg and skips rows not in the target set with `reason="not_in_target"`. Both registry id and name match.
- `scripts/ops/agnes-auto-upgrade.sh`: defers `docker compose up -d` while a sync is in flight. Probes `GET /api/sync/status` with a 5s timeout; if the response carries `"locked":true`, exits 0 with a deferred-recreate log line and waits for the next 5-min cron tick. Connection failures (older app version without the endpoint, app crashed, etc.) fall through to the upgrade — being stuck on a wedged image is worse than interrupting a hypothetical sync.
- `connectors/keboola/extractor.py`: `materialize_query` per-call tempdir is now opened with `ignore_cleanup_errors=True`. Previously a worker death mid-write under disk-full state could leave a multi-GiB stale slice tree (12 GiB seen on agnes-dev) because `TemporaryDirectory.__exit__` itself raised, masking the original exception and skipping cleanup. Now cleanup is best-effort and always fires.
- `src/scheduler.py`: `is_valid_schedule` now accepts `every 0m` (interval = 0 = "always due"). Useful as a force-resync override on a row whose previous attempt errored without recording `last_sync` — the default `every 1h` would otherwise block the retry for an hour. Existing values reject as before.
- `app/api/sync.py`: `POST /api/sync/trigger` now accepts both `["table_id"]` (legacy) and `{"tables": ["table_id"]}` (mirrors response shape) request bodies, plus `null` / no body for "sync everything". Malformed shapes return HTTP 422 with a structured detail. No client breakage — the old wire format keeps working.
### Changed
- `connectors/keboola`: materialized sync now requests **parquet directly** from the Storage API (`POST /v2/storage/tables/{id}/export-async` with `fileType=parquet`) instead of CSV → DuckDB COPY → parquet. The extractor downloads the Snowflake-UNLOADed parquet, renames into place, and skips the DuckDB roundtrip entirely. Eliminates the OOM that hits multi-GB Keboola tables when `read_csv(..., all_varchar=true, max_line_size=64MB)` materializes the whole CSV in memory before COPY. Sliced exports (large tables that Snowflake UNLOAD writes as multiple files) are merged via `DuckDB COPY (SELECT * FROM read_parquet([...]))` — peak memory bounded to one parquet row group (~1 MiB) regardless of table size. Admin can pin the legacy CSV path with `source_query='{"file_type":"csv"}'`. Backward-compat alias `KeboolaStorageClient.export_table_to_csv` retained.
- `connectors/keboola/storage_api.py`: `download_file` gzip detection no longer treats unencrypted files as gzipped (previous heuristic would have corrupted parquet downloads at gunzip time). Name-suffix-only.
- **BREAKING for Keboola operators**: schema bump to **v26**. Existing `query_mode='local'` Keboola rows are migrated to `query_mode='materialized'` (NULL `source_query` = full-table export — same effective behavior as before). New `register-table --source-type keboola` and `discover-and-register --source-type keboola` default to `materialized`. The `local` mode for Keboola is gone — it ran the DuckDB extension's COPY through Keboola QueryService, which is unreliable on linked-bucket projects (extension v0.1.6 fixes the linked-bucket case but not yet in the community CDN; pre-fix, projects with the `block-shared-snowflake-access` flag couldn't see bucket schemas at all). BigQuery and Jira `local` rows are untouched. See `connectors/keboola/storage_api.py` + the v25→v26 migration in `src/db.py`.
- **Keboola extract path is now Storage API direct**, not the DuckDB extension. New `connectors/keboola/storage_api.py` talks to Keboola Storage API straight via `requests`:
- `POST /v2/storage/tables/{id}/export-async` to kick off the job (with optional `whereFilters` / `columns` / `changedSince` from the row's `source_query` JSON);
- `GET /v2/storage/jobs/{id}` polled with bounded exponential backoff until `success` or `error`;
- `GET /v2/storage/files/{id}?federationToken=1` to fetch a signed URL;
- `GET <signed_url>` (or per-slice URLs from a manifest for sliced exports) → CSV → DuckDB COPY → parquet.
No `os.chdir`, no boto3/azure-blob/google-cloud-storage SDKs, no extension binary on the data path. Thread-safe. Same path is used both by `materialize_query()` (admin-registered tables with optional filter spec) and by `_extract_via_legacy()` (per-table fallback inside the parallel batch extractor).
- **`source_query` shape for Keboola materialized rows is JSON**, not SQL — Storage API takes a structured filter object, not free-form SQL. Mirrors the BQ materialized path conceptually but with a different payload. Schema:
```json
{
"where_filters": [{"column": "date", "operator": "ge", "values": ["2026-04-01"]}],
"columns": ["id", "date", "amount"],
"changed_since": "2026-04-01T00:00:00",
"limit": 1000
}
```
All fields optional. Empty / NULL = full-table export. Operators per Keboola Apiary: `eq`, `ne`, `in`, `notIn`, `ge`, `gt`, `le`, `lt`. See `connectors/keboola/storage_api.py:ExportFilter`.
- `POST /api/sync/trigger` is now singleton per process. A second trigger that arrives while the previous sync is still running returns **HTTP 409** (`detail: sync_already_in_progress`) instead of scheduling a parallel `_run_sync`. The scheduler container's `data-refresh` job logs the 409 as a normal warning and waits for its next tick — no retry loop. Operator-visible: clients that hand-roll their own polling on `/api/sync/trigger` now need to handle 409. Why it matters: two concurrent extractor subprocesses both write `extract.duckdb`, fight for its file lock, starve uvicorn's worker pool, and Docker flips `agnes-app` to `unhealthy` long enough for `reverse_proxy`-fronted deploys to return 503 to external traffic until contention drains.
- Keboola legacy Storage-API fallback now runs in parallel across a process pool. When the DuckDB extension's per-table scan fails (e.g. on projects with the `block-shared-snowflake-access` feature flag where workspace roles can't see bucket schemas, see keboola/duckdb-extension#17), tables that fall back to the legacy `kbcstorage` client are now drained concurrently instead of one-at-a-time. The dominant per-table cost is the synchronous wait on the Keboola Storage export job (which scans Snowflake into a CSV and returns); fanning out across N workers cuts wall-clock proportionally for batches that hit the fallback. Default 8 workers, override with `AGNES_KEBOOLA_PARALLELISM` (set to `1` for sequential, useful when debugging or seeing Keboola-side rate-limiting). Project-level concurrency is bounded by the operator's `storage.jobsParallelism` limit (typically 10); the default 8 leaves headroom for other clients. Workers are processes (not threads) because `connectors/keboola/client.py:export_table` does `os.chdir(temp_dir)` to redirect kbcstorage's slice-file downloads into a per-call temp directory — `os.chdir` is process-global, so two threads racing on it land slice files in the wrong directory and the merge step fails with `[Errno 2] No such file or directory: '<job_id>.csv_X_Y_Z.csv'`. Process workers each have their own CWD.
- Extractor subprocess timeout bumped from 1800s to 3600s (configurable via `AGNES_EXTRACTOR_TIMEOUT_SEC`). On projects where the legacy Storage-API fallback is the only working path (extension blocked by `block-shared-snowflake-access`), 28+ tables × multi-minute Keboola export jobs routinely overran the 30-min cap before the parallel fallback even existed; with parallelization in place the run usually fits, but `kbc_telemetry`-class tables and large CRM snapshots can still push it over. The 1h ceiling matches the longest practically-reasonable Keboola export job before an operator should intervene.
- Extractor subprocess is now launched in its own process group (`subprocess.Popen(..., start_new_session=True)`) so a timeout can take down the whole tree — the extractor parent plus the ProcessPoolExecutor workers it spawned for parallel legacy fallback. Without this, a `subprocess.run(timeout=...)` SIGKILLed only the immediate child; the pool workers were reparented to PID 1 and continued holding open Keboola Storage export jobs, blocking the next sync cycle. On timeout the parent now SIGTERMs the group (10s grace), then SIGKILLs stragglers. The extractor's inline Python script installs a SIGTERM → `sys.exit(143)` handler so the `with ProcessPoolExecutor(...)` block runs its `__exit__` (`shutdown(wait=True)`) cleanly before the process dies.
### Fixed (cutover regressions, surfaced 2026-05-06)
- `agnes pull` no longer fails with `hash mismatch: expected … got …` for every Keboola local-mode table. `src/orchestrator.py:_update_sync_state` stored `md5(f"{mtime_ns}:{size}")[:12]` — a 12-char fingerprint of file metadata — while the CLI's post-download integrity check compares against the full 32-char content MD5 it computes via `cli/commands/sync.py:_md5_file`. Those could never match, so every `agnes pull` reported `Updated 0 tables` even when the server had data. Now the orchestrator stores the same content MD5 the materialized SQL path already used (`app/api/sync.py:_file_hash`).
- Latent `NameError: name '_sys' is not defined` in `app/api/sync.py:_run_sync` when the function fell into its outer `except Exception` before reaching the inner `import sys as _sys`. Hoisted the import to the top of the body so the error path stays loggable instead of trading the original failure for a misleading stack trace.
- Keboola sync now falls back to the legacy Storage-API client when the DuckDB Keboola extension's per-table scan fails, not just when the initial `ATTACH` fails. Two changes:
- `kbcstorage>=0.9.0` is promoted from optional to core dependency. The legacy fallback path in `connectors/keboola/extractor.py:_extract_via_legacy` has been there since the extension landed, but until now the bare `from kbcstorage.client import Client` would crash any default install with `ModuleNotFoundError`.
- `connectors/keboola/extractor.py:run` now wraps `_extract_via_extension` in a per-table try/except — on any per-table scan failure it retries via the legacy client. Previously, when `ATTACH` succeeded but the table-level `COPY (SELECT * FROM kbc."<bucket>"."<table>")` failed, the table was just marked failed with no retry.
Together these unblock deployments where the extension's bucket-schema scans return `Schema '..."in.c-..."' does not exist or not authorized` (keboola/duckdb-extension#17) while the upstream extension fix is in flight.
- `connectors/keboola/access.py:KeboolaAccess.__init__` and `connectors/keboola/extractor.py:_try_attach_extension` now strip a trailing slash from the Keboola stack URL before passing it to the DuckDB Keboola extension's `ATTACH`. The canonical Keboola URL form (`https://connection.<region>.keboola.com/`) failed there with a network error; bare-host form works. Operators no longer have to massage the value out of `KEBOOLA_STACK_URL` / `instance.yaml`.
- `src/profiler.py:TableInfo.__init__` makes `description` optional (defaults to `""`). Two call sites in `app/api/catalog.py` and `app/api/sync.py` instantiate `TableInfo(name=..., table_id=...)` without it; the previous required-arg signature crashed sync's profiler pass with `TableInfo.__init__() missing 1 required positional argument: 'description'`, leaving `[SYNC] Profiled 0 tables` after every run.
- `scripts/ops/agnes-auto-upgrade.sh` now `chown`s `${STATE_DIR}` (`/data/state` by default), `/data/extracts`, `/data/analytics` to the new image's runtime UID:GID before `docker compose up` when the image digest moves. Catches root → non-root UID transitions across upgrades — without it, the new image's first start `PermissionError`s on `.session_secret` / DuckDB. Reads the target uid:gid from `/etc/passwd` inside the image so the script stays honest if the runtime user ever moves off uid 999.
### Internal
- `infra/modules/customer-instance` (tag `infra-v1.8.0`): `startup-script.sh.tpl` no longer overwrites operator-edited `AGNES_TAG` / `AGNES_TEMP_DIR` in `/opt/agnes/.env` on every boot. Reads the existing values when present and lets them win over the template-computed `$IMAGE_TAG`. Pre-fix, an in-place TF action that stopped/started the VM (e.g. `machine_type` change) would re-run the startup script and clobber any manually-pinned image tag — operators had to re-edit the file post-restart. Fresh provisions still get the TF-driven values; the `.env` file's existence is the disambiguator. To force a TF-driven reset, `rm /opt/agnes/.env` and reboot.
## [0.45.0] — 2026-05-07
Operator-and-analyst quality bundle: a security fix for the optional
@ -121,6 +181,8 @@ actionable signal. Closes #84, #164, #177, #178, #203, #204.
## [0.44.1] — 2026-05-07
## [0.44.1] — 2026-05-07
### Fixed
- `/admin/users/{id}` — "Add to group" dropdown explains itself when empty instead of leaving the admin staring at a silent `— Pick a group —` placeholder. Three cases now surface a hint below the picker: (a) user is already in every group, (b) every remaining group is Google-Workspace-managed and Agnes can't grant manually (POST would 409 — link to `/admin/groups` to create a custom group), (c) no groups exist at all. Pre-fix on deployments where `Admin` + `Everyone` are mapped via `AGNES_GROUP_{ADMIN,EVERYONE}_EMAIL` and no custom groups exist, the picker was empty with zero indication that the operator needed to create a custom group first.

View file

@ -1277,11 +1277,22 @@ class RegisterTableRequest(BaseModel):
raise ValueError(
"source_query is only valid when query_mode='materialized'"
)
# Non-BQ materialized rows must supply source_query explicitly — there
# is no server-generate fallback for Keboola materialized.
if self.query_mode == "materialized" and not sq and self.source_type != "bigquery":
# BigQuery materialized auto-generates a full-table-dump SQL from
# `bucket`+`source_table` when source_query is omitted (see
# `register_table` BQ branch). Keboola materialized: a NULL
# source_query means "full-table export via Storage API
# export-async" — no SQL needed (the API takes a structured
# filter, see `connectors/keboola/storage_api.py:ExportFilter`).
# Other source_types (e.g. jira) don't support materialized mode
# and require an explicit source_query if the operator opts in.
if (
self.query_mode == "materialized"
and not sq
and self.source_type not in ("bigquery", "keboola")
):
raise ValueError(
"query_mode='materialized' requires a non-empty source_query"
f"query_mode='materialized' for source_type='{self.source_type}' "
"requires a non-empty source_query"
)
# Backtick guard stays for non-materialized rows (DuckDB-flavor SQL
# contract); materialized SQL is BigQuery-native and MUST allow
@ -2766,13 +2777,172 @@ async def configure_instance(
}
def _discover_and_register_tables(conn: duckdb.DuckDBPyConnection, user_email: str) -> dict:
"""Discover tables from configured source and register them. Shared logic for API and sync."""
def _split_keboola_table_id(full_id: str, fallback_name: str = "") -> tuple[str, str]:
"""Split a Keboola table id into ``(bucket, source_table)``.
Keboola convention: ``<stage>.<bucket-id>.<table>`` where stage
``{in, out, sys}`` and bucket-id typically starts with ``c-``
(e.g. ``in.c-finance.orders``). Storage API export-async needs the
FULL ``<stage>.<bucket-id>`` as the bucket arg a stripped
``c-finance`` 404s. The 2-segment fallback covers id strings
without the stage prefix; the 0/1-segment path returns empty
bucket and uses ``fallback_name`` as the table name so the row
fails loud at sync time rather than silently registering with
no source coordinates.
"""
parts = (full_id or "").strip().split(".")
if len(parts) >= 3:
return ".".join(parts[:-1]), parts[-1]
if len(parts) == 2:
return parts[0], parts[1]
return "", fallback_name or full_id
def _build_keboola_discovery_plan(
conn: duckdb.DuckDBPyConnection, discovered: list[dict],
) -> dict:
"""Inspect ``discovered`` (output of ``KeboolaClient.discover_all_tables``)
against the live registry and bucket every entry into one of:
- ``new``: not in registry, will be inserted.
- ``existing_match``: row already in registry under the same id
AND its ``(bucket, source_table)`` matches what discovery would
write no-op, nothing to do.
- ``existing_drift``: a row in the registry conflicts with what
discovery would write. Two flavours, both surfaced for operator
visibility but **never overwritten**:
1. Same registry id, different ``(bucket, source_table)``
admin corrected the coordinates inline (rarer).
2. Different registry id but the discovered ``name`` clashes
with an existing row's ``name`` (case-insensitive). Real
example: registry has ``id='kbc_job', name='kbc_job',
bucket='in.c-kbc_telemetry'``; Keboola exposes the same
logical table at id ``in.c-keboola-storage.job`` (which
slugs to a different ``table_id``). Without this
check, auto-discovery would insert a duplicate ``kbc_job``
whose Storage API export-async 404s.
- ``invalid``: id couldn't produce a usable ``table_id`` slug.
Each bucket carries the exact rows; the API endpoint composes a
summary + (optionally) executes. Pre-fix, this logic was inlined
in ``_discover_and_register_tables`` and there was no way to see
what would change without writing.
"""
repo = TableRegistryRepository(conn)
# Pre-load all keboola rows once so the name-collision lookup
# below is O(1) per discovered entry. Falls back to per-id
# `repo.get(...)` calls when list_all isn't available — keeps
# the single-row test stubs working without forcing them to
# implement list_all.
try:
all_rows = [r for r in repo.list_all() if r.get("source_type") == "keboola"]
except AttributeError:
all_rows = []
by_name: dict[str, dict] = {
(r.get("name") or "").strip().lower(): r for r in all_rows
}
plan = {"new": [], "existing_match": [], "existing_drift": [], "invalid": []}
for table in discovered:
full_id = (table.get("id") or "").strip()
# Slug used as the registry primary key. Lowercase, dots/spaces
# → underscores. Stable across discovery runs.
table_id = full_id.lower().replace(".", "_").replace(" ", "_")
if not table_id:
plan["invalid"].append({
"table_id": "",
"full_id": full_id,
"reason": "empty id from discovery payload",
})
continue
# Prefer Keboola's authoritative `bucket_id` (separate field in
# the API response, normalised by `discover_all_tables`) over
# parsing the full id string. Fall back to the parser when
# the API didn't return bucket_id (older fallback path inside
# discover_all_tables).
bucket = (table.get("bucket_id") or "").strip()
name = (table.get("name") or "").strip()
source_table = name
if not bucket or not source_table:
bucket, source_table = _split_keboola_table_id(full_id, source_table)
entry = {
"table_id": table_id,
"name": table.get("name", table_id),
"full_id": full_id,
"bucket": bucket,
"source_table": source_table,
}
existing = repo.get(table_id)
if existing is not None:
ex_bucket = existing.get("bucket") or ""
ex_source_table = existing.get("source_table") or ""
if ex_bucket == bucket and ex_source_table == source_table:
plan["existing_match"].append(entry)
else:
plan["existing_drift"].append({
**entry,
"registry_bucket": ex_bucket,
"registry_source_table": ex_source_table,
"registry_id": existing.get("id"),
"drift_kind": "same_id_diff_coords",
})
continue
# No row at this id. Look for a name collision (admin
# registered the same logical table under a different id).
name_match = by_name.get(name.lower()) if name else None
if name_match is not None:
plan["existing_drift"].append({
**entry,
"registry_bucket": name_match.get("bucket") or "",
"registry_source_table": name_match.get("source_table") or "",
"registry_id": name_match.get("id"),
"drift_kind": "name_collision",
})
continue
plan["new"].append(entry)
return plan
def _discover_and_register_tables(
conn: duckdb.DuckDBPyConnection,
user_email: str,
*,
dry_run: bool = False,
) -> dict:
"""Discover tables from configured source and register them.
Behavior:
- Only the configured source type ``keboola`` is supported here
(BigQuery uses a different discovery endpoint).
- Already-registered rows are NEVER overwritten. The plan
classifies them as ``existing_match`` (no-op, registry agrees
with discovery) or ``existing_drift`` (admin edited the
coordinates; left alone, surfaced in the response so the
operator sees the divergence).
- ``dry_run=True`` returns the plan without writing anything
useful for auditing before a re-discovery on a registry that
already has admin overrides.
"""
from app.instance_config import get_data_source_type, get_value
source_type = get_data_source_type()
if source_type != "keboola":
return {"registered": 0, "skipped": 0, "errors": 0, "tables": [], "source": source_type}
return {
"registered": 0,
"skipped": 0,
"errors": 0,
"drifted": 0,
"tables": [],
"source": source_type,
"dry_run": dry_run,
}
from connectors.keboola.client import KeboolaClient
# Read from data_source.keboola (matches what /api/admin/configure writes)
@ -2785,65 +2955,106 @@ def _discover_and_register_tables(conn: duckdb.DuckDBPyConnection, user_email: s
client = KeboolaClient(token=token, url=url)
discovered = client.discover_all_tables()
plan = _build_keboola_discovery_plan(conn, discovered)
drift_summary = [
{
"table_id": e["table_id"],
"discovery": {"bucket": e["bucket"], "source_table": e["source_table"]},
"registry": {"bucket": e["registry_bucket"],
"source_table": e["registry_source_table"]},
}
for e in plan["existing_drift"]
]
if dry_run:
return {
"registered": 0,
"skipped": len(plan["existing_match"]),
"errors": len(plan["invalid"]),
"drifted": len(plan["existing_drift"]),
"tables": [e["table_id"] for e in plan["new"]],
"would_register": [e["table_id"] for e in plan["new"]],
"drift": drift_summary,
"invalid": plan["invalid"],
"source": "keboola",
"dry_run": True,
}
repo = TableRegistryRepository(conn)
registered = 0
skipped = 0
errors = 0
table_names = []
for table in discovered:
table_id = table.get("id", "").strip().lower().replace(".", "_").replace(" ", "_")
if not table_id:
errors += 1
continue
if repo.get(table_id):
skipped += 1
continue
for entry in plan["new"]:
try:
# Parse bucket from table ID (format: in.c-bucket.table_name)
parts = table.get("id", "").split(".")
bucket = parts[1] if len(parts) > 1 else ""
source_table = parts[2] if len(parts) > 2 else table.get("name", "")
repo.register(
id=table_id,
name=table.get("name", table_id),
id=entry["table_id"],
name=entry["name"],
source_type="keboola",
bucket=bucket,
source_table=source_table,
query_mode="local",
bucket=entry["bucket"],
source_table=entry["source_table"],
# Keboola goes through Storage API export-async via the
# materialized path (NULL source_query = full table). The
# legacy `local` mode for Keboola was retired in v26 and
# would no-op here anyway.
query_mode="materialized",
registered_by=user_email,
description=f"Auto-discovered from Keboola: {table.get('id', '')}",
description=f"Auto-discovered from Keboola: {entry['full_id']}",
)
registered += 1
table_names.append(table_id)
table_names.append(entry["table_id"])
except Exception as e:
logger.warning("Failed to register %s: %s", table_id, e)
logger.warning("Failed to register %s: %s", entry["table_id"], e)
errors += 1
if plan["existing_drift"]:
logger.warning(
"Auto-discover skipped %d row(s) where the admin-edited "
"bucket/source_table differs from discovery — preserving "
"the admin values. Run with dry_run=True to see the deltas.",
len(plan["existing_drift"]),
)
return {
"registered": registered,
"skipped": skipped,
"errors": errors,
"skipped": len(plan["existing_match"]),
"errors": errors + len(plan["invalid"]),
"drifted": len(plan["existing_drift"]),
"tables": table_names,
"drift": drift_summary,
"invalid": plan["invalid"],
"source": "keboola",
"dry_run": False,
}
@router.post("/discover-and-register")
async def discover_and_register(
dry_run: bool = False,
user: dict = Depends(require_admin),
conn: duckdb.DuckDBPyConnection = Depends(_get_db),
):
"""Discover tables from configured source and auto-register them.
Combines discover-tables + register-table into one call.
Skips already-registered tables. Used by /setup wizard and AI agents.
Combines discover-tables + register-table into one call. Already-
registered rows are NEVER overwritten admin edits to bucket /
source_table win. The response surfaces a ``drift`` array listing
any rows where discovery would have written different coordinates
than what's in the registry, so operators can audit divergence
after a Keboola-side bucket rename / table move.
Query params:
- ``dry_run=true`` returns the plan without writing anything.
Lists ``would_register``, ``drift``, and ``invalid`` so an
operator can decide whether to proceed (or, in the drift case,
which side they want to fix).
Used by /setup wizard and AI agents.
"""
try:
result = _discover_and_register_tables(conn, user.get("email", "admin"))
result = _discover_and_register_tables(
conn, user.get("email", "admin"), dry_run=dry_run,
)
return result
except Exception as e:
raise HTTPException(status_code=500, detail=f"Discovery and registration failed: {e}")

View file

@ -4,12 +4,13 @@ import hashlib
import logging
import os
import subprocess
import threading
import traceback
from datetime import datetime, timezone
from pathlib import Path
from typing import Optional, List
from typing import Any, Optional, List
from fastapi import APIRouter, Depends, HTTPException, BackgroundTasks
from fastapi import APIRouter, Body, Depends, HTTPException, BackgroundTasks
from pydantic import BaseModel
import duckdb
@ -25,6 +26,17 @@ from src.scheduler import filter_due_tables, is_table_due
logger = logging.getLogger(__name__)
router = APIRouter(prefix="/api/sync", tags=["sync"])
# Process-wide guard against overlapping `_run_sync` invocations. Two
# concurrent extractor subprocesses both write `extract.duckdb` and fight
# for its file lock — the first sync stalls, the second crashes, and the
# `/api/health` check times out long enough that Docker flips the
# container to `unhealthy`, which (behind a `reverse_proxy` upstream)
# bricks external traffic until contention drains. The singleton-ness is
# enforced both in the trigger handler (return 409 fast, before the work
# is scheduled) and in `_run_sync` itself (defense in depth, in case
# something bypasses the handler).
_sync_lock = threading.Lock()
def _file_hash(path: Path) -> str:
if not path.exists():
@ -54,13 +66,23 @@ def _materialize_table(
)
def _run_materialized_pass(conn: duckdb.DuckDBPyConnection, bq) -> dict:
def _run_materialized_pass(
conn: duckdb.DuckDBPyConnection,
bq,
tables: Optional[List[str]] = None,
) -> dict:
"""Walk `table_registry` for `query_mode='materialized'` rows and run any
that are due, dispatching by ``source_type`` to the correct connector's
materialize_query. Honors per-table `sync_schedule` via `is_table_due()`,
computes the file hash inline, and updates `sync_state` so the manifest
can serve the row to `agnes pull` without re-hashing on every request.
``tables`` (when not None) restricts the pass to a specific subset
targeted re-syncs from the operator (POST /api/sync/trigger with a
body) need this, otherwise an admin asking to re-sync `kbc_job` would
re-process every other materialized row that's also due. Matched
against both the registry id and name (admins often pass either).
BigQuery rows go through BqAccess + bigquery_query() (jobs API),
optionally cost-guarded by ``max_bytes_per_materialize``.
Keboola rows go through KeboolaAccess + ATTACH-and-COPY, no
@ -107,6 +129,14 @@ def _run_materialized_pass(conn: duckdb.DuckDBPyConnection, bq) -> dict:
summary = {"materialized": [], "skipped": [], "errors": []}
keboola_access = None # lazy-init on first Keboola row
# Targeted-trigger filter. Compare against both id and name so an admin
# who passes either form (the registry id slug, or the human-friendly
# name) gets the same result. `None` means "no filter — process all
# due materialized rows".
target_set: Optional[set] = (
set(tables) if tables is not None else None
)
for row in registry.list_all():
if row.get("query_mode") != "materialized":
continue
@ -120,6 +150,14 @@ def _run_materialized_pass(conn: duckdb.DuckDBPyConnection, bq) -> dict:
# manifest because the lookup misses on `id`.
ref_name = row["name"]
if target_set is not None and not (
ref_name in target_set or row.get("id") in target_set
):
summary["skipped"].append(
{"table": ref_name, "reason": "not_in_target"}
)
continue
last = state.get_last_sync(ref_name)
last_iso = last.isoformat() if last else None
schedule = row.get("sync_schedule") or "every 1h"
@ -143,10 +181,18 @@ def _run_materialized_pass(conn: duckdb.DuckDBPyConnection, bq) -> dict:
)
elif source_type == "keboola":
if keboola_access is None:
from connectors.keboola.access import KeboolaAccess
# Lazy-init the Storage API client (replaces the old
# DuckDB extension `KeboolaAccess`). One client is shared
# across all keboola materialized rows in this pass —
# `requests.Session` inside it is thread-safe and reuses
# the connection pool for HTTP keep-alive across rows.
# Variable name kept as `keboola_access` to minimise
# diff churn against the surrounding error-handling
# block; the type is now `KeboolaStorageClient`.
from connectors.keboola.storage_api import KeboolaStorageClient
keboola_url = get_value(
"data_source", "keboola", "stack_url", default=""
) or ""
) or os.environ.get("KEBOOLA_STACK_URL", "")
token_env = get_value(
"data_source", "keboola", "token_env",
default="KEBOOLA_STORAGE_TOKEN",
@ -162,17 +208,32 @@ def _run_materialized_pass(conn: duckdb.DuckDBPyConnection, bq) -> dict:
),
})
continue
keboola_access = KeboolaAccess(
keboola_access = KeboolaStorageClient(
url=keboola_url, token=keboola_token,
)
kb_output_dir.mkdir(parents=True, exist_ok=True)
from connectors.keboola.extractor import (
materialize_query as kb_materialize_query,
)
# Storage API needs the bucket+table split — registry rows
# carry both fields per the standard register-table schema.
bucket = row.get("bucket", "")
source_table = row.get("source_table") or ref_name
if not bucket:
summary["errors"].append({
"table": ref_name,
"error": (
"materialized keboola row is missing 'bucket'; "
"re-register with --bucket <in.c-...>"
),
})
continue
kb_stats = kb_materialize_query(
table_id=ref_name,
sql=row["source_query"],
keboola_access=keboola_access,
bucket=bucket,
source_table=source_table,
source_query=row.get("source_query"),
storage_client=keboola_access,
output_dir=kb_output_dir,
)
# Normalize Keboola materialize_query output to the shape the
@ -258,9 +319,20 @@ def _run_sync(tables: Optional[List[str]] = None):
Reads table configs from DuckDB (in main process which has the shared
connection), passes them as JSON via stdin to the extractor subprocess.
This avoids DuckDB lock conflicts subprocess never opens system.duckdb.
Singleton: only one invocation runs at a time per process (see
`_sync_lock` module-level). The trigger handler also fast-fails with
409 when the lock is held, so this branch is defense in depth.
"""
import json as _json
import sys
import sys as _sys
if not _sync_lock.acquire(blocking=False):
print(
"[SYNC] another sync is already in flight — skipping",
file=_sys.stderr, flush=True,
)
return
try:
from app.instance_config import get_data_source_type, get_value
@ -287,7 +359,17 @@ def _run_sync(tables: Optional[List[str]] = None):
registry_has_tables = bool(table_configs)
else:
table_configs = repo.list_local(source_type) if source_type else repo.list_local()
registry_has_tables = bool(table_configs)
# Auto-discover gate must consider the WHOLE registry, not
# just `local` rows. After the Keboola migration to
# materialized (v25→v26), an instance can have 30
# materialized Keboola rows and zero local rows — but
# `bool(table_configs)` here would be False, and
# `not registry_has_tables` would re-trigger
# `_discover_and_register_tables` on every scheduler tick,
# creating duplicate "auto-discovered" rows with the wrong
# bucket prefix every time.
# Use list_all (any source, any mode) for the gate.
registry_has_tables = bool(repo.list_all())
# Without this filter, every scheduler tick would re-sync
# every table regardless of its sync_schedule cadence,
# making the field a no-op at trigger time. Tables with
@ -342,7 +424,6 @@ def _run_sync(tables: Optional[List[str]] = None):
)
env = {**os.environ}
import sys as _sys
if run_extractor_subprocess:
# Serialize configs — strip non-serializable fields
@ -353,8 +434,8 @@ def _run_sync(tables: Optional[List[str]] = None):
# Run extractor subprocess with table configs via stdin
# Subprocess does NOT open system.duckdb — no lock conflict
cmd = [sys.executable, "-c", """
import json, sys, os, logging
cmd = [_sys.executable, "-c", """
import json, sys, os, logging, signal
from pathlib import Path
# Subprocess inherits no logging config — without basicConfig, Python's
@ -364,6 +445,19 @@ from pathlib import Path
# Devin BUG_0002 on PR #136 review.
logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")
# Convert SIGTERM into a controlled SystemExit so the ProcessPoolExecutor
# `with` block in connectors.keboola.extractor.run() runs its __exit__
# (shutdown/wait_for_workers) before this process dies. Without this,
# SIGTERM kills the parent abruptly, leaving the OS to clean up the pool
# children — but each worker holds an open Keboola Storage export job
# whose lifetime is tied to the HTTP poll loop, and those leak until the
# Keboola side TTLs them out. The parent extractor calls this from
# app.api.sync._run_sync after `subprocess.Popen(start_new_session=True)`
# + `os.killpg(SIGTERM)` on timeout.
def _exit_on_sigterm(signum, frame):
sys.exit(143)
signal.signal(signal.SIGTERM, _exit_on_sigterm)
configs = json.load(sys.stdin)
url = os.environ.get("KEBOOLA_STACK_URL", "")
token = os.environ.get("KEBOOLA_STORAGE_TOKEN", "")
@ -383,24 +477,53 @@ sys.exit(compute_exit_code(result, len(configs)))
print(f"[SYNC] Starting extractor subprocess for {len(table_configs)} tables", file=_sys.stderr, flush=True)
# Run in a new process group (start_new_session=True) so a
# timeout can take down the whole tree — the extractor itself
# plus any ProcessPoolExecutor workers it spawned for parallel
# legacy-fallback. Without this, plain `subprocess.run` on
# timeout SIGKILLs only the immediate child; the pool workers
# are reparented to PID 1 and continue holding open Keboola
# Storage export jobs, blocking the next sync cycle's
# connectivity to those same job IDs.
extractor_timeout = int(os.environ.get("AGNES_EXTRACTOR_TIMEOUT_SEC", "3600"))
proc = subprocess.Popen(
cmd,
stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE,
text=True, env=env,
cwd=str(Path(__file__).parent.parent.parent),
start_new_session=True,
)
try:
result = subprocess.run(
cmd, input=_json.dumps(serializable), capture_output=True, text=True,
timeout=1800, env=env,
cwd=str(Path(__file__).parent.parent.parent),
)
stdout, stderr = proc.communicate(input=_json.dumps(serializable), timeout=extractor_timeout)
result = subprocess.CompletedProcess(cmd, proc.returncode, stdout, stderr)
except subprocess.TimeoutExpired:
# SIGTERM the whole process group first to give workers a
# chance to shut down cleanly (release Keboola export jobs,
# close DuckDB conns), then SIGKILL the stragglers after a
# short grace window.
import signal
try:
os.killpg(proc.pid, signal.SIGTERM)
except ProcessLookupError:
pass
try:
proc.communicate(timeout=10)
except subprocess.TimeoutExpired:
try:
os.killpg(proc.pid, signal.SIGKILL)
except ProcessLookupError:
pass
try:
proc.communicate(timeout=5)
except subprocess.TimeoutExpired:
pass
# Catch the timeout LOCALLY so the materialized BQ pass and
# orchestrator rebuild below still fire. Pre-fix the timeout
# orchestrator rebuild below still fire — pre-fix the timeout
# propagated to the outer except handler and skipped the rest
# of `_run_sync` — on a dual-source deployment a slow Keboola
# extractor would silently block all materialized parquets +
# master-view rebuild until the next trigger. Devin BUG_0001
# on PR #148 commit 2219255. Mirrors the per-custom-connector
# timeout pattern below (line ~347).
# of `_run_sync` (Devin BUG_0001 on PR #148 commit 2219255).
print(
"[SYNC] Extractor timed out after 1800s — continuing to "
"materialized pass + orchestrator rebuild",
f"[SYNC] Extractor timed out after {extractor_timeout}s — process "
"group killed; continuing to materialized pass + orchestrator rebuild",
file=_sys.stderr, flush=True,
)
result = None
@ -441,7 +564,7 @@ sys.exit(compute_exit_code(result, len(configs)))
logger.info("Running custom connector: %s", connector_dir.name)
try:
custom_result = subprocess.run(
[sys.executable, str(extractor)],
[_sys.executable, str(extractor)],
env=env, capture_output=True, text=True, timeout=600,
cwd=str(Path(__file__).parent.parent.parent),
)
@ -469,7 +592,9 @@ sys.exit(compute_exit_code(result, len(configs)))
bq_access = get_bq_access() # sentinel if no BQ project; OK
mat_conn = _get_system_db()
try:
mat_summary = _run_materialized_pass(mat_conn, bq_access)
mat_summary = _run_materialized_pass(
mat_conn, bq_access, tables=tables,
)
finally:
mat_conn.close()
skipped_count = len(mat_summary["skipped"])
@ -531,10 +656,16 @@ sys.exit(compute_exit_code(result, len(configs)))
print(f"[SYNC] Profiler skipped: {e}", file=_sys.stderr, flush=True)
except subprocess.TimeoutExpired:
print("[SYNC] Extractor timed out after 1800s", file=_sys.stderr, flush=True)
# Outer-handler fallback for any subprocess.run call site (e.g.
# custom-connectors below) that didn't already catch its own
# TimeoutExpired. Concrete timeout value isn't available here —
# log generically.
print("[SYNC] Extractor subprocess timed out", file=_sys.stderr, flush=True)
except Exception as e:
print(f"[SYNC] FAILED: {e}", file=_sys.stderr, flush=True)
traceback.print_exc()
finally:
_sync_lock.release()
# ---- Manifest ----
@ -617,15 +748,86 @@ async def sync_manifest(
return _build_manifest_for_user(conn, user)
# ---- Status ----
@router.get("/status")
async def sync_status():
"""Whether a sync is currently in flight on this app process.
Public (no auth) used by the host-side ``agnes-auto-upgrade.sh``
cron to decide whether to skip a `docker compose up -d` that would
kill a running extractor / materialized pass mid-flight. Cheap to
serve (single Lock.locked() check) and contains no sensitive data.
Returns:
``{"locked": bool}`` True if `_sync_lock` is currently held by
a `_run_sync` invocation. The host script defers the upgrade
when this is True and retries on the next 5-min cron tick.
"""
return {"locked": _sync_lock.locked()}
# ---- Trigger ----
@router.post("/trigger")
async def trigger_sync(
background_tasks: BackgroundTasks,
tables: Optional[List[str]] = None,
body: Optional[Any] = Body(None),
user: dict = Depends(require_admin),
):
"""Trigger data sync from configured source. Admin only. Runs in background."""
"""Trigger data sync from configured source. Admin only. Runs in background.
Body accepts three shapes (all optional empty body / `null` syncs
every registered table):
- ``["kbc_job", "orders"]`` bare JSON array of table ids
- ``{"tables": ["kbc_job", "orders"]}`` object with a ``tables``
key (matches the wire shape of the response, more discoverable
for clients building requests by hand)
- ``null`` / no body sync everything
Both array forms have shipped at different times; accepting both
keeps older clients (PR-build CLIs, helper scripts) working while
surfacing the shape that mirrors the response payload. Anything
else returns HTTP 422 with a structured detail.
Returns 409 if a previously-triggered sync is still running. Two
concurrent extractor subprocesses fight for the same `extract.duckdb`
file lock that contention starves uvicorn, makes `/api/health` time
out, flips the container to `unhealthy`, and (behind a `reverse_proxy`
upstream like the bundled Caddy overlay) bricks external traffic
until contention drains. Fast-fail here keeps that from happening.
"""
if body is None:
tables: Optional[List[str]] = None
elif isinstance(body, list):
tables = list(body)
elif isinstance(body, dict):
tables = body.get("tables")
if tables is not None and not isinstance(tables, list):
raise HTTPException(
status_code=422,
detail="`tables` must be a list of strings",
)
else:
raise HTTPException(
status_code=422,
detail=(
"body must be a list of table ids, an object with a "
"`tables` list, or null"
),
)
if tables is not None and not all(isinstance(t, str) for t in tables):
raise HTTPException(
status_code=422,
detail="all entries in `tables` must be strings",
)
if _sync_lock.locked():
raise HTTPException(
status_code=409,
detail="sync_already_in_progress",
)
background_tasks.add_task(_run_sync, tables)
return {
"status": "triggered",

View file

@ -124,25 +124,29 @@ def register_table(
else:
source_query = query.strip()
if query_mode == "materialized" and not source_query:
# Keboola materialized rows can omit --query: a NULL source_query means
# "full-table export via Storage API export-async" (see v25→v26
# migration notes). For BigQuery materialized rows, --query is still
# required — BQ has no analogous "full table" semantic at the registry
# layer (the path is a SELECT against `<project>.<dataset>.<table>`,
# which the admin must spell out).
if query_mode == "materialized" and not source_query and source_type != "keboola":
typer.echo(
"Error: --query-mode materialized requires --query (literal SQL or @path.sql)",
"Error: --query-mode materialized requires --query (literal SQL or @path.sql) for source_type=" + source_type,
err=True,
)
raise typer.Exit(2)
# Bucket is load-bearing for the BQ destination identifier on
# materialized rows. Without it, registration succeeds but
# subsequent `agnes schema <name>` builds `bq."".."<src>"` from
# the empty bucket and the server rejects with HTTP 400 "unsafe
# BQ identifier in registry". Catch this at register time so the
# operator gets a clear error pointing at the right knob.
# Bucket is load-bearing on materialized rows. For BQ it backs the
# destination identifier (`agnes schema <name>` builds `bq."<bucket>"."
# <src>"` from it; an empty bucket trips "unsafe BQ identifier in
# registry" at query time). For Keboola it's the bucket id passed to
# `/v2/storage/tables/<bucket>.<source_table>/export-async` — without
# it the export call would 404. Same requirement, different rationale.
if query_mode == "materialized" and not bucket:
typer.echo(
"Error: --query-mode materialized requires --bucket (the BQ "
"dataset for the destination identifier). Without it the row "
"registers but `agnes schema <name>` later fails with "
"'unsafe BQ identifier in registry'.",
"Error: --query-mode materialized requires --bucket (the "
"BQ dataset / Keboola bucket id for the source identifier).",
err=True,
)
raise typer.Exit(2)
@ -279,12 +283,20 @@ def discover_and_register(
typer.echo(f" [DRY RUN] {name:30s} bucket={bucket_id:20s} rows={t.get('rowsCount', 0):>10,}")
continue
# Keboola tables always go through the Storage API export-async
# path (`materialize_query`), which is `query_mode='materialized'`
# in the registry. A NULL source_query means "full table export"
# — same effective semantics the old 'local' mode gave, but via
# the Storage API instead of the DuckDB extension. See
# connectors/keboola/storage_api.py + the v25→v26 migration.
# Other connectors keep their per-source default.
default_mode = "materialized" if source_type == "keboola" else "local"
resp = api_post("/api/admin/register-table", json={
"name": name,
"source_type": source_type,
"bucket": bucket_id,
"source_table": name,
"query_mode": "local",
"query_mode": default_mode,
"description": f"Auto-discovered from {source_type}",
})

View file

@ -24,7 +24,11 @@ class KeboolaAccess:
def __init__(self, *, url: str, token: str) -> None:
if not url or not token:
raise ValueError("KeboolaAccess requires url and token")
self._url = url
# Strip trailing slash — the Keboola DuckDB extension's ATTACH
# fails with a network error when the URL ends in `/` (e.g. the
# canonical `https://connection.<region>.keboola.com/` form).
# Bare host works.
self._url = url.rstrip("/")
self._token = token
@contextmanager

View file

@ -4,7 +4,7 @@ import logging
import os
from datetime import datetime, timezone
from pathlib import Path
from typing import List, Dict, Any
from typing import List, Dict, Any, Optional
import duckdb
@ -19,70 +19,253 @@ logger = logging.getLogger(__name__)
def materialize_query(
table_id: str,
sql: str,
*,
keboola_access, # KeboolaAccess (avoid circular import)
bucket: str,
source_table: str,
source_query: Optional[str] = None,
storage_client=None, # KeboolaStorageClient (avoid circular import)
keboola_url: Optional[str] = None,
keboola_token: Optional[str] = None,
output_dir: Path,
) -> dict:
"""Materialize an admin-registered SELECT against the Keboola Storage
API extension into a parquet file.
"""Materialize a Keboola Storage table to a local parquet via Storage API.
Parallel of `connectors/bigquery/extractor.py:materialize_query`.
Cost guardrail: the Keboola extension has no analog of BQ dry-run;
Storage API cost is download-shaped (per-byte egress + Storage API
job). Phase B ships without a guardrail and logs the byte count;
a future PR can add a configurable `max_bytes_per_keboola_materialize`
gate similar to BQ's `max_bytes_per_materialize`.
Replaces the previous DuckDB-extension path. The extension's QueryService
scan is unreliable on linked-bucket projects (keboola/duckdb-extension#17;
fix shipped upstream as v0.1.6 but not yet in the community CDN, and on
flag-restricted projects the pre-fix workspace role wouldn't have GRANTs
on the bucket schema anyway). The Storage API export-async path always
works regardless of project flags.
Parallel of `connectors/bigquery/extractor.py:materialize_query` in
surface same return shape, same atomic write, same MD5 contract but
the inputs differ because Keboola's structured filter spec replaces
BQ's free-form SQL.
Args:
table_id: parquet filename + sync_state key (must be a safe ident).
bucket: Keboola bucket id, e.g. ``in.c-crm``.
source_table: table id within the bucket, e.g. ``orders``.
source_query: optional JSON string with a Storage API filter spec
(see `storage_api.ExportFilter`). Empty / NULL = full table.
storage_client: pre-built `KeboolaStorageClient` (preferred lets
sync.py share one across rows). When omitted, ``keboola_url``
and ``keboola_token`` are used to construct a one-shot client.
keboola_url, keboola_token: alternative to ``storage_client`` for
single-call usage (tests, ad-hoc).
output_dir: directory to write `<table_id>.parquet`.
Returns:
``{"table_id", "path", "rows", "bytes", "md5"}`` same shape the
BQ branch returns, so ``app/api/sync.py:_run_materialized_pass``
downstream code stays uniform.
"""
import re
import hashlib
import json
import duckdb
# Defense: table_id is interpolated into the parquet filename.
# Reject anything that's not a safe identifier.
if not re.fullmatch(r"[A-Za-z_][A-Za-z0-9_]*", table_id):
raise ValueError(f"unsafe table_id for materialize: {table_id!r}")
parquet_path = Path(output_dir) / f"{table_id}.parquet"
tmp_path = Path(output_dir) / f"{table_id}.parquet.tmp"
if tmp_path.exists():
tmp_path.unlink()
safe_tmp_lit = str(tmp_path).replace("'", "''")
# Lazy import to avoid pulling `requests` at module import time when only
# the sync trigger imports `extractor` for `run()`.
from connectors.keboola.storage_api import (
FILE_TYPE_CSV, FILE_TYPE_PARQUET, ExportFilter, KeboolaStorageClient,
)
# Atomic write — mirror BQ's pattern at connectors/bigquery/extractor.py:370.
# COPY into a `.parquet.tmp`, hash + size from the tmp file, only swap to
# the final path on success. A mid-COPY failure (network, disk full,
# extension crash) leaves no partial parquet at the canonical path that
# the orchestrator rebuild would pick up. Devin finding 2026-05-01:
# BUG_pr-review-job-3fbd31c9_0003.
with keboola_access.duckdb_session() as conn:
if storage_client is None:
if not (keboola_url and keboola_token):
raise ValueError(
"materialize_query requires either storage_client or "
"(keboola_url + keboola_token)"
)
storage_client = KeboolaStorageClient(url=keboola_url, token=keboola_token)
# Filter spec is optional. Admin can register a row with no
# source_query at all (= full-table export), or with a JSON object
# describing whereFilters / columns / changedSince / file_type.
payload: dict = {}
if source_query:
try:
conn.execute(f"COPY ({sql}) TO '{safe_tmp_lit}' (FORMAT PARQUET)")
row_count = conn.execute(
f"SELECT COUNT(*) FROM read_parquet('{safe_tmp_lit}')"
).fetchone()[0]
except Exception:
if tmp_path.exists():
tmp_path.unlink()
raise
payload = json.loads(source_query)
except json.JSONDecodeError as e:
raise ValueError(
f"source_query for {table_id} is not valid JSON: {e}"
) from e
export_filter = ExportFilter.from_dict(payload)
# Streaming MD5 — never read the entire parquet into memory. Keboola
# materialized results can reach multi-GB sizes (admin-aggregated
# subsets); hashing in 8 KiB chunks keeps memory bounded. Mirror of BQ's
# streaming hash at connectors/bigquery/extractor.py:438. Devin finding
# 2026-05-01: BUG_pr-review-job-3fbd31c9_0002.
# Default the materialized path to parquet — Storage API serves it
# via native Snowflake UNLOAD, the extractor renames it into place,
# no CSV intermediate, no DuckDB COPY, no peak-memory load. Admin
# can pin `{"file_type":"csv"}` in source_query to fall back (legacy
# debugging, or projects whose backend can't UNLOAD parquet — none
# known today, but the escape hatch costs nothing). Only override
# when the admin spec didn't *explicitly* set a file_type.
if "file_type" not in payload and "fileType" not in payload:
export_filter.file_type = FILE_TYPE_PARQUET
output_dir = Path(output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
parquet_path = output_dir / f"{table_id}.parquet"
tmp_parquet = output_dir / f"{table_id}.parquet.tmp"
if tmp_parquet.exists():
tmp_parquet.unlink()
# Per-call temp dir for the intermediate file (CSV or parquet) —
# separates concurrent exports cleanly without the os.chdir() race
# the kbcstorage SDK has. ``ignore_cleanup_errors=True`` keeps
# disk-full / permission errors from masking the original
# exception, and prevents a half-cleaned dir from sitting around
# forever (a 12 GiB stale slice tree was seen after a worker died
# mid-write on a saturated boot disk). ``dir=get_temp_root()``
# routes to ``AGNES_TEMP_DIR`` when the operator has steered
# tempfiles off the overlayfs (e.g. onto the data disk) — see
# storage_api.get_temp_root for the rationale.
import tempfile
from connectors.keboola.storage_api import get_temp_root
with tempfile.TemporaryDirectory(
prefix=f"kbc-export-{table_id}-",
dir=get_temp_root(),
ignore_cleanup_errors=True,
) as tmpdir:
full_table_id = f"{bucket}.{source_table}"
if export_filter.file_type == FILE_TYPE_PARQUET:
# Native parquet path. Storage API serves Snowflake UNLOAD
# output directly. Two shapes to handle:
#
# 1. **Single file** (small exports): file_info.url points at
# one signed URL; download to tmp_parquet and we're done.
# 2. **Sliced** (large exports — Snowflake UNLOAD respects
# MAX_FILE_SIZE, default 16 MiB, so anything past that
# arrives as a manifest of N parquet slices). Each slice
# is itself a complete parquet file with its own footer;
# naively concatenating them like CSV would be invalid.
# We download all slices into the per-call tempdir, then
# DuckDB-COPY across `read_parquet([slice1, slice2, ...])`
# into one consolidated tmp_parquet. DuckDB streams row
# groups during this consolidation — peak memory is one
# row group (~1 MiB), not the full table.
stats = storage_client.prepare_export(
full_table_id, export_filter=export_filter,
)
file_info = stats["file_info"]
if file_info.get("isSliced"):
slice_dir = Path(tmpdir) / "slices"
slice_paths = storage_client.download_file_slices(
file_info, slice_dir
)
if not slice_paths:
raise RuntimeError(
f"sliced parquet export for {full_table_id} "
f"yielded no slices"
)
quoted = ", ".join(
"'" + str(p).replace("'", "''") + "'" for p in slice_paths
)
safe_tmp = str(tmp_parquet).replace("'", "''")
conv = duckdb.connect()
try:
conv.execute(
f"COPY (SELECT * FROM read_parquet([{quoted}])) "
f"TO '{safe_tmp}' (FORMAT PARQUET)"
)
finally:
conv.close()
else:
storage_client.download_file(file_info, tmp_parquet)
stats["bytes"] = (
tmp_parquet.stat().st_size if tmp_parquet.exists() else 0
)
if not tmp_parquet.exists() or tmp_parquet.stat().st_size == 0:
logger.warning(
"Storage API parquet export for %s returned no data "
"(filter may be too restrictive)",
full_table_id,
)
# Empty placeholder parquet so the orchestrator doesn't
# choke on a missing file.
duckdb.connect().execute(
f"COPY (SELECT 1 AS _empty WHERE FALSE) TO '{tmp_parquet}' (FORMAT PARQUET)"
).close()
else:
# Legacy CSV path. Kept for the explicit `{"file_type":"csv"}`
# opt-in. Slower (CSV parse + parquet rewrite) and
# memory-heavier (DuckDB pulls the CSV into a buffer with
# max_line_size headroom), but doesn't depend on Storage
# API parquet support if a future project backend lacks it.
csv_path = Path(tmpdir) / f"{table_id}.csv"
stats = storage_client.export_table(
full_table_id, csv_path, export_filter=export_filter,
)
if not csv_path.exists() or csv_path.stat().st_size == 0:
logger.warning(
"Storage API CSV export for %s returned no data "
"(filter may be too restrictive)",
full_table_id,
)
duckdb.connect().execute(
f"COPY (SELECT 1 AS _empty WHERE FALSE) TO '{tmp_parquet}' (FORMAT PARQUET)"
).close()
else:
# CSV → parquet via DuckDB. `all_varchar=True` matches the
# legacy client's behavior — preserves the source's exact
# character data without DuckDB's type inference rewriting
# numeric-looking strings (e.g. "Non-Manager") as NULL.
#
# `max_line_size=64MB` overrides DuckDB's default 2 MB cap
# on any single CSV line. Keboola tables that store
# embedded JSON / SQL transformation bodies routinely
# have multi-MB cells (e.g. `kbc_component_configuration`
# rows ship full Snowflake transformation SQL inline as
# a JSON column value); the default 2 MB ceiling rejects
# them with `Maximum line size of 2000000 bytes
# exceeded`. 64 MB is generous enough to absorb any
# reasonable embedded blob; DuckDB allocates a single
# buffer of this size per worker thread.
safe_csv = str(csv_path).replace("'", "''")
safe_tmp = str(tmp_parquet).replace("'", "''")
try:
conv = duckdb.connect()
conv.execute(
f"COPY (SELECT * FROM read_csv('{safe_csv}', "
f"all_varchar=true, max_line_size=67108864)) "
f"TO '{safe_tmp}' (FORMAT PARQUET)"
)
conv.close()
except Exception:
if tmp_parquet.exists():
tmp_parquet.unlink()
raise
# Row count from the parquet, not from `stats["rows"]` — Storage API
# sometimes omits totalRowsCount on small results, and the parquet is
# the authoritative count we'll be serving downstream anyway.
safe_tmp = str(tmp_parquet).replace("'", "''")
cnt_conn = duckdb.connect()
try:
row_count = cnt_conn.execute(
f"SELECT COUNT(*) FROM read_parquet('{safe_tmp}')"
).fetchone()[0]
finally:
cnt_conn.close()
# Streaming MD5 — bounded memory regardless of parquet size.
h = hashlib.md5()
with open(tmp_path, "rb") as f:
with open(tmp_parquet, "rb") as f:
for chunk in iter(lambda: f.read(8192), b""):
h.update(chunk)
md5 = h.hexdigest()
size = tmp_path.stat().st_size
size = tmp_parquet.stat().st_size
os.replace(tmp_path, parquet_path)
os.replace(tmp_parquet, parquet_path)
if row_count == 0:
logger.warning(
"Materialized Keboola query for %s wrote 0 rows — verify the "
"SQL filters and that the source bucket has data.",
"Materialized Keboola export for %s wrote 0 rows — verify the "
"filter and that the source bucket has data.",
table_id,
)
@ -128,7 +311,12 @@ def _try_attach_extension(conn: duckdb.DuckDBPyConnection, keboola_url: str, keb
try:
conn.execute("INSTALL keboola FROM community; LOAD keboola;")
escaped_token = keboola_token.replace("'", "''")
conn.execute(f"ATTACH '{keboola_url}' AS kbc (TYPE keboola, TOKEN '{escaped_token}')")
# Strip trailing slash — the Keboola DuckDB extension's ATTACH fails
# with a network error when the URL ends in `/` (e.g. the canonical
# `https://connection.us-east4.gcp.keboola.com/` form). Bare host
# works.
attach_url = keboola_url.rstrip("/")
conn.execute(f"ATTACH '{attach_url}' AS kbc (TYPE keboola, TOKEN '{escaped_token}')")
logger.info("Using DuckDB Keboola extension")
return True
except Exception as e:
@ -163,6 +351,11 @@ def run(output_dir: str, table_configs: List[Dict[str, Any]], keboola_url: str,
stats = {"tables_extracted": 0, "tables_failed": 0, "errors": []}
now = datetime.now(timezone.utc)
# Per-table workitems whose extension scan failed and need the legacy
# Storage-API fallback. Drained in a parallel pool below the per-table
# serial loop. Items are `(tc, pq_path)` tuples.
legacy_queue: List[tuple] = []
try:
# Try DuckDB Keboola extension
use_extension = _try_attach_extension(conn, keboola_url, keboola_token)
@ -244,30 +437,25 @@ def run(output_dir: str, table_configs: List[Dict[str, Any]], keboola_url: str,
# (`Schema '..."in.c-..."' does not exist or not
# authorized`, see keboola/duckdb-extension#17). The
# legacy Storage-API client doesn't go through
# QueryService at all, so retry there.
# QueryService at all, so queue for the parallel
# legacy fallback below.
logger.warning(
"Keboola extension scan failed for %s (%s); retrying via legacy Storage-API client",
"Keboola extension scan failed for %s (%s); queued for legacy Storage-API fallback",
table_name, ext_err,
)
_extract_via_legacy(tc, pq_path, keboola_url, keboola_token)
legacy_queue.append((tc, pq_path))
continue
else:
_extract_via_legacy(tc, pq_path, keboola_url, keboola_token)
legacy_queue.append((tc, pq_path))
continue
# Get row count and file size. pq_path is built from the
# validated table_name above, but escape the parquet path
# literal for defense-in-depth.
safe_pq_lit = pq_path.replace("'", "''")
rows = conn.execute(f"SELECT count(*) FROM read_parquet('{safe_pq_lit}')").fetchone()[0]
size = os.path.getsize(pq_path)
# Create view and register in _meta
conn.execute(f"CREATE OR REPLACE VIEW \"{table_name}\" AS SELECT * FROM read_parquet('{safe_pq_lit}')")
conn.execute(
"INSERT INTO _meta VALUES (?, ?, ?, ?, ?, 'local')",
[table_name, tc.get("description", ""), rows, size, now],
)
# Extension path succeeded — register _meta synchronously.
_register_local_meta(conn, tc, pq_path, now)
stats["tables_extracted"] += 1
logger.info("Extracted %s: %d rows, %d bytes", table_name, rows, size)
rows_log = conn.execute(
f"SELECT count(*) FROM read_parquet('{pq_path.replace(chr(39), chr(39)*2)}')"
).fetchone()[0]
logger.info("Extracted %s via extension: %d rows", table_name, rows_log)
except Exception as e:
logger.error("Failed to extract %s: %s", table_name, e)
@ -281,6 +469,61 @@ def run(output_dir: str, table_configs: List[Dict[str, Any]], keboola_url: str,
except Exception:
pass
# Phase 2: legacy fallback in parallel. Keboola Storage API export
# jobs are independent per table — a worker pool of N workers fans
# out the per-table HTTP roundtrips (export job submit + poll +
# CSV download) instead of stacking them sequentially. Project-level
# concurrency is bounded by the storage.jobsParallelism limit
# (typically 10); default to 4 to leave headroom for other clients.
# Override via AGNES_KEBOOLA_PARALLELISM env var.
#
# Workers are PROCESSES, not threads — `connectors/keboola/client.py:
# export_table` does `os.chdir(temp_dir)` to redirect kbcstorage's
# slice-file downloads into a per-call temp directory, and `os.chdir`
# is process-global. With threads, two parallel exports race on CWD
# and slice files end up in the wrong directory; the merge step then
# fails with `[Errno 2] No such file or directory:
# '<job_id>.csv_X_Y_Z.csv'`. ProcessPoolExecutor gives each worker
# its own process and therefore its own CWD.
if legacy_queue:
parallelism = max(1, int(os.environ.get("AGNES_KEBOOLA_PARALLELISM", "8")))
workers = min(parallelism, len(legacy_queue))
logger.info(
"Running legacy Storage-API fallback for %d tables across %d worker processes",
len(legacy_queue), workers,
)
if workers == 1:
legacy_results = [_legacy_worker(item, keboola_url, keboola_token) for item in legacy_queue]
else:
from concurrent.futures import ProcessPoolExecutor
with ProcessPoolExecutor(max_workers=workers) as ex:
futures = [ex.submit(_legacy_worker, item, keboola_url, keboola_token) for item in legacy_queue]
legacy_results = [f.result() for f in futures]
# Phase 3: serial _meta insert for legacy results. DuckDB conn
# isn't thread-safe, so we collect parallel work and only touch
# `conn` (and `stats`) here on the main thread.
for tc_, pq_, err in legacy_results:
tn = tc_["name"]
if err is not None:
logger.error("Failed to extract %s via legacy: %s", tn, err)
stats["tables_failed"] += 1
stats["errors"].append({"table": tn, "error": err})
continue
try:
_register_local_meta(conn, tc_, pq_, now)
stats["tables_extracted"] += 1
rows_log = conn.execute(
f"SELECT count(*) FROM read_parquet('{pq_.replace(chr(39), chr(39)*2)}')"
).fetchone()[0]
logger.info("Extracted %s via legacy: %d rows", tn, rows_log)
except Exception as e:
logger.error("Failed to register _meta for %s: %s", tn, e)
stats["tables_failed"] += 1
stats["errors"].append({"table": tn, "error": str(e)})
finally:
conn.execute("CHECKPOINT")
conn.close()
@ -302,6 +545,29 @@ def run(output_dir: str, table_configs: List[Dict[str, Any]], keboola_url: str,
return stats
def _register_local_meta(
conn: duckdb.DuckDBPyConnection,
tc: Dict[str, Any],
pq_path: str,
extracted_at: datetime,
) -> None:
"""After a parquet has been written for a local-mode table, create the
DuckDB view and register the row in `_meta`. Hoisted out of the run()
body so both the serial extension-success path and the parallel
legacy-result path share one implementation."""
table_name = tc["name"]
safe_pq_lit = pq_path.replace("'", "''")
rows = conn.execute(f"SELECT count(*) FROM read_parquet('{safe_pq_lit}')").fetchone()[0]
size = os.path.getsize(pq_path)
conn.execute(
f'CREATE OR REPLACE VIEW "{table_name}" AS SELECT * FROM read_parquet(\'{safe_pq_lit}\')'
)
conn.execute(
"INSERT INTO _meta VALUES (?, ?, ?, ?, ?, 'local')",
[table_name, tc.get("description", ""), rows, size, extracted_at],
)
def _extract_via_extension(conn: duckdb.DuckDBPyConnection, tc: Dict[str, Any], pq_path: str) -> None:
"""Extract a table using the DuckDB Keboola extension."""
bucket = tc.get("bucket", "")
@ -315,35 +581,75 @@ def _extract_via_extension(conn: duckdb.DuckDBPyConnection, tc: Dict[str, Any],
conn.execute(f'COPY (SELECT * FROM kbc."{bucket}"."{source_table}") TO \'{safe_pq_lit}\' (FORMAT PARQUET)')
def _extract_via_legacy(tc: Dict[str, Any], pq_path: str, keboola_url: str, keboola_token: str) -> None:
"""Fallback: extract using legacy Keboola client (kbcstorage SDK)."""
from connectors.keboola.client import KeboolaClient
client = KeboolaClient(token=keboola_token, url=keboola_url)
# Export to CSV temp file, then convert to parquet via DuckDB
import tempfile
with tempfile.NamedTemporaryFile(suffix=".csv", delete=False) as tmp:
csv_path = tmp.name
def _legacy_worker(tc_pq, keboola_url: str, keboola_token: str):
"""Module-level wrapper for ProcessPoolExecutor — must be picklable.
Returns `(tc, pq_path, error_str_or_None)` so the main process can
aggregate results and update _meta serially on its DuckDB connection.
"""
tc_, pq_ = tc_pq
try:
# Construct full Keboola table ID: bucket.source_table (e.g., in.c-finance.circle)
bucket = tc.get("bucket", "")
source_table = tc.get("source_table", tc["name"])
table_id = f"{bucket}.{source_table}" if bucket else tc.get("id", tc["name"])
client.export_table(table_id, Path(csv_path))
_extract_via_legacy(tc_, pq_, keboola_url, keboola_token)
return (tc_, pq_, None)
except Exception as exc:
return (tc_, pq_, str(exc))
# Convert CSV to Parquet using DuckDB — all_varchar avoids type inference errors
# (e.g. columns with mostly numeric values but some strings like "Non-Manager")
conv_conn = duckdb.connect()
conv_conn.execute(
f"COPY (SELECT * FROM read_csv('{csv_path}', all_varchar=true)) TO '{pq_path}' (FORMAT PARQUET)"
)
conv_conn.close()
finally:
if os.path.exists(csv_path):
os.unlink(csv_path)
def _extract_via_legacy(tc: Dict[str, Any], pq_path: str, keboola_url: str, keboola_token: str) -> None:
"""Per-table extract via the Storage API export-async path.
Despite the name (kept for caller compatibility with `_legacy_worker`),
this no longer goes through the `kbcstorage` SDK it talks to the
Storage API directly via `connectors/keboola/storage_api.py`. The old
SDK path had a thread-unsafe `os.chdir(temp_dir)` that broke parallel
execution; the direct path uses per-call temp directories and signed-URL
downloads, so threads / processes don't trip on each other.
Same surface as before `(tc, pq_path, url, token) writes parquet at
pq_path` so callers (including the parallel `_legacy_worker`) don't
need to change.
"""
import tempfile
from connectors.keboola.storage_api import KeboolaStorageClient, get_temp_root
bucket = tc.get("bucket", "")
source_table = tc.get("source_table", tc["name"])
table_id = f"{bucket}.{source_table}" if bucket else tc.get("id", tc["name"])
with tempfile.TemporaryDirectory(
prefix=f"kbc-export-{tc['name']}-",
dir=get_temp_root(),
ignore_cleanup_errors=True,
) as tmpdir:
csv_path = Path(tmpdir) / f"{tc['name']}.csv"
client = KeboolaStorageClient(url=keboola_url, token=keboola_token)
client.export_table_to_csv(table_id, csv_path)
if not csv_path.exists() or csv_path.stat().st_size == 0:
# Storage API succeeded but produced no rows. Emit an empty
# parquet rather than crashing — same defensive behavior as
# `materialize_query`.
duckdb.connect().execute(
f"COPY (SELECT 1 AS _empty WHERE FALSE) TO '{pq_path}' (FORMAT PARQUET)"
).close()
return
# all_varchar=true preserves the source's exact character data —
# matches what the kbcstorage path used to do, prevents DuckDB
# type inference from rewriting numeric-looking strings as NULL.
# max_line_size=64MB overrides DuckDB's 2MB default; matches the
# materialize_query path. See comment there for rationale.
safe_csv = str(csv_path).replace("'", "''")
safe_pq = pq_path.replace("'", "''")
conv = duckdb.connect()
try:
conv.execute(
f"COPY (SELECT * FROM read_csv('{safe_csv}', "
f"all_varchar=true, max_line_size=67108864)) "
f"TO '{safe_pq}' (FORMAT PARQUET)"
)
finally:
conv.close()
def compute_exit_code(stats: Dict[str, Any], total: int) -> int:

View file

@ -0,0 +1,696 @@
"""Lightweight Keboola Storage API client for table export.
The DuckDB Keboola extension was the originally-intended fast path, but on
projects with the `block-shared-snowflake-access` feature flag and on linked
buckets the per-session workspace can't see the bucket schemas at all
(keboola/duckdb-extension#17, fixed upstream in v0.1.6 but not yet in the
community CDN as of 2026-05-06). The `kbcstorage` SDK works but uses
`os.chdir(temp_dir)` to redirect slice downloads, which is process-global
threaded fan-out races on CWD and slice files land in the wrong directory.
This module talks to Storage API directly and downloads via signed URLs:
- POST /v2/storage/tables/{id}/export-async
- GET /v2/storage/jobs/{id} (poll until success/error)
- GET /v2/storage/files/{id}?federationToken=1
- GET <signed_url> (single file or manifest + per-slice URLs for sliced)
No `os.chdir`, no boto3/azure-blob/google-cloud-storage SDK dependencies
the federation-token detail response includes a signed URL that works for
all three cloud backends. Thread-safe: each call uses an independent
download path under a per-call temp directory.
Storage API reference:
- https://keboola.docs.apiary.io/#reference/tables/asynchronous-table-export
- https://keboola.docs.apiary.io/#reference/jobs
- https://keboola.docs.apiary.io/#reference/files/manage-files/file-detail
"""
from __future__ import annotations
import gzip
import logging
import os
import shutil
import tempfile
import time
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any, Iterable, List, Optional
import requests
logger = logging.getLogger(__name__)
# Storage API guarantees export jobs are created small and finish in seconds
# to a few minutes for typical bucket-table sizes; the absolute upper bound
# (very large tables, peak Snowflake load) is the operator's
# storage.jobsParallelism + scan duration. 30 min is a generous ceiling that
# matches what the dashboard's data-preview UI would also wait for.
_DEFAULT_EXPORT_TIMEOUT_SEC = int(os.environ.get("AGNES_KEBOOLA_EXPORT_TIMEOUT_SEC", "1800"))
_DEFAULT_POLL_INTERVAL_SEC = float(os.environ.get("AGNES_KEBOOLA_POLL_INTERVAL_SEC", "2"))
# Per-slice HTTP download timeout — separate from the export-job timeout.
# Sliced exports return a manifest of signed URLs; an individual slice is
# bounded in size by Storage API's slicer (typically ~100 MiB), so a few
# minutes is plenty for one HTTP GET.
_DEFAULT_SLICE_DOWNLOAD_TIMEOUT_SEC = int(
os.environ.get("AGNES_KEBOOLA_SLICE_TIMEOUT_SEC", "300")
)
def get_temp_root() -> Optional[str]:
"""Return the parent dir for per-call tempdirs, or None to use the
system default.
Reads ``AGNES_TEMP_DIR`` (compose env, single source of truth) and
creates the dir if it does not yet exist. Default behaviour
(``AGNES_TEMP_DIR`` unset) preserves the OSS pre-fix path
``tempfile.TemporaryDirectory(...)`` falls back to the platform's
`tmpdir` (typically ``/tmp``).
The agnes-dev cutover surfaced why this knob matters: the
container's ``/tmp`` lives on the boot disk's overlayfs (29 GiB
on agnes-dev, shared with /var), so a multi-slice Snowflake
UNLOAD of a wide table fills it long before the dedicated 20 GiB
data disk at ``/data`` would. Setting ``AGNES_TEMP_DIR=/data/tmp``
routes the staging dir to the data disk where the parquets are
going anyway, no extra mount required (the data disk is already
bind-mounted).
"""
root = os.environ.get("AGNES_TEMP_DIR", "").strip()
if not root:
return None
# Best-effort mkdir — if the parent isn't writable we let
# tempfile.TemporaryDirectory raise the real OSError later with
# the underlying detail. Avoids a silent fall-through to /tmp.
try:
os.makedirs(root, exist_ok=True)
except OSError as e:
logger.warning(
"AGNES_TEMP_DIR=%r not creatable (%s); tempfiles fall back "
"to system default. Set the env to a writable path or unset "
"to silence this warning.", root, e,
)
return None
return root
FILE_TYPE_CSV = "csv"
FILE_TYPE_PARQUET = "parquet"
_VALID_FILE_TYPES = {FILE_TYPE_CSV, FILE_TYPE_PARQUET}
@dataclass
class ExportFilter:
"""Structured Keboola Storage API filter spec.
Mirrors the BQ materialized path's `source_query` SQL string conceptually
both let the admin scope an extracted table but Storage API takes a
structured filter object rather than free-form SQL. Empty fields all
map to "no filter" so a default-constructed ExportFilter exports the
full table.
Operators per Apiary docs: eq, ne, in, notIn, ge, gt, le, lt.
`file_type` controls the format Storage API materializes into File
Storage. `parquet` is the recommended path for the materialized sync:
Keboola serves the parquet directly (UNLOADed from Snowflake), the
extractor renames it into place no CSV intermediate, no DuckDB
COPY, no peak-memory load. Falls back to CSV when an admin pins
`{"file_type":"csv"}` in source_query (e.g. for projects whose
backend can't UNLOAD parquet, or legacy debugging).
"""
where_filters: List[dict] = field(default_factory=list)
columns: List[str] = field(default_factory=list)
changed_since: Optional[str] = None
changed_until: Optional[str] = None
limit: Optional[int] = None
file_type: str = FILE_TYPE_CSV
def __post_init__(self):
if self.file_type not in _VALID_FILE_TYPES:
raise ValueError(
f"file_type must be one of {sorted(_VALID_FILE_TYPES)}, "
f"got {self.file_type!r}"
)
@classmethod
def from_dict(cls, data: Optional[dict]) -> "ExportFilter":
"""Parse from `table_registry.source_query` JSON. Tolerates None /
empty / unknown keys (registry stores admin input that may be sparse)."""
if not data:
return cls()
if not isinstance(data, dict):
raise ValueError(
f"ExportFilter.from_dict expects a dict, got {type(data).__name__}"
)
# Accept both `file_type` (preferred, matches the rest of the
# snake_case API) and `fileType` (matches Storage API wire name)
# so an admin who copies an example from Apiary docs doesn't trip.
ft = data.get("file_type") or data.get("fileType") or FILE_TYPE_CSV
return cls(
where_filters=list(data.get("where_filters") or []),
columns=list(data.get("columns") or []),
changed_since=data.get("changed_since"),
changed_until=data.get("changed_until"),
limit=data.get("limit"),
file_type=ft,
)
def to_export_params(self) -> dict:
"""Serialize for POST body of `/tables/{id}/export-async`.
whereFilters arrives as a list of `{column, operator, values}` dicts;
Storage API also accepts a single `whereColumn`/`whereOperator`/
`whereValues` triple but the multi-filter form is more general.
"""
params: dict = {}
if self.where_filters:
# Validate shape lightly — surface admin typos as ValueError
# rather than letting them turn into a 400 from Keboola's API
# without context.
for i, f in enumerate(self.where_filters):
if not isinstance(f, dict):
raise ValueError(f"where_filters[{i}] must be a dict")
missing = {"column", "operator", "values"} - set(f.keys())
if missing:
raise ValueError(
f"where_filters[{i}] missing fields: {sorted(missing)}"
)
if not isinstance(f["values"], list):
raise ValueError(f"where_filters[{i}].values must be a list")
params["whereFilters"] = self.where_filters
if self.columns:
params["columns"] = ",".join(self.columns)
if self.changed_since:
params["changedSince"] = self.changed_since
if self.changed_until:
params["changedUntil"] = self.changed_until
if self.limit is not None:
params["limit"] = int(self.limit)
# Only emit fileType when non-default — keeps the request body
# quiet for legacy callers that never knew about parquet, and
# matches the wire-side default behaviour.
if self.file_type and self.file_type != FILE_TYPE_CSV:
params["fileType"] = self.file_type
return params
class StorageApiError(RuntimeError):
"""Wraps a non-2xx Storage API response with the parsed body for context."""
def __init__(self, message: str, status: Optional[int] = None, body: Any = None):
super().__init__(message)
self.status = status
self.body = body
class KeboolaStorageClient:
"""Thread-safe Storage API client for table export.
One instance can be reused across threads `requests.Session` is
thread-safe when the underlying `HTTPAdapter`'s pool size is sized for
concurrent calls. Default `pool_connections=20, pool_maxsize=20`
accommodates the typical AGNES_KEBOOLA_PARALLELISM=8 plus headroom.
"""
def __init__(self, *, url: str, token: str, session: Optional[requests.Session] = None):
if not url or not token:
raise ValueError("KeboolaStorageClient requires url and token")
# The DuckDB Keboola extension's ATTACH chokes on a trailing slash
# (`https://connection.<region>.keboola.com/`); the Storage API
# tolerates either form, but normalising here keeps URL composition
# below predictable.
self.base = url.rstrip("/") + "/v2/storage"
self.token = token
if session is None:
session = requests.Session()
adapter = requests.adapters.HTTPAdapter(
pool_connections=20, pool_maxsize=20
)
session.mount("http://", adapter)
session.mount("https://", adapter)
self.session = session
# ---- low-level HTTP helpers -------------------------------------------
def _headers(self) -> dict:
return {"X-StorageApi-Token": self.token, "Accept": "application/json"}
def _get(self, path: str, **kwargs) -> dict:
url = f"{self.base}{path}"
resp = self.session.get(url, headers=self._headers(), timeout=30, **kwargs)
return self._parse(resp, "GET", url)
def _post(self, path: str, *, data: Optional[dict] = None) -> dict:
url = f"{self.base}{path}"
resp = self.session.post(
url, headers=self._headers(), data=data, timeout=30
)
return self._parse(resp, "POST", url)
def _parse(self, resp: requests.Response, method: str, url: str) -> dict:
try:
body = resp.json()
except Exception:
body = resp.text
if resp.status_code >= 400:
# Redact the token if it accidentally surfaces in an error body.
# The Storage API doesn't echo it, but third-party proxies in
# front of customer instances sometimes do.
redacted = self._redact(body)
raise StorageApiError(
f"{method} {url} -> HTTP {resp.status_code}: {redacted}",
status=resp.status_code,
body=body,
)
if not isinstance(body, dict):
raise StorageApiError(
f"{method} {url} -> unexpected non-JSON response: {str(body)[:200]}",
status=resp.status_code,
body=body,
)
return body
def _redact(self, body: Any) -> str:
s = str(body)
if self.token and self.token in s:
s = s.replace(self.token, "<redacted-storage-token>")
return s[:500]
# ---- export-async + job polling ---------------------------------------
def export_table_async(self, table_id: str, params: dict) -> dict:
"""POST /v2/storage/tables/{table_id}/export-async — kicks off the
async export and returns the job resource. Caller polls `job.id`
via `wait_for_job` to find the file id when status='success'."""
return self._post(f"/tables/{table_id}/export-async", data=params)
def wait_for_job(
self,
job_id: int,
*,
timeout: float = _DEFAULT_EXPORT_TIMEOUT_SEC,
poll_interval: float = _DEFAULT_POLL_INTERVAL_SEC,
) -> dict:
"""Block until the async job reaches a terminal state. Returns the
job dict on success; raises `StorageApiError` on failure or timeout.
The poll interval starts small and backs off slightly so a chain of
~10 fast polls covers a sub-30 s job without flogging the API, while
a 30-min job ends up at a steady cadence after a few minutes.
"""
deadline = time.monotonic() + timeout
interval = poll_interval
while time.monotonic() < deadline:
job = self._get(f"/jobs/{job_id}")
status = job.get("status")
if status == "success":
return job
if status == "error":
raise StorageApiError(
f"Storage API job {job_id} reported error: "
f"{job.get('error') or job}",
body=job,
)
time.sleep(interval)
# Exponential backoff bounded at 10 s — a multi-minute Snowflake
# scan does not benefit from sub-second polls. 1.5 multiplier
# reaches 10 s after ~9 polls (~30 s wall-clock) and stays there.
interval = min(interval * 1.5, 10.0)
raise StorageApiError(
f"Storage API job {job_id} did not finish within {timeout}s"
)
# ---- file detail + signed-URL download --------------------------------
def file_detail(self, file_id: int) -> dict:
"""GET /v2/storage/files/{file_id}?federationToken=1 — returns the
file metadata plus a presigned URL (`url`) usable directly via HTTP
without any cloud SDK. For sliced exports the `url` resolves to a
manifest JSON listing the per-slice signed URLs."""
return self._get(f"/files/{file_id}", params={"federationToken": 1})
def download_file(self, file_info: dict, dest_path: Path) -> Path:
"""Download a Storage API file (single or sliced) to `dest_path`.
Backend variants:
- **AWS / Azure**: signed HTTPS URL in `file_info["url"]` (S3
presigned / SAS). Sliced manifest entries are signed HTTPS too.
Plain HTTP GET works.
- **GCP**: `file_info["url"]` is a signed HTTPS URL for the
single-file case. For sliced exports, the manifest at `url`
lists per-slice paths as `gs://<bucket>/<key>` (NOT signed)
requires GCS authentication. We use the OAuth access token from
`file_info["gcsCredentials"]["access_token"]` and hit the REST
endpoint
`https://storage.googleapis.com/storage/v1/b/<bucket>/o/<urlencoded_key>?alt=media`
with `Authorization: Bearer <token>`. No google-cloud-storage
SDK dependency.
Single-file: stream the signed URL directly, gunzipping if the
URL/name ends in `.gz`. Sliced: stream each slice into
`dest_path` in order (slice 0 has the CSV header per Storage
API contract, subsequent slices are header-less data).
"""
url = file_info.get("url")
if not url:
raise StorageApiError(
f"file detail missing 'url': {self._redact(file_info)}",
body=file_info,
)
is_sliced = bool(file_info.get("isSliced"))
# Gzip detection is name-based only. Snowflake UNLOAD adds the
# `.gz` suffix when compression is requested (CSV exports), and
# leaves it off otherwise (parquet has its own internal
# compression and is served as plain `.parquet`). The previous
# `isEncrypted is False` fallback gated on a property that's
# orthogonal to compression — it would have flagged parquet
# downloads as gzipped and corrupted them at gunzip time.
is_gzipped = file_info.get("name", "").endswith(".gz")
dest_path.parent.mkdir(parents=True, exist_ok=True)
if is_sliced:
# GCP sliced manifests carry `gs://` URIs that need an OAuth
# bearer; AWS / Azure carry signed HTTPS URLs that work
# without auth. The presence of `gcsCredentials` in the file
# detail signals a GCP backend.
gcs_token = (file_info.get("gcsCredentials") or {}).get("access_token")
self._download_sliced(url, dest_path, gcs_token=gcs_token)
else:
self._download_single(url, dest_path, gunzip_on_read=is_gzipped)
return dest_path
def _download_single(
self,
url: str,
dest_path: Path,
*,
gunzip_on_read: bool,
extra_headers: Optional[dict] = None,
) -> None:
"""Stream a single signed URL (or GCS REST URL with bearer token
in `extra_headers`) into `dest_path`. Transparently gunzips if
the file name suggests it's a `.gz` — Storage API serves through
proxies that may rewrite Content-Encoding, so name-based
detection is more reliable than the header in practice."""
with self.session.get(
url,
stream=True,
timeout=_DEFAULT_SLICE_DOWNLOAD_TIMEOUT_SEC,
headers=extra_headers,
) as r:
r.raise_for_status()
tmp = dest_path.with_suffix(dest_path.suffix + ".part")
try:
with open(tmp, "wb") as fh:
for chunk in r.iter_content(chunk_size=64 * 1024):
if chunk:
fh.write(chunk)
if gunzip_on_read:
self._gunzip_in_place(tmp, dest_path)
tmp.unlink(missing_ok=True)
else:
tmp.replace(dest_path)
finally:
if tmp.exists():
tmp.unlink(missing_ok=True)
@staticmethod
def _gs_to_https(gs_url: str) -> str:
"""Rewrite `gs://<bucket>/<key>` to GCS JSON API media-download URL.
The JSON API requires the object name URL-encoded as a single
path segment (slashes inside the key are escaped). `alt=media`
switches the response from object metadata JSON to the actual
bytes matches what `bucket.blob(key).download_as_bytes()` does
in the google-cloud-storage SDK.
"""
from urllib.parse import quote
if not gs_url.startswith("gs://"):
raise ValueError(f"_gs_to_https expects gs://; got {gs_url!r}")
path = gs_url[5:] # strip "gs://"
bucket, _, key = path.partition("/")
if not bucket or not key:
raise ValueError(f"malformed gs:// URL: {gs_url!r}")
return (
f"https://storage.googleapis.com/storage/v1/b/{bucket}"
f"/o/{quote(key, safe='')}?alt=media"
)
def _download_sliced(
self, manifest_url: str, dest_path: Path, *, gcs_token: Optional[str] = None
) -> None:
"""Sliced exports: the file detail's `url` points at a JSON manifest
whose `entries[].url` lists per-slice locations. Download each slice
and concatenate into `dest_path`. The first slice contains the CSV
header (Storage API guarantees stable header positioning).
Per-slice URL forms:
- signed HTTPS (S3 presigned, Azure SAS) plain GET works.
- `gs://<bucket>/<key>` (GCP) requires `gcs_token` (OAuth bearer
shipped in the file_detail's `gcsCredentials.access_token`).
Mapped to `https://storage.googleapis.com/storage/v1/b/<bucket>/o/<encoded_key>?alt=media`.
"""
m = self.session.get(
manifest_url, timeout=_DEFAULT_SLICE_DOWNLOAD_TIMEOUT_SEC
)
m.raise_for_status()
manifest = m.json()
entries = manifest.get("entries") or []
if not entries:
raise StorageApiError(
f"sliced manifest had no entries: {str(manifest)[:200]}",
body=manifest,
)
with tempfile.TemporaryDirectory(
prefix="kbc-slice-", dir=get_temp_root(), ignore_cleanup_errors=True,
) as tmpdir:
slice_paths: List[Path] = []
for i, entry in enumerate(entries):
surl = entry.get("url")
if not surl:
raise StorageApiError(
f"slice {i} missing 'url': {str(entry)[:200]}",
body=entry,
)
sp = Path(tmpdir) / f"slice-{i:05d}"
# GCP backend: rewrite gs:// to GCS REST + bearer auth.
# The OAuth token comes from the file_detail's
# `gcsCredentials.access_token` (passed as `gcs_token`
# arg).
if surl.startswith("gs://"):
if not gcs_token:
raise StorageApiError(
f"slice {i} URL is gs:// but no gcs_token "
f"provided in file_detail.gcsCredentials"
)
surl = self._gs_to_https(surl)
extra_headers = {"Authorization": f"Bearer {gcs_token}"}
else:
extra_headers = None
# Slices may individually be gzipped — same heuristic as
# single-file: if the slice URL's path ends in `.gz`, gunzip
# after download.
gz = ".gz" in surl.split("?")[0].rsplit("/", 1)[-1]
self._download_single(
surl, sp, gunzip_on_read=gz, extra_headers=extra_headers,
)
slice_paths.append(sp)
# Concat. Sliced CSV exports include the header in slice 0 only
# (Storage API contract); subsequent slices are header-less.
with open(dest_path, "wb") as out:
for sp in slice_paths:
with open(sp, "rb") as fh:
shutil.copyfileobj(fh, out, length=64 * 1024)
@staticmethod
def _gunzip_in_place(src: Path, dest: Path) -> None:
with gzip.open(src, "rb") as gz, open(dest, "wb") as out:
shutil.copyfileobj(gz, out, length=64 * 1024)
# ---- high-level: export-async + poll, returning file metadata ---------
def prepare_export(
self,
table_id: str,
*,
export_filter: Optional[ExportFilter] = None,
export_timeout: float = _DEFAULT_EXPORT_TIMEOUT_SEC,
) -> dict:
"""Run export-async + wait_for_job + file_detail and return the
file metadata. Caller decides how to download (single vs
sliced) needed for the parquet path where sliced output must
be downloaded slice-by-slice and then DuckDB-merged (cat-style
concat would corrupt the per-slice parquet footers).
Returns:
{"job_id": int, "file_id": int, "rows": int|None,
"file_info": dict, "file_type": str}
"""
f = export_filter or ExportFilter()
params = f.to_export_params()
job_resp = self.export_table_async(table_id, params)
job_id = job_resp.get("id")
if not job_id:
raise StorageApiError(
f"export-async response missing job id: {self._redact(job_resp)}",
body=job_resp,
)
job = self.wait_for_job(job_id, timeout=export_timeout)
results = job.get("results") or {}
file_id = (results.get("file") or {}).get("id") or results.get("fileId")
if not file_id:
raise StorageApiError(
f"job {job_id} succeeded but had no result file: "
f"{self._redact(job)}",
body=job,
)
file_info = self.file_detail(file_id)
return {
"job_id": int(job_id),
"file_id": int(file_id),
"rows": (results.get("totalRowsCount")
or results.get("rowsCount")
or job.get("totalRowsCount")),
"file_info": file_info,
"file_type": f.file_type,
}
def download_file_slices(
self, file_info: dict, dest_dir: Path
) -> List[Path]:
"""Download a sliced Storage API export as separate per-slice
files into ``dest_dir``. Returns the slice paths in manifest
order. Use when the slices must be processed individually
(e.g. parquet each slice is a complete parquet file with its
own footer; concatenation would invalidate it). For CSV where
concat-with-header-only-on-first-slice is the right thing,
``download_file`` is the correct entry point.
"""
url = file_info.get("url")
if not url:
raise StorageApiError(
f"file detail missing 'url': {self._redact(file_info)}",
body=file_info,
)
if not file_info.get("isSliced"):
raise StorageApiError(
"download_file_slices called on a non-sliced file_info; "
"use download_file for the single-file case"
)
gcs_token = (file_info.get("gcsCredentials") or {}).get("access_token")
m = self.session.get(url, timeout=_DEFAULT_SLICE_DOWNLOAD_TIMEOUT_SEC)
m.raise_for_status()
manifest = m.json()
entries = manifest.get("entries") or []
if not entries:
raise StorageApiError(
f"sliced manifest had no entries: {str(manifest)[:200]}",
body=manifest,
)
dest_dir.mkdir(parents=True, exist_ok=True)
slice_paths: List[Path] = []
for i, entry in enumerate(entries):
surl = entry.get("url")
if not surl:
raise StorageApiError(
f"slice {i} missing 'url': {str(entry)[:200]}",
body=entry,
)
# Reuse the same gs:// rewrite + bearer + per-slice gz
# heuristics used by the concat path.
if surl.startswith("gs://"):
if not gcs_token:
raise StorageApiError(
f"slice {i} URL is gs:// but no gcs_token "
f"provided in file_detail.gcsCredentials"
)
surl = self._gs_to_https(surl)
extra_headers = {"Authorization": f"Bearer {gcs_token}"}
else:
extra_headers = None
gz = ".gz" in surl.split("?")[0].rsplit("/", 1)[-1]
sp = dest_dir / f"slice-{i:05d}"
self._download_single(
surl, sp, gunzip_on_read=gz, extra_headers=extra_headers,
)
slice_paths.append(sp)
return slice_paths
# ---- high-level: export to local file (csv or parquet) ----------------
def export_table(
self,
table_id: str,
dest_path: Path,
*,
export_filter: Optional[ExportFilter] = None,
export_timeout: float = _DEFAULT_EXPORT_TIMEOUT_SEC,
) -> dict:
"""End-to-end: export-async → poll → download to ``dest_path``.
``export_filter.file_type`` controls the format Storage API
materializes (``csv`` default, ``parquet`` when explicitly set).
``dest_path`` is the local file we write the bytes to; the caller
decides the extension. The downloader streams chunks to disk so
memory stays bounded regardless of file size.
For CSV the sliced case is handled transparently slices are
concatenated into ``dest_path`` (header in slice 0 only). For
**sliced parquet**, callers must use ``prepare_export`` +
``download_file_slices`` instead concatenating parquet slices
invalidates the per-slice footer. ``export_table`` will raise
StorageApiError if it sees a sliced parquet, to fail loud.
Returns a small stats dict so callers can log / record provenance:
{"job_id": int, "file_id": int, "rows": int|None, "bytes": int,
"file_type": str}
"""
prep = self.prepare_export(
table_id, export_filter=export_filter, export_timeout=export_timeout,
)
file_info = prep["file_info"]
if (
prep["file_type"] == FILE_TYPE_PARQUET
and file_info.get("isSliced")
):
raise StorageApiError(
f"sliced parquet export for {table_id}: use "
f"prepare_export + download_file_slices and merge with "
f"DuckDB COPY (concat would corrupt parquet footers)",
body=file_info,
)
self.download_file(file_info, dest_path)
size = dest_path.stat().st_size if dest_path.exists() else 0
return {
"job_id": prep["job_id"],
"file_id": prep["file_id"],
"rows": prep["rows"],
"bytes": size,
"file_type": prep["file_type"],
}
# Backwards-compat alias retained for external callers (e.g. ad-hoc
# scripts) that imported the old name. The behavior matches calling
# `export_table` with whatever `file_type` the export_filter carries
# — the *_to_csv suffix is now imprecise (Storage API can also serve
# parquet here), but renaming the import would force unrelated repos
# to coordinate. Prefer `export_table` in new code.
def export_table_to_csv(
self,
table_id: str,
dest_csv: Path,
*,
export_filter: Optional[ExportFilter] = None,
export_timeout: float = _DEFAULT_EXPORT_TIMEOUT_SEC,
) -> dict:
return self.export_table(
table_id,
dest_csv,
export_filter=export_filter,
export_timeout=export_timeout,
)

View file

@ -17,6 +17,13 @@ services:
env_file: .env
environment:
- DATA_DIR=/data
# Steer per-call tempdirs (Snowflake UNLOAD slice staging,
# CSV→parquet intermediates) onto the data volume. The container
# default ``/tmp`` lives on overlayfs (boot disk), which fills
# under multi-GiB sliced exports — see connectors/keboola/
# storage_api.py:get_temp_root. Operators can override per
# deployment via .env (or unset to fall back to system /tmp).
- AGNES_TEMP_DIR=${AGNES_TEMP_DIR:-/data/tmp}
healthcheck:
test: ["CMD", "curl", "-sf", "http://localhost:8000/api/health"]
interval: 30s
@ -39,6 +46,7 @@ services:
env_file: .env
environment:
- DATA_DIR=/data
- AGNES_TEMP_DIR=${AGNES_TEMP_DIR:-/data/tmp}
profiles:
- extract
@ -51,6 +59,7 @@ services:
env_file: .env
environment:
- DATA_DIR=/data
- AGNES_TEMP_DIR=${AGNES_TEMP_DIR:-/data/tmp}
- API_URL=http://app:8000
- SEED_ADMIN_EMAIL=${SEED_ADMIN_EMAIL:-}
depends_on:

View file

@ -67,19 +67,14 @@ dependencies = [
# ModuleNotFoundError boot loops on default Compose deploys.
"anthropic>=0.30.0",
"openai>=1.30.0",
# Legacy Keboola Storage API client. The primary Keboola path is the
# DuckDB community extension (`connectors/keboola/access.py`,
# `connectors/keboola/extractor.py`), but it routes scans through
# Keboola QueryService — and on projects whose Snowflake backend
# doesn't expose bucket schemas to the storage-token-derived role
# the extension fails with `Schema '..."in.c-..."' does not exist
# or not authorized` while the same token reads fine via the
# `/v2/storage/tables/{id}/data-preview` REST endpoint
# (keboola/duckdb-extension#17). The extractor has had a `kbcstorage`
# fallback path since the extension landed
# (`connectors/keboola/extractor.py:_extract_via_legacy`); making
# the dep core means that fallback is actually importable in the
# default install instead of crashing with `ModuleNotFoundError`.
# Keboola Storage API SDK — used by:
# - `connectors/keboola/client.py` for admin-side bucket / table list
# (consumed from `app/api/admin.py` discover-and-register, table
# metadata refresh).
# Extraction itself uses the lightweight `connectors/keboola/storage_api.py`
# module (export-async + signed-URL download) which talks to Storage API
# directly via `requests` — no SDK dependency on the data-path side. The
# SDK stays for the metadata reads.
"kbcstorage>=0.9.0",
]

View file

@ -134,12 +134,54 @@ fi
BEFORE=$(docker images --no-trunc --format '{{.Digest}}' "$IMAGE" | head -1)
docker compose "${COMPOSE_FILES[@]}" pull >/dev/null 2>&1
AFTER=$(docker images --no-trunc --format '{{.Digest}}' "$IMAGE" | head -1)
if [ "$BEFORE" != "$AFTER" ] || [ "$CONFIG_BEFORE" != "$CONFIG_AFTER" ]; then
REASON=()
[ "$BEFORE" != "$AFTER" ] && REASON+=("image digest")
[ "$CONFIG_BEFORE" != "$CONFIG_AFTER" ] && REASON+=("config files")
# Sync-in-flight defer guard. ``docker compose up -d`` recreates the
# uvicorn worker, which kills any in-flight extractor / materialized
# pass that was holding ``_sync_lock``. The next 5-min cron tick
# picks up the same change — we just delay the upgrade until the
# current sync finishes (typically minutes for small tables, longer
# for big Snowflake UNLOADs). curl with a 5s timeout: if the app is
# unreachable for any reason (already crashed, port not bound,
# older app version without /api/sync/status), we proceed with the
# upgrade — being stuck on a wedged previous version is worse than
# interrupting a hypothetical sync.
LOCK_JSON=$(curl -sf --max-time 5 http://localhost:8000/api/sync/status 2>/dev/null || true)
if echo "$LOCK_JSON" | grep -q '"locked"[[:space:]]*:[[:space:]]*true'; then
echo "$(date): sync in flight (${REASON[*]} pending) — deferring recreate to next tick"
logger -t agnes-auto-upgrade "deferred recreate: sync in flight (${REASON[*]})"
exit 0
fi
echo "$(date): change detected (${REASON[*]}) — recreating containers"
# Re-align ownership of mounted state to the image's runtime user
# before bringing containers up. Catches root → non-root UID
# transitions across upgrades — old root-owned files would otherwise
# cause PermissionError on .session_secret / DuckDB on the new
# image's first start. Idempotent (no-op when ownership already
# matches). The Dockerfile pins runtime to uid:gid 999:999 today
# (`useradd --system --uid 999 ... agnes`); read it back from the
# image config to stay honest if that ever changes. Only relevant
# when the image digest actually changed.
if [ "$BEFORE" != "$AFTER" ]; then
IMAGE_USER=$(docker image inspect -f '{{.Config.User}}' "$IMAGE" 2>/dev/null || true)
if [ -n "$IMAGE_USER" ] && [ "$IMAGE_USER" != "root" ] && [ "$IMAGE_USER" != "0" ]; then
# IMAGE_USER may be "agnes" (name) or "999" or "999:999".
# Resolve via /etc/passwd inside the image — works without
# requiring a shell in the runtime layer.
IMAGE_UIDGID=$(docker run --rm --entrypoint cat "$IMAGE" /etc/passwd 2>/dev/null \
| awk -F: -v u="${IMAGE_USER%%:*}" '$1==u || $3==u {print $3":"$4; exit}')
if [ -n "$IMAGE_UIDGID" ]; then
for d in "$STATE_DIR" /data/extracts /data/analytics; do
[ -d "$d" ] && chown -R "$IMAGE_UIDGID" "$d" 2>/dev/null || true
done
fi
fi
fi
# ${arr[@]+"${arr[@]}"} pattern: expands to nothing when array is
# empty (vs. plain "${arr[@]}" which trips `set -u` on bash <4.4).
docker compose "${COMPOSE_FILES[@]}" ${PROFILE_ARGS[@]+"${PROFILE_ARGS[@]}"} up -d

View file

@ -39,7 +39,7 @@ def _maybe_instrument(con, db_tag: str):
_SAFE_IDENTIFIER = re.compile(r"^[a-zA-Z_][a-zA-Z0-9_]{0,63}$")
SCHEMA_VERSION = 25
SCHEMA_VERSION = 26
_SYSTEM_SCHEMA = """
CREATE TABLE IF NOT EXISTS schema_version (
@ -1803,6 +1803,33 @@ _V24_TO_V25_MIGRATIONS = [
]
# v26: unify Keboola query_mode='local' rows into 'materialized'.
#
# The old `local` flow ran the DuckDB Keboola extension's COPY through
# QueryService — which is unreliable on linked-bucket projects (and was
# wholly broken pre-v0.1.6 of the extension). The new `materialized`
# flow uses the Storage API export-async path directly:
# POST /v2/storage/tables/<id>/export-async
# GET /v2/storage/jobs/<id> (poll)
# GET /v2/storage/files/<id>?federationToken=1 (signed URL)
# download → CSV → parquet
# That works regardless of project flags, and a NULL `source_query`
# means "full table export" — same effective behavior the `local` mode
# previously gave.
#
# Existing Keboola rows registered as `query_mode='local'` are flipped
# to 'materialized'; their source_query stays NULL (full table). Jira
# and BigQuery 'local' rows are untouched (this connector still uses
# its own path).
_V25_TO_V26_MIGRATIONS = [
"""
UPDATE table_registry
SET query_mode = 'materialized'
WHERE source_type = 'keboola' AND query_mode = 'local'
""",
]
# v24: rewrite materialized BQ source_query from DuckDB-flavor
# (bq."<dataset>"."<table>") to BigQuery-native (`<project>.<dataset>.<table>`)
# so the new connectors.bigquery.extractor.materialize_query wrapping
@ -2056,6 +2083,9 @@ def _ensure_schema(conn: duckdb.DuckDBPyConnection) -> None:
if current < 25:
for sql in _V24_TO_V25_MIGRATIONS:
conn.execute(sql)
if current < 26:
for sql in _V25_TO_V26_MIGRATIONS:
conn.execute(sql)
conn.execute(
"UPDATE schema_version SET version = ?, applied_at = current_timestamp",
[SCHEMA_VERSION],

View file

@ -654,7 +654,23 @@ class SyncOrchestrator:
logger.error("Failed to attach remote source %s: %s", alias, e)
def _update_sync_state(self, meta_rows: list, source_name: str) -> None:
"""Update sync_state table in system.duckdb from _meta entries."""
"""Update sync_state table in system.duckdb from _meta entries.
The hash stored here MUST match what `agnes pull` computes
client-side via `cli/commands/sync.py:_md5_file` and what the
materialized SQL path stores via `app/api/sync.py:_file_hash`
otherwise the CLI's post-download integrity check fails for every
local-mode table with `hash mismatch: expected got `. That's
a full content MD5 (`hashlib.md5(bytes).hexdigest()`), no
truncation.
Pre-fix this method computed `md5(f"{mtime_ns}:{size}")[:12]`
a fingerprint, not a content hash, and 12-char truncated to boot
which the CLI's full-32-char content MD5 could never match.
Symptom: `agnes pull` failed with hash mismatch on every Keboola
local-mode table because their sync_state hashes came from this
path while their on-disk content was unrelated.
"""
try:
from src.db import get_system_db
from src.repositories.sync_state import SyncStateRepository
@ -664,15 +680,14 @@ class SyncOrchestrator:
try:
repo = SyncStateRepository(sys_conn)
for table_name, rows, size_bytes, query_mode in meta_rows:
# Compute hash from parquet file stats (fast, no file read)
pq_path = extracts_dir / source_name / "data" / f"{table_name}.parquet"
file_hash = ""
if pq_path.exists():
stat = pq_path.stat()
file_hash = hashlib.md5(
f"{stat.st_mtime_ns}:{stat.st_size}".encode()
).hexdigest()[:12]
else:
file_hash = ""
h = hashlib.md5()
with open(pq_path, "rb") as f:
for chunk in iter(lambda: f.read(8192), b""):
h.update(chunk)
file_hash = h.hexdigest()
repo.update_sync(
table_id=table_name,

View file

@ -197,7 +197,7 @@ class TableInfo:
self,
table_id: str,
name: str,
description: str,
description: str = "",
primary_key: str = "",
sync_strategy: str = "none",
foreign_keys: Optional[List[ForeignKeyInfo]] = None,

View file

@ -174,7 +174,10 @@ def is_valid_schedule(schedule: Optional[str]) -> bool:
"""Return True iff ``schedule`` parses as a documented schedule string.
Accepted forms (mirroring the rest of this module):
- ``"every Nm"`` / ``"every Nh"`` with N a positive integer
- ``"every Nm"`` / ``"every Nh"`` with N a non-negative integer
(``every 0m`` = always due, useful for force-resync of a row whose
previous attempt errored without recording last_sync bypasses
the rate limit on the next ``/api/sync/trigger`` tick)
- ``"daily HH:MM"`` (24-h, UTC) optionally comma-separated:
``"daily 07:00,13:00"``
@ -187,7 +190,7 @@ def is_valid_schedule(schedule: Optional[str]) -> bool:
return False
interval = parse_interval_minutes(schedule)
if interval is not None:
return interval > 0
return interval >= 0
match = DAILY_PATTERN.match(schedule)
if not match:
return False

View file

@ -2137,12 +2137,16 @@ def test_register_request_accepts_valid_sync_schedule(schedule):
@pytest.mark.parametrize("schedule", [
"hourly",
"every 0m",
"daily 25:00",
"every 5x",
" ",
])
def test_register_request_rejects_malformed_sync_schedule(schedule):
# `every 0m` was previously here as a rejected value; the runtime
# now accepts it as the "always due" force-resync override (used
# to bypass the default 1h interval on a row whose previous
# attempt errored without recording last_sync). See group A
# commit `fix: tempdir leak cleanup, every 0m schedule, ...`.
with pytest.raises(ValidationError) as exc_info:
RegisterTableRequest(name="orders", sync_schedule=schedule)
assert "sync_schedule" in str(exc_info.value)

View file

@ -0,0 +1,231 @@
"""Pure-function tests for the Keboola auto-discovery planner.
The planner walks the discovered table list and the live registry,
classifying each entry as ``new`` / ``existing_match`` /
``existing_drift`` / ``invalid`` so the caller can decide whether
to write. The endpoint and CLI compose on top of this; the planner
itself touches no external services.
Two real-world incidents drove this split:
- kbc_job arrived with the wrong bucket from a manual registration
(``in.c-keboola-storage`` instead of ``in.c-kbc_telemetry``); a
naive auto-discover re-run would have overwritten the admin's
correction. The planner now classifies that as ``existing_drift``
and the writer skips it, surfacing the divergence in the response.
- Earlier auto-discover bug stripped the stage prefix off bucket ids
(e.g. ``c-finance`` instead of ``in.c-finance``), inserting 137
rows whose Storage API export-async calls all 404'd. The planner
now uses the Keboola API's authoritative ``bucket_id`` field
directly, falling back to id-string parsing only when the field
isn't present.
"""
from __future__ import annotations
from unittest.mock import MagicMock
import pytest
from app.api.admin import _build_keboola_discovery_plan, _split_keboola_table_id
# ---- _split_keboola_table_id (id parser fallback) --------------------------
class TestSplitKeboolaTableId:
def test_three_segment_canonical(self):
# in.c-finance.orders → bucket=in.c-finance, table=orders
assert _split_keboola_table_id("in.c-finance.orders") == (
"in.c-finance", "orders",
)
def test_three_segment_with_dotted_table_name(self):
# Bucket-id always c-<word>; treat anything trailing as table.
# 4-segment id → bucket = first three joined, table = last.
assert _split_keboola_table_id("in.c-x.foo.bar") == (
"in.c-x.foo", "bar",
)
def test_two_segment_no_stage(self):
# Defensive: id missing the stage prefix → use what we have.
assert _split_keboola_table_id("c-foo.events") == (
"c-foo", "events",
)
def test_one_segment_falls_back_to_name(self):
bucket, table = _split_keboola_table_id("orphan", fallback_name="orphan_t")
assert bucket == ""
assert table == "orphan_t"
def test_empty_string_safe(self):
bucket, table = _split_keboola_table_id("", fallback_name="x")
assert bucket == ""
assert table == "x"
# ---- _build_keboola_discovery_plan -----------------------------------------
def _make_repo(rows: dict[str, dict]):
"""Build a stub repo whose `.get(table_id)` returns the row in `rows`
(or None). Mirror of TableRegistryRepository's lookup-by-id surface."""
repo = MagicMock()
repo.get.side_effect = lambda tid: rows.get(tid)
return repo
@pytest.fixture(autouse=True)
def stub_table_registry(monkeypatch):
"""The planner instantiates `TableRegistryRepository(conn)` itself.
Patch the class to return whatever fixture-provided repo we set up
via ``request.param``-style indirection but here a simpler
pattern: per-test setup attaches the rows dict to a module-level
cache that the fake reads."""
state = {"rows": {}}
class _FakeRepo:
def __init__(self, conn): pass
def get(self, tid): return state["rows"].get(tid)
def list_all(self):
# The planner uses list_all() once at the top to build the
# name→row index for collision detection. Stamp `source_type`
# on every row so the planner's keboola filter accepts them.
return [
{**v, "id": k, "source_type": v.get("source_type", "keboola")}
for k, v in state["rows"].items()
]
monkeypatch.setattr("app.api.admin.TableRegistryRepository", _FakeRepo)
return state
def test_plan_buckets_new_existing_match_drift_and_invalid(stub_table_registry):
"""Single test exercising the four buckets at once — easier to read
than four separate tests; failures here surface the exact bucket
that misclassified.
Drift here is the **same-id, different-coords** flavour: registry
has a row at id `kbc_organization` with stale bucket; discovery
would write a different bucket under the same id. (The
name-collision flavour gets its own test below.)"""
stub_table_registry["rows"] = {
# existing_match: registry agrees with discovery
"in_c-sales_orders": {
"name": "orders", "bucket": "in.c-sales", "source_table": "orders",
},
# existing_drift (same id): admin migrated bucket post-registration
"in_c-kbc_telemetry_kbc_organization": {
"name": "kbc_organization",
"bucket": "in.c-OLD-bucket", "source_table": "kbc_organization",
},
}
discovered = [
# new
{"id": "in.c-sales.invoices", "name": "invoices",
"bucket_id": "in.c-sales"},
# existing_match
{"id": "in.c-sales.orders", "name": "orders",
"bucket_id": "in.c-sales"},
# existing_drift (same id, different bucket)
{"id": "in.c-kbc_telemetry.kbc_organization", "name": "kbc_organization",
"bucket_id": "in.c-kbc_telemetry"},
# invalid — empty id
{"id": "", "name": "broken",
"bucket_id": ""},
]
plan = _build_keboola_discovery_plan(MagicMock(), discovered)
assert [e["table_id"] for e in plan["new"]] == ["in_c-sales_invoices"]
assert [e["table_id"] for e in plan["existing_match"]] == ["in_c-sales_orders"]
assert len(plan["existing_drift"]) == 1
drift = plan["existing_drift"][0]
assert drift["drift_kind"] == "same_id_diff_coords"
assert drift["table_id"] == "in_c-kbc_telemetry_kbc_organization"
assert drift["bucket"] == "in.c-kbc_telemetry"
assert drift["registry_bucket"] == "in.c-OLD-bucket"
assert len(plan["invalid"]) == 1
assert plan["invalid"][0]["full_id"] == ""
def test_plan_drift_via_name_collision_kbc_job_real_world(stub_table_registry):
"""Real-world incident: kbc_job was registered manually as
``id='kbc_job', name='kbc_job', bucket='in.c-kbc_telemetry'``;
Keboola's auto-discovery exposes the same logical table at
``id='in.c-keboola-storage.job', name='job'``. Without
name-collision detection, the planner would have classified the
discovered row as `new` and inserted a duplicate whose Storage
API export-async 404s.
With the fix, planner detects the **discovered.name == registry.name**
(case-insensitive) collision, classifies as drift, surfaces the
`registry_id` so an operator can reconcile."""
stub_table_registry["rows"] = {
"kbc_job": {
"name": "kbc_job",
"bucket": "in.c-kbc_telemetry", "source_table": "kbc_job",
},
}
discovered = [
{"id": "in.c-keboola-storage.kbc_job", "name": "kbc_job",
"bucket_id": "in.c-keboola-storage"},
]
plan = _build_keboola_discovery_plan(MagicMock(), discovered)
assert plan["new"] == [], (
"duplicate kbc_job must NOT be in new bucket — would 404 at "
"next sync and clobber operator alerting"
)
assert len(plan["existing_drift"]) == 1
drift = plan["existing_drift"][0]
assert drift["drift_kind"] == "name_collision"
assert drift["table_id"] == "in_c-keboola-storage_kbc_job"
assert drift["registry_id"] == "kbc_job"
assert drift["bucket"] == "in.c-keboola-storage"
assert drift["registry_bucket"] == "in.c-kbc_telemetry"
def test_plan_prefers_api_bucket_id_over_id_parse(stub_table_registry):
"""Authoritative source for bucket is the API's `bucket_id` field
(when present). Pre-fix, the parser stripped the stage prefix and
inserted 137 broken rows using `bucket_id` directly avoids that
class of bug entirely."""
discovered = [
# bucket_id explicit + present, full id agrees: trivial
{"id": "in.c-x.t1", "name": "t1", "bucket_id": "in.c-x"},
# bucket_id present, full id messy / unreliable — bucket_id wins
{"id": "weird-id-without-dots", "name": "t2", "bucket_id": "in.c-y"},
]
plan = _build_keboola_discovery_plan(MagicMock(), discovered)
by_id = {e["table_id"]: e for e in plan["new"]}
assert by_id["in_c-x_t1"]["bucket"] == "in.c-x"
assert by_id["weird-id-without-dots"]["bucket"] == "in.c-y"
assert by_id["weird-id-without-dots"]["source_table"] == "t2"
def test_plan_falls_back_to_parser_when_bucket_id_missing(stub_table_registry):
"""Older Keboola SDK fallback path doesn't return `bucket_id`.
Plan must still produce a usable bucket via the id parser."""
discovered = [
{"id": "in.c-z.events", "name": "events"}, # no bucket_id field
]
plan = _build_keboola_discovery_plan(MagicMock(), discovered)
assert plan["new"][0]["bucket"] == "in.c-z"
assert plan["new"][0]["source_table"] == "events"
def test_plan_drift_skips_overwrite(stub_table_registry):
"""The plan classifies drift; the writer's contract (separate test
in test_admin_configure_api.py against the endpoint) is to NOT
overwrite. Verify here that drifted rows are NOT in the `new`
bucket (which is the only bucket the writer iterates)."""
stub_table_registry["rows"] = {
"in_c-sales_orders": {
"bucket": "in.c-OLD", "source_table": "orders_renamed",
},
}
discovered = [
{"id": "in.c-sales.orders", "name": "orders",
"bucket_id": "in.c-sales"},
]
plan = _build_keboola_discovery_plan(MagicMock(), discovered)
assert plan["new"] == []
assert len(plan["existing_drift"]) == 1

View file

@ -46,7 +46,14 @@ def test_register_keboola_materialized_accepts_source_query(seeded_app):
assert r.status_code == 201, r.text
def test_register_keboola_materialized_rejects_missing_source_query(seeded_app):
def test_register_keboola_materialized_accepts_missing_source_query(seeded_app):
"""A NULL source_query on a keboola materialized row means
'full-table export via Storage API export-async' no SQL needed.
The admin path must accept it. (BigQuery materialized has the same
no-source-query semantic via the SELECT *-from-bucket auto-fill in
the BQ branch of register_table; for keboola the export-async API
takes a structured filter, not a SQL string, so we just persist
NULL and let the extractor pass an empty ExportFilter.)"""
c = seeded_app["client"]
token = seeded_app["admin_token"]
auth = {"Authorization": f"Bearer {token}"}
@ -57,11 +64,12 @@ def test_register_keboola_materialized_rejects_missing_source_query(seeded_app):
"name": "orders_recent",
"source_type": "keboola",
"query_mode": "materialized",
# source_query missing
"bucket": "in.c-sales",
"source_table": "orders",
# source_query intentionally omitted.
},
)
assert r.status_code == 422
assert "source_query" in r.text
assert r.status_code == 201, r.text
def test_register_keboola_materialized_skips_bucket_check(seeded_app):

View file

@ -13,10 +13,14 @@ import duckdb
from src.db import SCHEMA_VERSION, _ensure_schema, get_schema_version
def test_schema_version_is_25():
def test_schema_version_is_26():
# bumped 24→25 for the Store + opt-out tables backing /store + /my-ai-stack
# (24 was the materialized BQ source_query rewrite migration)
assert SCHEMA_VERSION == 25
# bumped 25→26 to migrate Keboola query_mode='local' rows to
# 'materialized' — the local mode for Keboola is gone now that the
# extractor talks Storage API directly via signed URLs (NULL
# source_query = full-table export, same effective behavior).
assert SCHEMA_VERSION == 26
def test_v20_adds_source_query(tmp_path):

View file

@ -131,7 +131,7 @@ class TestKeboolaExtractor:
# No parquet file should exist
assert not (Path(output_dir) / "data" / "big_table.parquet").exists()
def test_handles_extraction_failure(self, output_dir, sample_configs):
def test_handles_extraction_failure(self, output_dir, sample_configs, monkeypatch):
"""Test that a failed table doesn't stop other tables from extracting."""
from connectors.keboola.extractor import run
@ -144,8 +144,18 @@ class TestKeboolaExtractor:
# Second call succeeds
_write_parquet(pq_path, "SELECT 1 AS id")
# Mock the legacy fallback too — without it the real client
# attempts an HTTPS round-trip to the test URL and hangs ~minute.
# Force inline (PARALLELISM=1) so the mock survives — the parallel
# path would spawn a subprocess that doesn't see the patch.
monkeypatch.setenv("AGNES_KEBOOLA_PARALLELISM", "1")
def legacy_reraise(tc, pq_path, url, token):
raise Exception("Network error")
with patch("connectors.keboola.extractor._try_attach_extension", side_effect=_mock_attach), \
patch("connectors.keboola.extractor._extract_via_extension", side_effect=side_effect):
patch("connectors.keboola.extractor._extract_via_extension", side_effect=side_effect), \
patch("connectors.keboola.extractor._extract_via_legacy", side_effect=legacy_reraise):
result = run(output_dir, sample_configs, "https://example.com", "test-token")
assert result["tables_extracted"] == 1
@ -299,7 +309,7 @@ class TestKeboolaExtractorFailureModes:
# The bad table's parquet file should NOT exist (failed before write)
assert not (Path(output_dir) / "data" / "bad_table.parquet").exists()
def test_network_timeout_during_extraction(self, output_dir):
def test_network_timeout_during_extraction(self, output_dir, monkeypatch):
"""Network timeout during extraction should return a meaningful error
in the stats, not crash the whole process."""
from connectors.keboola.extractor import run
@ -323,6 +333,11 @@ class TestKeboolaExtractorFailureModes:
# so we observe the final error surface; the contract under test is
# "extension failure doesn't crash, error makes it into stats, other
# tables continue", not which path produced the message.
# Force PARALLELISM=1 so the mock survives — the parallel path uses
# ProcessPoolExecutor which spawns subprocesses that don't see the
# mock.
monkeypatch.setenv("AGNES_KEBOOLA_PARALLELISM", "1")
def legacy_reraise(tc, pq_path, url, token):
raise socket.timeout("Connection timed out")
@ -389,7 +404,7 @@ class TestKeboolaExtractorFailureModes:
assert val[0] == 7
conn.close()
def test_all_tables_fail_returns_full_failure_stats(self, output_dir):
def test_all_tables_fail_returns_full_failure_stats(self, output_dir, monkeypatch):
"""When every table fails, the extractor returns all failures in stats
without crashing."""
from connectors.keboola.extractor import run
@ -403,7 +418,12 @@ class TestKeboolaExtractorFailureModes:
raise RuntimeError("Extraction failed")
# Mock legacy too — otherwise it would attempt a real HTTP call to
# the fake URL on each per-table fallback retry.
# the fake URL on each per-table fallback retry. Force inline mode
# (AGNES_KEBOOLA_PARALLELISM=1) so the mock survives — the parallel
# path uses ProcessPoolExecutor which spawns subprocesses that
# don't see the mock.
monkeypatch.setenv("AGNES_KEBOOLA_PARALLELISM", "1")
def legacy_also_fails(tc, pq_path, url, token):
raise RuntimeError("Extraction failed")
@ -416,6 +436,108 @@ class TestKeboolaExtractorFailureModes:
assert result["tables_failed"] == 2
assert len(result["errors"]) == 2
def test_legacy_parallelism_one_runs_inline(self, output_dir, monkeypatch):
"""AGNES_KEBOOLA_PARALLELISM=1 keeps the legacy fallback inline —
no ProcessPoolExecutor, so unittest.mock patches survive. Useful
as a debugging escape hatch and the path used by tests below.
Why processes (not threads) for the parallel path: the legacy
client's `export_table` does `os.chdir(temp_dir)` to direct
kbcstorage's slice-file downloads into a per-call temp directory.
`os.chdir` is process-global, so two threads racing on it land
slice files in the wrong directory and the merge step fails with
`[Errno 2] No such file or directory`. Process workers each have
their own CWD and don't interfere."""
from connectors.keboola.extractor import run
configs = [
{"name": f"u{i}", "query_mode": "local", "description": "",
"bucket": "in.c-test", "source_table": f"u{i}"}
for i in range(3)
]
call_count = 0
def mock_legacy(tc, pq_path, url, token):
nonlocal call_count
call_count += 1
_write_parquet(pq_path, "SELECT 1 AS x")
def extension_always_fails(conn, tc, pq_path):
raise RuntimeError("Schema not authorized")
monkeypatch.setenv("AGNES_KEBOOLA_PARALLELISM", "1")
with patch("connectors.keboola.extractor._try_attach_extension", side_effect=_mock_attach), \
patch("connectors.keboola.extractor._extract_via_extension", side_effect=extension_always_fails), \
patch("connectors.keboola.extractor._extract_via_legacy", side_effect=mock_legacy):
result = run(output_dir, configs, "https://example.com", "test-token")
assert result["tables_extracted"] == 3
assert result["tables_failed"] == 0
assert call_count == 3, "all 3 tables should have called the patched legacy fn"
def test_legacy_uses_process_pool_when_parallel(self, output_dir, monkeypatch):
"""Structural check: when len(legacy_queue) > 1 and
AGNES_KEBOOLA_PARALLELISM > 1, the parallel path imports
`concurrent.futures.ProcessPoolExecutor` to drain the queue.
The mock can't ride into subprocesses (mocks aren't picklable),
so we patch ProcessPoolExecutor itself and verify it's invoked
with the expected worker count."""
from connectors.keboola.extractor import run
configs = [
{"name": f"t{i}", "query_mode": "local", "description": "",
"bucket": "in.c-test", "source_table": f"t{i}"}
for i in range(5)
]
def extension_always_fails(conn, tc, pq_path):
raise RuntimeError("Schema not authorized")
# Stand in for ProcessPoolExecutor — runs everything in-process
# so we can verify the call shape without dealing with pickling.
seen_max_workers = []
class _FakePool:
def __init__(self, max_workers):
seen_max_workers.append(max_workers)
def __enter__(self):
return self
def __exit__(self, *exc):
return False
def submit(self, fn, *args, **kwargs):
from concurrent.futures import Future
f: Future = Future()
try:
# Inline the legacy call so the parquet ends up on disk
# for the orchestrator's downstream stat + _meta logic.
f.set_result(fn(*args, **kwargs))
except Exception as e:
f.set_exception(e)
return f
def mock_legacy(tc, pq_path, url, token):
_write_parquet(pq_path, "SELECT 1 AS x")
monkeypatch.setenv("AGNES_KEBOOLA_PARALLELISM", "4")
with patch("connectors.keboola.extractor._try_attach_extension", side_effect=_mock_attach), \
patch("connectors.keboola.extractor._extract_via_extension", side_effect=extension_always_fails), \
patch("connectors.keboola.extractor._extract_via_legacy", side_effect=mock_legacy), \
patch("concurrent.futures.ProcessPoolExecutor", _FakePool):
result = run(output_dir, configs, "https://example.com", "test-token")
assert result["tables_extracted"] == 5
assert result["tables_failed"] == 0
assert seen_max_workers == [4], (
f"Expected ProcessPoolExecutor(max_workers=4); got {seen_max_workers}"
)
def test_unsafe_identifier_skipped_not_crashed(self, output_dir):
"""Tables with unsafe identifiers are skipped with an error in stats,
not causing a crash."""

View file

@ -77,14 +77,22 @@ class TestSyncApiPartialFailureHandling:
from unittest.mock import MagicMock, patch
from app.api import sync as sync_mod
def fake_run(*args, **kwargs):
return MagicMock(
returncode=returncode, stdout="{}", stderr="",
)
# subprocess is imported locally inside _run_sync; patching the
# real module's run() works because Python's module cache means
# both call sites resolve to the same object.
monkeypatch.setattr(subprocess_real, "run", fake_run)
# _run_sync now uses subprocess.Popen (with start_new_session=True
# so the timeout path can SIGTERM the whole process group, including
# ProcessPoolExecutor workers spawned by the parallel legacy
# fallback). Patch Popen with a stand-in whose .communicate()
# returns immediately with the injected returncode — covers both
# the "happy path" (no timeout fired) and exit-code mapping.
class _FakePopen:
def __init__(self_inner, cmd, **kwargs):
self_inner.cmd = cmd
self_inner.returncode = returncode
self_inner.pid = 999
def communicate(self_inner, input=None, timeout=None):
return ("{}", "")
monkeypatch.setattr(subprocess_real, "Popen", _FakePopen)
# SyncOrchestrator is imported as `from src.orchestrator import
# SyncOrchestrator` inside _run_sync, so patching sync_mod

View file

@ -185,7 +185,7 @@ class TestKeboolaExtractorFull:
conn.close()
assert row[0] == "remote"
def test_partial_failure_continues(self, output_dir, sample_local_configs):
def test_partial_failure_continues(self, output_dir, sample_local_configs, monkeypatch):
"""A single table failure should not abort remaining tables."""
from connectors.keboola.extractor import run
@ -201,7 +201,11 @@ class TestKeboolaExtractorFull:
# _extract_via_legacy. Mock it to re-raise so we observe the final
# error surface; the contract under test is "single table failure
# doesn't abort remaining tables", not which path produced the
# message.
# message. Force inline mode (AGNES_KEBOOLA_PARALLELISM=1) so the
# mock survives — the parallel path uses ProcessPoolExecutor which
# spawns subprocesses that don't see the mock.
monkeypatch.setenv("AGNES_KEBOOLA_PARALLELISM", "1")
def legacy_reraise(tc, pq_path, url, token):
raise RuntimeError("Connection reset")

View file

@ -1,41 +1,105 @@
"""Tests for the Keboola materialize_query path."""
"""Tests for the Keboola materialize_query path.
Surface contract: takes ``bucket`` + ``source_table`` (+ optional
``source_query`` JSON filter spec), exports via Storage API, writes a
parquet, returns the same {table_id, path, rows, bytes, md5} shape the
BQ branch returns. We mock `KeboolaStorageClient` so tests don't hit
the network the real Storage API client is exercised in
tests/test_keboola_storage_api.py.
The default code path is now **parquet** (Storage API serves Snowflake
UNLOAD output directly; the extractor renames into place no CSV
intermediate, no DuckDB COPY of full file). Tests cover both the
default parquet path and the legacy CSV opt-in (via
``source_query='{"file_type":"csv"}'``).
"""
import hashlib
import pytest
import os
from pathlib import Path
from unittest.mock import MagicMock
from unittest.mock import MagicMock, patch
import duckdb
import pytest
from connectors.keboola import extractor as kbe
def test_materialize_query_writes_parquet_and_returns_metadata(tmp_path, monkeypatch):
"""Mock-mode: feed in a fake KeboolaAccess that yields a fake DuckDB
connection accepting `COPY ... TO '...' (FORMAT PARQUET)` and just
writes a small parquet via duckdb's own primitive on a tmp DB.
"""
import duckdb
real_conn = duckdb.connect(":memory:")
# Pre-create a small relation the fake materialize "copies".
real_conn.execute("CREATE TABLE t AS SELECT 1 AS x, 'hello' AS y UNION ALL SELECT 2, 'world'")
def _write_parquet(dest: Path, n_rows: int = 2) -> None:
"""Drop a tiny real parquet at ``dest`` so the materialize path can
read it back to compute row_count + MD5 same shape Snowflake
UNLOAD would produce."""
dest.parent.mkdir(parents=True, exist_ok=True)
safe = str(dest).replace("'", "''")
conn = duckdb.connect()
try:
conn.execute(
f"COPY (SELECT * FROM (VALUES {','.join('(' + str(i) + ')' for i in range(n_rows))}) AS t(id)) "
f"TO '{safe}' (FORMAT PARQUET)"
)
finally:
conn.close()
class FakeAccess:
def duckdb_session(self):
from contextlib import contextmanager
@contextmanager
def _cm():
yield real_conn
return _cm()
fake_access = FakeAccess()
def _seed_csv(dest: Path, header: str, rows: list[str]) -> None:
"""Write a tiny CSV the legacy CSV materialize path will convert to parquet."""
dest.parent.mkdir(parents=True, exist_ok=True)
dest.write_text("\n".join([header, *rows]) + "\n", encoding="utf-8")
@pytest.fixture
def fake_storage_client_parquet():
"""Mock for the **default** parquet path. ``prepare_export`` returns a
file_info marking a single (non-sliced) file. ``download_file``
writes a real 2-row parquet at the requested dest."""
def fake_prepare(table_id, *, export_filter=None, export_timeout=None):
return {
"job_id": 100,
"file_id": 200,
"rows": 2,
"file_info": {"id": 200, "url": "https://fake/x", "isSliced": False},
"file_type": "parquet",
}
def fake_download(file_info, dest_path):
_write_parquet(Path(dest_path), n_rows=2)
return Path(dest_path)
client = MagicMock()
client.prepare_export.side_effect = fake_prepare
client.download_file.side_effect = fake_download
return client
@pytest.fixture
def fake_storage_client_csv():
"""Mock for the legacy CSV opt-in path. ``export_table`` writes a
small CSV at dest. Used for tests that pin
``source_query='{"file_type":"csv"}'``."""
def fake_export(table_id, dest, *, export_filter=None, export_timeout=None):
_seed_csv(Path(dest), "id,name", ["1,alpha", "2,beta"])
return {"job_id": 100, "file_id": 200, "rows": 2,
"bytes": Path(dest).stat().st_size, "file_type": "csv"}
client = MagicMock()
client.export_table.side_effect = fake_export
return client
# ---- default parquet path --------------------------------------------------
def test_materialize_query_writes_parquet_and_returns_metadata(
tmp_path, fake_storage_client_parquet
):
"""Default path: no source_query → file_type=parquet, single file."""
output_dir = tmp_path / "out"
output_dir.mkdir()
# Submit a query that selects from the in-memory table (not a real
# Keboola bucket — the test verifies the COPY/parquet/hash path,
# not the extension behavior).
result = kbe.materialize_query(
table_id="example_subset",
sql="SELECT * FROM t",
keboola_access=fake_access,
bucket="in.c-sales",
source_table="orders",
source_query=None,
storage_client=fake_storage_client_parquet,
output_dir=output_dir,
)
@ -45,23 +109,83 @@ def test_materialize_query_writes_parquet_and_returns_metadata(tmp_path, monkeyp
assert result["path"] == str(parquet_path)
assert result["rows"] == 2
assert result["bytes"] > 0
# MD5 of the bytes should match what we recompute.
expected_md5 = hashlib.md5(parquet_path.read_bytes()).hexdigest()
assert result["md5"] == expected_md5
# Default file_type should be parquet — verify by inspecting the
# ExportFilter passed to prepare_export.
call_args = fake_storage_client_parquet.prepare_export.call_args
assert call_args.args[0] == "in.c-sales.orders"
assert call_args.kwargs["export_filter"].file_type == "parquet"
def test_materialize_query_zero_rows_logs_warning(tmp_path, caplog):
import duckdb
real_conn = duckdb.connect(":memory:")
real_conn.execute("CREATE TABLE t AS SELECT 1 AS x WHERE FALSE")
class FakeAccess:
def duckdb_session(self):
from contextlib import contextmanager
@contextmanager
def _cm():
yield real_conn
return _cm()
def test_materialize_query_parquet_sliced_merges_via_duckdb(tmp_path):
"""Sliced parquet output: each slice is itself a complete parquet file
(Snowflake UNLOAD MAX_FILE_SIZE behavior). The extractor must use
``download_file_slices`` to keep them as separate files, then
DuckDB-COPY across ``read_parquet([slice1, slice2])`` to merge
naive concat would corrupt the per-slice footer."""
def fake_prepare(table_id, *, export_filter=None, export_timeout=None):
return {
"job_id": 100, "file_id": 200, "rows": 4,
"file_info": {"id": 200, "url": "https://fake/manifest", "isSliced": True},
"file_type": "parquet",
}
def fake_download_slices(file_info, dest_dir):
dest_dir = Path(dest_dir)
dest_dir.mkdir(parents=True, exist_ok=True)
s1, s2 = dest_dir / "slice-00000", dest_dir / "slice-00001"
_write_parquet(s1, n_rows=2)
_write_parquet(s2, n_rows=2)
return [s1, s2]
client = MagicMock()
client.prepare_export.side_effect = fake_prepare
client.download_file_slices.side_effect = fake_download_slices
output_dir = tmp_path / "out"
output_dir.mkdir()
result = kbe.materialize_query(
table_id="big_table",
bucket="in.c-x", source_table="t",
source_query=None,
storage_client=client,
output_dir=output_dir,
)
# Final parquet contains all 4 rows from both slices.
final = output_dir / "big_table.parquet"
assert final.exists()
n = duckdb.connect().execute(
f"SELECT COUNT(*) FROM read_parquet('{str(final).replace(chr(39), chr(39)*2)}')"
).fetchone()[0]
assert n == 4
assert result["rows"] == 4
# Slices were not concatenated raw (would leave 2 footers in one file
# and break DuckDB on read).
client.download_file_slices.assert_called_once()
def test_materialize_query_parquet_zero_rows_emits_empty_parquet(tmp_path, caplog):
"""Storage API parquet succeeded but the filter matched 0 rows (file
is empty/missing). We log a warning and emit an empty placeholder."""
def fake_prepare(table_id, *, export_filter=None, export_timeout=None):
return {
"job_id": 1, "file_id": 2, "rows": 0,
"file_info": {"id": 2, "url": "https://fake/x", "isSliced": False},
"file_type": "parquet",
}
def fake_download(file_info, dest_path):
# Don't create the file — simulates no-rows result.
return Path(dest_path)
client = MagicMock()
client.prepare_export.side_effect = fake_prepare
client.download_file.side_effect = fake_download
output_dir = tmp_path / "out"
output_dir.mkdir()
@ -69,135 +193,305 @@ def test_materialize_query_zero_rows_logs_warning(tmp_path, caplog):
with caplog.at_level("WARNING"):
result = kbe.materialize_query(
table_id="empty_subset",
sql="SELECT * FROM t",
keboola_access=FakeAccess(),
bucket="in.c-test", source_table="empty",
source_query=None,
storage_client=client,
output_dir=output_dir,
)
assert result["rows"] == 0
assert "0 rows" in caplog.text or "empty" in caplog.text.lower()
assert (output_dir / "empty_subset.parquet").exists()
assert "no data" in caplog.text.lower() or "0 rows" in caplog.text
def test_materialize_query_rejects_unsafe_table_id(tmp_path):
def test_materialize_query_admin_can_pin_file_type_csv(tmp_path, fake_storage_client_csv):
"""Admin can opt out of parquet via ``source_query='{"file_type":"csv"}'``
falls back to CSV DuckDB-COPY parquet."""
output_dir = tmp_path / "out"
output_dir.mkdir()
result = kbe.materialize_query(
table_id="legacy_csv",
bucket="in.c-x", source_table="t",
source_query='{"file_type": "csv"}',
storage_client=fake_storage_client_csv,
output_dir=output_dir,
)
assert (output_dir / "legacy_csv.parquet").exists()
assert result["rows"] == 2
# Storage client called with file_type=csv on the ExportFilter.
call = fake_storage_client_csv.export_table.call_args
assert call.args[0] == "in.c-x.t"
assert call.kwargs["export_filter"].file_type == "csv"
# ---- tempdir cleanup on failure --------------------------------------------
def test_materialize_query_sliced_parquet_tempdir_cleaned_on_exception(tmp_path):
"""When a sliced parquet download raises mid-flight (e.g. OSError 28
'No space left'), the per-call tempdir at /tmp/kbc-export-<id>-*
that was already populated with downloaded slices must not survive.
Regression: an earlier worker death mid-write left a 12 GiB stale
slice tree on the boot disk because TemporaryDirectory's default
cleanup path itself raised under disk-full state, masking the
original exception AND leaving the dir behind. The fix uses
``ignore_cleanup_errors=True`` so cleanup is best-effort but always
fires the dir is empty (or at least mostly) after the function
returns."""
captured_tmpdir: dict[str, Path] = {}
def fake_prepare(table_id, *, export_filter=None, export_timeout=None):
return {
"job_id": 1, "file_id": 2, "rows": 1,
"file_info": {"id": 2, "url": "https://fake/manifest", "isSliced": True},
"file_type": "parquet",
}
def boom_download_slices(file_info, dest_dir):
# Capture the tempdir the extractor created (parent of dest_dir).
captured_tmpdir["path"] = Path(dest_dir).parent
# Simulate a real download writing partial state, then disk full.
Path(dest_dir).mkdir(parents=True, exist_ok=True)
(Path(dest_dir) / "slice-00000").write_bytes(b"PAR1...partial")
raise OSError(28, "No space left on device")
client = MagicMock()
client.prepare_export.side_effect = fake_prepare
client.download_file_slices.side_effect = boom_download_slices
output_dir = tmp_path / "out"
output_dir.mkdir()
with pytest.raises(OSError, match="No space left"):
kbe.materialize_query(
table_id="will_fail_sliced",
bucket="in.c-test", source_table="t",
source_query=None,
storage_client=client,
output_dir=output_dir,
)
# The tempdir that held the partial slice must be gone (or at least
# not the half-populated state that leaked previously).
assert "path" in captured_tmpdir, "download_file_slices was not invoked"
leftover = captured_tmpdir["path"]
assert not leftover.exists(), (
f"tempdir {leftover} must be cleaned on exception "
f"(otherwise leaks under disk-full conditions)"
)
# Final parquet must NOT exist.
assert not (output_dir / "will_fail_sliced.parquet").exists()
# ---- AGNES_TEMP_DIR routing -------------------------------------------------
def test_materialize_query_uses_AGNES_TEMP_DIR_when_set(
monkeypatch, tmp_path, fake_storage_client_parquet,
):
"""The per-call tempdir lands under ``AGNES_TEMP_DIR`` when set —
routes Snowflake-UNLOAD slice staging off the container's overlayfs
/tmp onto the data disk. Capture the dir the storage_client receives
via download_file's dest_path and assert it's under the configured
root.
Regression context: agnes-dev's boot disk filled to 100% during a
180-day kbc_job sync because slices accumulated in /tmp; the data
disk had 15 GiB free at the time."""
custom_root = tmp_path / "agnes-tmp"
custom_root.mkdir()
monkeypatch.setenv("AGNES_TEMP_DIR", str(custom_root))
output_dir = tmp_path / "out"
output_dir.mkdir()
kbe.materialize_query(
table_id="anywhere",
bucket="in.c-x", source_table="t",
source_query=None,
storage_client=fake_storage_client_parquet,
output_dir=output_dir,
)
# The tempdir created by `materialize_query` is anonymous, but
# `tempfile.TemporaryDirectory(dir=root, ...)` always places its
# dir as a direct child of `root`. After materialize_query returns
# the dir is cleaned, so check the root only contains paths that
# WOULD have been under it (post-cleanup it's empty — that's still
# the contract; the assertion is "AGNES_TEMP_DIR was honored as
# the parent"). We do this indirectly by calling get_temp_root
# ourselves under the same env and asserting the value flows.
from connectors.keboola.storage_api import get_temp_root
assert get_temp_root() == str(custom_root)
# And the dir is empty post-run (cleanup happened) but still exists
# — i.e. we didn't accidentally delete the operator's chosen root.
assert custom_root.is_dir()
def test_materialize_query_falls_back_to_system_tmp_when_unset(
monkeypatch, tmp_path, fake_storage_client_parquet,
):
"""No AGNES_TEMP_DIR → no behavioural change vs. pre-fix code.
The function still returns successfully; we don't peek inside
/tmp itself (CI-unfriendly), just assert the run completed and
the parquet exists at output_dir as expected."""
monkeypatch.delenv("AGNES_TEMP_DIR", raising=False)
output_dir = tmp_path / "out"
output_dir.mkdir()
result = kbe.materialize_query(
table_id="default_tmp",
bucket="in.c-x", source_table="t",
source_query=None,
storage_client=fake_storage_client_parquet,
output_dir=output_dir,
)
assert (output_dir / "default_tmp.parquet").exists()
assert result["rows"] == 2
# ---- generic guards (file_type-agnostic) -----------------------------------
def test_materialize_query_rejects_unsafe_table_id(tmp_path, fake_storage_client_parquet):
"""Defense: table_id is interpolated into the parquet filename. SQL/
path-traversal-unsafe values must be rejected up-front (mirror of BQ
materialize_query's validation)."""
class FakeAccess:
def duckdb_session(self):
raise AssertionError("should not be called")
path-traversal-unsafe values must be rejected up-front."""
output_dir = tmp_path / "out"
output_dir.mkdir()
with pytest.raises(ValueError, match="table_id"):
kbe.materialize_query(
table_id="../../etc/passwd",
sql="SELECT 1",
keboola_access=FakeAccess(),
bucket="in.c-test", source_table="t",
source_query=None,
storage_client=fake_storage_client_parquet,
output_dir=output_dir,
)
def test_keboola_materialize_atomic_write_on_failure(tmp_path, monkeypatch):
"""Devin finding 2026-05-01 (BUG_pr-review-job-3fbd31c9_0003):
if the COPY raises mid-stream, no partial file is left at the final
.parquet path AND the .parquet.tmp staging file is cleaned up. Pre-fix,
materialize_query wrote directly to the final path, so a network/disk
error mid-COPY would leave a corrupt parquet that the orchestrator
rebuild could pick up and serve to analysts."""
from connectors.keboola import extractor as kbe
def test_materialize_query_invalid_source_query_json_raises(tmp_path, fake_storage_client_parquet):
output_dir = tmp_path / "out"
output_dir.mkdir()
with pytest.raises(ValueError, match="not valid JSON"):
kbe.materialize_query(
table_id="bad_filter",
bucket="in.c-test", source_table="t",
source_query="this is not json",
storage_client=fake_storage_client_parquet,
output_dir=output_dir,
)
def test_materialize_query_passes_filter_spec_to_export(tmp_path, fake_storage_client_parquet):
"""source_query JSON is parsed into ExportFilter and forwarded to the
Storage API client. Verifies the dispatch shape the actual
filterparams conversion is covered in test_keboola_storage_api.py."""
output_dir = tmp_path / "out"
output_dir.mkdir()
kbe.materialize_query(
table_id="filtered",
bucket="in.c-sales", source_table="orders",
source_query=(
'{"where_filters": [{"column": "status", "operator": "eq", '
'"values": ["open"]}], "columns": ["id"]}'
),
storage_client=fake_storage_client_parquet,
output_dir=output_dir,
)
f = fake_storage_client_parquet.prepare_export.call_args.kwargs["export_filter"]
assert f.where_filters == [
{"column": "status", "operator": "eq", "values": ["open"]}
]
assert f.columns == ["id"]
# No explicit file_type → defaults to parquet.
assert f.file_type == "parquet"
# ---- atomic write contract -------------------------------------------------
def test_keboola_materialize_atomic_write_on_failure(tmp_path):
"""If the CSV→parquet conversion fails (legacy CSV opt-in), no
partial file is left at the final .parquet path AND the .parquet.tmp
staging file is cleaned up."""
def fake_export(table_id, dest, *, export_filter=None, export_timeout=None):
_seed_csv(Path(dest), "id,name", ["1,alpha"])
return {"job_id": 1, "file_id": 2, "rows": 1,
"bytes": Path(dest).stat().st_size, "file_type": "csv"}
client = MagicMock()
client.export_table.side_effect = fake_export
output_dir = tmp_path / "data"
output_dir.mkdir()
class FakeAccess:
def duckdb_session(self):
from contextlib import contextmanager
real_connect = duckdb.connect
class FailingConn:
def execute(self, sql, *a, **kw):
if "COPY" in sql:
raise RuntimeError("simulated mid-COPY failure")
raise AssertionError("unexpected execute: " + sql)
class FailingConn:
def __init__(self, inner):
self._inner = inner
def close(self):
pass
def execute(self, sql, *a, **kw):
if "FORMAT PARQUET" in sql:
raise RuntimeError("simulated mid-COPY failure")
return self._inner.execute(sql, *a, **kw)
@contextmanager
def _cm():
yield FailingConn()
return _cm()
def close(self):
self._inner.close()
with pytest.raises(RuntimeError, match="simulated mid-COPY failure"):
kbe.materialize_query(
table_id="atomic_test",
sql="SELECT 1",
keboola_access=FakeAccess(),
output_dir=output_dir,
)
def patched_connect(*args, **kwargs):
return FailingConn(real_connect(*args, **kwargs))
with patch("connectors.keboola.extractor.duckdb.connect", side_effect=patched_connect):
with pytest.raises(RuntimeError, match="simulated mid-COPY failure"):
kbe.materialize_query(
table_id="atomic_test",
bucket="in.c-test", source_table="t",
source_query='{"file_type": "csv"}',
storage_client=client,
output_dir=output_dir,
)
# Final parquet must NOT exist (we never reached os.replace).
final_path = output_dir / "atomic_test.parquet"
assert not final_path.exists(), (
f"Partial parquet left at final path {final_path} — orchestrator "
f"rebuild would pick this up and serve corrupt data."
)
# tmp file also cleaned up (the extractor unlinks it on COPY failure).
tmp_path_marker = output_dir / "atomic_test.parquet.tmp"
assert not tmp_path_marker.exists(), (
f"Stale .parquet.tmp left at {tmp_path_marker}"
)
tmp_marker = output_dir / "atomic_test.parquet.tmp"
assert not tmp_marker.exists(), f"Stale .parquet.tmp left at {tmp_marker}"
def test_keboola_materialize_uses_tmp_path_during_copy(tmp_path):
"""Atomic-write contract: COPY targets <id>.parquet.tmp first (verifiable
via the SQL string passed to conn.execute). After success, the file lands
at <id>.parquet (no .tmp suffix). This documents the contract that
BUG_pr-review-job-3fbd31c9_0003 closed."""
import duckdb
from connectors.keboola import extractor as kbe
real_conn = duckdb.connect(":memory:")
real_conn.execute("CREATE TABLE t AS SELECT 1 AS x, 'hello' AS y")
sqls_seen = []
class TracingConn:
"""Thin wrapper that records SQL strings. DuckDBPyConnection.execute
is read-only, so monkey-patching the method directly fails."""
def __init__(self, inner):
self._inner = inner
def execute(self, sql, *args, **kwargs):
sqls_seen.append(sql)
return self._inner.execute(sql, *args, **kwargs)
def close(self):
self._inner.close()
class FakeAccess:
def duckdb_session(self):
from contextlib import contextmanager
@contextmanager
def _cm():
yield TracingConn(real_conn)
return _cm()
def test_keboola_materialize_uses_tmp_path_during_copy(tmp_path, fake_storage_client_parquet):
"""Atomic-write contract: parquet first lands at <id>.parquet.tmp, then
is os.replaced into <id>.parquet on success. Verified by patching
os.replace to capture the (src, dst) pair."""
output_dir = tmp_path / "data"
output_dir.mkdir()
result = kbe.materialize_query(
table_id="tmp_path_test",
sql="SELECT * FROM t",
keboola_access=FakeAccess(),
output_dir=output_dir,
)
captured = {}
real_replace = os.replace
# COPY SQL targeted .parquet.tmp.
copy_sql = next((s for s in sqls_seen if "COPY" in s), None)
assert copy_sql is not None, sqls_seen
assert ".parquet.tmp" in copy_sql, copy_sql
def trace_replace(src, dst):
captured["src"] = str(src)
captured["dst"] = str(dst)
real_replace(src, dst)
with patch.object(kbe.os, "replace", side_effect=trace_replace):
result = kbe.materialize_query(
table_id="tmp_path_test",
bucket="in.c-test", source_table="t",
source_query=None,
storage_client=fake_storage_client_parquet,
output_dir=output_dir,
)
assert captured["src"].endswith(".parquet.tmp"), captured
assert captured["dst"].endswith(".parquet") and not captured["dst"].endswith(".tmp")
# Final file landed without .tmp suffix.
assert (output_dir / "tmp_path_test.parquet").exists()
assert not (output_dir / "tmp_path_test.parquet.tmp").exists()
assert result["path"].endswith(".parquet")

View file

@ -0,0 +1,520 @@
"""KeboolaStorageClient — direct Storage API export-async path.
Replaces the previous DuckDB-extension materialize path (extension scan
broken on linked-bucket projects, see keboola/duckdb-extension#17). Tests
mock the requests.Session at the adapter level so we exercise the real
HTTP shapes (status codes, JSON bodies) without touching the network.
"""
from __future__ import annotations
import gzip
import json
from io import BytesIO
from pathlib import Path
from unittest.mock import MagicMock, patch
import pytest
import requests
from connectors.keboola.storage_api import (
FILE_TYPE_CSV,
FILE_TYPE_PARQUET,
ExportFilter,
KeboolaStorageClient,
StorageApiError,
get_temp_root,
)
# ---- ExportFilter ----------------------------------------------------------
class TestExportFilter:
def test_empty_dict_means_full_table(self):
f = ExportFilter.from_dict({})
assert f.to_export_params() == {}
def test_none_means_full_table(self):
f = ExportFilter.from_dict(None)
assert f.to_export_params() == {}
def test_where_filters_columns_changed_since(self):
f = ExportFilter.from_dict({
"where_filters": [
{"column": "status", "operator": "eq", "values": ["open"]},
],
"columns": ["id", "status"],
"changed_since": "2026-04-01",
})
params = f.to_export_params()
assert params["whereFilters"] == [
{"column": "status", "operator": "eq", "values": ["open"]}
]
# Storage API takes columns as comma-joined string, not array — the
# `kbcstorage` SDK does the same join, so match its wire format.
assert params["columns"] == "id,status"
assert params["changedSince"] == "2026-04-01"
def test_where_filter_missing_keys_raises_with_context(self):
f = ExportFilter.from_dict({
"where_filters": [{"column": "x", "operator": "eq"}], # no values
})
with pytest.raises(ValueError, match=r"missing fields.*\['values'\]"):
f.to_export_params()
def test_where_filter_values_must_be_list(self):
f = ExportFilter.from_dict({
"where_filters": [{"column": "x", "operator": "eq", "values": "open"}],
})
with pytest.raises(ValueError, match="values must be a list"):
f.to_export_params()
def test_default_file_type_is_csv_and_omits_param(self):
# Wire-side default is csv — preserve old behavior for callers
# that never set file_type.
assert ExportFilter().file_type == FILE_TYPE_CSV
assert "fileType" not in ExportFilter().to_export_params()
def test_file_type_parquet_emits_fileType_param(self):
f = ExportFilter(file_type=FILE_TYPE_PARQUET)
assert f.to_export_params()["fileType"] == "parquet"
def test_from_dict_reads_file_type_snake_case(self):
f = ExportFilter.from_dict({"file_type": "parquet"})
assert f.file_type == "parquet"
assert f.to_export_params()["fileType"] == "parquet"
def test_from_dict_reads_fileType_camel_case_alias(self):
# Operators copying examples from Apiary docs ship the wire name.
f = ExportFilter.from_dict({"fileType": "parquet"})
assert f.file_type == "parquet"
def test_from_dict_invalid_file_type_raises(self):
with pytest.raises(ValueError, match="file_type"):
ExportFilter.from_dict({"file_type": "orc"})
# ---- HTTP client low-level -------------------------------------------------
def _mock_response(status, body):
"""Build a fake `requests.Response`-like object."""
resp = MagicMock(spec=requests.Response)
resp.status_code = status
resp.json.return_value = body
resp.text = json.dumps(body)
return resp
class TestStorageClient:
def test_init_normalises_trailing_slash(self):
c = KeboolaStorageClient(url="https://kbc/", token="t")
assert c.base.endswith("/v2/storage")
assert "/" * 2 not in c.base.replace("https://", "")
def test_init_rejects_missing_url_or_token(self):
with pytest.raises(ValueError):
KeboolaStorageClient(url="", token="t")
with pytest.raises(ValueError):
KeboolaStorageClient(url="https://kbc", token="")
def test_post_sends_storage_api_token_header(self):
sess = MagicMock()
sess.post.return_value = _mock_response(200, {"id": 42})
c = KeboolaStorageClient(url="https://kbc", token="abc", session=sess)
c.export_table_async("in.c-x.t", {"columns": "a"})
sess.post.assert_called_once()
kwargs = sess.post.call_args.kwargs
assert kwargs["headers"]["X-StorageApi-Token"] == "abc"
def test_post_4xx_redacts_token_in_error_message(self):
# If the API echoes the token (or a proxy injects it), we must not
# leak it into raised exceptions.
sess = MagicMock()
sess.post.return_value = _mock_response(
403, {"detail": "rejected token=secrettoken123"}
)
c = KeboolaStorageClient(url="https://kbc", token="secrettoken123", session=sess)
with pytest.raises(StorageApiError) as e:
c.export_table_async("in.c-x.t", {})
assert "secrettoken123" not in str(e.value)
assert "<redacted-storage-token>" in str(e.value)
# ---- wait_for_job ----------------------------------------------------------
class TestWaitForJob:
def test_returns_on_success(self):
sess = MagicMock()
sess.get.return_value = _mock_response(200, {
"id": 1, "status": "success", "results": {"file": {"id": 99}},
})
c = KeboolaStorageClient(url="https://kbc", token="t", session=sess)
job = c.wait_for_job(1, timeout=5, poll_interval=0.01)
assert job["status"] == "success"
def test_raises_on_error_status(self):
sess = MagicMock()
sess.get.return_value = _mock_response(200, {
"id": 1, "status": "error", "error": {"message": "bad table"},
})
c = KeboolaStorageClient(url="https://kbc", token="t", session=sess)
with pytest.raises(StorageApiError, match="reported error"):
c.wait_for_job(1, timeout=5, poll_interval=0.01)
def test_polls_until_terminal(self):
# First two responses 'waiting', third 'success'. The client must
# keep polling instead of giving up.
sess = MagicMock()
sess.get.side_effect = [
_mock_response(200, {"id": 1, "status": "waiting"}),
_mock_response(200, {"id": 1, "status": "processing"}),
_mock_response(200, {"id": 1, "status": "success", "results": {"file": {"id": 7}}}),
]
c = KeboolaStorageClient(url="https://kbc", token="t", session=sess)
job = c.wait_for_job(1, timeout=5, poll_interval=0.01)
assert job["status"] == "success"
assert sess.get.call_count == 3
def test_timeout_raises_with_job_id(self):
sess = MagicMock()
sess.get.return_value = _mock_response(200, {"id": 1, "status": "waiting"})
c = KeboolaStorageClient(url="https://kbc", token="t", session=sess)
with pytest.raises(StorageApiError, match="did not finish"):
c.wait_for_job(1, timeout=0.1, poll_interval=0.05)
# ---- download_file ---------------------------------------------------------
class TestDownloadFile:
def test_single_file_csv_passthrough(self, tmp_path):
sess = MagicMock()
# File detail returns a signed URL for a non-sliced .csv; download
# streams it directly.
single_resp = MagicMock()
single_resp.__enter__ = MagicMock(return_value=single_resp)
single_resp.__exit__ = MagicMock(return_value=False)
single_resp.iter_content.return_value = [b"col1,col2\n", b"a,1\n", b"b,2\n"]
single_resp.raise_for_status = MagicMock()
sess.get.return_value = single_resp
c = KeboolaStorageClient(url="https://kbc", token="t", session=sess)
dest = tmp_path / "out.csv"
c.download_file({
"url": "https://signed/single.csv",
"name": "single.csv",
"isSliced": False,
}, dest)
assert dest.exists()
assert dest.read_bytes() == b"col1,col2\na,1\nb,2\n"
def test_single_file_gz_is_gunzipped(self, tmp_path):
gzipped = BytesIO()
with gzip.GzipFile(fileobj=gzipped, mode="wb") as gz:
gz.write(b"col1,col2\nx,42\n")
payload = gzipped.getvalue()
sess = MagicMock()
single_resp = MagicMock()
single_resp.__enter__ = MagicMock(return_value=single_resp)
single_resp.__exit__ = MagicMock(return_value=False)
single_resp.iter_content.return_value = [payload]
single_resp.raise_for_status = MagicMock()
sess.get.return_value = single_resp
c = KeboolaStorageClient(url="https://kbc", token="t", session=sess)
dest = tmp_path / "out.csv"
c.download_file({
"url": "https://signed/single.csv.gz",
"name": "single.csv.gz",
"isSliced": False,
}, dest)
assert dest.read_bytes() == b"col1,col2\nx,42\n"
def test_sliced_concat_in_order(self, tmp_path):
# isSliced=True: detail.url points at a JSON manifest of slice URLs.
# Simulate two slices: slice 0 (header + rows), slice 1 (more rows,
# NO header per Storage API contract). We just concatenate bytes —
# the contract test is "every slice's bytes appear in dest, in order".
sess = MagicMock()
manifest_resp = MagicMock()
manifest_resp.json.return_value = {
"entries": [
{"url": "https://signed/slice-0"},
{"url": "https://signed/slice-1"},
]
}
manifest_resp.raise_for_status = MagicMock()
slice0 = MagicMock()
slice0.__enter__ = MagicMock(return_value=slice0)
slice0.__exit__ = MagicMock(return_value=False)
slice0.iter_content.return_value = [b"col\n", b"a\n"]
slice0.raise_for_status = MagicMock()
slice1 = MagicMock()
slice1.__enter__ = MagicMock(return_value=slice1)
slice1.__exit__ = MagicMock(return_value=False)
slice1.iter_content.return_value = [b"b\n"]
slice1.raise_for_status = MagicMock()
sess.get.side_effect = [manifest_resp, slice0, slice1]
c = KeboolaStorageClient(url="https://kbc", token="t", session=sess)
dest = tmp_path / "out.csv"
c.download_file({
"url": "https://signed/manifest.json",
"name": "sliced",
"isSliced": True,
}, dest)
assert dest.read_bytes() == b"col\na\nb\n"
# ---- end-to-end export_table_to_csv ---------------------------------------
class TestExportTableToCsv:
def test_full_pipeline_calls_post_poll_detail_download(self, tmp_path):
"""Smoke: export-async → wait_for_job → file_detail → download.
Mock the session at the boundary; assert the URL composition and
order of operations match the contract. The actual bytes-written
path is covered by TestDownloadFile."""
sess = MagicMock()
# 1) POST /tables/X/export-async → {id: 100}
export_resp = _mock_response(200, {"id": 100})
# 2) GET /jobs/100 → success with file id 200
job_resp = _mock_response(200, {
"id": 100,
"status": "success",
"results": {"file": {"id": 200}, "totalRowsCount": 5},
})
# 3) GET /files/200?federationToken=1 → single non-sliced URL
file_resp = _mock_response(200, {
"url": "https://signed/file.csv",
"name": "file.csv",
"isSliced": False,
})
# 4) GET https://signed/file.csv (download)
download_resp = MagicMock()
download_resp.__enter__ = MagicMock(return_value=download_resp)
download_resp.__exit__ = MagicMock(return_value=False)
download_resp.iter_content.return_value = [b"col\n1\n"]
download_resp.raise_for_status = MagicMock()
# session.get is called for: jobs (poll), file detail, download.
# session.post for the export-async kickoff.
sess.post.return_value = export_resp
sess.get.side_effect = [job_resp, file_resp, download_resp]
c = KeboolaStorageClient(url="https://kbc", token="t", session=sess)
dest = tmp_path / "out.csv"
stats = c.export_table_to_csv(
"in.c-x.t", dest,
export_filter=ExportFilter(columns=["col"]),
)
assert dest.read_bytes() == b"col\n1\n"
assert stats["job_id"] == 100
assert stats["file_id"] == 200
assert stats["rows"] == 5
assert stats["bytes"] == len(b"col\n1\n")
# Assert export-async POST URL composition + body shape
post_url = sess.post.call_args.args[0]
assert post_url == "https://kbc/v2/storage/tables/in.c-x.t/export-async"
post_body = sess.post.call_args.kwargs["data"]
assert post_body["columns"] == "col"
def test_missing_job_id_in_response_is_typed_error(self):
sess = MagicMock()
sess.post.return_value = _mock_response(200, {}) # no `id`
c = KeboolaStorageClient(url="https://kbc", token="t", session=sess)
with pytest.raises(StorageApiError, match="missing job id"):
c.export_table_to_csv("in.c-x.t", Path("/tmp/x"))
def test_missing_file_in_job_results_is_typed_error(self, tmp_path):
sess = MagicMock()
sess.post.return_value = _mock_response(200, {"id": 1})
sess.get.return_value = _mock_response(200, {
"id": 1, "status": "success", "results": {}, # no `file`
})
c = KeboolaStorageClient(url="https://kbc", token="t", session=sess)
with pytest.raises(StorageApiError, match="no result file"):
c.export_table_to_csv("in.c-x.t", tmp_path / "x")
# ---- prepare_export + download_file_slices (parquet path) ------------------
class TestParquetPath:
def test_parquet_request_emits_fileType_in_post_body(self, tmp_path):
sess = MagicMock()
sess.post.return_value = _mock_response(200, {"id": 100})
sess.get.side_effect = [
_mock_response(200, {
"id": 100, "status": "success",
"results": {"file": {"id": 200}, "totalRowsCount": 3},
}),
_mock_response(200, {
"id": 200, "url": "https://signed/x.parquet",
"name": "x.parquet", "isSliced": False,
}),
]
c = KeboolaStorageClient(url="https://kbc", token="t", session=sess)
prep = c.prepare_export(
"in.c-x.t",
export_filter=ExportFilter(file_type=FILE_TYPE_PARQUET),
)
assert prep["file_type"] == "parquet"
assert prep["file_info"]["isSliced"] is False
assert sess.post.call_args.kwargs["data"]["fileType"] == "parquet"
def test_export_table_rejects_sliced_parquet(self, tmp_path):
"""Concatenating sliced parquet would corrupt per-slice footers.
``export_table`` must fail loud and direct callers at
``download_file_slices``."""
sess = MagicMock()
sess.post.return_value = _mock_response(200, {"id": 1})
sess.get.side_effect = [
_mock_response(200, {
"id": 1, "status": "success",
"results": {"file": {"id": 2}},
}),
_mock_response(200, {
"id": 2, "url": "https://signed/manifest.json",
"name": "x.parquet", "isSliced": True,
}),
]
c = KeboolaStorageClient(url="https://kbc", token="t", session=sess)
with pytest.raises(StorageApiError, match="sliced parquet"):
c.export_table(
"in.c-x.t", tmp_path / "x.parquet",
export_filter=ExportFilter(file_type=FILE_TYPE_PARQUET),
)
def test_download_file_slices_returns_per_slice_paths(self, tmp_path):
sess = MagicMock()
manifest_resp = MagicMock()
manifest_resp.json.return_value = {
"entries": [
{"url": "https://signed/slice-0"},
{"url": "https://signed/slice-1"},
],
}
manifest_resp.raise_for_status = MagicMock()
def mk_chunk_resp(payload: bytes):
r = MagicMock()
r.__enter__ = MagicMock(return_value=r)
r.__exit__ = MagicMock(return_value=False)
r.iter_content.return_value = [payload]
r.raise_for_status = MagicMock()
return r
slice0 = mk_chunk_resp(b"PAR1...slice0...")
slice1 = mk_chunk_resp(b"PAR1...slice1...")
sess.get.side_effect = [manifest_resp, slice0, slice1]
c = KeboolaStorageClient(url="https://kbc", token="t", session=sess)
paths = c.download_file_slices(
{"url": "https://signed/manifest.json", "isSliced": True,
"name": "x.parquet"},
tmp_path / "slices",
)
assert len(paths) == 2
assert paths[0].read_bytes() == b"PAR1...slice0..."
assert paths[1].read_bytes() == b"PAR1...slice1..."
# Naming preserves manifest order — required for deterministic
# downstream merge.
assert paths[0].name < paths[1].name
def test_download_file_slices_refuses_non_sliced(self):
c = KeboolaStorageClient(url="https://kbc", token="t",
session=MagicMock())
with pytest.raises(StorageApiError, match="non-sliced"):
c.download_file_slices(
{"url": "https://x", "isSliced": False}, Path("/tmp/x"),
)
def test_get_temp_root_unset_returns_none(self, monkeypatch):
"""No env var → None → tempfile falls back to system default
(typically /tmp). Preserves OSS-pre-fix behaviour for users
who haven't set AGNES_TEMP_DIR."""
monkeypatch.delenv("AGNES_TEMP_DIR", raising=False)
assert get_temp_root() is None
def test_get_temp_root_creates_dir_when_missing(self, monkeypatch, tmp_path):
"""First-time use: target dir doesn't yet exist; helper mkdirs
it (non-recursive parents handled by exist_ok). Returns the
absolute path so tempfile uses it as the parent for staging."""
target = tmp_path / "agnes-tmp-fresh"
assert not target.exists()
monkeypatch.setenv("AGNES_TEMP_DIR", str(target))
assert get_temp_root() == str(target)
assert target.is_dir()
def test_get_temp_root_existing_dir_reused(self, monkeypatch, tmp_path):
target = tmp_path / "agnes-tmp-existing"
target.mkdir()
monkeypatch.setenv("AGNES_TEMP_DIR", str(target))
assert get_temp_root() == str(target)
def test_get_temp_root_unwritable_falls_back(self, monkeypatch, tmp_path, caplog):
"""Sandboxes / read-only mounts make the target uncreatable; the
helper logs a warning and returns None so tempfile falls back
to the system default rather than blowing up the sync run."""
# Point at a path under a read-only parent that doesn't exist.
unwritable = "/nonexistent/forbidden/agnes-tmp"
monkeypatch.setenv("AGNES_TEMP_DIR", unwritable)
with caplog.at_level("WARNING"):
assert get_temp_root() is None
assert any("AGNES_TEMP_DIR" in r.message for r in caplog.records)
def test_get_temp_root_empty_string_treated_as_unset(self, monkeypatch):
# Operator who left ``AGNES_TEMP_DIR=`` (empty) in .env doesn't
# get an mkdir of "" — same as unset.
monkeypatch.setenv("AGNES_TEMP_DIR", "")
assert get_temp_root() is None
def test_parquet_download_does_not_gunzip_plain_parquet(self, tmp_path):
"""Regression: previous heuristic flagged any unencrypted file as
gzipped, which would corrupt parquet downloads at gunzip time.
Verify a `.parquet` file is written through unmodified."""
sess = MagicMock()
single_resp = MagicMock()
single_resp.__enter__ = MagicMock(return_value=single_resp)
single_resp.__exit__ = MagicMock(return_value=False)
# Real parquet magic bytes — not valid gzip, would crash gunzip.
single_resp.iter_content.return_value = [b"PAR1\x00\x00\x00binary"]
single_resp.raise_for_status = MagicMock()
sess.get.return_value = single_resp
c = KeboolaStorageClient(url="https://kbc", token="t", session=sess)
dest = tmp_path / "out.parquet"
c.download_file({
"url": "https://signed/x.parquet",
"name": "x.parquet",
"isSliced": False,
"isEncrypted": False,
}, dest)
assert dest.read_bytes() == b"PAR1\x00\x00\x00binary"

View file

@ -0,0 +1,106 @@
"""SyncOrchestrator._update_sync_state must store the content MD5.
`agnes pull` re-hashes the downloaded parquet bytes and compares against
the manifest's hash for that table. If the orchestrator stores a
fingerprint (mtime+size) or a truncated MD5, every `agnes pull` of a
Keboola local-mode table fails with `hash mismatch: expected got `.
"""
import hashlib
from unittest.mock import patch
import duckdb
import pytest
from src.db import _ensure_schema
from src.orchestrator import SyncOrchestrator
from src.repositories.sync_state import SyncStateRepository
from src.repositories.table_registry import TableRegistryRepository
@pytest.fixture
def system_db_path(tmp_path):
"""Path to a system.duckdb the orchestrator opens via get_system_db."""
db_path = tmp_path / "system.duckdb"
conn = duckdb.connect(str(db_path))
try:
_ensure_schema(conn)
TableRegistryRepository(conn).register(
id="orders", name="orders", source_type="keboola",
bucket="in.c-crm", source_table="orders", query_mode="local",
description="",
)
finally:
conn.close()
return db_path
@pytest.fixture
def parquet_with_known_md5(tmp_path):
"""Lay down /tmp/data/extracts/keboola/data/orders.parquet with bytes
whose MD5 the test knows up front."""
extracts = tmp_path / "extracts" / "keboola" / "data"
extracts.mkdir(parents=True)
pq = extracts / "orders.parquet"
bytes_payload = b"PAR1" + b"x" * 1024 + b"PAR1"
pq.write_bytes(bytes_payload)
return pq, hashlib.md5(bytes_payload).hexdigest()
def _run_update(system_db_path, meta_rows, data_dir):
"""Helper: invoke `_update_sync_state` with `get_system_db` redirected
at our test DB and `_get_extracts_dir` redirected at our temp tree."""
def fake_get_system_db():
return duckdb.connect(str(system_db_path))
with patch("src.db.get_system_db", side_effect=fake_get_system_db), \
patch("src.orchestrator._get_extracts_dir", return_value=data_dir / "extracts"):
orch = SyncOrchestrator.__new__(SyncOrchestrator)
orch._update_sync_state(meta_rows=meta_rows, source_name="keboola")
def test_update_sync_state_stores_content_md5(
system_db_path, parquet_with_known_md5, tmp_path
):
"""The hash written into sync_state must equal MD5 of the parquet's
raw bytes, full 32 hex chars same shape as the CLI's `_md5_file`."""
pq_path, expected_md5 = parquet_with_known_md5
_run_update(
system_db_path,
meta_rows=[("orders", 100, pq_path.stat().st_size, "local")],
data_dir=tmp_path,
)
conn = duckdb.connect(str(system_db_path))
try:
state = SyncStateRepository(conn).get_table_state("orders")
finally:
conn.close()
assert state is not None, "sync_state row should exist"
stored = state["hash"]
assert stored == expected_md5, (
f"sync_state.hash must be the content MD5 ({expected_md5}) "
f"so `agnes pull` post-download integrity check passes; got {stored!r}"
)
assert len(stored) == 32, "full hex MD5, not truncated"
def test_update_sync_state_empty_hash_when_parquet_missing(
system_db_path, tmp_path
):
"""If the parquet isn't on disk (race / failed extract), store empty
string rather than crashing or writing a stale hash."""
(tmp_path / "extracts" / "keboola" / "data").mkdir(parents=True)
_run_update(
system_db_path,
meta_rows=[("orders", 0, 0, "local")],
data_dir=tmp_path,
)
conn = duckdb.connect(str(system_db_path))
try:
state = SyncStateRepository(conn).get_table_state("orders")
finally:
conn.close()
assert state is not None
assert state["hash"] == ""

View file

@ -64,3 +64,85 @@ def test_due_check_skipped_uses_due_check_reason(fake_registry_with_one_material
summary = _run_materialized_pass(MagicMock(), MagicMock())
assert summary["skipped"] == [{"table": "in_flight_t", "reason": "due_check"}]
# ---- targeted-trigger filter -----------------------------------------------
@pytest.fixture
def fake_registry_with_three_materialized(monkeypatch, tmp_path):
"""Three materialized rows so we can verify ``tables=['orders']`` only
touches `orders` and skips the other two with ``reason='not_in_target'``."""
monkeypatch.setenv("DATA_DIR", str(tmp_path))
rows = [
{"id": "orders", "name": "orders", "query_mode": "materialized",
"source_type": "bigquery", "source_query": "SELECT 1", "sync_schedule": None},
{"id": "customers", "name": "customers", "query_mode": "materialized",
"source_type": "bigquery", "source_query": "SELECT 1", "sync_schedule": None},
{"id": "events", "name": "events", "query_mode": "materialized",
"source_type": "bigquery", "source_query": "SELECT 1", "sync_schedule": None},
]
class _Repo:
def __init__(self, conn): pass
def list_all(self): return rows
class _State:
def __init__(self, conn): pass
def get_last_sync(self, _id): return None
def set_error(self, *a, **kw): pass
def update_sync(self, **kw): pass
monkeypatch.setattr("app.api.sync.TableRegistryRepository", _Repo)
monkeypatch.setattr("app.api.sync.SyncStateRepository", _State)
return rows
def test_targeted_trigger_only_processes_listed_tables(
fake_registry_with_three_materialized,
):
"""Targeted ``tables=['orders']`` must skip 'customers' and 'events'
even though all three are due. Pre-fix bug: targeted trigger of
`kbc_job` re-ran every other due materialized row too because the
pass ignored the `tables` arg entirely."""
materialized_calls = []
def fake_mat(table_id, sql, bq, output_dir, max_bytes):
materialized_calls.append(table_id)
return {"rows": 1, "size_bytes": 100, "hash": "abc"}
with patch("app.api.sync._materialize_table", side_effect=fake_mat):
summary = _run_materialized_pass(MagicMock(), MagicMock(), tables=["orders"])
assert materialized_calls == ["orders"]
assert summary["materialized"] == ["orders"]
skipped_pairs = [(s["table"], s["reason"]) for s in summary["skipped"]]
assert ("customers", "not_in_target") in skipped_pairs
assert ("events", "not_in_target") in skipped_pairs
def test_targeted_trigger_matches_id_or_name(
fake_registry_with_three_materialized, monkeypatch
):
"""Operators may pass either the registry id or the human-friendly
name. Both forms should select the same row."""
monkeypatch.setattr("app.api.sync._materialize_table",
lambda **kw: {"rows": 0, "size_bytes": 0, "hash": "x"})
# By id
s1 = _run_materialized_pass(MagicMock(), MagicMock(), tables=["orders"])
assert s1["materialized"] == ["orders"]
# By name (same value here, but the lookup logic checks both columns)
s2 = _run_materialized_pass(MagicMock(), MagicMock(), tables=["events"])
assert s2["materialized"] == ["events"]
def test_no_target_processes_all_due_rows(fake_registry_with_three_materialized):
"""Backward compat: ``tables=None`` (no filter) keeps the original
behavior process every due materialized row."""
with patch("app.api.sync._materialize_table",
return_value={"rows": 0, "size_bytes": 0, "hash": "x"}):
summary = _run_materialized_pass(MagicMock(), MagicMock(), tables=None)
assert sorted(summary["materialized"]) == ["customers", "events", "orders"]
assert summary["skipped"] == []

View file

@ -88,6 +88,14 @@ class TestIsTableDueInterval:
def test_empty_last_sync_is_due(self) -> None:
assert is_table_due("every 15m", last_sync_iso="", now=NOW) is True
def test_every_0m_is_always_due(self) -> None:
# ``every 0m`` opts out of rate limiting — used to force-resync
# a row whose previous attempt errored without recording
# last_sync. Even a sync seconds ago must come back as due.
last_sync = (NOW - timedelta(seconds=5)).isoformat()
assert is_table_due("every 0m", last_sync_iso=last_sync, now=NOW) is True
assert is_table_due("every 0m", last_sync_iso=None, now=NOW) is True
def test_synced_10min_ago_every_15m_not_due(self) -> None:
last_sync = (NOW - timedelta(minutes=10)).isoformat()
assert is_table_due("every 15m", last_sync_iso=last_sync, now=NOW) is False

View file

@ -10,6 +10,7 @@ from src.scheduler import filter_due_tables, is_valid_schedule
# ---------------- is_valid_schedule -----------------------------------------
@pytest.mark.parametrize("schedule", [
"every 0m", # always-due (force-resync of an errored row)
"every 15m",
"every 1h",
"every 6h",
@ -23,7 +24,6 @@ def test_is_valid_schedule_accepts_documented_formats(schedule):
@pytest.mark.parametrize("schedule", [
"",
"every",
"every 0m", # zero is not a positive interval
"every 15s", # seconds not supported
"daily",
"daily 25:00", # invalid hour
@ -172,6 +172,9 @@ def test_run_sync_filters_local_tables_by_schedule(monkeypatch, tmp_path):
class _StubRegistry:
def __init__(self, conn): pass
def list_local(self, source_type=None): return list(fake_configs)
# `_run_sync` calls `list_all()` purely for an emptiness check on
# the auto-discover gate — content does not matter, only truthiness.
def list_all(self): return list(fake_configs)
def get(self, table_id):
return next((c for c in fake_configs if c["id"] == table_id), None)
@ -207,19 +210,26 @@ def test_run_sync_filters_local_tables_by_schedule(monkeypatch, tmp_path):
),
)
# Capture the configs that subprocess.run sees (via stdin payload).
# Capture the configs the extractor subprocess sees (via stdin
# payload). Hooks subprocess.Popen because _run_sync now uses Popen
# (with start_new_session=True so a timeout can SIGTERM the whole
# process group, including ProcessPoolExecutor workers spawned by
# the parallel legacy fallback).
captured = {}
def _fake_run(cmd, input, capture_output, text, timeout, env, cwd):
import json as _json
captured["configs"] = _json.loads(input)
class _R:
returncode = 0
stdout = "{}"
stderr = ""
return _R()
class _FakePopen:
def __init__(self, cmd, **kwargs):
self.cmd = cmd
self.returncode = 0
self.pid = 999
monkeypatch.setattr(sync_module.subprocess, "run", _fake_run)
def communicate(self, input=None, timeout=None):
import json as _json
if input is not None:
captured["configs"] = _json.loads(input)
return ("{}", "")
monkeypatch.setattr(sync_module.subprocess, "Popen", _FakePopen)
# Stub orchestrator + profiler imports inside the function so we don't
# require a real DuckDB analytics file.
@ -263,6 +273,9 @@ def test_run_sync_does_not_auto_discover_when_filter_returns_empty(monkeypatch,
class _StubRegistry:
def __init__(self, conn): pass
def list_local(self, source_type=None): return list(fake_configs)
# `_run_sync` calls `list_all()` purely for an emptiness check on
# the auto-discover gate — content does not matter, only truthiness.
def list_all(self): return list(fake_configs)
def get(self, table_id):
return next((c for c in fake_configs if c["id"] == table_id), None)

View file

@ -52,7 +52,11 @@ def test_run_materialized_pass_dispatches_keboola_to_keboola_extractor(
repo.register(
id="orders_recent", name="orders_recent",
source_type="keboola", query_mode="materialized",
source_query='SELECT * FROM kbc."in.c-sales"."orders" WHERE 1=1',
bucket="in.c-sales", source_table="orders",
# Storage API filter spec (replaces the old SQL string after the
# extension → Storage API rewrite). Empty filter = full table; a
# whereFilters array narrows down on the Keboola side.
source_query='{"where_filters": [{"column": "status", "operator": "eq", "values": ["open"]}]}',
sync_schedule="every 1m", # always due
)
@ -149,7 +153,9 @@ def test_run_sync_runs_materialized_pass_on_keboola_only_instance(
name="kb_aggregated",
source_type="keboola",
query_mode="materialized",
source_query="SELECT 1 AS x",
bucket="in.c-test", source_table="kb_aggregated",
# Empty source_query = full-table export via Storage API.
source_query=None,
registered_by="admin@test",
)

View file

@ -232,7 +232,7 @@ def test_run_sync_keboola_timeout_does_not_skip_materialized(tmp_path, monkeypat
materialized_called = {"count": 0}
orchestrator_called = {"count": 0}
def _spy_materialized(_conn, _bq):
def _spy_materialized(_conn, _bq, *, tables=None):
materialized_called["count"] += 1
return {"materialized": ["m1"], "skipped": [], "errors": []}
@ -302,7 +302,7 @@ def test_run_sync_runs_materialized_pass_on_bq_only_deployment(
materialized_called = {"count": 0}
orchestrator_called = {"count": 0}
def _spy_materialized_pass(_conn, _bq):
def _spy_materialized_pass(_conn, _bq, *, tables=None):
materialized_called["count"] += 1
return {"materialized": ["m1"], "skipped": [], "errors": []}

View file

@ -0,0 +1,224 @@
"""Process-wide singleton guard around POST /api/sync/trigger.
Without it, two near-simultaneous trigger calls each launch their own
extractor subprocess, both write `extract.duckdb`, fight for the file
lock, starve uvicorn, and Docker flips the container to `unhealthy`.
These tests cover the trigger handler's 409 fast-fail (the
operator-visible behavior) and the in-`_run_sync` defense-in-depth
(if something bypasses the handler).
"""
from unittest.mock import patch
import pytest
from app.api import sync as sync_module
@pytest.fixture(autouse=True)
def reset_sync_lock():
"""Make sure each test starts with a free lock — and never leaves one
held even if an assertion fires mid-test."""
if sync_module._sync_lock.locked():
sync_module._sync_lock.release()
yield
if sync_module._sync_lock.locked():
sync_module._sync_lock.release()
def test_run_sync_skips_when_lock_held(capsys):
"""When `_sync_lock` is already held, `_run_sync` no-ops with a log
line instead of starting a second extractor subprocess."""
sync_module._sync_lock.acquire()
# Patch the heavy parts so a successful path would otherwise execute.
# Reaching any of them while the lock is held would be the bug.
with patch("app.api.sync.subprocess.run") as run_mock, \
patch("app.instance_config.get_data_source_type") as src_mock:
sync_module._run_sync(tables=None)
assert not run_mock.called, "extractor subprocess must not run when lock is held"
assert not src_mock.called, "_run_sync must short-circuit before reading config"
captured = capsys.readouterr()
assert "another sync is already in flight" in captured.err
def test_run_sync_releases_lock_on_exception():
"""Even if the body throws, the lock must release so the next sync can
run. Asserts the `finally:` covers all exit paths.
`_run_sync` imports `get_data_source_type` lazily inside the body, so
we patch the source module rather than the re-export in `app.api.sync`.
"""
with patch(
"app.instance_config.get_data_source_type",
side_effect=RuntimeError("boom"),
):
# Should not raise — `_run_sync` catches and logs
sync_module._run_sync(tables=None)
assert not sync_module._sync_lock.locked()
def test_trigger_endpoint_returns_409_when_locked():
"""Handler-level fast-fail: when a sync is already running, the
trigger endpoint returns 409 without scheduling a second background
task."""
from fastapi.testclient import TestClient
from fastapi import FastAPI
# Stand up a minimal app exposing only the sync router. Bypass auth
# by overriding the require_admin dependency.
from app.auth.access import require_admin
app = FastAPI()
app.include_router(sync_module.router)
app.dependency_overrides[require_admin] = lambda: {"id": "test", "email": "t@e"}
client = TestClient(app)
sync_module._sync_lock.acquire()
try:
resp = client.post("/api/sync/trigger")
assert resp.status_code == 409
assert resp.json()["detail"] == "sync_already_in_progress"
finally:
sync_module._sync_lock.release()
def test_trigger_endpoint_succeeds_when_lock_free():
"""When the lock is free, the trigger endpoint schedules the
background task and returns 200. The background task itself doesn't
execute synchronously in TestClient that's how FastAPI background
tasks work so we patch `_run_sync` to a no-op and only assert the
handler shape."""
from fastapi.testclient import TestClient
from fastapi import FastAPI
from app.auth.access import require_admin
app = FastAPI()
app.include_router(sync_module.router)
app.dependency_overrides[require_admin] = lambda: {"id": "test", "email": "t@e"}
client = TestClient(app)
with patch("app.api.sync._run_sync") as run_mock:
resp = client.post("/api/sync/trigger")
assert resp.status_code == 200
body = resp.json()
assert body["status"] == "triggered"
# BackgroundTasks runs after response; TestClient awaits them
run_mock.assert_called_once()
# ---- body shape acceptance -------------------------------------------------
def _make_client():
"""Stand up a minimal FastAPI app exposing the sync router with auth
bypassed and `_run_sync` mocked. Returns (client, run_mock)."""
from fastapi.testclient import TestClient
from fastapi import FastAPI
from app.auth.access import require_admin
app = FastAPI()
app.include_router(sync_module.router)
app.dependency_overrides[require_admin] = lambda: {"id": "test", "email": "t@e"}
return TestClient(app)
@pytest.mark.parametrize("body,expected_tables", [
(None, None), # no body
([], []), # empty array
(["kbc_job"], ["kbc_job"]), # bare array
(["a", "b", "c"], ["a", "b", "c"]),
({"tables": None}, None), # explicit null
({"tables": []}, []),
({"tables": ["kbc_job"]}, ["kbc_job"]), # object form
({"tables": ["a", "b"], "extra": "x"}, ["a", "b"]), # extra keys ignored
])
def test_trigger_accepts_both_body_shapes(body, expected_tables):
"""Both ``["x", "y"]`` and ``{"tables": ["x", "y"]}`` (and `null` /
no body) reach `_run_sync` with the same `tables` arg. Lets older
clients (raw array) and newer ones (object matching the response
payload shape) both work."""
client = _make_client()
with patch("app.api.sync._run_sync") as run_mock:
if body is None:
resp = client.post("/api/sync/trigger")
else:
resp = client.post("/api/sync/trigger", json=body)
assert resp.status_code == 200, resp.text
run_mock.assert_called_once_with(expected_tables)
@pytest.mark.parametrize("bad_body", [
"kbc_job", # bare string
42, # number
{"tables": "kbc_job"}, # tables as string, not array
{"tables": [1, 2, 3]}, # tables entries not strings
[1, 2, 3], # array of ints
[{"id": "x"}], # array of objects
])
def test_trigger_rejects_malformed_bodies(bad_body):
"""Anything that isn't a list-of-strings, an object with a
list-of-strings under `tables`, or null/missing returns 422 with a
structured detail never silently treated as 'sync everything'."""
client = _make_client()
with patch("app.api.sync._run_sync") as run_mock:
resp = client.post("/api/sync/trigger", json=bad_body)
assert resp.status_code == 422, resp.text
assert not run_mock.called
# ---- /api/sync/status (auto-upgrade defer probe) --------------------------
def test_sync_status_unlocked_returns_locked_false():
"""Default state: no sync running → ``{"locked": false}``. No auth
required (host-side cron probes from outside the auth boundary)."""
from fastapi.testclient import TestClient
from fastapi import FastAPI
app = FastAPI()
app.include_router(sync_module.router)
client = TestClient(app)
if sync_module._sync_lock.locked():
sync_module._sync_lock.release()
resp = client.get("/api/sync/status")
assert resp.status_code == 200
assert resp.json() == {"locked": False}
def test_sync_status_locked_returns_locked_true():
"""Held lock → ``{"locked": true}``. agnes-auto-upgrade.sh greps for
`"locked":true` to defer the recreate, so the wire format must be
exactly this shape."""
from fastapi.testclient import TestClient
from fastapi import FastAPI
app = FastAPI()
app.include_router(sync_module.router)
client = TestClient(app)
sync_module._sync_lock.acquire()
try:
resp = client.get("/api/sync/status")
assert resp.status_code == 200
body = resp.json()
assert body == {"locked": True}
finally:
sync_module._sync_lock.release()
def test_sync_status_does_not_require_auth():
"""No `require_admin` / `get_current_user` dependency — the host's
cron has no PAT and shouldn't need one for a status check."""
from fastapi.testclient import TestClient
from fastapi import FastAPI
app = FastAPI()
app.include_router(sync_module.router)
client = TestClient(app)
# No auth headers at all.
resp = client.get("/api/sync/status")
assert resp.status_code == 200