agnes-the-ai-analyst/tests/test_sync_trigger_materialized.py
ZdenekSrotyr 28430ced09
Keboola cutover: native parquet path + sync correctness + auto-discover protection (#190)
* fix: cutover regressions + parallel Keboola legacy fallback

Bundled fixes from a fresh-deploy run on a Keboola Storage backend with
the block-shared-snowflake-access feature flag — DuckDB Keboola
extension's per-table scan can't access bucket schemas, so the legacy
kbcstorage Storage-API client is the only working path.

CUTOVER REGRESSIONS

- agnes pull hash mismatch on every Keboola local-mode table —
  src/orchestrator.py:_update_sync_state stored md5(mtime+size)[:12]
  while the CLI compares against full 32-char content MD5. Now stores
  the same content MD5 the materialized SQL path already used.

- Trailing-slash sanitization in connectors/keboola/access.py and
  extractor.py — DuckDB Keboola extension's ATTACH fails when the URL
  ends in / (canonical form).

- src/profiler.py:TableInfo.description becomes optional — two call
  sites instantiated without it, crashing the profiler pass.

- scripts/ops/agnes-auto-upgrade.sh: chown on UID change — older images
  ran as root, current runs as agnes (uid 999). Reads target uid:gid
  from /etc/passwd inside the new image and chowns ${STATE_DIR},
  /data/extracts, /data/analytics when the digest moves.

- POST /api/sync/trigger is now singleton per process — two
  near-simultaneous trigger calls each forked an extractor subprocess,
  fought for extract.duckdb's file lock, starved uvicorn, flipped the
  container to unhealthy. Trigger now returns 409
  (sync_already_in_progress) when held; _run_sync acquires non-blocking.

PARALLEL LEGACY FALLBACK

- Process pool fan-out for the _extract_via_legacy queue (default 8
  workers, override via AGNES_KEBOOLA_PARALLELISM). Process pool, not
  thread pool, because connectors/keboola/client.py:export_table does
  os.chdir(temp_dir) — process-global, so threads raced and slice files
  landed in the wrong directory ("[Errno 2] No such file or directory:
  '<job_id>.csv_X_Y_Z.csv'").

- Extractor subprocess timeout 1800s -> 3600s (configurable via
  AGNES_EXTRACTOR_TIMEOUT_SEC). 28+ tables × multi-minute Keboola export
  jobs need the headroom on telemetry-class projects.

- Process group cleanup on timeout — Popen(start_new_session=True) puts
  the extractor in its own group. On timeout the parent SIGTERMs the
  group (10s grace) then SIGKILLs stragglers. Without this, the pool
  workers were reparented to PID 1 and continued holding open Keboola
  Storage export jobs. Inline extractor script also installs a SIGTERM
  -> sys.exit(143) handler so the with ProcessPoolExecutor(...) block
  __exit__ runs cleanly.

Tests: existing tests that patched subprocess.run updated to patch
subprocess.Popen with a _FakePopen stand-in (same exit-code-injection
contract). Two tests that exercised the parallel path forced
AGNES_KEBOOLA_PARALLELISM=1 to keep mocks alive (mocks don't ride into
ProcessPoolExecutor subprocesses).

Squashed onto current main (was 7 commits + multi-commit CHANGELOG +
agnes-auto-upgrade.sh conflicts; squash avoids per-commit conflict
resolution against main's flat-mount STATE_DIR refactor and 0.38.0
release cut).

* feat(keboola): Storage API direct extract path; drop extension data path

The DuckDB Keboola extension's COPY routes through Keboola QueryService,
which is unreliable on linked-bucket projects (extension v0.1.6 fixes
that case but isn't yet in the community CDN, and pre-fix any project
with the block-shared-snowflake-access feature flag couldn't see bucket
schemas at all). Move the extract path off the extension entirely and
talk to the Storage API directly via signed-URL download — works on any
project, regardless of extension state.

connectors/keboola/storage_api.py (NEW)
  Lightweight client built on requests.Session. Three endpoints:
  - POST /v2/storage/tables/{id}/export-async        (kicks off job)
  - GET  /v2/storage/jobs/{id}                        (poll until done)
  - GET  /v2/storage/files/{id}?federationToken=1     (signed URL detail)
  - GET  <signed_url>                                 (download bytes)
  Supports sliced exports (manifest + per-slice signed URLs) and gzipped
  payloads. ExportFilter dataclass mirrors the Keboola filter spec
  (whereFilters / columns / changedSince / limit) and handles JSON
  round-trip with the registry's source_query column. Token redaction
  in error messages. Bounded exponential backoff on job polling.
  No cloud-SDK dependency on the data path; thread-safe.

connectors/keboola/extractor.py
  - materialize_query() rewritten: takes bucket/source_table/source_query
    (JSON filter spec), exports via KeboolaStorageClient, converts CSV
    to parquet via DuckDB, atomic os.replace. Same return shape so
    sync.py downstream code stays uniform with the BQ branch.
  - _extract_via_legacy() also moved to Storage API direct (kept the
    name for caller compatibility with _legacy_worker / the parallel
    batch extractor). Per-call temp directories — no os.chdir, threads
    don't race.

app/api/sync.py
  _run_materialized_pass for source_type='keboola' rows now constructs a
  KeboolaStorageClient (replaces KeboolaAccess) and passes
  bucket/source_table/source_query to materialize_query. Reuses one
  client across rows for HTTP keep-alive. Sources keboola URL from env
  too (KEBOOLA_STACK_URL) when instance.yaml doesn't have stack_url
  configured.

cli/commands/admin.py
  discover-and-register defaults Keboola rows to query_mode='materialized'
  (NULL source_query = full table), matching the v26 migration's
  unification of the local/materialized split for Keboola. BigQuery and
  Jira keep their per-source defaults.

src/db.py
  Schema bump 25 → 26. Migration: UPDATE table_registry SET
  query_mode='materialized' WHERE source_type='keboola' AND
  query_mode='local'. NULL source_query on those rows means "full table
  export" — same effective behavior the local mode provided, but now
  via Storage API instead of the extension.

pyproject.toml
  kbcstorage dep stays (admin-side bucket/table list still uses the
  SDK in app/api/admin.py / connectors/keboola/client.py); only the
  data path is migrated off the SDK. Comment updated to reflect the
  new boundary.

tests
  - test_keboola_storage_api.py (NEW, 19 tests): ExportFilter parsing,
    HTTP client (token redaction, retry logic, polling), download_file
    (single, gzipped, sliced), end-to-end export_table_to_csv.
  - test_keboola_materialize.py rewritten: mocks KeboolaStorageClient
    instead of FakeAccess; same atomic-write + zero-rows + unsafe-id
    contracts.
  - test_sync_trigger_keboola_materialized.py: registry rows now carry
    bucket+source_table+JSON-shape source_query.

114+ Keboola-impacted tests green locally.

* test: schema version assertion bumped to 26 alongside the keboola query_mode migration

* fix(keboola): cutover hot-patches surfaced on agnes-dev

Five small fixes that were applied as in-container hot-patches during
agnes-dev cutover and need to be on the source-of-truth image so a fresh
upgrade does not undo them.

- app/api/sync.py: auto-discover gate considers the WHOLE registry (any
  source, any mode), not just rows where source matches and query_mode
  is local. After the v25→v26 keboola materialized migration an
  instance can have 30 materialized rows and zero local rows; the
  previous gate kept re-firing _discover_and_register_tables every
  scheduler tick, creating duplicate auto-discovered rows with the
  wrong bucket prefix every time.

- app/api/admin.py: _discover_and_register_tables reassembles the
  bucket as <stage>.<bucket-id> (e.g. in.c-finance) instead of
  dropping the stage prefix; default query_mode for keboola is now
  materialized (the v26 contract); validator allows NULL source_query
  for keboola materialized rows (full-table export via Storage API
  export-async, no SQL needed).

- cli/commands/admin.py: register-table mirrors the server validator
  (NULL source_query allowed for source_type=keboola); --bucket help
  text generalized to cover both BQ dataset and Keboola bucket id.

- connectors/keboola/extractor.py: max_line_size=64 MiB on
  read_csv_auto so embedded JSON / SQL cells (kbc_component_configuration
  in particular) do not trip the default 2 MiB ceiling.

- connectors/keboola/storage_api.py: GCP backend support — when the
  Storage API returns a manifest whose slice URLs are gs://
  references with a gcsCredentials block, rewrite to the JSON REST
  download endpoint and authenticate with the issued OAuth bearer
  token; redact tokens in any surfaced error string.

* test: align with new keboola materialized + auto-discover-gate contracts

- test_admin_keboola_materialized: rename
  test_register_keboola_materialized_rejects_missing_source_query →
  test_register_keboola_materialized_accepts_missing_source_query.
  v25→v26 introduced 'keboola materialized with NULL source_query
  means full-table export via Storage API export-async' as the
  default registration shape; the rejection case is no longer the
  contract.

- test_sync_filter: add list_all() to _StubRegistry. The auto-discover
  gate in _run_sync now keys off the WHOLE registry (not just local
  rows) so materialized-only Keboola instances do not re-trigger
  discovery on every tick.

* feat(keboola): native parquet export — skip CSV roundtrip

Storage API export-async accepts fileType={csv,parquet}. Switching the
materialized sync to parquet eliminates the CSV → DuckDB COPY → parquet
roundtrip that pinned a single uvicorn worker over 4 GiB on multi-GB
tables (read_csv with all_varchar + max_line_size=64MB has to
materialize the whole CSV in memory before COPY can stream out a
parquet). Snowflake UNLOAD on Keboola's side already produces typed,
self-contained parquet files; the extractor downloads them and renames
into place.

Two cases:

- **Single-file** export (small table): file_info.url points at one
  signed URL; download_file streams chunks straight to .parquet.tmp
  and we're done. No DuckDB.

- **Sliced** export (Snowflake UNLOAD respects MAX_FILE_SIZE — 16 MiB
  default — so anything larger arrives as N parquet slices): each
  slice is a complete parquet file with its own footer; naive concat
  would corrupt them. download_file_slices keeps the slices as
  separate files in a tempdir, then DuckDB COPY (SELECT * FROM
  read_parquet([slice0, slice1, ...])) merges them into one
  consolidated parquet. DuckDB streams row groups during this — peak
  memory bounded to one row group (~1 MiB) regardless of source size.

The legacy CSV path stays as the explicit opt-in via source_query=
'{"file_type":"csv"}' for projects whose backend can't UNLOAD
parquet (none known today; cheap escape hatch). Backward-compat alias
KeboolaStorageClient.export_table_to_csv kept.

Also fixes a latent bug in download_file's gzip detection: previous
heuristic flagged any unencrypted file as gzipped, which would have
corrupted parquet downloads at gunzip time. Name-suffix-only now.

* fix: tempdir leak cleanup, every 0m schedule, /sync/trigger body shapes

Three small self-contained fixes uncovered during agnes-dev cutover.

- connectors/keboola/extractor.py: tempfile.TemporaryDirectory now uses
  ignore_cleanup_errors=True so a worker death mid-write doesn't leave
  multi-GiB stale slice trees on the boot disk. (12 GiB seen after a
  disk-full crash where TemporaryDirectory's own cleanup also raised
  and got swallowed.)

- src/scheduler.py: is_valid_schedule accepts 'every 0m' (interval=0
  = always due). Force-resync of an errored row no longer requires
  waiting out the default 'every 1h' interval — admin can flip the
  schedule, trigger, then flip back.

- app/api/sync.py: POST /api/sync/trigger accepts both ['table_id']
  (legacy bare-array body) and {'tables': ['table_id']} (matches the
  response payload shape, more discoverable for clients building
  requests by hand). Malformed bodies return 422 with a structured
  detail; null/missing means 'sync everything' as before.

Tests cover: tempdir cleanup on raise (sliced parquet path),
is_valid_schedule + is_table_due 'every 0m' acceptance, and trigger
body parametrized matrix (8 valid shapes + 6 rejection cases).

* fix: targeted-trigger filter in materialized pass + auto-upgrade defer

Two operational gaps observed during agnes-dev cutover, in the same
sync-routing area.

- _run_materialized_pass now takes a 'tables' arg and skips rows not in
  the target set with reason='not_in_target'. POST /api/sync/trigger
  with a body of tables previously only scoped the legacy extractor
  subprocess — the materialized pass kept iterating every due
  materialized row, so an admin asking to re-sync kbc_job re-ran
  every other due materialized row alongside it. Match on registry id
  OR name (admins commonly pass either form). tables=None preserves
  the no-filter behavior.

- New GET /api/sync/status (public, no auth) returns {locked: bool}
  off _sync_lock.locked(). agnes-auto-upgrade.sh probes this before
  docker compose up -d and exits 0 with a 'deferred recreate' log
  line if a sync is in flight — the next 5-min cron tick retries.
  Pre-fix, an auto-upgrade triggered mid-sync would recreate the
  uvicorn worker and kill the in-flight extractor / Snowflake-UNLOAD
  download (observed when kbc_job's first 7-day retry got SIGKILLed).
  Connection failures in the probe fall through to the upgrade —
  being stuck on a wedged image is worse than interrupting a
  hypothetical sync.

* fix: auto-discover protects admin overrides + surfaces drift

Two real-world incidents on agnes-dev drove this:

1. kbc_job was registered manually with the correct
   (in.c-kbc_telemetry, kbc_job) coordinates. A naive auto-discover
   re-run would have inserted a SECOND kbc_job row at the slugified
   id 'in_c-keboola-storage_kbc_job' (where Keboola's discovery
   places it) — and that row's Storage API export-async 404s.

2. An earlier auto-discover bug stripped the stage prefix from
   bucket ids ('c-finance' instead of 'in.c-finance'), inserting
   137 rows whose syncs all failed.

Fix:

- _discover_and_register_tables now builds a plan first
  (_build_keboola_discovery_plan) classifying each discovered table
  into one of new / existing_match / existing_drift / invalid, then
  executes only the 'new' bucket. Drift rows are reported with both
  sides of the disagreement plus drift_kind:
  - same_id_diff_coords: registry has the same id but different
    bucket / source_table (admin migrated coords inline).
  - name_collision: discovery's slugified id differs from any
    registry id, but the discovered .name matches an existing row's
    .name (case-insensitive). Catches the kbc_job case.

- Bucket detection now prefers the API's authoritative bucket_id
  field (separate field on the Keboola tables.list response,
  normalised by KeboolaClient.discover_all_tables). Falls back to
  id-string parsing only when bucket_id is missing (older fallback
  path inside discover_all_tables).

- Endpoint POST /api/admin/discover-and-register?dry_run=true
  returns the plan without writing — would_register, drift,
  invalid lists. Lets an operator audit before merging discovery
  with a registry that has admin overrides.

Removed 'every 0m' from test_register_request_rejects_malformed_sync_schedule
— the runtime started accepting it in the previous commit (force-resync
override) and the validator follows suit.

* feat(keboola): AGNES_TEMP_DIR routes tempfiles off overlayfs /tmp

The container's /tmp lives on the boot disk's overlayfs (29 GiB on
agnes-dev, shared with /var). Snowflake UNLOAD of a wide table writes
slices into per-call /tmp tempdirs that fill multi-GiB / many-slice
exports long before the dedicated data disk fills. agnes-dev hit
100% boot-disk while the 20 GiB data disk had 15 GiB free.

connectors.keboola.storage_api.get_temp_root() reads AGNES_TEMP_DIR;
mkdirs the target on first use; unset / empty / unwritable falls
back to None (system tempdir, OSS-pre-fix behaviour). Both
materialize_query (parquet path) and _extract_via_legacy (CSV
fallback) and the sliced-CSV concat path in storage_api use the
helper now.

docker-compose.yml defaults AGNES_TEMP_DIR=/data/tmp on app, scheduler,
and extract services. The data volume is the dedicated disk in
production layouts and a plain docker volume in single-disk
dev/laptop setups — same blast radius as the previous /tmp default
on the latter, no regression.
2026-05-07 12:12:14 +02:00

485 lines
17 KiB
Python

"""_run_materialized_pass walks table_registry for materialized BQ rows
and runs each that is due via _materialize_table.
Tests inject a stub BqAccess (factories never called by these tests since
_materialize_table is patched) and assert that scheduling, error
aggregation, sync_state hash, and the disable-sentinel all behave
correctly.
"""
import duckdb
import pytest
from contextlib import contextmanager
from pathlib import Path
from unittest.mock import patch, MagicMock
from src.db import _ensure_schema
from src.repositories.table_registry import TableRegistryRepository
from src.repositories.sync_state import SyncStateRepository
from connectors.bigquery.access import BqAccess, BqProjects
@pytest.fixture
def system_db(tmp_path, monkeypatch):
db_path = tmp_path / "system.duckdb"
conn = duckdb.connect(str(db_path))
_ensure_schema(conn)
monkeypatch.setenv("DATA_DIR", str(tmp_path / "data"))
yield conn
conn.close()
@pytest.fixture
def stub_bq():
"""A BqAccess instance that the tests don't actually exercise (the test
patches `_materialize_table`); just needs to be a valid BqAccess so the
type contract doesn't break."""
@contextmanager
def _session(_p):
conn = duckdb.connect(":memory:")
try:
yield conn
finally:
conn.close()
return BqAccess(
BqProjects(billing="t", data="t"),
client_factory=lambda _p: MagicMock(),
duckdb_session_factory=_session,
)
def test_materialized_pass_calls_materialize_for_due_rows(system_db, stub_bq, tmp_path):
repo = TableRegistryRepository(system_db)
repo.register(
id="orders_90d", name="orders_90d",
source_type="bigquery", query_mode="materialized",
source_query="SELECT 1 AS n",
sync_schedule="every 1m", # always due in tests (no prior sync)
)
# Pre-create the parquet so _file_hash returns non-empty
parquet_dir = tmp_path / "data" / "extracts" / "bigquery" / "data"
parquet_dir.mkdir(parents=True, exist_ok=True)
(parquet_dir / "orders_90d.parquet").write_bytes(
b"PAR1" + b"\x00" * 16 + b"PAR1"
)
from app.api import sync as sync_mod
with patch("app.api.sync._materialize_table") as mock_mat:
mock_mat.return_value = {
"rows": 1, "size_bytes": 100, "query_mode": "materialized",
}
summary = sync_mod._run_materialized_pass(system_db, stub_bq)
mock_mat.assert_called_once()
call_kwargs = mock_mat.call_args.kwargs
assert call_kwargs["table_id"] == "orders_90d"
assert "SELECT 1 AS n" in call_kwargs["sql"]
assert call_kwargs["bq"] is stub_bq
# Default cap (10 GiB) flows through when no instance.yaml override
assert call_kwargs["max_bytes"] == 10 * 2**30
assert "orders_90d" in summary["materialized"]
assert not summary["errors"]
def test_materialized_pass_skips_undue_rows(system_db, stub_bq):
repo = TableRegistryRepository(system_db)
repo.register(
id="orders_daily", name="orders_daily",
source_type="bigquery", query_mode="materialized",
source_query="SELECT 1",
sync_schedule="daily 03:00",
)
state = SyncStateRepository(system_db)
state.update_sync(
table_id="orders_daily", rows=1, file_size_bytes=10, hash="x",
)
from app.api import sync as sync_mod
with patch("app.api.sync._materialize_table") as mock_mat:
summary = sync_mod._run_materialized_pass(system_db, stub_bq)
mock_mat.assert_not_called()
# summary["skipped"] is now list[dict] — see PR zs/materialize-sync-fix
assert {"table": "orders_daily", "reason": "due_check"} in summary["skipped"]
def test_materialized_pass_skips_non_materialized_rows(system_db, stub_bq):
repo = TableRegistryRepository(system_db)
repo.register(id="t1", name="t1", source_type="keboola", query_mode="local")
repo.register(id="t2", name="t2", source_type="bigquery", query_mode="remote")
from app.api import sync as sync_mod
with patch("app.api.sync._materialize_table") as mock_mat:
summary = sync_mod._run_materialized_pass(system_db, stub_bq)
mock_mat.assert_not_called()
assert summary == {"materialized": [], "skipped": [], "errors": []}
def test_materialized_pass_collects_errors_per_row(system_db, stub_bq, tmp_path):
"""One row failing must not stop a healthy sibling."""
repo = TableRegistryRepository(system_db)
repo.register(
id="ok", name="ok", source_type="bigquery",
query_mode="materialized", source_query="SELECT 1",
sync_schedule="every 1m",
)
repo.register(
id="bad", name="bad", source_type="bigquery",
query_mode="materialized", source_query="SELECT broken",
sync_schedule="every 1m",
)
parquet_dir = tmp_path / "data" / "extracts" / "bigquery" / "data"
parquet_dir.mkdir(parents=True, exist_ok=True)
(parquet_dir / "ok.parquet").write_bytes(b"PAR1" + b"\x00" * 16 + b"PAR1")
from app.api import sync as sync_mod
def _fake(table_id, sql, bq, output_dir, max_bytes):
if table_id == "bad":
raise RuntimeError("simulated COPY failure")
return {"rows": 1, "size_bytes": 100, "query_mode": "materialized"}
with patch("app.api.sync._materialize_table", side_effect=_fake):
summary = sync_mod._run_materialized_pass(system_db, stub_bq)
assert summary["materialized"] == ["ok"]
assert len(summary["errors"]) == 1
assert summary["errors"][0]["table"] == "bad"
assert "simulated" in summary["errors"][0]["error"]
def test_materialized_pass_records_parquet_hash(system_db, stub_bq, tmp_path):
"""sync_state.hash must be the MD5 of the parquet file — otherwise the
manifest reports an empty hash and every agnes pull re-downloads."""
repo = TableRegistryRepository(system_db)
repo.register(
id="hashed", name="hashed",
source_type="bigquery", query_mode="materialized",
source_query="SELECT 1",
sync_schedule="every 1m",
)
parquet_dir = tmp_path / "data" / "extracts" / "bigquery" / "data"
parquet_dir.mkdir(parents=True, exist_ok=True)
parquet_path = parquet_dir / "hashed.parquet"
def _fake(**kwargs):
parquet_path.write_bytes(b"PAR1" + b"\x00" * 16 + b"PAR1")
return {"rows": 1, "size_bytes": 24, "query_mode": "materialized"}
from app.api import sync as sync_mod
with patch("app.api.sync._materialize_table", side_effect=_fake):
sync_mod._run_materialized_pass(system_db, stub_bq)
state = SyncStateRepository(system_db)
row = state.get_table_state("hashed")
assert row is not None
import hashlib
expected = hashlib.md5(b"PAR1" + b"\x00" * 16 + b"PAR1").hexdigest()
assert row["hash"] == expected
def test_run_sync_keboola_timeout_does_not_skip_materialized(tmp_path, monkeypatch):
"""REGRESSION (Devin BUG_0001 on 2219255): when the Keboola extractor
subprocess raises TimeoutExpired, the materialized BQ pass and the
orchestrator rebuild must still fire. Pre-fix the timeout propagated
to the outer except handler and skipped the rest of _run_sync — on
a dual-source deployment a slow Keboola extractor would silently
block all materialized parquets and the master-view rebuild until
the next trigger."""
import duckdb
import subprocess as _sp
from src.db import _ensure_schema
db_path = tmp_path / "system.duckdb"
conn = duckdb.connect(str(db_path))
_ensure_schema(conn)
repo = TableRegistryRepository(conn)
# One Keboola row (drives the extractor subprocess) + one materialized
# BQ row (must still run after the Keboola timeout).
repo.register(
id="kbc_a", name="kbc_a", source_type="keboola",
query_mode="local", bucket="in.c-foo", source_table="a",
)
repo.register(
id="m1", name="m1", source_type="bigquery",
query_mode="materialized",
source_query="SELECT 1",
sync_schedule="every 1m",
)
conn.close()
monkeypatch.setenv("DATA_DIR", str(tmp_path / "data"))
monkeypatch.setenv("KEBOOLA_STACK_URL", "https://example.invalid")
monkeypatch.setenv("KEBOOLA_STORAGE_TOKEN", "fake")
from app.api import sync as sync_mod
# Subprocess raises TimeoutExpired the moment _run_sync calls it.
def _timeout(*a, **kw):
raise _sp.TimeoutExpired(cmd=["fake"], timeout=1)
monkeypatch.setattr(sync_mod.subprocess, "run", _timeout)
materialized_called = {"count": 0}
orchestrator_called = {"count": 0}
def _spy_materialized(_conn, _bq, *, tables=None):
materialized_called["count"] += 1
return {"materialized": ["m1"], "skipped": [], "errors": []}
class _OrchStub:
def rebuild(self):
orchestrator_called["count"] += 1
return {}
monkeypatch.setattr("app.api.sync._run_materialized_pass", _spy_materialized)
monkeypatch.setattr(
"src.orchestrator.SyncOrchestrator", lambda *a, **kw: _OrchStub(),
)
monkeypatch.setattr(
"app.instance_config.get_data_source_type", lambda: "keboola",
)
monkeypatch.setattr(
"app.instance_config.get_value",
lambda *args, **kw: (
"my-bq-proj" if (args and args[-1] == "project")
else kw.get("default", "")
),
)
sync_mod._run_sync()
assert materialized_called["count"] == 1, (
"materialized pass must run even when Keboola subprocess timed out"
)
assert orchestrator_called["count"] == 1, (
"orchestrator rebuild must run after Keboola timeout to publish "
"any partial / materialized parquets that did land"
)
def test_run_sync_runs_materialized_pass_on_bq_only_deployment(
tmp_path, monkeypatch,
):
"""REGRESSION (Devin BUG_0002 on 2fa44f2): on BigQuery-only deployments
`list_local('bigquery')` is always empty (BQ rows are remote or
materialized, never local). The pre-fix _run_sync early-returned in
that case → materialized pass + orchestrator rebuild were dead code.
Post-fix: run_extractor_subprocess flag skips just the Keboola
subprocess, and the materialized pass still fires."""
import duckdb
from src.db import _ensure_schema
db_path = tmp_path / "system.duckdb"
conn = duckdb.connect(str(db_path))
_ensure_schema(conn)
repo = TableRegistryRepository(conn)
# Materialized BQ row — would be invisible to list_local('bigquery').
repo.register(
id="m1", name="m1", source_type="bigquery",
query_mode="materialized",
source_query="SELECT 1",
sync_schedule="every 1m",
)
conn.close()
monkeypatch.setenv("DATA_DIR", str(tmp_path / "data"))
# Patch the heavy collaborators so we observe what _run_sync invoked
# without actually running BQ / orchestrator.
from app.api import sync as sync_mod
materialized_called = {"count": 0}
orchestrator_called = {"count": 0}
def _spy_materialized_pass(_conn, _bq, *, tables=None):
materialized_called["count"] += 1
return {"materialized": ["m1"], "skipped": [], "errors": []}
class _OrchStub:
def rebuild(self):
orchestrator_called["count"] += 1
return {}
monkeypatch.setattr(
"app.api.sync._run_materialized_pass",
_spy_materialized_pass,
)
monkeypatch.setattr(
"src.orchestrator.SyncOrchestrator",
lambda *a, **kw: _OrchStub(),
)
# Pretend instance.yaml says data_source.type=bigquery
monkeypatch.setattr(
"app.instance_config.get_data_source_type",
lambda: "bigquery",
)
# bq_project must be truthy so the materialized pass branch fires.
real_get_value = sync_mod.__dict__.get("get_value")
monkeypatch.setattr(
"app.instance_config.get_value",
lambda *args, **kw: (
"my-bq-proj" if (args and args[-1] == "project")
else kw.get("default", "")
),
)
sync_mod._run_sync()
assert materialized_called["count"] == 1, (
"materialized pass must run on BQ-only deployment (no local rows)"
)
assert orchestrator_called["count"] == 1, (
"orchestrator rebuild must run so materialized parquets are picked up"
)
@pytest.mark.parametrize("yaml_value, expected_max", [
(10737418240, 10737418240), # int — canonical
(10737418240.0, 10737418240), # float — YAML often parses as float
(1e10, 10000000000), # scientific notation
("10737418240", 10737418240), # string — coerced
(0, None), # explicit disable sentinel
(None, None), # missing key
("not-a-number", None), # malformed → fail-open + warn
])
def test_materialized_pass_max_bytes_yaml_coercion(
system_db, stub_bq, tmp_path, monkeypatch, yaml_value, expected_max,
):
"""`max_bytes_per_materialize` YAML value is coerced to int regardless of
the YAML scalar type (int / float / scientific / string). Devin found
that an `isinstance(raw, int)` guard silently disabled the guardrail
on float values."""
repo = TableRegistryRepository(system_db)
repo.register(
id="t", name="t", source_type="bigquery",
query_mode="materialized", source_query="SELECT 1",
sync_schedule="every 1m",
)
parquet_dir = tmp_path / "data" / "extracts" / "bigquery" / "data"
parquet_dir.mkdir(parents=True, exist_ok=True)
(parquet_dir / "t.parquet").write_bytes(b"PAR1" + b"\x00" * 16 + b"PAR1")
captured = {}
def _spy(table_id, sql, bq, output_dir, max_bytes):
captured["max_bytes"] = max_bytes
return {"rows": 1, "size_bytes": 100, "query_mode": "materialized"}
from app.api import sync as sync_mod
with patch(
"app.instance_config.get_value",
side_effect=lambda *a, **kw: (
yaml_value if a[-1] == "max_bytes_per_materialize"
else kw.get("default", "")
),
), patch("app.api.sync._materialize_table", side_effect=_spy):
sync_mod._run_materialized_pass(system_db, stub_bq)
assert captured["max_bytes"] == expected_max
def test_materialized_pass_keys_sync_state_by_name_not_id(
system_db, stub_bq, tmp_path,
):
"""Devin review: when admin registers a name with mixed case (e.g.
"Orders_90d") the slug-derived id ("orders_90d") differs from name.
sync_state must be keyed by `name` so the manifest's `registry_by_name`
lookup resolves and `query_mode='materialized'` flows through to the
client. Otherwise CLI sees `query_mode='local'` and downloads the
wrong file or skips the row."""
repo = TableRegistryRepository(system_db)
# Mixed-case name — id will be slugified to lowercase by the API path,
# but at the repo level we control both directly.
repo.register(
id="orders_90d", name="Orders_90d",
source_type="bigquery", query_mode="materialized",
source_query="SELECT 1",
sync_schedule="every 1m",
)
# Pre-create the parquet at the NAME-keyed path.
parquet_dir = tmp_path / "data" / "extracts" / "bigquery" / "data"
parquet_dir.mkdir(parents=True, exist_ok=True)
(parquet_dir / "Orders_90d.parquet").write_bytes(
b"PAR1" + b"\x00" * 16 + b"PAR1"
)
from app.api import sync as sync_mod
captured = {}
def _spy(table_id, sql, bq, output_dir, max_bytes):
captured["table_id"] = table_id
return {"rows": 1, "size_bytes": 100, "query_mode": "materialized"}
with patch("app.api.sync._materialize_table", side_effect=_spy):
sync_mod._run_materialized_pass(system_db, stub_bq)
# materialize_query was called with the NAME, not the id.
assert captured["table_id"] == "Orders_90d"
# sync_state row keyed by name.
state = SyncStateRepository(system_db)
name_row = state.get_table_state("Orders_90d")
id_row = state.get_table_state("orders_90d")
assert name_row is not None, "sync_state should be keyed by name"
assert id_row is None, "sync_state should NOT be keyed by id"
def test_materialized_pass_zero_max_bytes_disables_guardrail(
system_db, stub_bq, tmp_path, monkeypatch
):
"""`max_bytes_per_materialize: 0` in instance.yaml → None passed downstream
so materialize_query skips the dry-run entirely."""
repo = TableRegistryRepository(system_db)
repo.register(
id="big", name="big", source_type="bigquery",
query_mode="materialized", source_query="SELECT 1",
sync_schedule="every 1m",
)
parquet_dir = tmp_path / "data" / "extracts" / "bigquery" / "data"
parquet_dir.mkdir(parents=True, exist_ok=True)
(parquet_dir / "big.parquet").write_bytes(
b"PAR1" + b"\x00" * 16 + b"PAR1"
)
monkeypatch.setattr(
"app.api.sync.get_value",
lambda *args, **kwargs: 0 if args[-1] == "max_bytes_per_materialize" else "",
raising=False,
)
from app.api import sync as sync_mod
captured = {}
def _spy(**kwargs):
captured.update(kwargs)
return {"rows": 1, "size_bytes": 100, "query_mode": "materialized"}
# The function reads `get_value` via a local import in the body — patch
# the import target instead.
with patch(
"app.instance_config.get_value",
side_effect=lambda *args, **kw: (
0 if args[-1] == "max_bytes_per_materialize"
else kw.get("default", "")
),
), patch("app.api.sync._materialize_table", side_effect=_spy):
sync_mod._run_materialized_pass(system_db, stub_bq)
assert captured["max_bytes"] is None