* fix: cutover regressions + parallel Keboola legacy fallback
Bundled fixes from a fresh-deploy run on a Keboola Storage backend with
the block-shared-snowflake-access feature flag — DuckDB Keboola
extension's per-table scan can't access bucket schemas, so the legacy
kbcstorage Storage-API client is the only working path.
CUTOVER REGRESSIONS
- agnes pull hash mismatch on every Keboola local-mode table —
src/orchestrator.py:_update_sync_state stored md5(mtime+size)[:12]
while the CLI compares against full 32-char content MD5. Now stores
the same content MD5 the materialized SQL path already used.
- Trailing-slash sanitization in connectors/keboola/access.py and
extractor.py — DuckDB Keboola extension's ATTACH fails when the URL
ends in / (canonical form).
- src/profiler.py:TableInfo.description becomes optional — two call
sites instantiated without it, crashing the profiler pass.
- scripts/ops/agnes-auto-upgrade.sh: chown on UID change — older images
ran as root, current runs as agnes (uid 999). Reads target uid:gid
from /etc/passwd inside the new image and chowns ${STATE_DIR},
/data/extracts, /data/analytics when the digest moves.
- POST /api/sync/trigger is now singleton per process — two
near-simultaneous trigger calls each forked an extractor subprocess,
fought for extract.duckdb's file lock, starved uvicorn, flipped the
container to unhealthy. Trigger now returns 409
(sync_already_in_progress) when held; _run_sync acquires non-blocking.
PARALLEL LEGACY FALLBACK
- Process pool fan-out for the _extract_via_legacy queue (default 8
workers, override via AGNES_KEBOOLA_PARALLELISM). Process pool, not
thread pool, because connectors/keboola/client.py:export_table does
os.chdir(temp_dir) — process-global, so threads raced and slice files
landed in the wrong directory ("[Errno 2] No such file or directory:
'<job_id>.csv_X_Y_Z.csv'").
- Extractor subprocess timeout 1800s -> 3600s (configurable via
AGNES_EXTRACTOR_TIMEOUT_SEC). 28+ tables × multi-minute Keboola export
jobs need the headroom on telemetry-class projects.
- Process group cleanup on timeout — Popen(start_new_session=True) puts
the extractor in its own group. On timeout the parent SIGTERMs the
group (10s grace) then SIGKILLs stragglers. Without this, the pool
workers were reparented to PID 1 and continued holding open Keboola
Storage export jobs. Inline extractor script also installs a SIGTERM
-> sys.exit(143) handler so the with ProcessPoolExecutor(...) block
__exit__ runs cleanly.
Tests: existing tests that patched subprocess.run updated to patch
subprocess.Popen with a _FakePopen stand-in (same exit-code-injection
contract). Two tests that exercised the parallel path forced
AGNES_KEBOOLA_PARALLELISM=1 to keep mocks alive (mocks don't ride into
ProcessPoolExecutor subprocesses).
Squashed onto current main (was 7 commits + multi-commit CHANGELOG +
agnes-auto-upgrade.sh conflicts; squash avoids per-commit conflict
resolution against main's flat-mount STATE_DIR refactor and 0.38.0
release cut).
* feat(keboola): Storage API direct extract path; drop extension data path
The DuckDB Keboola extension's COPY routes through Keboola QueryService,
which is unreliable on linked-bucket projects (extension v0.1.6 fixes
that case but isn't yet in the community CDN, and pre-fix any project
with the block-shared-snowflake-access feature flag couldn't see bucket
schemas at all). Move the extract path off the extension entirely and
talk to the Storage API directly via signed-URL download — works on any
project, regardless of extension state.
connectors/keboola/storage_api.py (NEW)
Lightweight client built on requests.Session. Three endpoints:
- POST /v2/storage/tables/{id}/export-async (kicks off job)
- GET /v2/storage/jobs/{id} (poll until done)
- GET /v2/storage/files/{id}?federationToken=1 (signed URL detail)
- GET <signed_url> (download bytes)
Supports sliced exports (manifest + per-slice signed URLs) and gzipped
payloads. ExportFilter dataclass mirrors the Keboola filter spec
(whereFilters / columns / changedSince / limit) and handles JSON
round-trip with the registry's source_query column. Token redaction
in error messages. Bounded exponential backoff on job polling.
No cloud-SDK dependency on the data path; thread-safe.
connectors/keboola/extractor.py
- materialize_query() rewritten: takes bucket/source_table/source_query
(JSON filter spec), exports via KeboolaStorageClient, converts CSV
to parquet via DuckDB, atomic os.replace. Same return shape so
sync.py downstream code stays uniform with the BQ branch.
- _extract_via_legacy() also moved to Storage API direct (kept the
name for caller compatibility with _legacy_worker / the parallel
batch extractor). Per-call temp directories — no os.chdir, threads
don't race.
app/api/sync.py
_run_materialized_pass for source_type='keboola' rows now constructs a
KeboolaStorageClient (replaces KeboolaAccess) and passes
bucket/source_table/source_query to materialize_query. Reuses one
client across rows for HTTP keep-alive. Sources keboola URL from env
too (KEBOOLA_STACK_URL) when instance.yaml doesn't have stack_url
configured.
cli/commands/admin.py
discover-and-register defaults Keboola rows to query_mode='materialized'
(NULL source_query = full table), matching the v26 migration's
unification of the local/materialized split for Keboola. BigQuery and
Jira keep their per-source defaults.
src/db.py
Schema bump 25 → 26. Migration: UPDATE table_registry SET
query_mode='materialized' WHERE source_type='keboola' AND
query_mode='local'. NULL source_query on those rows means "full table
export" — same effective behavior the local mode provided, but now
via Storage API instead of the extension.
pyproject.toml
kbcstorage dep stays (admin-side bucket/table list still uses the
SDK in app/api/admin.py / connectors/keboola/client.py); only the
data path is migrated off the SDK. Comment updated to reflect the
new boundary.
tests
- test_keboola_storage_api.py (NEW, 19 tests): ExportFilter parsing,
HTTP client (token redaction, retry logic, polling), download_file
(single, gzipped, sliced), end-to-end export_table_to_csv.
- test_keboola_materialize.py rewritten: mocks KeboolaStorageClient
instead of FakeAccess; same atomic-write + zero-rows + unsafe-id
contracts.
- test_sync_trigger_keboola_materialized.py: registry rows now carry
bucket+source_table+JSON-shape source_query.
114+ Keboola-impacted tests green locally.
* test: schema version assertion bumped to 26 alongside the keboola query_mode migration
* fix(keboola): cutover hot-patches surfaced on agnes-dev
Five small fixes that were applied as in-container hot-patches during
agnes-dev cutover and need to be on the source-of-truth image so a fresh
upgrade does not undo them.
- app/api/sync.py: auto-discover gate considers the WHOLE registry (any
source, any mode), not just rows where source matches and query_mode
is local. After the v25→v26 keboola materialized migration an
instance can have 30 materialized rows and zero local rows; the
previous gate kept re-firing _discover_and_register_tables every
scheduler tick, creating duplicate auto-discovered rows with the
wrong bucket prefix every time.
- app/api/admin.py: _discover_and_register_tables reassembles the
bucket as <stage>.<bucket-id> (e.g. in.c-finance) instead of
dropping the stage prefix; default query_mode for keboola is now
materialized (the v26 contract); validator allows NULL source_query
for keboola materialized rows (full-table export via Storage API
export-async, no SQL needed).
- cli/commands/admin.py: register-table mirrors the server validator
(NULL source_query allowed for source_type=keboola); --bucket help
text generalized to cover both BQ dataset and Keboola bucket id.
- connectors/keboola/extractor.py: max_line_size=64 MiB on
read_csv_auto so embedded JSON / SQL cells (kbc_component_configuration
in particular) do not trip the default 2 MiB ceiling.
- connectors/keboola/storage_api.py: GCP backend support — when the
Storage API returns a manifest whose slice URLs are gs://
references with a gcsCredentials block, rewrite to the JSON REST
download endpoint and authenticate with the issued OAuth bearer
token; redact tokens in any surfaced error string.
* test: align with new keboola materialized + auto-discover-gate contracts
- test_admin_keboola_materialized: rename
test_register_keboola_materialized_rejects_missing_source_query →
test_register_keboola_materialized_accepts_missing_source_query.
v25→v26 introduced 'keboola materialized with NULL source_query
means full-table export via Storage API export-async' as the
default registration shape; the rejection case is no longer the
contract.
- test_sync_filter: add list_all() to _StubRegistry. The auto-discover
gate in _run_sync now keys off the WHOLE registry (not just local
rows) so materialized-only Keboola instances do not re-trigger
discovery on every tick.
* feat(keboola): native parquet export — skip CSV roundtrip
Storage API export-async accepts fileType={csv,parquet}. Switching the
materialized sync to parquet eliminates the CSV → DuckDB COPY → parquet
roundtrip that pinned a single uvicorn worker over 4 GiB on multi-GB
tables (read_csv with all_varchar + max_line_size=64MB has to
materialize the whole CSV in memory before COPY can stream out a
parquet). Snowflake UNLOAD on Keboola's side already produces typed,
self-contained parquet files; the extractor downloads them and renames
into place.
Two cases:
- **Single-file** export (small table): file_info.url points at one
signed URL; download_file streams chunks straight to .parquet.tmp
and we're done. No DuckDB.
- **Sliced** export (Snowflake UNLOAD respects MAX_FILE_SIZE — 16 MiB
default — so anything larger arrives as N parquet slices): each
slice is a complete parquet file with its own footer; naive concat
would corrupt them. download_file_slices keeps the slices as
separate files in a tempdir, then DuckDB COPY (SELECT * FROM
read_parquet([slice0, slice1, ...])) merges them into one
consolidated parquet. DuckDB streams row groups during this — peak
memory bounded to one row group (~1 MiB) regardless of source size.
The legacy CSV path stays as the explicit opt-in via source_query=
'{"file_type":"csv"}' for projects whose backend can't UNLOAD
parquet (none known today; cheap escape hatch). Backward-compat alias
KeboolaStorageClient.export_table_to_csv kept.
Also fixes a latent bug in download_file's gzip detection: previous
heuristic flagged any unencrypted file as gzipped, which would have
corrupted parquet downloads at gunzip time. Name-suffix-only now.
* fix: tempdir leak cleanup, every 0m schedule, /sync/trigger body shapes
Three small self-contained fixes uncovered during agnes-dev cutover.
- connectors/keboola/extractor.py: tempfile.TemporaryDirectory now uses
ignore_cleanup_errors=True so a worker death mid-write doesn't leave
multi-GiB stale slice trees on the boot disk. (12 GiB seen after a
disk-full crash where TemporaryDirectory's own cleanup also raised
and got swallowed.)
- src/scheduler.py: is_valid_schedule accepts 'every 0m' (interval=0
= always due). Force-resync of an errored row no longer requires
waiting out the default 'every 1h' interval — admin can flip the
schedule, trigger, then flip back.
- app/api/sync.py: POST /api/sync/trigger accepts both ['table_id']
(legacy bare-array body) and {'tables': ['table_id']} (matches the
response payload shape, more discoverable for clients building
requests by hand). Malformed bodies return 422 with a structured
detail; null/missing means 'sync everything' as before.
Tests cover: tempdir cleanup on raise (sliced parquet path),
is_valid_schedule + is_table_due 'every 0m' acceptance, and trigger
body parametrized matrix (8 valid shapes + 6 rejection cases).
* fix: targeted-trigger filter in materialized pass + auto-upgrade defer
Two operational gaps observed during agnes-dev cutover, in the same
sync-routing area.
- _run_materialized_pass now takes a 'tables' arg and skips rows not in
the target set with reason='not_in_target'. POST /api/sync/trigger
with a body of tables previously only scoped the legacy extractor
subprocess — the materialized pass kept iterating every due
materialized row, so an admin asking to re-sync kbc_job re-ran
every other due materialized row alongside it. Match on registry id
OR name (admins commonly pass either form). tables=None preserves
the no-filter behavior.
- New GET /api/sync/status (public, no auth) returns {locked: bool}
off _sync_lock.locked(). agnes-auto-upgrade.sh probes this before
docker compose up -d and exits 0 with a 'deferred recreate' log
line if a sync is in flight — the next 5-min cron tick retries.
Pre-fix, an auto-upgrade triggered mid-sync would recreate the
uvicorn worker and kill the in-flight extractor / Snowflake-UNLOAD
download (observed when kbc_job's first 7-day retry got SIGKILLed).
Connection failures in the probe fall through to the upgrade —
being stuck on a wedged image is worse than interrupting a
hypothetical sync.
* fix: auto-discover protects admin overrides + surfaces drift
Two real-world incidents on agnes-dev drove this:
1. kbc_job was registered manually with the correct
(in.c-kbc_telemetry, kbc_job) coordinates. A naive auto-discover
re-run would have inserted a SECOND kbc_job row at the slugified
id 'in_c-keboola-storage_kbc_job' (where Keboola's discovery
places it) — and that row's Storage API export-async 404s.
2. An earlier auto-discover bug stripped the stage prefix from
bucket ids ('c-finance' instead of 'in.c-finance'), inserting
137 rows whose syncs all failed.
Fix:
- _discover_and_register_tables now builds a plan first
(_build_keboola_discovery_plan) classifying each discovered table
into one of new / existing_match / existing_drift / invalid, then
executes only the 'new' bucket. Drift rows are reported with both
sides of the disagreement plus drift_kind:
- same_id_diff_coords: registry has the same id but different
bucket / source_table (admin migrated coords inline).
- name_collision: discovery's slugified id differs from any
registry id, but the discovered .name matches an existing row's
.name (case-insensitive). Catches the kbc_job case.
- Bucket detection now prefers the API's authoritative bucket_id
field (separate field on the Keboola tables.list response,
normalised by KeboolaClient.discover_all_tables). Falls back to
id-string parsing only when bucket_id is missing (older fallback
path inside discover_all_tables).
- Endpoint POST /api/admin/discover-and-register?dry_run=true
returns the plan without writing — would_register, drift,
invalid lists. Lets an operator audit before merging discovery
with a registry that has admin overrides.
Removed 'every 0m' from test_register_request_rejects_malformed_sync_schedule
— the runtime started accepting it in the previous commit (force-resync
override) and the validator follows suit.
* feat(keboola): AGNES_TEMP_DIR routes tempfiles off overlayfs /tmp
The container's /tmp lives on the boot disk's overlayfs (29 GiB on
agnes-dev, shared with /var). Snowflake UNLOAD of a wide table writes
slices into per-call /tmp tempdirs that fill multi-GiB / many-slice
exports long before the dedicated data disk fills. agnes-dev hit
100% boot-disk while the 20 GiB data disk had 15 GiB free.
connectors.keboola.storage_api.get_temp_root() reads AGNES_TEMP_DIR;
mkdirs the target on first use; unset / empty / unwritable falls
back to None (system tempdir, OSS-pre-fix behaviour). Both
materialize_query (parquet path) and _extract_via_legacy (CSV
fallback) and the sliced-CSV concat path in storage_api use the
helper now.
docker-compose.yml defaults AGNES_TEMP_DIR=/data/tmp on app, scheduler,
and extract services. The data volume is the dedicated disk in
production layouts and a plain docker volume in single-disk
dev/laptop setups — same blast radius as the previous /tmp default
on the latter, no regression.
494 lines
22 KiB
Python
494 lines
22 KiB
Python
"""Tests for src.scheduler - schedule parsing and sync-due evaluation."""
|
|
|
|
from datetime import datetime, timedelta, timezone
|
|
from typing import Optional
|
|
|
|
import pytest
|
|
|
|
from src.scheduler import (
|
|
_is_daily_due,
|
|
_parse_daily_times,
|
|
_parse_timestamp,
|
|
is_table_due,
|
|
parse_interval_minutes,
|
|
)
|
|
|
|
# Fixed reference time: 2026-03-15 12:00:00 UTC
|
|
NOW = datetime(2026, 3, 15, 12, 0, 0, tzinfo=timezone.utc)
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# parse_interval_minutes
|
|
# ---------------------------------------------------------------------------
|
|
|
|
|
|
class TestParseIntervalMinutes:
|
|
"""Tests for parse_interval_minutes()."""
|
|
|
|
def test_minutes_basic(self) -> None:
|
|
assert parse_interval_minutes("every 15m") == 15
|
|
|
|
def test_minutes_single_digit(self) -> None:
|
|
assert parse_interval_minutes("every 5m") == 5
|
|
|
|
def test_minutes_large(self) -> None:
|
|
assert parse_interval_minutes("every 120m") == 120
|
|
|
|
def test_hours_basic(self) -> None:
|
|
assert parse_interval_minutes("every 2h") == 120
|
|
|
|
def test_hours_single(self) -> None:
|
|
assert parse_interval_minutes("every 1h") == 60
|
|
|
|
def test_hours_large(self) -> None:
|
|
assert parse_interval_minutes("every 24h") == 1440
|
|
|
|
def test_daily_returns_none(self) -> None:
|
|
assert parse_interval_minutes("daily 05:00") is None
|
|
|
|
def test_invalid_format_returns_none(self) -> None:
|
|
assert parse_interval_minutes("not a schedule") is None
|
|
|
|
def test_empty_string_returns_none(self) -> None:
|
|
assert parse_interval_minutes("") is None
|
|
|
|
def test_missing_unit_returns_none(self) -> None:
|
|
assert parse_interval_minutes("every 15") is None
|
|
|
|
def test_wrong_unit_returns_none(self) -> None:
|
|
assert parse_interval_minutes("every 15s") is None
|
|
|
|
def test_no_space_returns_none(self) -> None:
|
|
assert parse_interval_minutes("every15m") is None
|
|
|
|
def test_extra_whitespace_returns_none(self) -> None:
|
|
# Strict parsing: extra whitespace is rejected
|
|
assert parse_interval_minutes("every 15m") is None
|
|
|
|
def test_negative_not_matched(self) -> None:
|
|
# Regex uses \d+ so negative sign won't match
|
|
assert parse_interval_minutes("every -5m") is None
|
|
|
|
def test_zero_minutes(self) -> None:
|
|
# "every 0m" matches the pattern, returns 0
|
|
assert parse_interval_minutes("every 0m") == 0
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# is_table_due - interval schedules
|
|
# ---------------------------------------------------------------------------
|
|
|
|
|
|
class TestIsTableDueInterval:
|
|
"""Tests for is_table_due() with interval-based schedules."""
|
|
|
|
def test_never_synced_is_due(self) -> None:
|
|
assert is_table_due("every 15m", last_sync_iso=None, now=NOW) is True
|
|
|
|
def test_empty_last_sync_is_due(self) -> None:
|
|
assert is_table_due("every 15m", last_sync_iso="", now=NOW) is True
|
|
|
|
def test_every_0m_is_always_due(self) -> None:
|
|
# ``every 0m`` opts out of rate limiting — used to force-resync
|
|
# a row whose previous attempt errored without recording
|
|
# last_sync. Even a sync seconds ago must come back as due.
|
|
last_sync = (NOW - timedelta(seconds=5)).isoformat()
|
|
assert is_table_due("every 0m", last_sync_iso=last_sync, now=NOW) is True
|
|
assert is_table_due("every 0m", last_sync_iso=None, now=NOW) is True
|
|
|
|
def test_synced_10min_ago_every_15m_not_due(self) -> None:
|
|
last_sync = (NOW - timedelta(minutes=10)).isoformat()
|
|
assert is_table_due("every 15m", last_sync_iso=last_sync, now=NOW) is False
|
|
|
|
def test_synced_20min_ago_every_15m_is_due(self) -> None:
|
|
last_sync = (NOW - timedelta(minutes=20)).isoformat()
|
|
assert is_table_due("every 15m", last_sync_iso=last_sync, now=NOW) is True
|
|
|
|
def test_synced_exactly_15min_ago_every_15m_is_due(self) -> None:
|
|
last_sync = (NOW - timedelta(minutes=15)).isoformat()
|
|
assert is_table_due("every 15m", last_sync_iso=last_sync, now=NOW) is True
|
|
|
|
def test_synced_30min_ago_every_1h_not_due(self) -> None:
|
|
last_sync = (NOW - timedelta(minutes=30)).isoformat()
|
|
assert is_table_due("every 1h", last_sync_iso=last_sync, now=NOW) is False
|
|
|
|
def test_synced_90min_ago_every_1h_is_due(self) -> None:
|
|
last_sync = (NOW - timedelta(minutes=90)).isoformat()
|
|
assert is_table_due("every 1h", last_sync_iso=last_sync, now=NOW) is True
|
|
|
|
def test_synced_exactly_1h_ago_every_1h_is_due(self) -> None:
|
|
last_sync = (NOW - timedelta(hours=1)).isoformat()
|
|
assert is_table_due("every 1h", last_sync_iso=last_sync, now=NOW) is True
|
|
|
|
def test_synced_59min_ago_every_1h_not_due(self) -> None:
|
|
last_sync = (NOW - timedelta(minutes=59)).isoformat()
|
|
assert is_table_due("every 1h", last_sync_iso=last_sync, now=NOW) is False
|
|
|
|
def test_synced_3h_ago_every_2h_is_due(self) -> None:
|
|
last_sync = (NOW - timedelta(hours=3)).isoformat()
|
|
assert is_table_due("every 2h", last_sync_iso=last_sync, now=NOW) is True
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# is_table_due - daily schedules
|
|
# ---------------------------------------------------------------------------
|
|
|
|
|
|
class TestIsTableDueDaily:
|
|
"""Tests for is_table_due() with daily schedules."""
|
|
|
|
def test_before_target_time_not_due(self) -> None:
|
|
now = datetime(2026, 3, 15, 4, 30, 0, tzinfo=timezone.utc)
|
|
last_sync = datetime(2026, 3, 14, 6, 0, 0, tzinfo=timezone.utc).isoformat()
|
|
assert is_table_due("daily 05:00", last_sync_iso=last_sync, now=now) is False
|
|
|
|
def test_past_target_not_synced_today_is_due(self) -> None:
|
|
now = datetime(2026, 3, 15, 5, 30, 0, tzinfo=timezone.utc)
|
|
last_sync = datetime(2026, 3, 15, 4, 0, 0, tzinfo=timezone.utc).isoformat()
|
|
assert is_table_due("daily 05:00", last_sync_iso=last_sync, now=now) is True
|
|
|
|
def test_past_target_already_synced_after_target_not_due(self) -> None:
|
|
now = datetime(2026, 3, 15, 5, 30, 0, tzinfo=timezone.utc)
|
|
last_sync = datetime(2026, 3, 15, 5, 15, 0, tzinfo=timezone.utc).isoformat()
|
|
assert is_table_due("daily 05:00", last_sync_iso=last_sync, now=now) is False
|
|
|
|
def test_evening_schedule_past_target_last_sync_yesterday_is_due(self) -> None:
|
|
now = datetime(2026, 3, 15, 18, 0, 0, tzinfo=timezone.utc)
|
|
last_sync = datetime(2026, 3, 14, 17, 30, 0, tzinfo=timezone.utc).isoformat()
|
|
assert is_table_due("daily 17:00", last_sync_iso=last_sync, now=now) is True
|
|
|
|
def test_daily_never_synced_is_due(self) -> None:
|
|
now = datetime(2026, 3, 15, 6, 0, 0, tzinfo=timezone.utc)
|
|
assert is_table_due("daily 05:00", last_sync_iso=None, now=now) is True
|
|
|
|
def test_daily_never_synced_before_target_still_due(self) -> None:
|
|
# Never synced always returns True regardless of target time
|
|
now = datetime(2026, 3, 15, 3, 0, 0, tzinfo=timezone.utc)
|
|
assert is_table_due("daily 05:00", last_sync_iso=None, now=now) is True
|
|
|
|
def test_daily_exactly_at_target_time_is_due(self) -> None:
|
|
now = datetime(2026, 3, 15, 5, 0, 0, tzinfo=timezone.utc)
|
|
last_sync = datetime(2026, 3, 14, 5, 0, 0, tzinfo=timezone.utc).isoformat()
|
|
# now == today_target, so now < today_target is False
|
|
# last_sync (yesterday) < today_target => due
|
|
assert is_table_due("daily 05:00", last_sync_iso=last_sync, now=now) is True
|
|
|
|
def test_daily_synced_at_exactly_target_not_due_again(self) -> None:
|
|
now = datetime(2026, 3, 15, 5, 30, 0, tzinfo=timezone.utc)
|
|
last_sync = datetime(2026, 3, 15, 5, 0, 0, tzinfo=timezone.utc).isoformat()
|
|
# last_sync == today_target => last_sync >= today_target => not due
|
|
assert is_table_due("daily 05:00", last_sync_iso=last_sync, now=now) is False
|
|
|
|
def test_midnight_schedule(self) -> None:
|
|
now = datetime(2026, 3, 15, 0, 30, 0, tzinfo=timezone.utc)
|
|
last_sync = datetime(2026, 3, 14, 0, 15, 0, tzinfo=timezone.utc).isoformat()
|
|
assert is_table_due("daily 00:00", last_sync_iso=last_sync, now=now) is True
|
|
|
|
def test_end_of_day_schedule(self) -> None:
|
|
now = datetime(2026, 3, 15, 23, 59, 0, tzinfo=timezone.utc)
|
|
last_sync = datetime(2026, 3, 14, 23, 50, 0, tzinfo=timezone.utc).isoformat()
|
|
assert is_table_due("daily 23:30", last_sync_iso=last_sync, now=now) is True
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# is_table_due - edge cases
|
|
# ---------------------------------------------------------------------------
|
|
|
|
|
|
class TestIsTableDueEdgeCases:
|
|
"""Edge case tests for is_table_due()."""
|
|
|
|
def test_unparseable_last_sync_returns_true(self) -> None:
|
|
# Fail-safe: if we can't parse last_sync, assume sync is needed
|
|
assert is_table_due("every 15m", last_sync_iso="garbage", now=NOW) is True
|
|
|
|
def test_unknown_schedule_format_returns_false(self) -> None:
|
|
last_sync = (NOW - timedelta(hours=2)).isoformat()
|
|
assert is_table_due("weekly", last_sync_iso=last_sync, now=NOW) is False
|
|
|
|
def test_unknown_schedule_never_synced_returns_true(self) -> None:
|
|
# Never synced takes priority over unknown schedule
|
|
assert is_table_due("weekly", last_sync_iso=None, now=NOW) is True
|
|
|
|
def test_now_defaults_to_current_time(self) -> None:
|
|
# When now is not provided, it defaults to current UTC time
|
|
# A table that was never synced should be due regardless
|
|
assert is_table_due("every 15m", last_sync_iso=None) is True
|
|
|
|
def test_naive_last_sync_treated_as_utc(self) -> None:
|
|
# Naive timestamp (no timezone) should be treated as UTC
|
|
naive_ts = "2026-03-15T11:50:00"
|
|
# 10 minutes ago from NOW (12:00), with 15m interval -> not due
|
|
assert is_table_due("every 15m", last_sync_iso=naive_ts, now=NOW) is False
|
|
|
|
def test_last_sync_in_future_not_due(self) -> None:
|
|
# Edge case: last_sync in the future (clock skew, etc.)
|
|
future = (NOW + timedelta(hours=1)).isoformat()
|
|
assert is_table_due("every 15m", last_sync_iso=future, now=NOW) is False
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# _is_daily_due (internal function, direct tests)
|
|
# ---------------------------------------------------------------------------
|
|
|
|
|
|
class TestIsDailyDue:
|
|
"""Direct tests for _is_daily_due() internal function."""
|
|
|
|
def test_before_target_not_due(self) -> None:
|
|
now = datetime(2026, 3, 15, 4, 0, 0, tzinfo=timezone.utc)
|
|
last_sync = datetime(2026, 3, 14, 5, 30, 0, tzinfo=timezone.utc)
|
|
assert _is_daily_due(last_sync, now, [(5, 0)]) is False
|
|
|
|
def test_after_target_last_sync_before_target_is_due(self) -> None:
|
|
now = datetime(2026, 3, 15, 6, 0, 0, tzinfo=timezone.utc)
|
|
last_sync = datetime(2026, 3, 15, 4, 0, 0, tzinfo=timezone.utc)
|
|
assert _is_daily_due(last_sync, now, [(5, 0)]) is True
|
|
|
|
def test_after_target_last_sync_after_target_not_due(self) -> None:
|
|
now = datetime(2026, 3, 15, 6, 0, 0, tzinfo=timezone.utc)
|
|
last_sync = datetime(2026, 3, 15, 5, 30, 0, tzinfo=timezone.utc)
|
|
assert _is_daily_due(last_sync, now, [(5, 0)]) is False
|
|
|
|
def test_target_with_minutes(self) -> None:
|
|
now = datetime(2026, 3, 15, 17, 45, 0, tzinfo=timezone.utc)
|
|
last_sync = datetime(2026, 3, 15, 10, 0, 0, tzinfo=timezone.utc)
|
|
assert _is_daily_due(last_sync, now, [(17, 30)]) is True
|
|
|
|
def test_target_with_minutes_not_yet(self) -> None:
|
|
now = datetime(2026, 3, 15, 17, 15, 0, tzinfo=timezone.utc)
|
|
last_sync = datetime(2026, 3, 15, 10, 0, 0, tzinfo=timezone.utc)
|
|
assert _is_daily_due(last_sync, now, [(17, 30)]) is False
|
|
|
|
|
|
class TestMultipleDailyTimes:
|
|
"""Tests for multiple daily schedule times."""
|
|
|
|
def test_multi_time_first_due(self) -> None:
|
|
now = datetime(2026, 3, 15, 8, 0, 0, tzinfo=timezone.utc)
|
|
last_sync = datetime(2026, 3, 14, 19, 0, 0, tzinfo=timezone.utc)
|
|
assert _is_daily_due(last_sync, now, [(7, 0), (13, 0), (18, 0)]) is True
|
|
|
|
def test_multi_time_second_due(self) -> None:
|
|
now = datetime(2026, 3, 15, 14, 0, 0, tzinfo=timezone.utc)
|
|
last_sync = datetime(2026, 3, 15, 7, 30, 0, tzinfo=timezone.utc)
|
|
assert _is_daily_due(last_sync, now, [(7, 0), (13, 0), (18, 0)]) is True
|
|
|
|
def test_multi_time_third_due(self) -> None:
|
|
now = datetime(2026, 3, 15, 19, 0, 0, tzinfo=timezone.utc)
|
|
last_sync = datetime(2026, 3, 15, 13, 30, 0, tzinfo=timezone.utc)
|
|
assert _is_daily_due(last_sync, now, [(7, 0), (13, 0), (18, 0)]) is True
|
|
|
|
def test_multi_time_between_slots_not_due(self) -> None:
|
|
now = datetime(2026, 3, 15, 10, 0, 0, tzinfo=timezone.utc)
|
|
last_sync = datetime(2026, 3, 15, 7, 30, 0, tzinfo=timezone.utc)
|
|
assert _is_daily_due(last_sync, now, [(7, 0), (13, 0), (18, 0)]) is False
|
|
|
|
def test_multi_time_all_done_not_due(self) -> None:
|
|
now = datetime(2026, 3, 15, 20, 0, 0, tzinfo=timezone.utc)
|
|
last_sync = datetime(2026, 3, 15, 18, 30, 0, tzinfo=timezone.utc)
|
|
assert _is_daily_due(last_sync, now, [(7, 0), (13, 0), (18, 0)]) is False
|
|
|
|
def test_is_table_due_multi_time_format(self) -> None:
|
|
now = datetime(2026, 3, 15, 14, 0, 0, tzinfo=timezone.utc)
|
|
last_sync = datetime(2026, 3, 15, 7, 30, 0, tzinfo=timezone.utc).isoformat()
|
|
assert is_table_due("daily 07:00,13:00,18:00", last_sync_iso=last_sync, now=now) is True
|
|
|
|
def test_is_table_due_multi_time_not_due(self) -> None:
|
|
now = datetime(2026, 3, 15, 10, 0, 0, tzinfo=timezone.utc)
|
|
last_sync = datetime(2026, 3, 15, 7, 30, 0, tzinfo=timezone.utc).isoformat()
|
|
assert is_table_due("daily 07:00,13:00,18:00", last_sync_iso=last_sync, now=now) is False
|
|
|
|
|
|
class TestParseDailyTimes:
|
|
"""Tests for _parse_daily_times()."""
|
|
|
|
def test_single_time(self) -> None:
|
|
assert _parse_daily_times("05:00") == [(5, 0)]
|
|
|
|
def test_multiple_times(self) -> None:
|
|
assert _parse_daily_times("07:00,13:00,18:00") == [(7, 0), (13, 0), (18, 0)]
|
|
|
|
def test_invalid_format(self) -> None:
|
|
assert _parse_daily_times("7:00") == []
|
|
|
|
def test_invalid_hour(self) -> None:
|
|
assert _parse_daily_times("25:00") == []
|
|
|
|
def test_invalid_minute(self) -> None:
|
|
assert _parse_daily_times("12:60") == []
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# _parse_timestamp
|
|
# ---------------------------------------------------------------------------
|
|
|
|
|
|
class TestParseTimestamp:
|
|
"""Tests for _parse_timestamp() internal function."""
|
|
|
|
def test_iso_with_timezone(self) -> None:
|
|
result = _parse_timestamp("2026-03-15T12:00:00+00:00")
|
|
assert result is not None
|
|
assert result.year == 2026
|
|
assert result.month == 3
|
|
assert result.day == 15
|
|
assert result.hour == 12
|
|
|
|
def test_iso_with_z_suffix(self) -> None:
|
|
# Python 3.11+ fromisoformat handles Z
|
|
result = _parse_timestamp("2026-03-15T12:00:00Z")
|
|
assert result is not None
|
|
assert result.hour == 12
|
|
|
|
def test_iso_without_timezone(self) -> None:
|
|
result = _parse_timestamp("2026-03-15T12:00:00")
|
|
assert result is not None
|
|
assert result.hour == 12
|
|
assert result.tzinfo is None
|
|
|
|
def test_iso_with_microseconds(self) -> None:
|
|
result = _parse_timestamp("2026-03-15T12:00:00.123456")
|
|
assert result is not None
|
|
assert result.microsecond == 123456
|
|
|
|
def test_space_separated(self) -> None:
|
|
result = _parse_timestamp("2026-03-15 12:00:00")
|
|
assert result is not None
|
|
assert result.hour == 12
|
|
|
|
def test_invalid_string_returns_none(self) -> None:
|
|
assert _parse_timestamp("not-a-date") is None
|
|
|
|
def test_empty_string_returns_none(self) -> None:
|
|
assert _parse_timestamp("") is None
|
|
|
|
def test_partial_date_returns_none(self) -> None:
|
|
# "2026-03-15" alone - fromisoformat handles date-only in 3.11+
|
|
result = _parse_timestamp("2026-03-15")
|
|
# Should parse as a date (with hour=0, minute=0)
|
|
assert result is not None
|
|
assert result.hour == 0
|
|
|
|
def test_iso_with_positive_offset(self) -> None:
|
|
result = _parse_timestamp("2026-03-15T12:00:00+05:30")
|
|
assert result is not None
|
|
assert result.hour == 12
|
|
assert result.utcoffset() is not None
|
|
|
|
def test_iso_with_negative_offset(self) -> None:
|
|
result = _parse_timestamp("2026-03-15T12:00:00-07:00")
|
|
assert result is not None
|
|
assert result.utcoffset() is not None
|
|
|
|
def test_numeric_garbage_returns_none(self) -> None:
|
|
assert _parse_timestamp("12345") is None
|
|
|
|
def test_none_like_string_returns_none(self) -> None:
|
|
assert _parse_timestamp("None") is None
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# LLM pipeline cadence env vars (#179 review)
|
|
#
|
|
# Three jobs (session-collector, verification-detector, corporate-memory) and
|
|
# the health-check staleness grace must all derive from a single SCHEDULER_*
|
|
# env var per job, so an operator changing the cadence in one place moves
|
|
# both the schedule string and the grace window. The env var name was already
|
|
# read in app/api/health.py before this change but didn't actually drive the
|
|
# scheduler — this test pins the wired-up behavior.
|
|
# ---------------------------------------------------------------------------
|
|
|
|
|
|
_LLM_PIPELINE_ENV = (
|
|
"SCHEDULER_DATA_REFRESH_INTERVAL",
|
|
"SCHEDULER_HEALTH_CHECK_INTERVAL",
|
|
"SCHEDULER_TICK_SECONDS",
|
|
"SCHEDULER_SCRIPT_RUN_INTERVAL",
|
|
"SCHEDULER_SESSION_COLLECTOR_INTERVAL",
|
|
"SCHEDULER_VERIFICATION_DETECTOR_INTERVAL",
|
|
"SCHEDULER_CORPORATE_MEMORY_INTERVAL",
|
|
)
|
|
|
|
|
|
def _clear_scheduler_env(monkeypatch) -> None:
|
|
for v in _LLM_PIPELINE_ENV:
|
|
monkeypatch.delenv(v, raising=False)
|
|
|
|
|
|
class TestLLMPipelineCadenceEnvVars:
|
|
"""Three new env vars drive both the scheduler and the health grace window."""
|
|
|
|
def test_default_cadences_preserve_coprime_offset(self, monkeypatch) -> None:
|
|
"""Defaults are 10m / 15m / 17m so the three jobs don't fire on the same tick."""
|
|
_clear_scheduler_env(monkeypatch)
|
|
from services.scheduler.__main__ import build_jobs
|
|
jobs = {name: schedule for name, schedule, *_ in build_jobs()}
|
|
assert jobs["session-collector"] == "every 10m"
|
|
assert jobs["verification-detector"] == "every 15m"
|
|
assert jobs["corporate-memory"] == "every 17m"
|
|
|
|
def test_session_collector_env_override_changes_cadence(self, monkeypatch) -> None:
|
|
_clear_scheduler_env(monkeypatch)
|
|
monkeypatch.setenv("SCHEDULER_SESSION_COLLECTOR_INTERVAL", "300") # 5m
|
|
from services.scheduler.__main__ import build_jobs
|
|
jobs = {name: schedule for name, schedule, *_ in build_jobs()}
|
|
assert jobs["session-collector"] == "every 5m"
|
|
# Other LLM jobs must be unaffected.
|
|
assert jobs["verification-detector"] == "every 15m"
|
|
assert jobs["corporate-memory"] == "every 17m"
|
|
|
|
def test_verification_detector_env_override_changes_cadence(self, monkeypatch) -> None:
|
|
_clear_scheduler_env(monkeypatch)
|
|
monkeypatch.setenv("SCHEDULER_VERIFICATION_DETECTOR_INTERVAL", "600") # 10m
|
|
from services.scheduler.__main__ import build_jobs
|
|
jobs = {name: schedule for name, schedule, *_ in build_jobs()}
|
|
assert jobs["verification-detector"] == "every 10m"
|
|
assert jobs["session-collector"] == "every 10m"
|
|
assert jobs["corporate-memory"] == "every 17m"
|
|
|
|
def test_corporate_memory_env_override_changes_cadence(self, monkeypatch) -> None:
|
|
_clear_scheduler_env(monkeypatch)
|
|
monkeypatch.setenv("SCHEDULER_CORPORATE_MEMORY_INTERVAL", "1800") # 30m
|
|
from services.scheduler.__main__ import build_jobs
|
|
jobs = {name: schedule for name, schedule, *_ in build_jobs()}
|
|
assert jobs["corporate-memory"] == "every 30m"
|
|
assert jobs["session-collector"] == "every 10m"
|
|
assert jobs["verification-detector"] == "every 15m"
|
|
|
|
@pytest.mark.parametrize("var", [
|
|
"SCHEDULER_SESSION_COLLECTOR_INTERVAL",
|
|
"SCHEDULER_VERIFICATION_DETECTOR_INTERVAL",
|
|
"SCHEDULER_CORPORATE_MEMORY_INTERVAL",
|
|
])
|
|
@pytest.mark.parametrize("bad", ["0", "-5", "abc", ""])
|
|
def test_invalid_llm_env_rejected(self, monkeypatch, var, bad) -> None:
|
|
_clear_scheduler_env(monkeypatch)
|
|
monkeypatch.setenv(var, bad)
|
|
from services.scheduler.__main__ import build_jobs
|
|
with pytest.raises(ValueError):
|
|
build_jobs()
|
|
|
|
|
|
class TestVerificationDetectorGraceFollowsCadence:
|
|
"""The health-check grace window is 2x the cadence — same env var drives both."""
|
|
|
|
def test_grace_doubles_when_env_overrides_cadence(self, monkeypatch) -> None:
|
|
_clear_scheduler_env(monkeypatch)
|
|
monkeypatch.setenv("SCHEDULER_VERIFICATION_DETECTOR_INTERVAL", "600") # 10m
|
|
from app.api.health import _verification_detector_grace_seconds
|
|
from services.scheduler.__main__ import build_jobs
|
|
|
|
jobs = {name: schedule for name, schedule, *_ in build_jobs()}
|
|
# Cadence and grace MUST be derived from the same env var, so an
|
|
# operator who throttles the detector for any reason (rate-limit,
|
|
# cost, debugging) gets a proportionally wider staleness window
|
|
# automatically — no second knob to forget.
|
|
assert jobs["verification-detector"] == "every 10m"
|
|
assert _verification_detector_grace_seconds() == 2 * 600
|
|
|
|
def test_grace_uses_default_cadence_when_env_unset(self, monkeypatch) -> None:
|
|
_clear_scheduler_env(monkeypatch)
|
|
from app.api.health import _verification_detector_grace_seconds
|
|
# Default cadence 900s -> grace 1800s.
|
|
assert _verification_detector_grace_seconds() == 2 * 900
|