* fix(sync+ops): defer-probe race, AGNES_TEMP_DIR chown, default-schedule env knob
Three sync-ops fixes surfaced during agnes-dev steady-state operation
after the v0.46→v0.54 cutover settled. None of them depend on each
other; bundled because they all live in the sync trigger / agnes-auto-
upgrade flow and are diagnosed from the same observation window.
1. (fix) /api/sync/status race window. The trigger handler returned 200
BEFORE the background task acquired _sync_lock. In that few-hundred-ms
gap, an honest /api/sync/status call returned locked=false — and the
host-side agnes-auto-upgrade.sh defer probe fired right in that
window proceeded with 'docker compose up -d' and SIGKILLed the
just-spawning extractor / materialized worker.
Observed on agnes-dev: 3 mid-sync container kills in 30 min, each
followed by a few-min outage and a partial sync. The WAL replay
auto-recovery (PR #217) kept the system DB consistent through each
kill, but the actual sync work was lost.
Fix: handler stamps _recent_trigger_at; status endpoint returns
locked=true for _TRIGGER_HOLD_SEC (=30s) after the most recent
trigger, even if the background task hasn't yet acquired the lock.
30s covers the schedule → spawn latency with margin; short enough
not to indefinitely block auto-upgrade after a one-off trigger.
Defense in depth: the real lock still gates the extractor subprocess.
2. (fix) scripts/ops/agnes-auto-upgrade.sh: post-upgrade chown loop
now mkdir -p's /data/tmp before chown'ing, and includes it in the
list of dirs that get the runtime UID:GID. /data/tmp is the default
AGNES_TEMP_DIR set in docker-compose.yml — Snowflake-UNLOAD slice
staging and CSV intermediates land here. Pre-fix the runtime user
(uid 999) couldn't create /data/tmp under a root-owned data-disk
root, so tempfiles silently fell back to the boot disk's overlayfs
/tmp — defeating the whole point of routing slice staging onto the
dedicated data volume.
3. (feat) AGNES_DEFAULT_SYNC_SCHEDULE env var sets the platform-wide
fallback sync_schedule. Lets a deployment dial cadence down to
'daily 03:00' (data freshness budget once-per-day) without having
to PUT every registry row. Per-table sync_schedule still wins;
literal 'every 1h' is the floor if neither is set — OSS-historical
default unchanged.
Tests:
- test_sync_status_trigger_hold_window_reports_locked_after_trigger
- test_sync_status_trigger_hold_window_expires
- test_default_schedule_falls_through_env_then_every_1h (3 branches)
* release: 0.54.3 — sync defer-probe race + AGNES_TEMP_DIR chown + default-schedule env knob
Last commit on the PR per CLAUDE.md hard rule. Patch bump (0.54.2 →
0.54.3) bundling three sync-ops fixes from agnes-dev steady-state
observation.
No DB migration; trigger-hold window is additive (anything that already
saw locked=true still does — the window EXTENDS the true period);
/data/tmp chown is no-op when already correct; AGNES_DEFAULT_SYNC_SCHEDULE
unset = every-1h default unchanged.