fix(sync+ops): defer-probe race, AGNES_TEMP_DIR chown, default-schedule env knob (#283)

* fix(sync+ops): defer-probe race, AGNES_TEMP_DIR chown, default-schedule env knob

Three sync-ops fixes surfaced during agnes-dev steady-state operation
after the v0.46→v0.54 cutover settled. None of them depend on each
other; bundled because they all live in the sync trigger / agnes-auto-
upgrade flow and are diagnosed from the same observation window.

1. (fix) /api/sync/status race window. The trigger handler returned 200
   BEFORE the background task acquired _sync_lock. In that few-hundred-ms
   gap, an honest /api/sync/status call returned locked=false — and the
   host-side agnes-auto-upgrade.sh defer probe fired right in that
   window proceeded with 'docker compose up -d' and SIGKILLed the
   just-spawning extractor / materialized worker.

   Observed on agnes-dev: 3 mid-sync container kills in 30 min, each
   followed by a few-min outage and a partial sync. The WAL replay
   auto-recovery (PR #217) kept the system DB consistent through each
   kill, but the actual sync work was lost.

   Fix: handler stamps _recent_trigger_at; status endpoint returns
   locked=true for _TRIGGER_HOLD_SEC (=30s) after the most recent
   trigger, even if the background task hasn't yet acquired the lock.
   30s covers the schedule → spawn latency with margin; short enough
   not to indefinitely block auto-upgrade after a one-off trigger.
   Defense in depth: the real lock still gates the extractor subprocess.

2. (fix) scripts/ops/agnes-auto-upgrade.sh: post-upgrade chown loop
   now mkdir -p's /data/tmp before chown'ing, and includes it in the
   list of dirs that get the runtime UID:GID. /data/tmp is the default
   AGNES_TEMP_DIR set in docker-compose.yml — Snowflake-UNLOAD slice
   staging and CSV intermediates land here. Pre-fix the runtime user
   (uid 999) couldn't create /data/tmp under a root-owned data-disk
   root, so tempfiles silently fell back to the boot disk's overlayfs
   /tmp — defeating the whole point of routing slice staging onto the
   dedicated data volume.

3. (feat) AGNES_DEFAULT_SYNC_SCHEDULE env var sets the platform-wide
   fallback sync_schedule. Lets a deployment dial cadence down to
   'daily 03:00' (data freshness budget once-per-day) without having
   to PUT every registry row. Per-table sync_schedule still wins;
   literal 'every 1h' is the floor if neither is set — OSS-historical
   default unchanged.

Tests:
- test_sync_status_trigger_hold_window_reports_locked_after_trigger
- test_sync_status_trigger_hold_window_expires
- test_default_schedule_falls_through_env_then_every_1h (3 branches)

* release: 0.54.3 — sync defer-probe race + AGNES_TEMP_DIR chown + default-schedule env knob

Last commit on the PR per CLAUDE.md hard rule. Patch bump (0.54.2 →
0.54.3) bundling three sync-ops fixes from agnes-dev steady-state
observation.

No DB migration; trigger-hold window is additive (anything that already
saw locked=true still does — the window EXTENDS the true period);
/data/tmp chown is no-op when already correct; AGNES_DEFAULT_SYNC_SCHEDULE
unset = every-1h default unchanged.
This commit is contained in:
ZdenekSrotyr 2026-05-13 11:44:20 +02:00 committed by GitHub
parent 50a974f196
commit 117b6784ea
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
6 changed files with 180 additions and 7 deletions

View file

@ -10,6 +10,15 @@ CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every C
## [Unreleased]
## [0.54.3] — 2026-05-13
### Added
- `AGNES_DEFAULT_SYNC_SCHEDULE` env var (consumed by `app/api/sync.py:_run_materialized_pass`) sets the platform-wide fallback `sync_schedule` for registry rows that don't pin their own value. Lets a deployment dial cadence down to `daily 03:00` without having to PUT every row. Per-table `sync_schedule` still wins; literal `every 1h` is the floor if neither is set (matches OSS-historical behaviour).
### Fixed
- `GET /api/sync/status` no longer reports `locked=false` during the ~few-hundred-ms window between the trigger handler's 200 response and the background task's `_sync_lock.acquire()`. The handler now stamps `_recent_trigger_at`, and the status endpoint returns `locked=true` for `_TRIGGER_HOLD_SEC` (=30s) after the most recent trigger. Pre-fix, host-side `agnes-auto-upgrade.sh` defer probe firing in that window saw an honest `locked=false` and proceeded with `docker compose up -d`, SIGKILLing the just-spawning extractor / materialized worker. Observed on agnes-dev: 3 mid-sync container kills in 30 min until the trigger-hold window closed the gap.
- `scripts/ops/agnes-auto-upgrade.sh`: the post-upgrade chown loop now includes `/data/tmp` (the default `AGNES_TEMP_DIR` set in `docker-compose.yml`) and `mkdir -p`'s it first. Pre-fix the runtime user (`uid 999`) couldn't create `/data/tmp` under a root-owned data-disk root, so tempfiles silently fell back to the boot disk's overlayfs `/tmp` — defeating the whole point of routing slice staging onto the dedicated data volume.
## [0.54.2] — 2026-05-13
### Added

View file

@ -41,6 +41,26 @@ router = APIRouter(prefix="/api/sync", tags=["sync"])
# something bypasses the handler).
_sync_lock = threading.Lock()
# Race-protection: the trigger handler returns 200 BEFORE the background task
# acquires ``_sync_lock``. In that ~few-hundred-ms gap, ``/api/sync/status``
# would honestly report ``locked=False`` — and the host-side
# ``agnes-auto-upgrade.sh`` defer probe (which polls this endpoint) would
# proceed with ``docker compose up -d`` and SIGKILL the still-spawning
# extractor / materialized worker. Mid-sync container kill is the exact
# class of corruption the WAL replay auto-recovery is meant to be a
# safety net for, not a routine occurrence.
#
# Fix: stamp the trigger time alongside the lock. ``/api/sync/status`` also
# returns ``locked=True`` for ``_TRIGGER_HOLD_SEC`` seconds after the most
# recent trigger, even if the background task hasn't yet acquired the lock.
# The window is short enough that an operator-issued ``/api/sync/trigger``
# followed by an immediate ``GET /api/sync/status`` is consistent
# (locked=True), but long enough to cover the schedule → background-task
# spawn latency. Defense in depth: the real lock still gates the
# extractor subprocess.
_TRIGGER_HOLD_SEC = 30
_recent_trigger_at: float = 0.0 # monotonic clock; 0 = never triggered
def _file_hash(path: Path) -> str:
if not path.exists():
@ -164,7 +184,17 @@ def _run_materialized_pass(
last = state.get_last_sync(ref_name)
last_iso = last.isoformat() if last else None
schedule = row.get("sync_schedule") or "every 1h"
# Per-table schedule wins; fall through to AGNES_DEFAULT_SYNC_SCHEDULE
# (operator override), then to ``every 1h`` (OSS-historical default).
# The env knob lets a deployment dial down the platform-wide refresh
# cadence without having to PUT every registry row — useful when
# data freshness budget is "once per day" and the hourly default
# over-fetches.
schedule = (
row.get("sync_schedule")
or os.environ.get("AGNES_DEFAULT_SYNC_SCHEDULE", "").strip()
or "every 1h"
)
if not is_table_due(schedule, last_iso):
summary["skipped"].append({"table": ref_name, "reason": "due_check"})
continue
@ -785,10 +815,19 @@ async def sync_status():
Returns:
``{"locked": bool}`` True if `_sync_lock` is currently held by
a `_run_sync` invocation. The host script defers the upgrade
when this is True and retries on the next 5-min cron tick.
a `_run_sync` invocation, OR a sync was triggered within the
last ``_TRIGGER_HOLD_SEC`` seconds (so the FastAPI background
task hasn't yet acquired the lock). Without the trigger-hold
window, an auto-upgrade probe firing in the gap between the
trigger handler's 200 response and the background task's
``_sync_lock.acquire()`` would see ``locked=False`` and proceed
with ``up -d`` killing the just-spawning extractor.
"""
return {"locked": _sync_lock.locked()}
locked = _sync_lock.locked()
if not locked and _recent_trigger_at:
# Monotonic deadline; clock skew / DST jumps don't matter.
locked = (time.monotonic() - _recent_trigger_at) < _TRIGGER_HOLD_SEC
return {"locked": locked}
# ---- Trigger ----
@ -870,6 +909,12 @@ async def trigger_sync(
detail="sync_already_in_progress",
)
_t0 = time.monotonic()
# Stamp the trigger time so `/api/sync/status` reports locked=True
# for the next ``_TRIGGER_HOLD_SEC`` even though the background
# task hasn't yet acquired ``_sync_lock``. Closes the race window
# the host-side ``agnes-auto-upgrade.sh`` defer probe was hitting.
global _recent_trigger_at
_recent_trigger_at = _t0
background_tasks.add_task(_run_sync, tables)
try:
from src.db import get_system_db

View file

@ -1,6 +1,6 @@
[project]
name = "agnes-the-ai-analyst"
version = "0.54.2"
version = "0.54.3"
description = "Agnes — AI Data Analyst platform for AI analytical systems"
requires-python = ">=3.11,<3.14"
license = "MIT"

View file

@ -176,7 +176,15 @@ if [ "$BEFORE" != "$AFTER" ] || [ "$CONFIG_BEFORE" != "$CONFIG_AFTER" ]; then
IMAGE_UIDGID=$(docker run --rm --entrypoint cat "$IMAGE" /etc/passwd 2>/dev/null \
| awk -F: -v u="${IMAGE_USER%%:*}" '$1==u || $3==u {print $3":"$4; exit}')
if [ -n "$IMAGE_UIDGID" ]; then
for d in "$STATE_DIR" /data/extracts /data/analytics; do
# /data/tmp is the default ``AGNES_TEMP_DIR`` from
# docker-compose.yml — Snowflake-UNLOAD slice staging
# and CSV intermediates land here. Without an explicit
# mkdir + chown, the runtime user can't create it
# under a root-owned ``/data`` (the data disk's root
# comes up root-owned on first mount). ``mkdir -p``
# is idempotent so existing dirs survive.
mkdir -p /data/tmp 2>/dev/null || true
for d in "$STATE_DIR" /data/extracts /data/analytics /data/tmp; do
[ -d "$d" ] && chown -R "$IMAGE_UIDGID" "$d" 2>/dev/null || true
done
fi

View file

@ -41,6 +41,60 @@ def fake_registry_with_one_materialized(monkeypatch, tmp_path):
return state
def test_default_schedule_falls_through_env_then_every_1h(
monkeypatch, fake_registry_with_one_materialized,
):
"""Per-table ``sync_schedule=None`` → fall through to
``AGNES_DEFAULT_SYNC_SCHEDULE`` env (operator deployment override)
fall through to literal ``every 1h`` (OSS-historical default).
Test the THREE branches:
1. Per-table schedule wins over env.
2. Env wins when per-table is None.
3. ``every 1h`` is the floor env unset + per-table None.
Branch (2) is the operator knob for ``daily 03:00`` deployments
(data freshness budget once-per-day; the hourly default
over-fetches Snowflake on every Keboola export-async cycle)."""
captured = {}
def fake_is_due(schedule, last_iso, now=None):
captured["schedule"] = schedule
return False # short-circuit the dispatcher
monkeypatch.setattr("app.api.sync.is_table_due", fake_is_due)
# Case 3: env unset, per-table None → "every 1h"
monkeypatch.delenv("AGNES_DEFAULT_SYNC_SCHEDULE", raising=False)
_run_materialized_pass(MagicMock(), MagicMock())
assert captured["schedule"] == "every 1h", captured
# Case 2: env set, per-table None → env value
monkeypatch.setenv("AGNES_DEFAULT_SYNC_SCHEDULE", "daily 03:00")
_run_materialized_pass(MagicMock(), MagicMock())
assert captured["schedule"] == "daily 03:00", captured
# Case 1: per-table schedule wins over env. (Mutate fixture's row.)
fake_registry_with_one_materialized # ensure fixture is loaded
import app.api.sync as _sm
# The fixture's _Repo.list_all returns a captured list; reach into
# its closure isn't easy. Easier: monkeypatch list_all directly.
pinned_rows = [{
"id": "in_flight_t", "name": "in_flight_t",
"query_mode": "materialized", "source_type": "bigquery",
"source_query": "SELECT 1",
"sync_schedule": "every 30m", # explicit per-table
}]
class _RepoWithSched:
def __init__(self, conn): pass
def list_all(self): return pinned_rows
monkeypatch.setattr(_sm, "TableRegistryRepository", _RepoWithSched)
_run_materialized_pass(MagicMock(), MagicMock())
assert captured["schedule"] == "every 30m", captured
def test_in_flight_recorded_as_skipped_not_error(fake_registry_with_one_materialized):
state = fake_registry_with_one_materialized

View file

@ -18,12 +18,16 @@ from app.api import sync as sync_module
@pytest.fixture(autouse=True)
def reset_sync_lock():
"""Make sure each test starts with a free lock — and never leaves one
held even if an assertion fires mid-test."""
held even if an assertion fires mid-test. Also wipe the trigger-hold
timestamp so an earlier test's ``_recent_trigger_at`` doesn't
silently pin the next test's ``/api/sync/status`` at locked=True."""
if sync_module._sync_lock.locked():
sync_module._sync_lock.release()
sync_module._recent_trigger_at = 0.0
yield
if sync_module._sync_lock.locked():
sync_module._sync_lock.release()
sync_module._recent_trigger_at = 0.0
def test_run_sync_skips_when_lock_held(capsys):
@ -222,3 +226,56 @@ def test_sync_status_does_not_require_auth():
# No auth headers at all.
resp = client.get("/api/sync/status")
assert resp.status_code == 200
def test_sync_status_trigger_hold_window_reports_locked_after_trigger():
"""Race-protection: even when ``_sync_lock`` is NOT yet held (the
background task hasn't acquired it), a recent ``_recent_trigger_at``
timestamp within ``_TRIGGER_HOLD_SEC`` must report
``{"locked": true}``. Without this, an auto-upgrade defer probe
firing in the few-hundred-ms window between the trigger handler's
200 response and ``_run_sync.acquire()`` would see locked=False
and SIGKILL the spawning extractor."""
import time as _time
if sync_module._sync_lock.locked():
sync_module._sync_lock.release()
# Stamp a fresh trigger time, then immediately probe.
sync_module._recent_trigger_at = _time.monotonic()
try:
from fastapi.testclient import TestClient
from fastapi import FastAPI
app = FastAPI()
app.include_router(sync_module.router)
client = TestClient(app)
resp = client.get("/api/sync/status")
assert resp.status_code == 200
assert resp.json() == {"locked": True}, (
"trigger-hold window not honored — auto-upgrade defer probe "
"would race the background task acquisition"
)
finally:
sync_module._recent_trigger_at = 0.0
def test_sync_status_trigger_hold_window_expires():
"""Trigger-hold reports locked only for ``_TRIGGER_HOLD_SEC`` — a
stale timestamp past the window must NOT pin the probe at True
forever. Without expiry, a single trigger would block all
auto-upgrades indefinitely."""
if sync_module._sync_lock.locked():
sync_module._sync_lock.release()
# Stamp a timestamp ``_TRIGGER_HOLD_SEC + 5`` seconds in the past.
import time as _time
sync_module._recent_trigger_at = (
_time.monotonic() - sync_module._TRIGGER_HOLD_SEC - 5
)
try:
from fastapi.testclient import TestClient
from fastapi import FastAPI
app = FastAPI()
app.include_router(sync_module.router)
client = TestClient(app)
resp = client.get("/api/sync/status")
assert resp.json() == {"locked": False}
finally:
sync_module._recent_trigger_at = 0.0