agnes-the-ai-analyst/app/auth/scheduler_token.py
ZdenekSrotyr 995e4cd366
fix(scheduler): HTTP marketplaces job + SCHEDULER_API_TOKEN shared secret (#127)
* fix(scheduler): HTTP marketplaces job + SCHEDULER_API_TOKEN shared secret

Two scheduler-reliability bugs surfaced after the v0.12.1 USER-agnes flip:

1. The marketplaces job called src.marketplace.sync_marketplaces() in-process
   from the scheduler container, racing the app's long-lived system.duckdb
   handle. DuckDB rejects cross-process writers — every cron tick 500-ed on
   "Could not set lock on file ... PID 0".

2. The data-refresh + new marketplaces jobs both 401-ed on the API because
   SCHEDULER_API_TOKEN was never propagated by the Terraform startup script.
   The scheduler had no credential to authenticate with.

Fix:
- New POST /api/marketplaces/sync-all (admin-only) drives the nightly refresh
  through the app process so it inherits the existing DB connection.
- Scheduler swaps fn->http for marketplaces; all jobs are now plain HTTP and
  the scheduler is reduced to a cron clock.
- New app/auth/scheduler_token.py adds a shared-secret auth path. The
  startup script generates a 256-bit secret on first boot, persists it
  across reboots, and writes it to /opt/agnes/.env. Both containers source
  the same .env. The app validates incoming Bearer tokens against the env
  var (constant-time, length-floored) and resolves matches to a synthetic
  scheduler@system.local user that's a member of the Admin system group.
  Audit-log entries from the scheduler are attributed to this user.
- app/main.py seeds the synthetic user at startup so the first cron tick
  has a valid actor; lazy seed in get_scheduler_user covers token rotation
  before the next app restart.

Tests: 5 new in tests/test_auth_scheduler_token.py covering empty/short
secret rejection, exact-match comparison, idempotent user seeding, and
lazy provisioning. 142 marketplace + scheduler tests + 96 auth tests
remain green.

Existing VMs with .env from before this change need a one-time
re-provisioning (re-run startup-script or rotate via openssl rand);
documented in CHANGELOG.

* fix(audit): use '_all' sentinel for bulk marketplace sync — Devin review #127

Avoids the literal string 'marketplace:None' in the audit_log resource
column when the bulk sync endpoint writes its summary row.

* fix(scheduler): unblock event loop + per-job timeouts — Devin review #127

Two findings from Devin re-review on commit 5fbad15:

1. BUG: trigger_sync_all was async def, so FastAPI ran it on the asyncio
   event loop. sync_marketplaces() does blocking I/O (subprocess git
   clones up to GIT_TIMEOUT_SEC=300 each, threading.Lock, DuckDB writes)
   and would freeze every concurrent request for the duration of a bulk
   sync. Switched to plain def so FastAPI auto-routes to the thread pool.

2. ANALYSIS: scheduler used a fixed 120s httpx timeout for every POST.
   Bulk marketplace sync iterates the registry under a single lock with
   up to 300s per repo — easily exceeds 120s on 2-3 slow repos. The
   scheduler then sees a timeout, doesn't update last_run, and re-fires
   on the next 30s tick, queueing redundant work. Per-job timeout
   override added to the JOBS tuple; marketplaces gets 900s (15 min),
   data-refresh keeps 120s, health-check 30s.

* fix(auth): require_session_token rejects scheduler shared secret — Devin review #127

require_session_token gates /auth/tokens (PAT minting). Pre-fix it only
rejected JWTs with typ=pat — but the scheduler shared secret is an opaque
string, so verify_token() returns None, payload becomes {}, and the
PAT-claim check silently passed. A caller bearing SCHEDULER_API_TOKEN
could mint persistent PATs that survive a secret rotation.

Added explicit is_scheduler_token() check before the PAT-claim check;
new regression test in tests/test_auth_scheduler_token.py.

Devin's other note (pre-existing async def trigger_sync at marketplaces.py:392
also calls blocking sync_one) — Devin flagged it as out-of-scope for this PR
and I agree; tracking separately.

* release(0.17.0): cut + clean up CHANGELOG duplicates

Cuts 0.17.0 (minor: scheduler shared-secret auth + sync-all endpoint
plus the deploy-shape fixes that landed since the last release tag).

Bumps pyproject from 0.15.0 — also corrects the missed bump from PR #120
(v0.16.0 was tagged on GitHub and shipped as :stable, but pyproject
stayed at 0.15.0, so /api/version, /cli/latest, and `da --version` had
been under-reporting the running release).

Removes the long-form duplicate entries for 0.13.0 / 0.14.0 / 0.15.0
above [0.16.0] — the canonical short summaries (with GitHub-release
links) already exist below 0.16.0, the long forms were leftover state
from before those versions were cut and have been silently shadowed
ever since.
2026-04-29 11:44:00 +02:00

136 lines
5.3 KiB
Python

"""Shared-secret auth path for the in-cluster scheduler service.
The scheduler container ships every cron tick to the FastAPI app over HTTP
(see ``services.scheduler.__main__``). It needs a long-lived credential to
authenticate itself, but minting a real PAT for it requires a logged-in
session — chicken-and-egg at first boot.
The pragmatic solution: both the ``app`` and ``scheduler`` containers source
the same ``.env`` (via Docker Compose ``env_file: .env``). The
``infra/modules/customer-instance/startup-script.sh.tpl`` generates a random
``SCHEDULER_API_TOKEN`` once at VM provisioning and writes it there. When a
caller presents that exact secret as ``Authorization: Bearer <secret>``, the
app loads (or seeds on demand) a synthetic ``scheduler@system.local`` user
that is a member of the ``Admin`` system group — so existing RBAC paths
continue to work without special-casing.
Constraints on the secret (enforced here, not parsed):
- Empty / unset → this auth path is **disabled**. Production deploys should
set it; dev / LOCAL_DEV_MODE typically doesn't, since the scheduler
rides the dev-bypass instead.
- Length < 32 → treated as misconfiguration and disabled. Prevents an
operator typo that sets ``SCHEDULER_API_TOKEN=todo`` from accidentally
granting admin to a 4-character bearer.
- Comparison uses :func:`hmac.compare_digest` — constant-time so a remote
caller cannot mount a length-discrimination timing attack.
Audit: every action by this user is attributed to ``scheduler@system.local``,
visible in ``audit_log`` as a normal admin actor. Rotating the secret is
``edit .env → docker compose restart app scheduler``; no DB write needed.
"""
from __future__ import annotations
import hmac
import logging
import os
import uuid
from typing import Optional
import duckdb
logger = logging.getLogger(__name__)
# Identity of the synthetic user that backs the shared-secret auth path.
# Kept stable so audit-log entries from the scheduler are easy to filter.
SCHEDULER_USER_EMAIL = "scheduler@system.local"
SCHEDULER_USER_NAME = "Scheduler"
# Floor on the secret length. 32 bytes ≈ 256 bits of entropy if generated
# from /dev/urandom; well above the brute-force frontier and well above any
# typo a human is plausibly going to make.
SCHEDULER_TOKEN_MIN_LENGTH = 32
def get_scheduler_secret() -> str:
"""Return the configured shared secret, stripped. Empty when disabled."""
return os.environ.get("SCHEDULER_API_TOKEN", "").strip()
def is_scheduler_token(token: str) -> bool:
"""True iff ``token`` exactly matches the configured shared secret.
Returns False when the env var is empty or shorter than the minimum
length (auth path disabled). Uses constant-time comparison.
"""
if not token:
return False
secret = get_scheduler_secret()
if not secret or len(secret) < SCHEDULER_TOKEN_MIN_LENGTH:
return False
return hmac.compare_digest(token, secret)
def ensure_scheduler_user(conn: duckdb.DuckDBPyConnection) -> dict:
"""Idempotently provision the scheduler user + Admin group membership.
Called both from the app's startup hook (so the user exists from the
very first boot) and lazily from :func:`get_scheduler_user` so a token
presented before the next restart of the app still resolves.
Returns the user dict in the same shape ``UserRepository.get_by_email``
yields elsewhere — the caller treats it as any other authenticated user.
"""
from src.db import SYSTEM_ADMIN_GROUP
from src.repositories.user_group_members import UserGroupMembersRepository
from src.repositories.users import UserRepository
repo = UserRepository(conn)
user = repo.get_by_email(SCHEDULER_USER_EMAIL)
if not user:
user_id = str(uuid.uuid4())
repo.create(
id=user_id,
email=SCHEDULER_USER_EMAIL,
name=SCHEDULER_USER_NAME,
role="admin",
# No password_hash — this user authenticates via the shared
# secret only, never via /auth/login. Keeps the bootstrap
# check ("any user has a password?") accurate.
password_hash=None,
)
user = repo.get_by_email(SCHEDULER_USER_EMAIL)
logger.info("Seeded scheduler service user: %s", SCHEDULER_USER_EMAIL)
admin_group = conn.execute(
"SELECT id FROM user_groups WHERE name = ?", [SYSTEM_ADMIN_GROUP],
).fetchone()
if admin_group:
UserGroupMembersRepository(conn).add_member(
user_id=user["id"],
group_id=admin_group[0],
source="system_seed",
added_by="app.auth.scheduler_token:ensure_scheduler_user",
)
return user
def get_scheduler_user(conn: duckdb.DuckDBPyConnection) -> Optional[dict]:
"""Look up the scheduler user, seeding it on demand if absent.
Returns None only when seeding fails — typically a malformed schema or
an out-of-band DB error. The caller (``get_current_user``) maps None
to a normal 401 so the failure is observable but does not crash.
"""
from src.repositories.users import UserRepository
user = UserRepository(conn).get_by_email(SCHEDULER_USER_EMAIL)
if user:
return user
try:
return ensure_scheduler_user(conn)
except Exception as e: # noqa: BLE001
logger.error("Failed to provision scheduler user on demand: %s", e)
return None