fix(scheduler): HTTP marketplaces job + SCHEDULER_API_TOKEN shared secret (#127)

* fix(scheduler): HTTP marketplaces job + SCHEDULER_API_TOKEN shared secret

Two scheduler-reliability bugs surfaced after the v0.12.1 USER-agnes flip:

1. The marketplaces job called src.marketplace.sync_marketplaces() in-process
   from the scheduler container, racing the app's long-lived system.duckdb
   handle. DuckDB rejects cross-process writers — every cron tick 500-ed on
   "Could not set lock on file ... PID 0".

2. The data-refresh + new marketplaces jobs both 401-ed on the API because
   SCHEDULER_API_TOKEN was never propagated by the Terraform startup script.
   The scheduler had no credential to authenticate with.

Fix:
- New POST /api/marketplaces/sync-all (admin-only) drives the nightly refresh
  through the app process so it inherits the existing DB connection.
- Scheduler swaps fn->http for marketplaces; all jobs are now plain HTTP and
  the scheduler is reduced to a cron clock.
- New app/auth/scheduler_token.py adds a shared-secret auth path. The
  startup script generates a 256-bit secret on first boot, persists it
  across reboots, and writes it to /opt/agnes/.env. Both containers source
  the same .env. The app validates incoming Bearer tokens against the env
  var (constant-time, length-floored) and resolves matches to a synthetic
  scheduler@system.local user that's a member of the Admin system group.
  Audit-log entries from the scheduler are attributed to this user.
- app/main.py seeds the synthetic user at startup so the first cron tick
  has a valid actor; lazy seed in get_scheduler_user covers token rotation
  before the next app restart.

Tests: 5 new in tests/test_auth_scheduler_token.py covering empty/short
secret rejection, exact-match comparison, idempotent user seeding, and
lazy provisioning. 142 marketplace + scheduler tests + 96 auth tests
remain green.

Existing VMs with .env from before this change need a one-time
re-provisioning (re-run startup-script or rotate via openssl rand);
documented in CHANGELOG.

* fix(audit): use '_all' sentinel for bulk marketplace sync — Devin review #127

Avoids the literal string 'marketplace:None' in the audit_log resource
column when the bulk sync endpoint writes its summary row.

* fix(scheduler): unblock event loop + per-job timeouts — Devin review #127

Two findings from Devin re-review on commit 5fbad15:

1. BUG: trigger_sync_all was async def, so FastAPI ran it on the asyncio
   event loop. sync_marketplaces() does blocking I/O (subprocess git
   clones up to GIT_TIMEOUT_SEC=300 each, threading.Lock, DuckDB writes)
   and would freeze every concurrent request for the duration of a bulk
   sync. Switched to plain def so FastAPI auto-routes to the thread pool.

2. ANALYSIS: scheduler used a fixed 120s httpx timeout for every POST.
   Bulk marketplace sync iterates the registry under a single lock with
   up to 300s per repo — easily exceeds 120s on 2-3 slow repos. The
   scheduler then sees a timeout, doesn't update last_run, and re-fires
   on the next 30s tick, queueing redundant work. Per-job timeout
   override added to the JOBS tuple; marketplaces gets 900s (15 min),
   data-refresh keeps 120s, health-check 30s.

* fix(auth): require_session_token rejects scheduler shared secret — Devin review #127

require_session_token gates /auth/tokens (PAT minting). Pre-fix it only
rejected JWTs with typ=pat — but the scheduler shared secret is an opaque
string, so verify_token() returns None, payload becomes {}, and the
PAT-claim check silently passed. A caller bearing SCHEDULER_API_TOKEN
could mint persistent PATs that survive a secret rotation.

Added explicit is_scheduler_token() check before the PAT-claim check;
new regression test in tests/test_auth_scheduler_token.py.

Devin's other note (pre-existing async def trigger_sync at marketplaces.py:392
also calls blocking sync_one) — Devin flagged it as out-of-scope for this PR
and I agree; tracking separately.

* release(0.17.0): cut + clean up CHANGELOG duplicates

Cuts 0.17.0 (minor: scheduler shared-secret auth + sync-all endpoint
plus the deploy-shape fixes that landed since the last release tag).

Bumps pyproject from 0.15.0 — also corrects the missed bump from PR #120
(v0.16.0 was tagged on GitHub and shipped as :stable, but pyproject
stayed at 0.15.0, so /api/version, /cli/latest, and `da --version` had
been under-reporting the running release).

Removes the long-form duplicate entries for 0.13.0 / 0.14.0 / 0.15.0
above [0.16.0] — the canonical short summaries (with GitHub-release
links) already exist below 0.16.0, the long forms were leftover state
from before those versions were cut and have been silently shadowed
ever since.
This commit is contained in:
ZdenekSrotyr 2026-04-29 11:44:00 +02:00 committed by GitHub
parent 7012966482
commit 995e4cd366
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
10 changed files with 469 additions and 109 deletions

View file

@ -10,8 +10,17 @@ CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every C
## [Unreleased] ## [Unreleased]
## [0.17.0] — 2026-04-29
### Added
- **Shared-secret auth path for the in-cluster scheduler service** (`SCHEDULER_API_TOKEN`). Both the `app` and `scheduler` containers source the same `/opt/agnes/.env` via Docker Compose `env_file:`, so a 256-bit secret generated once at VM provisioning serves both sides symmetrically. The app validates incoming `Authorization: Bearer <secret>` against the env var (constant-time compare; minimum length 32 chars; rejected when env is empty) and resolves matches to a synthetic `scheduler@system.local` user that is a member of the `Admin` system group — every existing RBAC gate (`require_admin`, `require_resource_access`) works unchanged. Audit-log entries from the scheduler are attributed to this user. Rotation: edit `.env`, `docker compose restart app scheduler`. See `app/auth/scheduler_token.py` for the threat model.
- **`POST /api/marketplaces/sync-all`** — admin-only endpoint that runs `src.marketplace.sync_marketplaces()` inside the app process. Wired up so the scheduler container can drive the nightly refresh over HTTP without opening `system.duckdb` directly.
### Fixed ### Fixed
- **Scheduler `marketplaces` job 500-ed every cron tick with `IO Error: Could not set lock on file system.duckdb` after v0.12.1.** The previous implementation called `src.marketplace.sync_marketplaces()` in-process from the scheduler container, but DuckDB permits only one writer per file across processes — the scheduler raced the app's long-lived handle. Switched the job to `POST /api/marketplaces/sync-all`, making the app the sole writer; the scheduler is now a pure cron clock.
- **Scheduler `data-refresh` job 401-ed every 15 minutes** with `Missing or invalid Authorization header` because `SCHEDULER_API_TOKEN` was never propagated by `infra/modules/customer-instance/startup-script.sh.tpl`. The startup script now generates a 64-hex-char secret on first boot via `openssl rand -hex 32`, persists it across reboots by reading back from an existing `.env` (rotation requires explicit operator action — both containers must restart together), and writes it into `/opt/agnes/.env` alongside the other secrets. `app/main.py` seeds the matching synthetic user at startup so the very first cron tick has a valid actor to attribute audit-log entries to. Existing VMs need a one-time `sudo /opt/agnes/agnes-rotate-scheduler-token.sh` (or simply re-run the startup script via `terraform apply -replace='module.agnes.google_compute_instance.vm["<vm-name>"]'`); see migration note in this changelog or rerun the startup script manually.
- **Non-root container couldn't write to host-bind-mounted `/data` after the v0.12.1 USER-agnes flip.** `infra/modules/customer-instance/startup-script.sh.tpl` now `chown -R 999:999 /data` after creating the persistent-disk subdirs (`state`, `analytics`, `extracts`). Without this, a freshly-attached PD is root-owned by default and `USER agnes` (uid 999) cannot open `/data/state/system.duckdb` for write — every authed request 500s with `IOException: Cannot open file ... Permission denied` while `/api/health` (which doesn't open the system DB) keeps returning 200, masking the failure from health-only monitoring. Regression first observed on `agnes-development` on 2026-04-29 after the auto-upgrade picked up `:stable` from the 0.12.1 release. **Existing VMs with PD-backed `/data` need a one-time host-side `sudo chown -R 999:999 /var/lib/docker/volumes/agnes_data/_data && sudo docker restart agnes-app-1 agnes-scheduler-1` to recover** — Terraform `metadata_startup_script` only runs on boot, so an apply alone does not retro-fix running VMs. - **Non-root container couldn't write to host-bind-mounted `/data` after the v0.12.1 USER-agnes flip.** `infra/modules/customer-instance/startup-script.sh.tpl` now `chown -R 999:999 /data` after creating the persistent-disk subdirs (`state`, `analytics`, `extracts`). Without this, a freshly-attached PD is root-owned by default and `USER agnes` (uid 999) cannot open `/data/state/system.duckdb` for write — every authed request 500s with `IOException: Cannot open file ... Permission denied` while `/api/health` (which doesn't open the system DB) keeps returning 200, masking the failure from health-only monitoring. Regression first observed on `agnes-development` on 2026-04-29 after the auto-upgrade picked up `:stable` from the 0.12.1 release. **Existing VMs with PD-backed `/data` need a one-time host-side `sudo chown -R 999:999 /var/lib/docker/volumes/agnes_data/_data && sudo docker restart agnes-app-1 agnes-scheduler-1` to recover** — Terraform `metadata_startup_script` only runs on boot, so an apply alone does not retro-fix running VMs.
- `Dockerfile` pins the `agnes` user to `uid:gid 999:999` explicitly (`useradd --uid 999`). Previously the uid was whatever Debian's `useradd --system` assigned next — happened to be 999 today, but a future base-image change picking 998 or 1000 would silently desync from the startup-script's `chown 999:999`, reintroducing the same incident. Pinning makes the contract grep-able from both sides. - `Dockerfile` pins the `agnes` user to `uid:gid 999:999` explicitly (`useradd --uid 999`). Previously the uid was whatever Debian's `useradd --system` assigned next — happened to be 999 today, but a future base-image change picking 998 or 1000 would silently desync from the startup-script's `chown 999:999`, reintroducing the same incident. Pinning makes the contract grep-able from both sides.
- `scripts/smoke-test.sh` no longer silently SKIPs every authed check when `bootstrap` returns 403 (users exist) and `SMOKE_TOKEN` is not set — it now FAILs loudly. Also adds an unauthenticated DB-touching probe (`POST /auth/email/request`) before bootstrap, since `/api/health` deliberately doesn't open `system.duckdb` (kept cheap for LB probes) and so cannot detect filesystem/permission issues. The new probe catches the foundryai-development class of regression even on instances where bootstrap is closed. - `scripts/smoke-test.sh` no longer silently SKIPs every authed check when `bootstrap` returns 403 (users exist) and `SMOKE_TOKEN` is not set — it now FAILs loudly. Also adds an unauthenticated DB-touching probe (`POST /auth/email/request`) before bootstrap, since `/api/health` deliberately doesn't open `system.duckdb` (kept cheap for LB probes) and so cannot detect filesystem/permission issues. The new probe catches the foundryai-development class of regression even on instances where bootstrap is closed.
@ -23,56 +32,6 @@ CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every C
- `release.yml` adds an `e2e-bind-mount` job that boots the freshly built image against a host-bind-mounted `/data` directory (instead of the named volume the existing `smoke-test` job uses). Docker initializes a fresh named volume by copying from the image's `/data` — which the Dockerfile chowns to `agnes:agnes` before flipping USER — so the named-volume path always works. The bind-mount path mirrors what GCE VMs run via `docker-compose.host-mount.yml`, and includes a negative assertion (write must fail on root-owned `/data` before the operator chown) plus a positive assertion (smoke passes after the chown). Locks in the contract that broke `agnes-development`: removing `chown 999:999` from `startup-script.sh.tpl` or changing the Dockerfile uid pin breaks CI. - `release.yml` adds an `e2e-bind-mount` job that boots the freshly built image against a host-bind-mounted `/data` directory (instead of the named volume the existing `smoke-test` job uses). Docker initializes a fresh named volume by copying from the image's `/data` — which the Dockerfile chowns to `agnes:agnes` before flipping USER — so the named-volume path always works. The bind-mount path mirrors what GCE VMs run via `docker-compose.host-mount.yml`, and includes a negative assertion (write must fail on root-owned `/data` before the operator chown) plus a positive assertion (smoke passes after the chown). Locks in the contract that broke `agnes-development`: removing `chown 999:999` from `startup-script.sh.tpl` or changing the Dockerfile uid pin breaks CI.
## [0.15.0] — 2026-04-29
### Added
- **Corporate-memory v1 + v1.5 — confidence, contradictions, audience distribution, and rule sync.** Issue #72.
- Schema v15: `knowledge_items` gains context-engineering columns (`confidence`, `domain`, `entities`, `source_type`, `source_ref`, `valid_from`, `valid_until`, `supersedes`, `sensitivity`, `is_personal`); new `knowledge_contradictions` table for surfacing conflicting facts and `session_extraction_state` for the verification detector's idempotent resume.
- Schema v16: `verification_evidence` table — one row per analyst confirmation, indexed on `item_id`, drives the `confidence` calculation in `services/corporate_memory/confidence.py` (linear-decay with floor + additional-verifier boost; configurable via `instance.yaml`).
- **Server**: `GET /api/memory/bundle` returns mandatory + ranked-approved items within a token budget (default 6000 ≈ 24KB) — drives `da sync`'s rule write. `GET /api/memory/stats` now uses SQL aggregation (no full-list materialization, doesn't block the event loop). `POST /api/memory/admin/mandate` and `POST /api/memory/admin/batch` accept an `audience` field; the audience is matched against the caller's `user_group_members` JOIN `user_groups` on read so a user in group `finance` sees `audience='group:finance'` items, and admins see all. Verification-detector endpoint extracts knowledge candidates from session JSONL files and merges evidence into the existing item when a fact is re-asserted.
- **CLI**: `da sync` step 7 (`_fetch_and_write_rules`) calls `/api/memory/bundle`, writes `mandatory` items to `.claude/rules/km_<id>.md` (one file per item) and concatenates `approved` into `.claude/rules/km_approved.md`. Stale `km_*.md` files from a previous run are pruned. Best-effort: any HTTP/JSON failure is logged and sync continues.
- **Auto-tagging**: `services/corporate_memory/tagger.py` runs an LLM extraction pass on knowledge create/update when `ai:` is configured in `instance.yaml`. Wrapped in `asyncio.to_thread` so it doesn't block the event loop. Best-effort: missing config or LLM error → item created with no auto-tags.
- **Per-item privacy**: `is_personal` items are visible only to the contributor and platform admins (members of the `Admin` system group). The `_can_view_item` helper takes a pre-computed `is_priv` flag so list endpoints don't re-query `user_group_members` per item.
- **Audit log**: every admin action (mandate / approve / reject / revoke / mandate-batch / contradiction resolve) writes a row tagged `corporate_memory.<action>` with the affected item ids and reason field.
- `/me/debug` — self-only auth diagnostic page. Shows the logged-in user their own decoded JWT claims (no raw token), group memberships with sources and bound `external_id` when present, resource grants effective via those memberships, and a "Refetch from Google (dry-run)" button that issues a fresh `fetch_user_groups` call and reports the diff against the cached `user_group_members` snapshot without writing anything. Gated by `AGNES_DEBUG_AUTH=true` env var (default off → route returns 404 and the navbar item is not rendered). Intended for dev / staging VMs; do not enable on customer-facing instances. Issue #116.
### Changed
- **`POST /api/memory/{id}/vote` accepts `vote=0`** to retract a previous vote (toggle un-vote from the UI). Pre-fix the API rejected vote=0 with 400 and the UI's toggle logic silently no-op'd on un-toggle.
- **`/admin/corporate-memory` and `/corporate-memory/admin` are gated by `require_admin`** (admin-group membership) instead of the v9-era `require_role(Role.KM_ADMIN)`. The km_admin role was collapsed into admin in main's RBAC v13; module authors needing finer-grained corporate-memory curation should use a `resource_grants` row of type `corporate_memory_admin`.
### Internal
- `_effective_groups` in `app/api/memory.py` queries `user_group_members JOIN user_groups` instead of reading the `users.groups` JSON column (dropped in v13). The audience-distribution tests in `tests/test_memory_api.py` use a `_add_user_to_group` helper that inserts into `user_group_members` + `user_groups` directly.
- `jsonschema` added to dev dependencies for the corporate-memory schema-validation fixtures (`tests/test_corporate_memory_v1.py::TestSchemaValidation`). Production code does not import it.
## [0.14.0] — 2026-04-29
### Added
- **v2 fetch primitives — discovery + scoped fetch** for analytical workflows. Replaces the BigQuery wrap-view pattern (which caused "Response too large" on multi-hundred-million-row source views) with a Claude-session-driven toolkit. Server exposes `GET /api/v2/catalog` (RBAC-filtered table list with flavor + fetch-via hints), `GET /api/v2/schema/{table_id}` (column metadata + BQ flavor hints), `GET /api/v2/sample/{table_id}` (N sample rows), `POST /api/v2/scan` (validator + RBAC + quota + max_result_bytes guard, Arrow IPC stream), `POST /api/v2/scan/estimate` (BigQuery dryRun, no execution). CLI gains `da catalog`, `da schema`, `da describe`, `da fetch <table> --select … --where … --limit N [--estimate] [--as <name>]`, `da snapshot list/refresh/drop/prune`, `da disk-info`. WHERE-clause validator at `app/api/where_validator.py` is sqlglot-backed with structural rejects + function allow-list + column-existence enforcement. Process-local quota tracker (concurrent + daily bytes per user). New `cli/skills/agnes-data-querying.md` standalone skill + CLAUDE.md addendum tells Claude to discover → estimate → fetch a filtered subset locally → analyze. 11 new `instance.yaml` knobs (`api.scan_max_concurrent_per_user`, `api.scan_daily_bytes_per_user`, `api.scan_max_result_bytes`, `api.where_clause_max_length`, `api.catalog_cache_ttl_seconds`, `api.schema_cache_ttl_seconds`, `api.sample_cache_ttl_seconds`, `data_source.bigquery.billing_project`, `data_source.bigquery.legacy_wrap_views`, `snapshots.dir`, `snapshots.cache_size_limit_gb`). Issue #101.
- `data_source.bigquery.billing_project` config knob — explicit billing project for BQ scan + estimate. Defaults to `data_source.bigquery.project`. Matters for cross-project read patterns where the VM service account has `bigquery.data.*` on the data project but lacks `serviceusage.services.use` there; setting this to a project where the SA holds `serviceusage.services.use` fixes the dry-run 403.
### Changed
- **BREAKING**: `connectors/bigquery/extractor.py` no longer creates a wrap view (`SELECT * FROM bigquery_query(...)`) for VIEW / MATERIALIZED_VIEW entities by default. Operators relying on the old behavior must set `data_source.bigquery.legacy_wrap_views: true` in `instance.yaml` for one release cycle. BASE TABLE entities are unchanged. The new `da fetch` workflow replaces wrap views for analytical queries — see CLAUDE.md "Querying Agnes data — agent rails".
- `RegisterTableRequest.primary_key` and `UpdateTableRequest.primary_key` accept `Optional[List[str]]` for composite keys (session-grain MSA tables key on `(session_id, event_date)`, browse rows on more); a bare string remains accepted for backward compat via a `field_validator(mode="before")` that wraps it in a one-element list. Old CLI scripts posting `"primary_key": "session_id"` continue to work.
- `GET /api/v2/catalog` now caches the underlying `repo.list_all()` rows globally with the documented `api.catalog_cache_ttl_seconds` default (300s) and runs RBAC fresh per request. The previous per-user cache served stale RBAC-filtered results for up to TTL after a permission flip — `v2_schema.py` and `v2_sample.py` already had this pattern; `v2_catalog.py` now matches.
## [0.13.0] — 2026-04-29
### Added
- **Windows/PowerShell wrapper for local dev.** New `scripts/run-local-dev.ps1` mirrors `scripts/run-local-dev.sh` for operators on Windows where GNU Make / bash aren't available — same compose stack (`docker-compose.yml` + `docker-compose.dev.yml` + `docker-compose.local-dev.yml`), same `LOCAL_DEV_GROUPS` default seeding, same `up` / `down` / `logs` actions. Run `.\scripts\run-local-dev.ps1` for the fast path (reuses existing image) or `.\scripts\run-local-dev.ps1 -Build` to force `--build` after `pyproject.toml` / `Dockerfile` changes. Verified on Docker Desktop for Windows. See `docs/local-development.md`.
- **Admin server configuration editor** at `/admin/server-config` — admins can now view and edit `instance.yaml` from the web UI without SSHing into the host. Two new endpoints (`GET /api/admin/server-config` returns the current config with secret-looking values masked; `POST /api/admin/server-config` deep-merges a section patch into `DATA_DIR/state/instance.yaml`). The page lists the editable sections (`instance`, `data_source`, `email`, `telegram`, `jira`, `theme`, `server`, `auth`) and renders a per-field form. Saves touching `auth.*` or `server.*` ("danger zone" — can lock operators out) require an explicit confirmation step. Every save writes an `instance_config.update` row to `audit_log` with a per-field diff (secret values masked as `***`, field paths preserved so a rotation is recorded as `email.smtp_password: *** → ***`). Issue #91.
### Fixed
- `app/instance_config.py:load_instance_config` now deep-merges the static `CONFIG_DIR/instance.yaml` with the writable overlay at `DATA_DIR/state/instance.yaml` instead of returning the overlay verbatim when present. Pre-fix, the first save through the new server-config editor (which writes only the section the operator actually touched) caused every consumer of static-only sections (corporate memory, dataset list, OpenMetadata client) to fall through to empty defaults until the overlay was deleted. Issue #91.
- `POST /api/admin/configure` now uses the same narrow-overlay write strategy as the new server-config editor: it reads the overlay verbatim (no static fallback), patches only `instance` / `auth` / `data_source`, and writes atomically via tmp + `os.replace`. Pre-fix it seeded `existing` from the env-resolved merged config when no overlay file was present and dumped the whole thing back, persisting cleartext `${ENV_VAR}` values (e.g. `smtp_password`) into the writable overlay even though the wizard never touched those sections. Issue #91.
- `POST /api/admin/server-config` now strips redaction sentinels (`***` / `<empty>`) out of every secret-keyed leaf in the incoming patch before the deep-merge. The companion GET endpoint masks secret-keyed children inside nested objects (e.g. `data_source.keboola.token_env`), and the form renders those nested objects as JSON textareas — without the scrub, a no-op save would round-trip the masked JSON back and overwrite the real overlay value (`token_env: "KEBOOLA_STORAGE_TOKEN"` → `"***"`), silently breaking the next sync. Defense-in-depth on both sides: the client form scrubs before posting, and the server scrubs before merge so an API caller (CLI / script) can't corrupt secrets either. Issue #91.
## [0.16.0] — 2026-04-29 ## [0.16.0] — 2026-04-29
Minor release. Comprehensive deploy safety audit — CI/CD pipeline hardening, 50+ new tests covering previously untested failure modes, DB schema health check, config versioning, and BigQuery ATTACH error resilience. Built on top of v0.15.0 / `2e1dfb7`. Minor release. Comprehensive deploy safety audit — CI/CD pipeline hardening, 50+ new tests covering previously untested failure modes, DB schema health check, config versioning, and BigQuery ATTACH error resilience. Built on top of v0.15.0 / `2e1dfb7`.

View file

@ -257,7 +257,7 @@ Admin-managed git repos cloned nightly to `${DATA_DIR}/marketplaces/<slug>/`
so FastAPI can read their contents from disk. so FastAPI can read their contents from disk.
- Register via `/admin/marketplaces` (admin UI) or `POST /api/marketplaces`. - Register via `/admin/marketplaces` (admin UI) or `POST /api/marketplaces`.
- Scheduler calls `src.marketplace.sync_marketplaces()` in-process at `daily 03:00` UTC — no HTTP round-trip to the main app. - Scheduler calls `POST /api/marketplaces/sync-all` (admin-only, authed via `SCHEDULER_API_TOKEN`) at `daily 03:00` UTC. Routing through HTTP keeps the app the sole writer to `system.duckdb` — the previous in-process call from the scheduler container raced the app's long-lived DB handle and 500-ed on `Could not set lock on file`.
- Manual re-sync from the UI ("Sync now") hits `POST /api/marketplaces/{id}/sync`. - Manual re-sync from the UI ("Sync now") hits `POST /api/marketplaces/{id}/sync`.
- PATs for private repos persist to `${DATA_DIR}/state/.env_overlay` (chmod 600) as `AGNES_MARKETPLACE_<SLUG>_TOKEN`. DuckDB stores only the env-var name (`token_env`), never the secret. - PATs for private repos persist to `${DATA_DIR}/state/.env_overlay` (chmod 600) as `AGNES_MARKETPLACE_<SLUG>_TOKEN`. DuckDB stores only the env-var name (`token_env`), never the secret.
- Registry lives in DuckDB table `marketplace_registry` (schema v9). - Registry lives in DuckDB table `marketplace_registry` (schema v9).

View file

@ -22,6 +22,7 @@ from src.marketplace import (
MarketplaceNotFound, MarketplaceNotFound,
delete_marketplace_dir, delete_marketplace_dir,
is_valid_slug, is_valid_slug,
sync_marketplaces,
sync_one, sync_one,
) )
from src.repositories.audit import AuditRepository from src.repositories.audit import AuditRepository
@ -409,3 +410,46 @@ async def trigger_sync(
{"commit": result["commit"], "action": result["action"]}, {"commit": result["commit"], "action": result["action"]},
) )
return result return result
@router.post("/sync-all")
def trigger_sync_all(
user: dict = Depends(require_admin),
conn: duckdb.DuckDBPyConnection = Depends(_get_db),
):
"""Sync every registered marketplace.
Wired up so the scheduler service can drive the nightly refresh over
HTTP. The previous implementation called ``src.marketplace.sync_marketplaces``
in-process from the scheduler container, which conflicted with the app's
long-lived ``system.duckdb`` handle (DuckDB allows only one writer per
file across processes). Routing through the app inherits the existing
connection without contention.
Declared ``def`` (not ``async def``) so FastAPI runs it in a thread
pool :func:`sync_marketplaces` does blocking I/O (subprocess git
clones with ``GIT_TIMEOUT_SEC=300`` per repo, DuckDB writes, a
process-wide threading.Lock) and would freeze the event loop for the
duration of a bulk sync if it ran on the asyncio thread. Health
checks, login redirects, and every other concurrent request keep
serving while the bulk sync churns through the registry.
One audit row per call summarises the outcome per-marketplace details
live in ``marketplace_registry`` and the per-call result payload below.
"""
result = sync_marketplaces()
# _audit appends "marketplace:" to the target id when writing the
# resource column. "_all" produces "marketplace:_all" — a stable,
# greppable sentinel for bulk-sync rows; the real per-marketplace
# commit/error breakdown is in the params payload.
_audit(
conn,
user["id"],
"marketplace.sync_all",
"_all",
{
"synced": [r.get("id") for r in result.get("synced", [])],
"errors": [{"id": e.get("id"), "error": e.get("error")} for e in result.get("errors", [])],
},
)
return result

View file

@ -179,6 +179,21 @@ async def get_current_user(
detail="Missing or invalid Authorization header", detail="Missing or invalid Authorization header",
) )
# Shared-secret path for the in-cluster scheduler. Checked before
# pat_resolver because the scheduler token is not a JWT — feeding it to
# verify_token() would log a spurious decode warning every cron tick.
# See app/auth/scheduler_token.py for the threat model.
from app.auth.scheduler_token import get_scheduler_user, is_scheduler_token
if is_scheduler_token(token):
scheduler_user = get_scheduler_user(conn)
if scheduler_user:
_attach_admin_flag(scheduler_user, conn)
return scheduler_user
raise HTTPException(
status_code=status.HTTP_401_UNAUTHORIZED,
detail="Scheduler user not provisioned",
)
from app.auth.pat_resolver import resolve_token_to_user from app.auth.pat_resolver import resolve_token_to_user
user, reason = resolve_token_to_user(conn, token, request) user, reason = resolve_token_to_user(conn, token, request)
if user: if user:
@ -226,8 +241,21 @@ async def get_optional_user(
async def require_session_token(request: Request, user: dict = Depends(get_current_user)) -> dict: async def require_session_token(request: Request, user: dict = Depends(get_current_user)) -> dict:
"""Like get_current_user but rejects PAT — for endpoints that must not """Like get_current_user but rejects every non-interactive token kind —
be callable via a long-lived CI token (e.g. creating new tokens, changing password).""" for endpoints that must not be callable via a long-lived service or CI
credential (e.g. creating new tokens, changing password).
Two non-interactive paths exist today:
1. **PAT** JWT with ``typ="pat"``. Detected by decoding the JWT and
inspecting the claim.
2. **Scheduler shared secret** opaque string equal to
``SCHEDULER_API_TOKEN``. Not a JWT, so ``verify_token`` returns None
and the PAT-claim check would silently pass meaning a caller
holding the scheduler secret could mint persistent PATs through
``POST /auth/tokens`` that survive a secret rotation. Explicit
check here closes that bypass.
"""
auth = request.headers.get("authorization", "") auth = request.headers.get("authorization", "")
token = None token = None
if auth.startswith("Bearer "): if auth.startswith("Bearer "):
@ -235,6 +263,12 @@ async def require_session_token(request: Request, user: dict = Depends(get_curre
if not token and request: if not token and request:
token = request.cookies.get("access_token") token = request.cookies.get("access_token")
if token: if token:
from app.auth.scheduler_token import is_scheduler_token
if is_scheduler_token(token):
raise HTTPException(
status_code=status.HTTP_403_FORBIDDEN,
detail="This endpoint requires an interactive session, not a service token",
)
from app.auth.jwt import verify_token from app.auth.jwt import verify_token
payload = verify_token(token) or {} payload = verify_token(token) or {}
if payload.get("typ") == "pat": if payload.get("typ") == "pat":

136
app/auth/scheduler_token.py Normal file
View file

@ -0,0 +1,136 @@
"""Shared-secret auth path for the in-cluster scheduler service.
The scheduler container ships every cron tick to the FastAPI app over HTTP
(see ``services.scheduler.__main__``). It needs a long-lived credential to
authenticate itself, but minting a real PAT for it requires a logged-in
session chicken-and-egg at first boot.
The pragmatic solution: both the ``app`` and ``scheduler`` containers source
the same ``.env`` (via Docker Compose ``env_file: .env``). The
``infra/modules/customer-instance/startup-script.sh.tpl`` generates a random
``SCHEDULER_API_TOKEN`` once at VM provisioning and writes it there. When a
caller presents that exact secret as ``Authorization: Bearer <secret>``, the
app loads (or seeds on demand) a synthetic ``scheduler@system.local`` user
that is a member of the ``Admin`` system group so existing RBAC paths
continue to work without special-casing.
Constraints on the secret (enforced here, not parsed):
- Empty / unset this auth path is **disabled**. Production deploys should
set it; dev / LOCAL_DEV_MODE typically doesn't, since the scheduler
rides the dev-bypass instead.
- Length < 32 treated as misconfiguration and disabled. Prevents an
operator typo that sets ``SCHEDULER_API_TOKEN=todo`` from accidentally
granting admin to a 4-character bearer.
- Comparison uses :func:`hmac.compare_digest` constant-time so a remote
caller cannot mount a length-discrimination timing attack.
Audit: every action by this user is attributed to ``scheduler@system.local``,
visible in ``audit_log`` as a normal admin actor. Rotating the secret is
``edit .env docker compose restart app scheduler``; no DB write needed.
"""
from __future__ import annotations
import hmac
import logging
import os
import uuid
from typing import Optional
import duckdb
logger = logging.getLogger(__name__)
# Identity of the synthetic user that backs the shared-secret auth path.
# Kept stable so audit-log entries from the scheduler are easy to filter.
SCHEDULER_USER_EMAIL = "scheduler@system.local"
SCHEDULER_USER_NAME = "Scheduler"
# Floor on the secret length. 32 bytes ≈ 256 bits of entropy if generated
# from /dev/urandom; well above the brute-force frontier and well above any
# typo a human is plausibly going to make.
SCHEDULER_TOKEN_MIN_LENGTH = 32
def get_scheduler_secret() -> str:
"""Return the configured shared secret, stripped. Empty when disabled."""
return os.environ.get("SCHEDULER_API_TOKEN", "").strip()
def is_scheduler_token(token: str) -> bool:
"""True iff ``token`` exactly matches the configured shared secret.
Returns False when the env var is empty or shorter than the minimum
length (auth path disabled). Uses constant-time comparison.
"""
if not token:
return False
secret = get_scheduler_secret()
if not secret or len(secret) < SCHEDULER_TOKEN_MIN_LENGTH:
return False
return hmac.compare_digest(token, secret)
def ensure_scheduler_user(conn: duckdb.DuckDBPyConnection) -> dict:
"""Idempotently provision the scheduler user + Admin group membership.
Called both from the app's startup hook (so the user exists from the
very first boot) and lazily from :func:`get_scheduler_user` so a token
presented before the next restart of the app still resolves.
Returns the user dict in the same shape ``UserRepository.get_by_email``
yields elsewhere the caller treats it as any other authenticated user.
"""
from src.db import SYSTEM_ADMIN_GROUP
from src.repositories.user_group_members import UserGroupMembersRepository
from src.repositories.users import UserRepository
repo = UserRepository(conn)
user = repo.get_by_email(SCHEDULER_USER_EMAIL)
if not user:
user_id = str(uuid.uuid4())
repo.create(
id=user_id,
email=SCHEDULER_USER_EMAIL,
name=SCHEDULER_USER_NAME,
role="admin",
# No password_hash — this user authenticates via the shared
# secret only, never via /auth/login. Keeps the bootstrap
# check ("any user has a password?") accurate.
password_hash=None,
)
user = repo.get_by_email(SCHEDULER_USER_EMAIL)
logger.info("Seeded scheduler service user: %s", SCHEDULER_USER_EMAIL)
admin_group = conn.execute(
"SELECT id FROM user_groups WHERE name = ?", [SYSTEM_ADMIN_GROUP],
).fetchone()
if admin_group:
UserGroupMembersRepository(conn).add_member(
user_id=user["id"],
group_id=admin_group[0],
source="system_seed",
added_by="app.auth.scheduler_token:ensure_scheduler_user",
)
return user
def get_scheduler_user(conn: duckdb.DuckDBPyConnection) -> Optional[dict]:
"""Look up the scheduler user, seeding it on demand if absent.
Returns None only when seeding fails typically a malformed schema or
an out-of-band DB error. The caller (``get_current_user``) maps None
to a normal 401 so the failure is observable but does not crash.
"""
from src.repositories.users import UserRepository
user = UserRepository(conn).get_by_email(SCHEDULER_USER_EMAIL)
if user:
return user
try:
return ensure_scheduler_user(conn)
except Exception as e: # noqa: BLE001
logger.error("Failed to provision scheduler user on demand: %s", e)
return None

View file

@ -292,6 +292,35 @@ def create_app() -> FastAPI:
except Exception as e: except Exception as e:
logger.warning(f"Could not seed admin: {e}") logger.warning(f"Could not seed admin: {e}")
# Seed the synthetic scheduler user when SCHEDULER_API_TOKEN is configured,
# so the very first cron tick after a fresh deploy already has a valid
# actor to attribute audit-log entries to. The lazy seed in
# `app.auth.scheduler_token.get_scheduler_user` covers the case where the
# secret is rotated mid-life, but doing it here keeps startup observable.
from app.auth.scheduler_token import get_scheduler_secret
if get_scheduler_secret():
try:
from app.auth.scheduler_token import (
SCHEDULER_TOKEN_MIN_LENGTH,
ensure_scheduler_user,
)
from src.db import get_system_db
secret = get_scheduler_secret()
if len(secret) < SCHEDULER_TOKEN_MIN_LENGTH:
logger.warning(
"SCHEDULER_API_TOKEN is set but only %d chars — auth path"
" disabled (minimum %d). Generate a longer secret in .env.",
len(secret), SCHEDULER_TOKEN_MIN_LENGTH,
)
else:
conn = get_system_db()
try:
ensure_scheduler_user(conn)
finally:
conn.close()
except Exception as e:
logger.warning(f"Could not seed scheduler user: {e}")
# C8: Warn when no user has a password_hash — bootstrap endpoint is open. # C8: Warn when no user has a password_hash — bootstrap endpoint is open.
# This is intentional UX (operator can claim seed admin), but the open # This is intentional UX (operator can claim seed admin), but the open
# window should be visible in startup logs so it's not forgotten. # window should be visible in startup logs so it's not forgotten.

View file

@ -75,6 +75,25 @@ if [ "$DATA_SOURCE" = "keboola" ]; then
fi fi
JWT_KEY=$(gcloud secrets versions access latest --secret=agnes-$${CUSTOMER_NAME}-jwt-secret) JWT_KEY=$(gcloud secrets versions access latest --secret=agnes-$${CUSTOMER_NAME}-jwt-secret)
# SCHEDULER_API_TOKEN — shared secret between the app and scheduler containers.
# Both source the same /opt/agnes/.env via Docker Compose env_file:, so the
# scheduler's outbound bearer token always matches the app's expected value.
# See app/auth/scheduler_token.py for the auth path it unlocks.
#
# Preserve across reboots: the token is plumbed into a long-lived synthetic
# user, and rotating it forces a restart of both containers. Read back from
# an existing .env when present; mint fresh only on the first boot.
SCHEDULER_API_TOKEN=""
if [ -f "$APP_DIR/.env" ]; then
SCHEDULER_API_TOKEN=$(grep -E '^SCHEDULER_API_TOKEN=' "$APP_DIR/.env" | head -1 | cut -d= -f2- | tr -d '"' || true)
fi
if [ -z "$SCHEDULER_API_TOKEN" ]; then
# 64 hex chars = 256 bits of /dev/urandom entropy. Floor enforced in
# app/auth/scheduler_token.SCHEDULER_TOKEN_MIN_LENGTH is 32; 64 leaves
# headroom for a future tightening without re-provisioning every VM.
SCHEDULER_API_TOKEN=$(openssl rand -hex 32)
fi
# Optional Google OAuth credentials. If the operator has created # Optional Google OAuth credentials. If the operator has created
# google-oauth-client-{id,secret} secrets in the project's Secret Manager # google-oauth-client-{id,secret} secrets in the project's Secret Manager
# AND wired them via runtime_secrets in the calling Terraform, the VM SA can # AND wired them via runtime_secrets in the calling Terraform, the VM SA can
@ -118,6 +137,7 @@ KEBOOLA_STORAGE_TOKEN=$KEBOOLA_TOKEN
KEBOOLA_STACK_URL=$KEBOOLA_STACK_URL KEBOOLA_STACK_URL=$KEBOOLA_STACK_URL
SEED_ADMIN_EMAIL=$SEED_ADMIN_EMAIL SEED_ADMIN_EMAIL=$SEED_ADMIN_EMAIL
SEED_ADMIN_PASSWORD=$SEED_ADMIN_PASSWORD SEED_ADMIN_PASSWORD=$SEED_ADMIN_PASSWORD
SCHEDULER_API_TOKEN=$SCHEDULER_API_TOKEN
LOG_LEVEL=info LOG_LEVEL=info
DOMAIN=$DOMAIN DOMAIN=$DOMAIN
AGNES_TAG=$IMAGE_TAG AGNES_TAG=$IMAGE_TAG

View file

@ -1,6 +1,6 @@
[project] [project]
name = "agnes-the-ai-analyst" name = "agnes-the-ai-analyst"
version = "0.15.0" version = "0.17.0"
description = "Agnes — AI Data Analyst platform for AI analytical systems" description = "Agnes — AI Data Analyst platform for AI analytical systems"
requires-python = ">=3.11,<3.14" requires-python = ">=3.11,<3.14"
license = "MIT" license = "MIT"

View file

@ -1,12 +1,20 @@
"""Scheduler service — replaces systemd timers. """Scheduler service — replaces systemd timers.
Lightweight sidecar that fires scheduled jobs. Two job kinds: Lightweight sidecar that fires scheduled jobs over HTTP against the main
- "http": POST/GET an endpoint on the main app (e.g. data-refresh). app. Authenticates with ``SCHEDULER_API_TOKEN`` (shared-secret synthetic
- "fn": call a Python function in-process (e.g. marketplaces sync). admin see ``app.auth.scheduler_token``); falls back to no-auth in
LOCAL_DEV_MODE.
Schedules are strings parsed by src.scheduler.is_table_due accepts Schedules are strings parsed by ``src.scheduler.is_table_due`` accepts
"every 15m", "every 1h", "daily 03:00", "daily 07:00,13:00". "every 15m", "every 1h", "daily 03:00", "daily 07:00,13:00".
Why every job is HTTP and nothing runs in-process: the scheduler container
shares ``/data/state/system.duckdb`` with the app container, but DuckDB
permits only one writer per file across processes. An in-process call
from the scheduler raced the app's long-lived handle and 500-ed on
``Could not set lock on file``. Going through HTTP makes the app the sole
writer; the scheduler is reduced to a pure cron clock.
Usage: python -m services.scheduler Usage: python -m services.scheduler
""" """
@ -36,21 +44,17 @@ _token_warning_emitted = False
def _get_auth_token() -> str: def _get_auth_token() -> str:
"""Return the bearer token for API calls. """Return the bearer token for API calls.
Production: ``SCHEDULER_API_TOKEN`` env var carries a long-lived PAT Production: ``SCHEDULER_API_TOKEN`` is a shared secret generated by the
minted via ``/tokens`` for a service-account user with the roles the Terraform startup script and written to ``/opt/agnes/.env``. Both the
jobs need (typically ``core.admin`` for sync triggers). Set it. ``app`` and ``scheduler`` containers source the same .env via Docker
Compose ``env_file:``, so the secret is symmetric. The app validates
incoming Bearer tokens against this env var (constant-time compare in
``app.auth.scheduler_token``) and resolves matches to a synthetic
``scheduler@system.local`` user that is a member of the Admin group.
Dev / LOCAL_DEV_MODE: leave it unset. The scheduler returns the empty Dev / LOCAL_DEV_MODE: leave it unset. The scheduler returns the empty
string and calls the API without an ``Authorization`` header the string and calls the API without an ``Authorization`` header the
API's dev-bypass auto-authenticates the request as the dev user. API's dev-bypass auto-authenticates the request as the dev user.
The previous implementation tried to auto-fetch a token by POSTing to
``/auth/token`` with just the seed admin's email. That endpoint
requires email + password (or rejects external-auth accounts that
have no local password), so the call always 401-ed and the scheduler
log was noisy with one access-log line per cron tick. Removed in
favor of explicit configuration: either set the PAT or rely on
LOCAL_DEV_MODE.
""" """
global _token_warning_emitted global _token_warning_emitted
if SCHEDULER_API_TOKEN: if SCHEDULER_API_TOKEN:
@ -65,24 +69,23 @@ def _get_auth_token() -> str:
return "" return ""
def _marketplaces_job(): # Schedule definitions: (name, schedule_string, endpoint, method, timeout_sec).
"""Entry point for the nightly marketplaces sync. # All jobs are HTTP — see the module docstring for why nothing runs
# in-process anymore. ``daily 03:00`` for marketplaces matches the cadence
Imported lazily so the scheduler container still starts even if the # the previous in-process job used; the endpoint is admin-only and
module has an import-time issue in development a failure here only # idempotent (it iterates the registry and per-marketplace errors do not
kills one job, not the whole loop. # abort the run).
""" #
from src.marketplace import sync_marketplaces # timeout_sec: per-job override for the httpx call. Marketplaces gets a
return sync_marketplaces() # generous 15 min because the app handler iterates every registered
# marketplace under a single lock with up to 300s of git timeout per
# entry — at 120s (the default that data-refresh uses) a real-world
# Schedule definitions: (name, schedule_string, kind, target) # registry of more than 2-3 slow repos times out the scheduler call,
# kind = "http" -> target = (endpoint, method) # which then re-fires on the next 30s tick and queues a redundant sync.
# kind = "fn" -> target = callable_returning_any
JOBS = [ JOBS = [
("data-refresh", "every 15m", "http", ("/api/sync/trigger", "POST")), ("data-refresh", "every 15m", "/api/sync/trigger", "POST", 120),
("health-check", "every 5m", "http", ("/api/health", "GET")), ("health-check", "every 5m", "/api/health", "GET", 30),
("marketplaces", "daily 03:00", "fn", _marketplaces_job), ("marketplaces", "daily 03:00", "/api/marketplaces/sync-all", "POST", 900),
] ]
_running = True _running = True
@ -94,7 +97,7 @@ def _signal_handler(sig, frame):
_running = False _running = False
def _call_api(endpoint: str, method: str = "POST") -> bool: def _call_api(endpoint: str, method: str, timeout_sec: int) -> bool:
"""Call the main app API. Returns True on success.""" """Call the main app API. Returns True on success."""
url = f"{API_URL}{endpoint}" url = f"{API_URL}{endpoint}"
headers = {} headers = {}
@ -103,9 +106,9 @@ def _call_api(endpoint: str, method: str = "POST") -> bool:
headers["Authorization"] = f"Bearer {token}" headers["Authorization"] = f"Bearer {token}"
try: try:
if method == "POST": if method == "POST":
resp = httpx.post(url, headers=headers, timeout=120) resp = httpx.post(url, headers=headers, timeout=timeout_sec)
else: else:
resp = httpx.get(url, headers=headers, timeout=30) resp = httpx.get(url, headers=headers, timeout=timeout_sec)
if resp.status_code < 400: if resp.status_code < 400:
logger.info(f"Job {endpoint}: {resp.status_code}") logger.info(f"Job {endpoint}: {resp.status_code}")
return True return True
@ -117,17 +120,6 @@ def _call_api(endpoint: str, method: str = "POST") -> bool:
return False return False
def _call_fn(label: str, fn) -> bool:
"""Run an in-process callable. Returns True on success."""
try:
result = fn()
logger.info("Job %s OK: %s", label, result)
return True
except Exception as e:
logger.error("Job %s failed: %s", label, e)
return False
def run(): def run():
signal.signal(signal.SIGTERM, _signal_handler) signal.signal(signal.SIGTERM, _signal_handler)
signal.signal(signal.SIGINT, _signal_handler) signal.signal(signal.SIGINT, _signal_handler)
@ -140,18 +132,11 @@ def run():
while _running: while _running:
now_iso = datetime.now(timezone.utc).isoformat() now_iso = datetime.now(timezone.utc).isoformat()
for name, schedule, kind, target in JOBS: for name, schedule, endpoint, method, timeout_sec in JOBS:
if not is_table_due(schedule, last_run[name]): if not is_table_due(schedule, last_run[name]):
continue continue
logger.info("Running job: %s (%s)", name, schedule) logger.info("Running job: %s (%s)", name, schedule)
if kind == "http": ok = _call_api(endpoint, method, timeout_sec)
endpoint, method = target
ok = _call_api(endpoint, method)
elif kind == "fn":
ok = _call_fn(name, target)
else:
logger.error("Unknown job kind %r for %s", kind, name)
ok = False
if ok: if ok:
last_run[name] = now_iso last_run[name] = now_iso
# 30s tick is plenty: interval jobs have minute-level resolution, # 30s tick is plenty: interval jobs have minute-level resolution,

View file

@ -0,0 +1,153 @@
"""Tests for the SCHEDULER_API_TOKEN shared-secret auth path."""
import tempfile
import pytest
@pytest.fixture
def fresh_db(monkeypatch):
"""Isolated DuckDB + JWT secret per test, mirroring tests/test_pat.py."""
with tempfile.TemporaryDirectory() as tmp:
monkeypatch.setenv("DATA_DIR", tmp)
monkeypatch.setenv("TESTING", "1")
monkeypatch.setenv("JWT_SECRET_KEY", "test-jwt-secret-key-minimum-32-chars!!")
# Clean slate — clear any inherited token from the host shell.
monkeypatch.delenv("SCHEDULER_API_TOKEN", raising=False)
# Force pristine state — earlier tests in the same session may have
# opened the singleton; drop it so the new DATA_DIR takes effect.
from src.db import close_system_db
close_system_db()
yield tmp
close_system_db()
def test_is_scheduler_token_disabled_when_env_unset(fresh_db, monkeypatch):
"""Empty SCHEDULER_API_TOKEN must disable the auth path entirely.
A bug here would let any caller authenticate with empty Bearer "" the
constant-time compare would also be empty granting admin to anyone.
"""
from app.auth.scheduler_token import is_scheduler_token
monkeypatch.delenv("SCHEDULER_API_TOKEN", raising=False)
assert is_scheduler_token("") is False
assert is_scheduler_token("anything") is False
def test_is_scheduler_token_disabled_when_env_too_short(fresh_db, monkeypatch):
"""Operator typo (SCHEDULER_API_TOKEN=todo) must NOT grant admin.
The minimum length floor exists specifically to prevent a 4-char bearer
from accidentally matching a 4-char misconfigured secret.
"""
from app.auth.scheduler_token import is_scheduler_token
monkeypatch.setenv("SCHEDULER_API_TOKEN", "too-short")
assert is_scheduler_token("too-short") is False
def test_is_scheduler_token_matches_only_exact_value(fresh_db, monkeypatch):
from app.auth.scheduler_token import is_scheduler_token
secret = "x" * 64 # > min length
monkeypatch.setenv("SCHEDULER_API_TOKEN", secret)
assert is_scheduler_token(secret) is True
assert is_scheduler_token(secret + "trailing") is False
assert is_scheduler_token(secret[:-1]) is False
assert is_scheduler_token("y" * 64) is False
def test_ensure_scheduler_user_seeds_user_and_admin_membership(fresh_db, monkeypatch):
"""First call seeds; second call is a no-op idempotent re-add."""
from app.auth.scheduler_token import (
SCHEDULER_USER_EMAIL,
ensure_scheduler_user,
)
from src.db import SYSTEM_ADMIN_GROUP, get_system_db
conn = get_system_db()
try:
user1 = ensure_scheduler_user(conn)
assert user1["email"] == SCHEDULER_USER_EMAIL
# Admin group membership exists.
admin_group = conn.execute(
"SELECT id FROM user_groups WHERE name = ?", [SYSTEM_ADMIN_GROUP],
).fetchone()
assert admin_group is not None
membership = conn.execute(
"SELECT 1 FROM user_group_members WHERE user_id = ? AND group_id = ?",
[user1["id"], admin_group[0]],
).fetchone()
assert membership is not None
# Second call — same id, no duplicate membership row.
user2 = ensure_scheduler_user(conn)
assert user2["id"] == user1["id"]
rows = conn.execute(
"SELECT COUNT(*) FROM user_group_members WHERE user_id = ? AND group_id = ?",
[user1["id"], admin_group[0]],
).fetchone()
assert rows[0] == 1
finally:
conn.close()
def test_get_scheduler_user_lazy_seeds_when_absent(fresh_db, monkeypatch):
"""First lookup with no prior seed should provision on demand.
The startup hook in app.main also seeds eagerly, but the scheduler may
present the token before main.py has finished its lifespan setup on a
cold boot get_scheduler_user must close that gap.
"""
from app.auth.scheduler_token import (
SCHEDULER_USER_EMAIL,
get_scheduler_user,
)
from src.db import get_system_db
from src.repositories.users import UserRepository
conn = get_system_db()
try:
# Confirm user does not exist before the call.
assert UserRepository(conn).get_by_email(SCHEDULER_USER_EMAIL) is None
user = get_scheduler_user(conn)
assert user is not None
assert user["email"] == SCHEDULER_USER_EMAIL
finally:
conn.close()
def test_require_session_token_rejects_scheduler_secret(fresh_db, monkeypatch):
"""The shared scheduler secret must NOT pass `require_session_token`.
/auth/tokens (PAT minting) is gated by `require_session_token`, which
historically rejected only PATs (JWTs with typ=pat). The scheduler
secret is opaque so verify_token() returns None and the PAT-claim
check would silently pass letting a compromised secret forge
persistent PATs that survive a rotation. Regression guard for the
Devin review on PR #127.
"""
import asyncio
from unittest.mock import MagicMock
from fastapi import HTTPException
from app.auth.dependencies import require_session_token
secret = "x" * 64
monkeypatch.setenv("SCHEDULER_API_TOKEN", secret)
request = MagicMock()
request.headers = {"authorization": f"Bearer {secret}"}
request.cookies = {}
user = {"id": "scheduler-id", "email": "scheduler@system.local"}
try:
asyncio.run(require_session_token(request=request, user=user))
except HTTPException as exc:
assert exc.status_code == 403
# Detail should signal "interactive only", flavor doesn't matter.
assert "interactive" in exc.detail.lower()
else:
raise AssertionError("require_session_token must reject scheduler secret")