fix(scheduler): HTTP marketplaces job + SCHEDULER_API_TOKEN shared secret (#127)
* fix(scheduler): HTTP marketplaces job + SCHEDULER_API_TOKEN shared secret Two scheduler-reliability bugs surfaced after the v0.12.1 USER-agnes flip: 1. The marketplaces job called src.marketplace.sync_marketplaces() in-process from the scheduler container, racing the app's long-lived system.duckdb handle. DuckDB rejects cross-process writers — every cron tick 500-ed on "Could not set lock on file ... PID 0". 2. The data-refresh + new marketplaces jobs both 401-ed on the API because SCHEDULER_API_TOKEN was never propagated by the Terraform startup script. The scheduler had no credential to authenticate with. Fix: - New POST /api/marketplaces/sync-all (admin-only) drives the nightly refresh through the app process so it inherits the existing DB connection. - Scheduler swaps fn->http for marketplaces; all jobs are now plain HTTP and the scheduler is reduced to a cron clock. - New app/auth/scheduler_token.py adds a shared-secret auth path. The startup script generates a 256-bit secret on first boot, persists it across reboots, and writes it to /opt/agnes/.env. Both containers source the same .env. The app validates incoming Bearer tokens against the env var (constant-time, length-floored) and resolves matches to a synthetic scheduler@system.local user that's a member of the Admin system group. Audit-log entries from the scheduler are attributed to this user. - app/main.py seeds the synthetic user at startup so the first cron tick has a valid actor; lazy seed in get_scheduler_user covers token rotation before the next app restart. Tests: 5 new in tests/test_auth_scheduler_token.py covering empty/short secret rejection, exact-match comparison, idempotent user seeding, and lazy provisioning. 142 marketplace + scheduler tests + 96 auth tests remain green. Existing VMs with .env from before this change need a one-time re-provisioning (re-run startup-script or rotate via openssl rand); documented in CHANGELOG. * fix(audit): use '_all' sentinel for bulk marketplace sync — Devin review #127 Avoids the literal string 'marketplace:None' in the audit_log resource column when the bulk sync endpoint writes its summary row. * fix(scheduler): unblock event loop + per-job timeouts — Devin review #127 Two findings from Devin re-review on commit 5fbad15: 1. BUG: trigger_sync_all was async def, so FastAPI ran it on the asyncio event loop. sync_marketplaces() does blocking I/O (subprocess git clones up to GIT_TIMEOUT_SEC=300 each, threading.Lock, DuckDB writes) and would freeze every concurrent request for the duration of a bulk sync. Switched to plain def so FastAPI auto-routes to the thread pool. 2. ANALYSIS: scheduler used a fixed 120s httpx timeout for every POST. Bulk marketplace sync iterates the registry under a single lock with up to 300s per repo — easily exceeds 120s on 2-3 slow repos. The scheduler then sees a timeout, doesn't update last_run, and re-fires on the next 30s tick, queueing redundant work. Per-job timeout override added to the JOBS tuple; marketplaces gets 900s (15 min), data-refresh keeps 120s, health-check 30s. * fix(auth): require_session_token rejects scheduler shared secret — Devin review #127 require_session_token gates /auth/tokens (PAT minting). Pre-fix it only rejected JWTs with typ=pat — but the scheduler shared secret is an opaque string, so verify_token() returns None, payload becomes {}, and the PAT-claim check silently passed. A caller bearing SCHEDULER_API_TOKEN could mint persistent PATs that survive a secret rotation. Added explicit is_scheduler_token() check before the PAT-claim check; new regression test in tests/test_auth_scheduler_token.py. Devin's other note (pre-existing async def trigger_sync at marketplaces.py:392 also calls blocking sync_one) — Devin flagged it as out-of-scope for this PR and I agree; tracking separately. * release(0.17.0): cut + clean up CHANGELOG duplicates Cuts 0.17.0 (minor: scheduler shared-secret auth + sync-all endpoint plus the deploy-shape fixes that landed since the last release tag). Bumps pyproject from 0.15.0 — also corrects the missed bump from PR #120 (v0.16.0 was tagged on GitHub and shipped as :stable, but pyproject stayed at 0.15.0, so /api/version, /cli/latest, and `da --version` had been under-reporting the running release). Removes the long-form duplicate entries for 0.13.0 / 0.14.0 / 0.15.0 above [0.16.0] — the canonical short summaries (with GitHub-release links) already exist below 0.16.0, the long forms were leftover state from before those versions were cut and have been silently shadowed ever since.
This commit is contained in:
parent
7012966482
commit
995e4cd366
10 changed files with 469 additions and 109 deletions
59
CHANGELOG.md
59
CHANGELOG.md
|
|
@ -10,8 +10,17 @@ CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every C
|
|||
|
||||
## [Unreleased]
|
||||
|
||||
## [0.17.0] — 2026-04-29
|
||||
|
||||
### Added
|
||||
|
||||
- **Shared-secret auth path for the in-cluster scheduler service** (`SCHEDULER_API_TOKEN`). Both the `app` and `scheduler` containers source the same `/opt/agnes/.env` via Docker Compose `env_file:`, so a 256-bit secret generated once at VM provisioning serves both sides symmetrically. The app validates incoming `Authorization: Bearer <secret>` against the env var (constant-time compare; minimum length 32 chars; rejected when env is empty) and resolves matches to a synthetic `scheduler@system.local` user that is a member of the `Admin` system group — every existing RBAC gate (`require_admin`, `require_resource_access`) works unchanged. Audit-log entries from the scheduler are attributed to this user. Rotation: edit `.env`, `docker compose restart app scheduler`. See `app/auth/scheduler_token.py` for the threat model.
|
||||
- **`POST /api/marketplaces/sync-all`** — admin-only endpoint that runs `src.marketplace.sync_marketplaces()` inside the app process. Wired up so the scheduler container can drive the nightly refresh over HTTP without opening `system.duckdb` directly.
|
||||
|
||||
### Fixed
|
||||
|
||||
- **Scheduler `marketplaces` job 500-ed every cron tick with `IO Error: Could not set lock on file system.duckdb` after v0.12.1.** The previous implementation called `src.marketplace.sync_marketplaces()` in-process from the scheduler container, but DuckDB permits only one writer per file across processes — the scheduler raced the app's long-lived handle. Switched the job to `POST /api/marketplaces/sync-all`, making the app the sole writer; the scheduler is now a pure cron clock.
|
||||
- **Scheduler `data-refresh` job 401-ed every 15 minutes** with `Missing or invalid Authorization header` because `SCHEDULER_API_TOKEN` was never propagated by `infra/modules/customer-instance/startup-script.sh.tpl`. The startup script now generates a 64-hex-char secret on first boot via `openssl rand -hex 32`, persists it across reboots by reading back from an existing `.env` (rotation requires explicit operator action — both containers must restart together), and writes it into `/opt/agnes/.env` alongside the other secrets. `app/main.py` seeds the matching synthetic user at startup so the very first cron tick has a valid actor to attribute audit-log entries to. Existing VMs need a one-time `sudo /opt/agnes/agnes-rotate-scheduler-token.sh` (or simply re-run the startup script via `terraform apply -replace='module.agnes.google_compute_instance.vm["<vm-name>"]'`); see migration note in this changelog or rerun the startup script manually.
|
||||
- **Non-root container couldn't write to host-bind-mounted `/data` after the v0.12.1 USER-agnes flip.** `infra/modules/customer-instance/startup-script.sh.tpl` now `chown -R 999:999 /data` after creating the persistent-disk subdirs (`state`, `analytics`, `extracts`). Without this, a freshly-attached PD is root-owned by default and `USER agnes` (uid 999) cannot open `/data/state/system.duckdb` for write — every authed request 500s with `IOException: Cannot open file ... Permission denied` while `/api/health` (which doesn't open the system DB) keeps returning 200, masking the failure from health-only monitoring. Regression first observed on `agnes-development` on 2026-04-29 after the auto-upgrade picked up `:stable` from the 0.12.1 release. **Existing VMs with PD-backed `/data` need a one-time host-side `sudo chown -R 999:999 /var/lib/docker/volumes/agnes_data/_data && sudo docker restart agnes-app-1 agnes-scheduler-1` to recover** — Terraform `metadata_startup_script` only runs on boot, so an apply alone does not retro-fix running VMs.
|
||||
- `Dockerfile` pins the `agnes` user to `uid:gid 999:999` explicitly (`useradd --uid 999`). Previously the uid was whatever Debian's `useradd --system` assigned next — happened to be 999 today, but a future base-image change picking 998 or 1000 would silently desync from the startup-script's `chown 999:999`, reintroducing the same incident. Pinning makes the contract grep-able from both sides.
|
||||
- `scripts/smoke-test.sh` no longer silently SKIPs every authed check when `bootstrap` returns 403 (users exist) and `SMOKE_TOKEN` is not set — it now FAILs loudly. Also adds an unauthenticated DB-touching probe (`POST /auth/email/request`) before bootstrap, since `/api/health` deliberately doesn't open `system.duckdb` (kept cheap for LB probes) and so cannot detect filesystem/permission issues. The new probe catches the foundryai-development class of regression even on instances where bootstrap is closed.
|
||||
|
|
@ -23,56 +32,6 @@ CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every C
|
|||
|
||||
- `release.yml` adds an `e2e-bind-mount` job that boots the freshly built image against a host-bind-mounted `/data` directory (instead of the named volume the existing `smoke-test` job uses). Docker initializes a fresh named volume by copying from the image's `/data` — which the Dockerfile chowns to `agnes:agnes` before flipping USER — so the named-volume path always works. The bind-mount path mirrors what GCE VMs run via `docker-compose.host-mount.yml`, and includes a negative assertion (write must fail on root-owned `/data` before the operator chown) plus a positive assertion (smoke passes after the chown). Locks in the contract that broke `agnes-development`: removing `chown 999:999` from `startup-script.sh.tpl` or changing the Dockerfile uid pin breaks CI.
|
||||
|
||||
## [0.15.0] — 2026-04-29
|
||||
|
||||
### Added
|
||||
|
||||
- **Corporate-memory v1 + v1.5 — confidence, contradictions, audience distribution, and rule sync.** Issue #72.
|
||||
- Schema v15: `knowledge_items` gains context-engineering columns (`confidence`, `domain`, `entities`, `source_type`, `source_ref`, `valid_from`, `valid_until`, `supersedes`, `sensitivity`, `is_personal`); new `knowledge_contradictions` table for surfacing conflicting facts and `session_extraction_state` for the verification detector's idempotent resume.
|
||||
- Schema v16: `verification_evidence` table — one row per analyst confirmation, indexed on `item_id`, drives the `confidence` calculation in `services/corporate_memory/confidence.py` (linear-decay with floor + additional-verifier boost; configurable via `instance.yaml`).
|
||||
- **Server**: `GET /api/memory/bundle` returns mandatory + ranked-approved items within a token budget (default 6000 ≈ 24KB) — drives `da sync`'s rule write. `GET /api/memory/stats` now uses SQL aggregation (no full-list materialization, doesn't block the event loop). `POST /api/memory/admin/mandate` and `POST /api/memory/admin/batch` accept an `audience` field; the audience is matched against the caller's `user_group_members` JOIN `user_groups` on read so a user in group `finance` sees `audience='group:finance'` items, and admins see all. Verification-detector endpoint extracts knowledge candidates from session JSONL files and merges evidence into the existing item when a fact is re-asserted.
|
||||
- **CLI**: `da sync` step 7 (`_fetch_and_write_rules`) calls `/api/memory/bundle`, writes `mandatory` items to `.claude/rules/km_<id>.md` (one file per item) and concatenates `approved` into `.claude/rules/km_approved.md`. Stale `km_*.md` files from a previous run are pruned. Best-effort: any HTTP/JSON failure is logged and sync continues.
|
||||
- **Auto-tagging**: `services/corporate_memory/tagger.py` runs an LLM extraction pass on knowledge create/update when `ai:` is configured in `instance.yaml`. Wrapped in `asyncio.to_thread` so it doesn't block the event loop. Best-effort: missing config or LLM error → item created with no auto-tags.
|
||||
- **Per-item privacy**: `is_personal` items are visible only to the contributor and platform admins (members of the `Admin` system group). The `_can_view_item` helper takes a pre-computed `is_priv` flag so list endpoints don't re-query `user_group_members` per item.
|
||||
- **Audit log**: every admin action (mandate / approve / reject / revoke / mandate-batch / contradiction resolve) writes a row tagged `corporate_memory.<action>` with the affected item ids and reason field.
|
||||
- `/me/debug` — self-only auth diagnostic page. Shows the logged-in user their own decoded JWT claims (no raw token), group memberships with sources and bound `external_id` when present, resource grants effective via those memberships, and a "Refetch from Google (dry-run)" button that issues a fresh `fetch_user_groups` call and reports the diff against the cached `user_group_members` snapshot without writing anything. Gated by `AGNES_DEBUG_AUTH=true` env var (default off → route returns 404 and the navbar item is not rendered). Intended for dev / staging VMs; do not enable on customer-facing instances. Issue #116.
|
||||
|
||||
### Changed
|
||||
|
||||
- **`POST /api/memory/{id}/vote` accepts `vote=0`** to retract a previous vote (toggle un-vote from the UI). Pre-fix the API rejected vote=0 with 400 and the UI's toggle logic silently no-op'd on un-toggle.
|
||||
- **`/admin/corporate-memory` and `/corporate-memory/admin` are gated by `require_admin`** (admin-group membership) instead of the v9-era `require_role(Role.KM_ADMIN)`. The km_admin role was collapsed into admin in main's RBAC v13; module authors needing finer-grained corporate-memory curation should use a `resource_grants` row of type `corporate_memory_admin`.
|
||||
|
||||
### Internal
|
||||
|
||||
- `_effective_groups` in `app/api/memory.py` queries `user_group_members JOIN user_groups` instead of reading the `users.groups` JSON column (dropped in v13). The audience-distribution tests in `tests/test_memory_api.py` use a `_add_user_to_group` helper that inserts into `user_group_members` + `user_groups` directly.
|
||||
- `jsonschema` added to dev dependencies for the corporate-memory schema-validation fixtures (`tests/test_corporate_memory_v1.py::TestSchemaValidation`). Production code does not import it.
|
||||
|
||||
## [0.14.0] — 2026-04-29
|
||||
|
||||
### Added
|
||||
|
||||
- **v2 fetch primitives — discovery + scoped fetch** for analytical workflows. Replaces the BigQuery wrap-view pattern (which caused "Response too large" on multi-hundred-million-row source views) with a Claude-session-driven toolkit. Server exposes `GET /api/v2/catalog` (RBAC-filtered table list with flavor + fetch-via hints), `GET /api/v2/schema/{table_id}` (column metadata + BQ flavor hints), `GET /api/v2/sample/{table_id}` (N sample rows), `POST /api/v2/scan` (validator + RBAC + quota + max_result_bytes guard, Arrow IPC stream), `POST /api/v2/scan/estimate` (BigQuery dryRun, no execution). CLI gains `da catalog`, `da schema`, `da describe`, `da fetch <table> --select … --where … --limit N [--estimate] [--as <name>]`, `da snapshot list/refresh/drop/prune`, `da disk-info`. WHERE-clause validator at `app/api/where_validator.py` is sqlglot-backed with structural rejects + function allow-list + column-existence enforcement. Process-local quota tracker (concurrent + daily bytes per user). New `cli/skills/agnes-data-querying.md` standalone skill + CLAUDE.md addendum tells Claude to discover → estimate → fetch a filtered subset locally → analyze. 11 new `instance.yaml` knobs (`api.scan_max_concurrent_per_user`, `api.scan_daily_bytes_per_user`, `api.scan_max_result_bytes`, `api.where_clause_max_length`, `api.catalog_cache_ttl_seconds`, `api.schema_cache_ttl_seconds`, `api.sample_cache_ttl_seconds`, `data_source.bigquery.billing_project`, `data_source.bigquery.legacy_wrap_views`, `snapshots.dir`, `snapshots.cache_size_limit_gb`). Issue #101.
|
||||
- `data_source.bigquery.billing_project` config knob — explicit billing project for BQ scan + estimate. Defaults to `data_source.bigquery.project`. Matters for cross-project read patterns where the VM service account has `bigquery.data.*` on the data project but lacks `serviceusage.services.use` there; setting this to a project where the SA holds `serviceusage.services.use` fixes the dry-run 403.
|
||||
|
||||
### Changed
|
||||
|
||||
- **BREAKING**: `connectors/bigquery/extractor.py` no longer creates a wrap view (`SELECT * FROM bigquery_query(...)`) for VIEW / MATERIALIZED_VIEW entities by default. Operators relying on the old behavior must set `data_source.bigquery.legacy_wrap_views: true` in `instance.yaml` for one release cycle. BASE TABLE entities are unchanged. The new `da fetch` workflow replaces wrap views for analytical queries — see CLAUDE.md "Querying Agnes data — agent rails".
|
||||
- `RegisterTableRequest.primary_key` and `UpdateTableRequest.primary_key` accept `Optional[List[str]]` for composite keys (session-grain MSA tables key on `(session_id, event_date)`, browse rows on more); a bare string remains accepted for backward compat via a `field_validator(mode="before")` that wraps it in a one-element list. Old CLI scripts posting `"primary_key": "session_id"` continue to work.
|
||||
- `GET /api/v2/catalog` now caches the underlying `repo.list_all()` rows globally with the documented `api.catalog_cache_ttl_seconds` default (300s) and runs RBAC fresh per request. The previous per-user cache served stale RBAC-filtered results for up to TTL after a permission flip — `v2_schema.py` and `v2_sample.py` already had this pattern; `v2_catalog.py` now matches.
|
||||
|
||||
## [0.13.0] — 2026-04-29
|
||||
|
||||
### Added
|
||||
|
||||
- **Windows/PowerShell wrapper for local dev.** New `scripts/run-local-dev.ps1` mirrors `scripts/run-local-dev.sh` for operators on Windows where GNU Make / bash aren't available — same compose stack (`docker-compose.yml` + `docker-compose.dev.yml` + `docker-compose.local-dev.yml`), same `LOCAL_DEV_GROUPS` default seeding, same `up` / `down` / `logs` actions. Run `.\scripts\run-local-dev.ps1` for the fast path (reuses existing image) or `.\scripts\run-local-dev.ps1 -Build` to force `--build` after `pyproject.toml` / `Dockerfile` changes. Verified on Docker Desktop for Windows. See `docs/local-development.md`.
|
||||
- **Admin server configuration editor** at `/admin/server-config` — admins can now view and edit `instance.yaml` from the web UI without SSHing into the host. Two new endpoints (`GET /api/admin/server-config` returns the current config with secret-looking values masked; `POST /api/admin/server-config` deep-merges a section patch into `DATA_DIR/state/instance.yaml`). The page lists the editable sections (`instance`, `data_source`, `email`, `telegram`, `jira`, `theme`, `server`, `auth`) and renders a per-field form. Saves touching `auth.*` or `server.*` ("danger zone" — can lock operators out) require an explicit confirmation step. Every save writes an `instance_config.update` row to `audit_log` with a per-field diff (secret values masked as `***`, field paths preserved so a rotation is recorded as `email.smtp_password: *** → ***`). Issue #91.
|
||||
|
||||
### Fixed
|
||||
|
||||
- `app/instance_config.py:load_instance_config` now deep-merges the static `CONFIG_DIR/instance.yaml` with the writable overlay at `DATA_DIR/state/instance.yaml` instead of returning the overlay verbatim when present. Pre-fix, the first save through the new server-config editor (which writes only the section the operator actually touched) caused every consumer of static-only sections (corporate memory, dataset list, OpenMetadata client) to fall through to empty defaults until the overlay was deleted. Issue #91.
|
||||
- `POST /api/admin/configure` now uses the same narrow-overlay write strategy as the new server-config editor: it reads the overlay verbatim (no static fallback), patches only `instance` / `auth` / `data_source`, and writes atomically via tmp + `os.replace`. Pre-fix it seeded `existing` from the env-resolved merged config when no overlay file was present and dumped the whole thing back, persisting cleartext `${ENV_VAR}` values (e.g. `smtp_password`) into the writable overlay even though the wizard never touched those sections. Issue #91.
|
||||
- `POST /api/admin/server-config` now strips redaction sentinels (`***` / `<empty>`) out of every secret-keyed leaf in the incoming patch before the deep-merge. The companion GET endpoint masks secret-keyed children inside nested objects (e.g. `data_source.keboola.token_env`), and the form renders those nested objects as JSON textareas — without the scrub, a no-op save would round-trip the masked JSON back and overwrite the real overlay value (`token_env: "KEBOOLA_STORAGE_TOKEN"` → `"***"`), silently breaking the next sync. Defense-in-depth on both sides: the client form scrubs before posting, and the server scrubs before merge so an API caller (CLI / script) can't corrupt secrets either. Issue #91.
|
||||
|
||||
## [0.16.0] — 2026-04-29
|
||||
|
||||
Minor release. Comprehensive deploy safety audit — CI/CD pipeline hardening, 50+ new tests covering previously untested failure modes, DB schema health check, config versioning, and BigQuery ATTACH error resilience. Built on top of v0.15.0 / `2e1dfb7`.
|
||||
|
|
|
|||
|
|
@ -257,7 +257,7 @@ Admin-managed git repos cloned nightly to `${DATA_DIR}/marketplaces/<slug>/`
|
|||
so FastAPI can read their contents from disk.
|
||||
|
||||
- Register via `/admin/marketplaces` (admin UI) or `POST /api/marketplaces`.
|
||||
- Scheduler calls `src.marketplace.sync_marketplaces()` in-process at `daily 03:00` UTC — no HTTP round-trip to the main app.
|
||||
- Scheduler calls `POST /api/marketplaces/sync-all` (admin-only, authed via `SCHEDULER_API_TOKEN`) at `daily 03:00` UTC. Routing through HTTP keeps the app the sole writer to `system.duckdb` — the previous in-process call from the scheduler container raced the app's long-lived DB handle and 500-ed on `Could not set lock on file`.
|
||||
- Manual re-sync from the UI ("Sync now") hits `POST /api/marketplaces/{id}/sync`.
|
||||
- PATs for private repos persist to `${DATA_DIR}/state/.env_overlay` (chmod 600) as `AGNES_MARKETPLACE_<SLUG>_TOKEN`. DuckDB stores only the env-var name (`token_env`), never the secret.
|
||||
- Registry lives in DuckDB table `marketplace_registry` (schema v9).
|
||||
|
|
|
|||
|
|
@ -22,6 +22,7 @@ from src.marketplace import (
|
|||
MarketplaceNotFound,
|
||||
delete_marketplace_dir,
|
||||
is_valid_slug,
|
||||
sync_marketplaces,
|
||||
sync_one,
|
||||
)
|
||||
from src.repositories.audit import AuditRepository
|
||||
|
|
@ -409,3 +410,46 @@ async def trigger_sync(
|
|||
{"commit": result["commit"], "action": result["action"]},
|
||||
)
|
||||
return result
|
||||
|
||||
|
||||
@router.post("/sync-all")
|
||||
def trigger_sync_all(
|
||||
user: dict = Depends(require_admin),
|
||||
conn: duckdb.DuckDBPyConnection = Depends(_get_db),
|
||||
):
|
||||
"""Sync every registered marketplace.
|
||||
|
||||
Wired up so the scheduler service can drive the nightly refresh over
|
||||
HTTP. The previous implementation called ``src.marketplace.sync_marketplaces``
|
||||
in-process from the scheduler container, which conflicted with the app's
|
||||
long-lived ``system.duckdb`` handle (DuckDB allows only one writer per
|
||||
file across processes). Routing through the app inherits the existing
|
||||
connection without contention.
|
||||
|
||||
Declared ``def`` (not ``async def``) so FastAPI runs it in a thread
|
||||
pool — :func:`sync_marketplaces` does blocking I/O (subprocess git
|
||||
clones with ``GIT_TIMEOUT_SEC=300`` per repo, DuckDB writes, a
|
||||
process-wide threading.Lock) and would freeze the event loop for the
|
||||
duration of a bulk sync if it ran on the asyncio thread. Health
|
||||
checks, login redirects, and every other concurrent request keep
|
||||
serving while the bulk sync churns through the registry.
|
||||
|
||||
One audit row per call summarises the outcome — per-marketplace details
|
||||
live in ``marketplace_registry`` and the per-call result payload below.
|
||||
"""
|
||||
result = sync_marketplaces()
|
||||
# _audit appends "marketplace:" to the target id when writing the
|
||||
# resource column. "_all" produces "marketplace:_all" — a stable,
|
||||
# greppable sentinel for bulk-sync rows; the real per-marketplace
|
||||
# commit/error breakdown is in the params payload.
|
||||
_audit(
|
||||
conn,
|
||||
user["id"],
|
||||
"marketplace.sync_all",
|
||||
"_all",
|
||||
{
|
||||
"synced": [r.get("id") for r in result.get("synced", [])],
|
||||
"errors": [{"id": e.get("id"), "error": e.get("error")} for e in result.get("errors", [])],
|
||||
},
|
||||
)
|
||||
return result
|
||||
|
|
|
|||
|
|
@ -179,6 +179,21 @@ async def get_current_user(
|
|||
detail="Missing or invalid Authorization header",
|
||||
)
|
||||
|
||||
# Shared-secret path for the in-cluster scheduler. Checked before
|
||||
# pat_resolver because the scheduler token is not a JWT — feeding it to
|
||||
# verify_token() would log a spurious decode warning every cron tick.
|
||||
# See app/auth/scheduler_token.py for the threat model.
|
||||
from app.auth.scheduler_token import get_scheduler_user, is_scheduler_token
|
||||
if is_scheduler_token(token):
|
||||
scheduler_user = get_scheduler_user(conn)
|
||||
if scheduler_user:
|
||||
_attach_admin_flag(scheduler_user, conn)
|
||||
return scheduler_user
|
||||
raise HTTPException(
|
||||
status_code=status.HTTP_401_UNAUTHORIZED,
|
||||
detail="Scheduler user not provisioned",
|
||||
)
|
||||
|
||||
from app.auth.pat_resolver import resolve_token_to_user
|
||||
user, reason = resolve_token_to_user(conn, token, request)
|
||||
if user:
|
||||
|
|
@ -226,8 +241,21 @@ async def get_optional_user(
|
|||
|
||||
|
||||
async def require_session_token(request: Request, user: dict = Depends(get_current_user)) -> dict:
|
||||
"""Like get_current_user but rejects PAT — for endpoints that must not
|
||||
be callable via a long-lived CI token (e.g. creating new tokens, changing password)."""
|
||||
"""Like get_current_user but rejects every non-interactive token kind —
|
||||
for endpoints that must not be callable via a long-lived service or CI
|
||||
credential (e.g. creating new tokens, changing password).
|
||||
|
||||
Two non-interactive paths exist today:
|
||||
|
||||
1. **PAT** — JWT with ``typ="pat"``. Detected by decoding the JWT and
|
||||
inspecting the claim.
|
||||
2. **Scheduler shared secret** — opaque string equal to
|
||||
``SCHEDULER_API_TOKEN``. Not a JWT, so ``verify_token`` returns None
|
||||
and the PAT-claim check would silently pass — meaning a caller
|
||||
holding the scheduler secret could mint persistent PATs through
|
||||
``POST /auth/tokens`` that survive a secret rotation. Explicit
|
||||
check here closes that bypass.
|
||||
"""
|
||||
auth = request.headers.get("authorization", "")
|
||||
token = None
|
||||
if auth.startswith("Bearer "):
|
||||
|
|
@ -235,6 +263,12 @@ async def require_session_token(request: Request, user: dict = Depends(get_curre
|
|||
if not token and request:
|
||||
token = request.cookies.get("access_token")
|
||||
if token:
|
||||
from app.auth.scheduler_token import is_scheduler_token
|
||||
if is_scheduler_token(token):
|
||||
raise HTTPException(
|
||||
status_code=status.HTTP_403_FORBIDDEN,
|
||||
detail="This endpoint requires an interactive session, not a service token",
|
||||
)
|
||||
from app.auth.jwt import verify_token
|
||||
payload = verify_token(token) or {}
|
||||
if payload.get("typ") == "pat":
|
||||
|
|
|
|||
136
app/auth/scheduler_token.py
Normal file
136
app/auth/scheduler_token.py
Normal file
|
|
@ -0,0 +1,136 @@
|
|||
"""Shared-secret auth path for the in-cluster scheduler service.
|
||||
|
||||
The scheduler container ships every cron tick to the FastAPI app over HTTP
|
||||
(see ``services.scheduler.__main__``). It needs a long-lived credential to
|
||||
authenticate itself, but minting a real PAT for it requires a logged-in
|
||||
session — chicken-and-egg at first boot.
|
||||
|
||||
The pragmatic solution: both the ``app`` and ``scheduler`` containers source
|
||||
the same ``.env`` (via Docker Compose ``env_file: .env``). The
|
||||
``infra/modules/customer-instance/startup-script.sh.tpl`` generates a random
|
||||
``SCHEDULER_API_TOKEN`` once at VM provisioning and writes it there. When a
|
||||
caller presents that exact secret as ``Authorization: Bearer <secret>``, the
|
||||
app loads (or seeds on demand) a synthetic ``scheduler@system.local`` user
|
||||
that is a member of the ``Admin`` system group — so existing RBAC paths
|
||||
continue to work without special-casing.
|
||||
|
||||
Constraints on the secret (enforced here, not parsed):
|
||||
|
||||
- Empty / unset → this auth path is **disabled**. Production deploys should
|
||||
set it; dev / LOCAL_DEV_MODE typically doesn't, since the scheduler
|
||||
rides the dev-bypass instead.
|
||||
- Length < 32 → treated as misconfiguration and disabled. Prevents an
|
||||
operator typo that sets ``SCHEDULER_API_TOKEN=todo`` from accidentally
|
||||
granting admin to a 4-character bearer.
|
||||
- Comparison uses :func:`hmac.compare_digest` — constant-time so a remote
|
||||
caller cannot mount a length-discrimination timing attack.
|
||||
|
||||
Audit: every action by this user is attributed to ``scheduler@system.local``,
|
||||
visible in ``audit_log`` as a normal admin actor. Rotating the secret is
|
||||
``edit .env → docker compose restart app scheduler``; no DB write needed.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import hmac
|
||||
import logging
|
||||
import os
|
||||
import uuid
|
||||
from typing import Optional
|
||||
|
||||
import duckdb
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Identity of the synthetic user that backs the shared-secret auth path.
|
||||
# Kept stable so audit-log entries from the scheduler are easy to filter.
|
||||
SCHEDULER_USER_EMAIL = "scheduler@system.local"
|
||||
SCHEDULER_USER_NAME = "Scheduler"
|
||||
|
||||
# Floor on the secret length. 32 bytes ≈ 256 bits of entropy if generated
|
||||
# from /dev/urandom; well above the brute-force frontier and well above any
|
||||
# typo a human is plausibly going to make.
|
||||
SCHEDULER_TOKEN_MIN_LENGTH = 32
|
||||
|
||||
|
||||
def get_scheduler_secret() -> str:
|
||||
"""Return the configured shared secret, stripped. Empty when disabled."""
|
||||
return os.environ.get("SCHEDULER_API_TOKEN", "").strip()
|
||||
|
||||
|
||||
def is_scheduler_token(token: str) -> bool:
|
||||
"""True iff ``token`` exactly matches the configured shared secret.
|
||||
|
||||
Returns False when the env var is empty or shorter than the minimum
|
||||
length (auth path disabled). Uses constant-time comparison.
|
||||
"""
|
||||
if not token:
|
||||
return False
|
||||
secret = get_scheduler_secret()
|
||||
if not secret or len(secret) < SCHEDULER_TOKEN_MIN_LENGTH:
|
||||
return False
|
||||
return hmac.compare_digest(token, secret)
|
||||
|
||||
|
||||
def ensure_scheduler_user(conn: duckdb.DuckDBPyConnection) -> dict:
|
||||
"""Idempotently provision the scheduler user + Admin group membership.
|
||||
|
||||
Called both from the app's startup hook (so the user exists from the
|
||||
very first boot) and lazily from :func:`get_scheduler_user` so a token
|
||||
presented before the next restart of the app still resolves.
|
||||
|
||||
Returns the user dict in the same shape ``UserRepository.get_by_email``
|
||||
yields elsewhere — the caller treats it as any other authenticated user.
|
||||
"""
|
||||
from src.db import SYSTEM_ADMIN_GROUP
|
||||
from src.repositories.user_group_members import UserGroupMembersRepository
|
||||
from src.repositories.users import UserRepository
|
||||
|
||||
repo = UserRepository(conn)
|
||||
user = repo.get_by_email(SCHEDULER_USER_EMAIL)
|
||||
if not user:
|
||||
user_id = str(uuid.uuid4())
|
||||
repo.create(
|
||||
id=user_id,
|
||||
email=SCHEDULER_USER_EMAIL,
|
||||
name=SCHEDULER_USER_NAME,
|
||||
role="admin",
|
||||
# No password_hash — this user authenticates via the shared
|
||||
# secret only, never via /auth/login. Keeps the bootstrap
|
||||
# check ("any user has a password?") accurate.
|
||||
password_hash=None,
|
||||
)
|
||||
user = repo.get_by_email(SCHEDULER_USER_EMAIL)
|
||||
logger.info("Seeded scheduler service user: %s", SCHEDULER_USER_EMAIL)
|
||||
|
||||
admin_group = conn.execute(
|
||||
"SELECT id FROM user_groups WHERE name = ?", [SYSTEM_ADMIN_GROUP],
|
||||
).fetchone()
|
||||
if admin_group:
|
||||
UserGroupMembersRepository(conn).add_member(
|
||||
user_id=user["id"],
|
||||
group_id=admin_group[0],
|
||||
source="system_seed",
|
||||
added_by="app.auth.scheduler_token:ensure_scheduler_user",
|
||||
)
|
||||
|
||||
return user
|
||||
|
||||
|
||||
def get_scheduler_user(conn: duckdb.DuckDBPyConnection) -> Optional[dict]:
|
||||
"""Look up the scheduler user, seeding it on demand if absent.
|
||||
|
||||
Returns None only when seeding fails — typically a malformed schema or
|
||||
an out-of-band DB error. The caller (``get_current_user``) maps None
|
||||
to a normal 401 so the failure is observable but does not crash.
|
||||
"""
|
||||
from src.repositories.users import UserRepository
|
||||
|
||||
user = UserRepository(conn).get_by_email(SCHEDULER_USER_EMAIL)
|
||||
if user:
|
||||
return user
|
||||
try:
|
||||
return ensure_scheduler_user(conn)
|
||||
except Exception as e: # noqa: BLE001
|
||||
logger.error("Failed to provision scheduler user on demand: %s", e)
|
||||
return None
|
||||
29
app/main.py
29
app/main.py
|
|
@ -292,6 +292,35 @@ def create_app() -> FastAPI:
|
|||
except Exception as e:
|
||||
logger.warning(f"Could not seed admin: {e}")
|
||||
|
||||
# Seed the synthetic scheduler user when SCHEDULER_API_TOKEN is configured,
|
||||
# so the very first cron tick after a fresh deploy already has a valid
|
||||
# actor to attribute audit-log entries to. The lazy seed in
|
||||
# `app.auth.scheduler_token.get_scheduler_user` covers the case where the
|
||||
# secret is rotated mid-life, but doing it here keeps startup observable.
|
||||
from app.auth.scheduler_token import get_scheduler_secret
|
||||
if get_scheduler_secret():
|
||||
try:
|
||||
from app.auth.scheduler_token import (
|
||||
SCHEDULER_TOKEN_MIN_LENGTH,
|
||||
ensure_scheduler_user,
|
||||
)
|
||||
from src.db import get_system_db
|
||||
secret = get_scheduler_secret()
|
||||
if len(secret) < SCHEDULER_TOKEN_MIN_LENGTH:
|
||||
logger.warning(
|
||||
"SCHEDULER_API_TOKEN is set but only %d chars — auth path"
|
||||
" disabled (minimum %d). Generate a longer secret in .env.",
|
||||
len(secret), SCHEDULER_TOKEN_MIN_LENGTH,
|
||||
)
|
||||
else:
|
||||
conn = get_system_db()
|
||||
try:
|
||||
ensure_scheduler_user(conn)
|
||||
finally:
|
||||
conn.close()
|
||||
except Exception as e:
|
||||
logger.warning(f"Could not seed scheduler user: {e}")
|
||||
|
||||
# C8: Warn when no user has a password_hash — bootstrap endpoint is open.
|
||||
# This is intentional UX (operator can claim seed admin), but the open
|
||||
# window should be visible in startup logs so it's not forgotten.
|
||||
|
|
|
|||
|
|
@ -75,6 +75,25 @@ if [ "$DATA_SOURCE" = "keboola" ]; then
|
|||
fi
|
||||
JWT_KEY=$(gcloud secrets versions access latest --secret=agnes-$${CUSTOMER_NAME}-jwt-secret)
|
||||
|
||||
# SCHEDULER_API_TOKEN — shared secret between the app and scheduler containers.
|
||||
# Both source the same /opt/agnes/.env via Docker Compose env_file:, so the
|
||||
# scheduler's outbound bearer token always matches the app's expected value.
|
||||
# See app/auth/scheduler_token.py for the auth path it unlocks.
|
||||
#
|
||||
# Preserve across reboots: the token is plumbed into a long-lived synthetic
|
||||
# user, and rotating it forces a restart of both containers. Read back from
|
||||
# an existing .env when present; mint fresh only on the first boot.
|
||||
SCHEDULER_API_TOKEN=""
|
||||
if [ -f "$APP_DIR/.env" ]; then
|
||||
SCHEDULER_API_TOKEN=$(grep -E '^SCHEDULER_API_TOKEN=' "$APP_DIR/.env" | head -1 | cut -d= -f2- | tr -d '"' || true)
|
||||
fi
|
||||
if [ -z "$SCHEDULER_API_TOKEN" ]; then
|
||||
# 64 hex chars = 256 bits of /dev/urandom entropy. Floor enforced in
|
||||
# app/auth/scheduler_token.SCHEDULER_TOKEN_MIN_LENGTH is 32; 64 leaves
|
||||
# headroom for a future tightening without re-provisioning every VM.
|
||||
SCHEDULER_API_TOKEN=$(openssl rand -hex 32)
|
||||
fi
|
||||
|
||||
# Optional Google OAuth credentials. If the operator has created
|
||||
# google-oauth-client-{id,secret} secrets in the project's Secret Manager
|
||||
# AND wired them via runtime_secrets in the calling Terraform, the VM SA can
|
||||
|
|
@ -118,6 +137,7 @@ KEBOOLA_STORAGE_TOKEN=$KEBOOLA_TOKEN
|
|||
KEBOOLA_STACK_URL=$KEBOOLA_STACK_URL
|
||||
SEED_ADMIN_EMAIL=$SEED_ADMIN_EMAIL
|
||||
SEED_ADMIN_PASSWORD=$SEED_ADMIN_PASSWORD
|
||||
SCHEDULER_API_TOKEN=$SCHEDULER_API_TOKEN
|
||||
LOG_LEVEL=info
|
||||
DOMAIN=$DOMAIN
|
||||
AGNES_TAG=$IMAGE_TAG
|
||||
|
|
|
|||
|
|
@ -1,6 +1,6 @@
|
|||
[project]
|
||||
name = "agnes-the-ai-analyst"
|
||||
version = "0.15.0"
|
||||
version = "0.17.0"
|
||||
description = "Agnes — AI Data Analyst platform for AI analytical systems"
|
||||
requires-python = ">=3.11,<3.14"
|
||||
license = "MIT"
|
||||
|
|
|
|||
|
|
@ -1,12 +1,20 @@
|
|||
"""Scheduler service — replaces systemd timers.
|
||||
|
||||
Lightweight sidecar that fires scheduled jobs. Two job kinds:
|
||||
- "http": POST/GET an endpoint on the main app (e.g. data-refresh).
|
||||
- "fn": call a Python function in-process (e.g. marketplaces sync).
|
||||
Lightweight sidecar that fires scheduled jobs over HTTP against the main
|
||||
app. Authenticates with ``SCHEDULER_API_TOKEN`` (shared-secret synthetic
|
||||
admin — see ``app.auth.scheduler_token``); falls back to no-auth in
|
||||
LOCAL_DEV_MODE.
|
||||
|
||||
Schedules are strings parsed by src.scheduler.is_table_due — accepts
|
||||
Schedules are strings parsed by ``src.scheduler.is_table_due`` — accepts
|
||||
"every 15m", "every 1h", "daily 03:00", "daily 07:00,13:00".
|
||||
|
||||
Why every job is HTTP and nothing runs in-process: the scheduler container
|
||||
shares ``/data/state/system.duckdb`` with the app container, but DuckDB
|
||||
permits only one writer per file across processes. An in-process call
|
||||
from the scheduler raced the app's long-lived handle and 500-ed on
|
||||
``Could not set lock on file``. Going through HTTP makes the app the sole
|
||||
writer; the scheduler is reduced to a pure cron clock.
|
||||
|
||||
Usage: python -m services.scheduler
|
||||
"""
|
||||
|
||||
|
|
@ -36,21 +44,17 @@ _token_warning_emitted = False
|
|||
def _get_auth_token() -> str:
|
||||
"""Return the bearer token for API calls.
|
||||
|
||||
Production: ``SCHEDULER_API_TOKEN`` env var carries a long-lived PAT
|
||||
minted via ``/tokens`` for a service-account user with the roles the
|
||||
jobs need (typically ``core.admin`` for sync triggers). Set it.
|
||||
Production: ``SCHEDULER_API_TOKEN`` is a shared secret generated by the
|
||||
Terraform startup script and written to ``/opt/agnes/.env``. Both the
|
||||
``app`` and ``scheduler`` containers source the same .env via Docker
|
||||
Compose ``env_file:``, so the secret is symmetric. The app validates
|
||||
incoming Bearer tokens against this env var (constant-time compare in
|
||||
``app.auth.scheduler_token``) and resolves matches to a synthetic
|
||||
``scheduler@system.local`` user that is a member of the Admin group.
|
||||
|
||||
Dev / LOCAL_DEV_MODE: leave it unset. The scheduler returns the empty
|
||||
string and calls the API without an ``Authorization`` header — the
|
||||
API's dev-bypass auto-authenticates the request as the dev user.
|
||||
|
||||
The previous implementation tried to auto-fetch a token by POSTing to
|
||||
``/auth/token`` with just the seed admin's email. That endpoint
|
||||
requires email + password (or rejects external-auth accounts that
|
||||
have no local password), so the call always 401-ed and the scheduler
|
||||
log was noisy with one access-log line per cron tick. Removed in
|
||||
favor of explicit configuration: either set the PAT or rely on
|
||||
LOCAL_DEV_MODE.
|
||||
"""
|
||||
global _token_warning_emitted
|
||||
if SCHEDULER_API_TOKEN:
|
||||
|
|
@ -65,24 +69,23 @@ def _get_auth_token() -> str:
|
|||
return ""
|
||||
|
||||
|
||||
def _marketplaces_job():
|
||||
"""Entry point for the nightly marketplaces sync.
|
||||
|
||||
Imported lazily so the scheduler container still starts even if the
|
||||
module has an import-time issue in development — a failure here only
|
||||
kills one job, not the whole loop.
|
||||
"""
|
||||
from src.marketplace import sync_marketplaces
|
||||
return sync_marketplaces()
|
||||
|
||||
|
||||
# Schedule definitions: (name, schedule_string, kind, target)
|
||||
# kind = "http" -> target = (endpoint, method)
|
||||
# kind = "fn" -> target = callable_returning_any
|
||||
# Schedule definitions: (name, schedule_string, endpoint, method, timeout_sec).
|
||||
# All jobs are HTTP — see the module docstring for why nothing runs
|
||||
# in-process anymore. ``daily 03:00`` for marketplaces matches the cadence
|
||||
# the previous in-process job used; the endpoint is admin-only and
|
||||
# idempotent (it iterates the registry and per-marketplace errors do not
|
||||
# abort the run).
|
||||
#
|
||||
# timeout_sec: per-job override for the httpx call. Marketplaces gets a
|
||||
# generous 15 min because the app handler iterates every registered
|
||||
# marketplace under a single lock with up to 300s of git timeout per
|
||||
# entry — at 120s (the default that data-refresh uses) a real-world
|
||||
# registry of more than 2-3 slow repos times out the scheduler call,
|
||||
# which then re-fires on the next 30s tick and queues a redundant sync.
|
||||
JOBS = [
|
||||
("data-refresh", "every 15m", "http", ("/api/sync/trigger", "POST")),
|
||||
("health-check", "every 5m", "http", ("/api/health", "GET")),
|
||||
("marketplaces", "daily 03:00", "fn", _marketplaces_job),
|
||||
("data-refresh", "every 15m", "/api/sync/trigger", "POST", 120),
|
||||
("health-check", "every 5m", "/api/health", "GET", 30),
|
||||
("marketplaces", "daily 03:00", "/api/marketplaces/sync-all", "POST", 900),
|
||||
]
|
||||
|
||||
_running = True
|
||||
|
|
@ -94,7 +97,7 @@ def _signal_handler(sig, frame):
|
|||
_running = False
|
||||
|
||||
|
||||
def _call_api(endpoint: str, method: str = "POST") -> bool:
|
||||
def _call_api(endpoint: str, method: str, timeout_sec: int) -> bool:
|
||||
"""Call the main app API. Returns True on success."""
|
||||
url = f"{API_URL}{endpoint}"
|
||||
headers = {}
|
||||
|
|
@ -103,9 +106,9 @@ def _call_api(endpoint: str, method: str = "POST") -> bool:
|
|||
headers["Authorization"] = f"Bearer {token}"
|
||||
try:
|
||||
if method == "POST":
|
||||
resp = httpx.post(url, headers=headers, timeout=120)
|
||||
resp = httpx.post(url, headers=headers, timeout=timeout_sec)
|
||||
else:
|
||||
resp = httpx.get(url, headers=headers, timeout=30)
|
||||
resp = httpx.get(url, headers=headers, timeout=timeout_sec)
|
||||
if resp.status_code < 400:
|
||||
logger.info(f"Job {endpoint}: {resp.status_code}")
|
||||
return True
|
||||
|
|
@ -117,17 +120,6 @@ def _call_api(endpoint: str, method: str = "POST") -> bool:
|
|||
return False
|
||||
|
||||
|
||||
def _call_fn(label: str, fn) -> bool:
|
||||
"""Run an in-process callable. Returns True on success."""
|
||||
try:
|
||||
result = fn()
|
||||
logger.info("Job %s OK: %s", label, result)
|
||||
return True
|
||||
except Exception as e:
|
||||
logger.error("Job %s failed: %s", label, e)
|
||||
return False
|
||||
|
||||
|
||||
def run():
|
||||
signal.signal(signal.SIGTERM, _signal_handler)
|
||||
signal.signal(signal.SIGINT, _signal_handler)
|
||||
|
|
@ -140,18 +132,11 @@ def run():
|
|||
|
||||
while _running:
|
||||
now_iso = datetime.now(timezone.utc).isoformat()
|
||||
for name, schedule, kind, target in JOBS:
|
||||
for name, schedule, endpoint, method, timeout_sec in JOBS:
|
||||
if not is_table_due(schedule, last_run[name]):
|
||||
continue
|
||||
logger.info("Running job: %s (%s)", name, schedule)
|
||||
if kind == "http":
|
||||
endpoint, method = target
|
||||
ok = _call_api(endpoint, method)
|
||||
elif kind == "fn":
|
||||
ok = _call_fn(name, target)
|
||||
else:
|
||||
logger.error("Unknown job kind %r for %s", kind, name)
|
||||
ok = False
|
||||
ok = _call_api(endpoint, method, timeout_sec)
|
||||
if ok:
|
||||
last_run[name] = now_iso
|
||||
# 30s tick is plenty: interval jobs have minute-level resolution,
|
||||
|
|
|
|||
153
tests/test_auth_scheduler_token.py
Normal file
153
tests/test_auth_scheduler_token.py
Normal file
|
|
@ -0,0 +1,153 @@
|
|||
"""Tests for the SCHEDULER_API_TOKEN shared-secret auth path."""
|
||||
|
||||
import tempfile
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def fresh_db(monkeypatch):
|
||||
"""Isolated DuckDB + JWT secret per test, mirroring tests/test_pat.py."""
|
||||
with tempfile.TemporaryDirectory() as tmp:
|
||||
monkeypatch.setenv("DATA_DIR", tmp)
|
||||
monkeypatch.setenv("TESTING", "1")
|
||||
monkeypatch.setenv("JWT_SECRET_KEY", "test-jwt-secret-key-minimum-32-chars!!")
|
||||
# Clean slate — clear any inherited token from the host shell.
|
||||
monkeypatch.delenv("SCHEDULER_API_TOKEN", raising=False)
|
||||
# Force pristine state — earlier tests in the same session may have
|
||||
# opened the singleton; drop it so the new DATA_DIR takes effect.
|
||||
from src.db import close_system_db
|
||||
close_system_db()
|
||||
yield tmp
|
||||
close_system_db()
|
||||
|
||||
|
||||
def test_is_scheduler_token_disabled_when_env_unset(fresh_db, monkeypatch):
|
||||
"""Empty SCHEDULER_API_TOKEN must disable the auth path entirely.
|
||||
|
||||
A bug here would let any caller authenticate with empty Bearer "" — the
|
||||
constant-time compare would also be empty — granting admin to anyone.
|
||||
"""
|
||||
from app.auth.scheduler_token import is_scheduler_token
|
||||
|
||||
monkeypatch.delenv("SCHEDULER_API_TOKEN", raising=False)
|
||||
assert is_scheduler_token("") is False
|
||||
assert is_scheduler_token("anything") is False
|
||||
|
||||
|
||||
def test_is_scheduler_token_disabled_when_env_too_short(fresh_db, monkeypatch):
|
||||
"""Operator typo (SCHEDULER_API_TOKEN=todo) must NOT grant admin.
|
||||
|
||||
The minimum length floor exists specifically to prevent a 4-char bearer
|
||||
from accidentally matching a 4-char misconfigured secret.
|
||||
"""
|
||||
from app.auth.scheduler_token import is_scheduler_token
|
||||
|
||||
monkeypatch.setenv("SCHEDULER_API_TOKEN", "too-short")
|
||||
assert is_scheduler_token("too-short") is False
|
||||
|
||||
|
||||
def test_is_scheduler_token_matches_only_exact_value(fresh_db, monkeypatch):
|
||||
from app.auth.scheduler_token import is_scheduler_token
|
||||
|
||||
secret = "x" * 64 # > min length
|
||||
monkeypatch.setenv("SCHEDULER_API_TOKEN", secret)
|
||||
assert is_scheduler_token(secret) is True
|
||||
assert is_scheduler_token(secret + "trailing") is False
|
||||
assert is_scheduler_token(secret[:-1]) is False
|
||||
assert is_scheduler_token("y" * 64) is False
|
||||
|
||||
|
||||
def test_ensure_scheduler_user_seeds_user_and_admin_membership(fresh_db, monkeypatch):
|
||||
"""First call seeds; second call is a no-op idempotent re-add."""
|
||||
from app.auth.scheduler_token import (
|
||||
SCHEDULER_USER_EMAIL,
|
||||
ensure_scheduler_user,
|
||||
)
|
||||
from src.db import SYSTEM_ADMIN_GROUP, get_system_db
|
||||
|
||||
conn = get_system_db()
|
||||
try:
|
||||
user1 = ensure_scheduler_user(conn)
|
||||
assert user1["email"] == SCHEDULER_USER_EMAIL
|
||||
# Admin group membership exists.
|
||||
admin_group = conn.execute(
|
||||
"SELECT id FROM user_groups WHERE name = ?", [SYSTEM_ADMIN_GROUP],
|
||||
).fetchone()
|
||||
assert admin_group is not None
|
||||
membership = conn.execute(
|
||||
"SELECT 1 FROM user_group_members WHERE user_id = ? AND group_id = ?",
|
||||
[user1["id"], admin_group[0]],
|
||||
).fetchone()
|
||||
assert membership is not None
|
||||
|
||||
# Second call — same id, no duplicate membership row.
|
||||
user2 = ensure_scheduler_user(conn)
|
||||
assert user2["id"] == user1["id"]
|
||||
rows = conn.execute(
|
||||
"SELECT COUNT(*) FROM user_group_members WHERE user_id = ? AND group_id = ?",
|
||||
[user1["id"], admin_group[0]],
|
||||
).fetchone()
|
||||
assert rows[0] == 1
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
|
||||
def test_get_scheduler_user_lazy_seeds_when_absent(fresh_db, monkeypatch):
|
||||
"""First lookup with no prior seed should provision on demand.
|
||||
|
||||
The startup hook in app.main also seeds eagerly, but the scheduler may
|
||||
present the token before main.py has finished its lifespan setup on a
|
||||
cold boot — get_scheduler_user must close that gap.
|
||||
"""
|
||||
from app.auth.scheduler_token import (
|
||||
SCHEDULER_USER_EMAIL,
|
||||
get_scheduler_user,
|
||||
)
|
||||
from src.db import get_system_db
|
||||
from src.repositories.users import UserRepository
|
||||
|
||||
conn = get_system_db()
|
||||
try:
|
||||
# Confirm user does not exist before the call.
|
||||
assert UserRepository(conn).get_by_email(SCHEDULER_USER_EMAIL) is None
|
||||
user = get_scheduler_user(conn)
|
||||
assert user is not None
|
||||
assert user["email"] == SCHEDULER_USER_EMAIL
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
|
||||
def test_require_session_token_rejects_scheduler_secret(fresh_db, monkeypatch):
|
||||
"""The shared scheduler secret must NOT pass `require_session_token`.
|
||||
|
||||
/auth/tokens (PAT minting) is gated by `require_session_token`, which
|
||||
historically rejected only PATs (JWTs with typ=pat). The scheduler
|
||||
secret is opaque so verify_token() returns None and the PAT-claim
|
||||
check would silently pass — letting a compromised secret forge
|
||||
persistent PATs that survive a rotation. Regression guard for the
|
||||
Devin review on PR #127.
|
||||
"""
|
||||
import asyncio
|
||||
from unittest.mock import MagicMock
|
||||
|
||||
from fastapi import HTTPException
|
||||
|
||||
from app.auth.dependencies import require_session_token
|
||||
|
||||
secret = "x" * 64
|
||||
monkeypatch.setenv("SCHEDULER_API_TOKEN", secret)
|
||||
|
||||
request = MagicMock()
|
||||
request.headers = {"authorization": f"Bearer {secret}"}
|
||||
request.cookies = {}
|
||||
|
||||
user = {"id": "scheduler-id", "email": "scheduler@system.local"}
|
||||
try:
|
||||
asyncio.run(require_session_token(request=request, user=user))
|
||||
except HTTPException as exc:
|
||||
assert exc.status_code == 403
|
||||
# Detail should signal "interactive only", flavor doesn't matter.
|
||||
assert "interactive" in exc.detail.lower()
|
||||
else:
|
||||
raise AssertionError("require_session_token must reject scheduler secret")
|
||||
Loading…
Reference in a new issue