fix(scheduler): HTTP marketplaces job + SCHEDULER_API_TOKEN shared secret (#127)
* fix(scheduler): HTTP marketplaces job + SCHEDULER_API_TOKEN shared secret Two scheduler-reliability bugs surfaced after the v0.12.1 USER-agnes flip: 1. The marketplaces job called src.marketplace.sync_marketplaces() in-process from the scheduler container, racing the app's long-lived system.duckdb handle. DuckDB rejects cross-process writers — every cron tick 500-ed on "Could not set lock on file ... PID 0". 2. The data-refresh + new marketplaces jobs both 401-ed on the API because SCHEDULER_API_TOKEN was never propagated by the Terraform startup script. The scheduler had no credential to authenticate with. Fix: - New POST /api/marketplaces/sync-all (admin-only) drives the nightly refresh through the app process so it inherits the existing DB connection. - Scheduler swaps fn->http for marketplaces; all jobs are now plain HTTP and the scheduler is reduced to a cron clock. - New app/auth/scheduler_token.py adds a shared-secret auth path. The startup script generates a 256-bit secret on first boot, persists it across reboots, and writes it to /opt/agnes/.env. Both containers source the same .env. The app validates incoming Bearer tokens against the env var (constant-time, length-floored) and resolves matches to a synthetic scheduler@system.local user that's a member of the Admin system group. Audit-log entries from the scheduler are attributed to this user. - app/main.py seeds the synthetic user at startup so the first cron tick has a valid actor; lazy seed in get_scheduler_user covers token rotation before the next app restart. Tests: 5 new in tests/test_auth_scheduler_token.py covering empty/short secret rejection, exact-match comparison, idempotent user seeding, and lazy provisioning. 142 marketplace + scheduler tests + 96 auth tests remain green. Existing VMs with .env from before this change need a one-time re-provisioning (re-run startup-script or rotate via openssl rand); documented in CHANGELOG. * fix(audit): use '_all' sentinel for bulk marketplace sync — Devin review #127 Avoids the literal string 'marketplace:None' in the audit_log resource column when the bulk sync endpoint writes its summary row. * fix(scheduler): unblock event loop + per-job timeouts — Devin review #127 Two findings from Devin re-review on commit 5fbad15: 1. BUG: trigger_sync_all was async def, so FastAPI ran it on the asyncio event loop. sync_marketplaces() does blocking I/O (subprocess git clones up to GIT_TIMEOUT_SEC=300 each, threading.Lock, DuckDB writes) and would freeze every concurrent request for the duration of a bulk sync. Switched to plain def so FastAPI auto-routes to the thread pool. 2. ANALYSIS: scheduler used a fixed 120s httpx timeout for every POST. Bulk marketplace sync iterates the registry under a single lock with up to 300s per repo — easily exceeds 120s on 2-3 slow repos. The scheduler then sees a timeout, doesn't update last_run, and re-fires on the next 30s tick, queueing redundant work. Per-job timeout override added to the JOBS tuple; marketplaces gets 900s (15 min), data-refresh keeps 120s, health-check 30s. * fix(auth): require_session_token rejects scheduler shared secret — Devin review #127 require_session_token gates /auth/tokens (PAT minting). Pre-fix it only rejected JWTs with typ=pat — but the scheduler shared secret is an opaque string, so verify_token() returns None, payload becomes {}, and the PAT-claim check silently passed. A caller bearing SCHEDULER_API_TOKEN could mint persistent PATs that survive a secret rotation. Added explicit is_scheduler_token() check before the PAT-claim check; new regression test in tests/test_auth_scheduler_token.py. Devin's other note (pre-existing async def trigger_sync at marketplaces.py:392 also calls blocking sync_one) — Devin flagged it as out-of-scope for this PR and I agree; tracking separately. * release(0.17.0): cut + clean up CHANGELOG duplicates Cuts 0.17.0 (minor: scheduler shared-secret auth + sync-all endpoint plus the deploy-shape fixes that landed since the last release tag). Bumps pyproject from 0.15.0 — also corrects the missed bump from PR #120 (v0.16.0 was tagged on GitHub and shipped as :stable, but pyproject stayed at 0.15.0, so /api/version, /cli/latest, and `da --version` had been under-reporting the running release). Removes the long-form duplicate entries for 0.13.0 / 0.14.0 / 0.15.0 above [0.16.0] — the canonical short summaries (with GitHub-release links) already exist below 0.16.0, the long forms were leftover state from before those versions were cut and have been silently shadowed ever since.
This commit is contained in:
parent
7012966482
commit
995e4cd366
10 changed files with 469 additions and 109 deletions
59
CHANGELOG.md
59
CHANGELOG.md
|
|
@ -10,8 +10,17 @@ CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every C
|
||||||
|
|
||||||
## [Unreleased]
|
## [Unreleased]
|
||||||
|
|
||||||
|
## [0.17.0] — 2026-04-29
|
||||||
|
|
||||||
|
### Added
|
||||||
|
|
||||||
|
- **Shared-secret auth path for the in-cluster scheduler service** (`SCHEDULER_API_TOKEN`). Both the `app` and `scheduler` containers source the same `/opt/agnes/.env` via Docker Compose `env_file:`, so a 256-bit secret generated once at VM provisioning serves both sides symmetrically. The app validates incoming `Authorization: Bearer <secret>` against the env var (constant-time compare; minimum length 32 chars; rejected when env is empty) and resolves matches to a synthetic `scheduler@system.local` user that is a member of the `Admin` system group — every existing RBAC gate (`require_admin`, `require_resource_access`) works unchanged. Audit-log entries from the scheduler are attributed to this user. Rotation: edit `.env`, `docker compose restart app scheduler`. See `app/auth/scheduler_token.py` for the threat model.
|
||||||
|
- **`POST /api/marketplaces/sync-all`** — admin-only endpoint that runs `src.marketplace.sync_marketplaces()` inside the app process. Wired up so the scheduler container can drive the nightly refresh over HTTP without opening `system.duckdb` directly.
|
||||||
|
|
||||||
### Fixed
|
### Fixed
|
||||||
|
|
||||||
|
- **Scheduler `marketplaces` job 500-ed every cron tick with `IO Error: Could not set lock on file system.duckdb` after v0.12.1.** The previous implementation called `src.marketplace.sync_marketplaces()` in-process from the scheduler container, but DuckDB permits only one writer per file across processes — the scheduler raced the app's long-lived handle. Switched the job to `POST /api/marketplaces/sync-all`, making the app the sole writer; the scheduler is now a pure cron clock.
|
||||||
|
- **Scheduler `data-refresh` job 401-ed every 15 minutes** with `Missing or invalid Authorization header` because `SCHEDULER_API_TOKEN` was never propagated by `infra/modules/customer-instance/startup-script.sh.tpl`. The startup script now generates a 64-hex-char secret on first boot via `openssl rand -hex 32`, persists it across reboots by reading back from an existing `.env` (rotation requires explicit operator action — both containers must restart together), and writes it into `/opt/agnes/.env` alongside the other secrets. `app/main.py` seeds the matching synthetic user at startup so the very first cron tick has a valid actor to attribute audit-log entries to. Existing VMs need a one-time `sudo /opt/agnes/agnes-rotate-scheduler-token.sh` (or simply re-run the startup script via `terraform apply -replace='module.agnes.google_compute_instance.vm["<vm-name>"]'`); see migration note in this changelog or rerun the startup script manually.
|
||||||
- **Non-root container couldn't write to host-bind-mounted `/data` after the v0.12.1 USER-agnes flip.** `infra/modules/customer-instance/startup-script.sh.tpl` now `chown -R 999:999 /data` after creating the persistent-disk subdirs (`state`, `analytics`, `extracts`). Without this, a freshly-attached PD is root-owned by default and `USER agnes` (uid 999) cannot open `/data/state/system.duckdb` for write — every authed request 500s with `IOException: Cannot open file ... Permission denied` while `/api/health` (which doesn't open the system DB) keeps returning 200, masking the failure from health-only monitoring. Regression first observed on `agnes-development` on 2026-04-29 after the auto-upgrade picked up `:stable` from the 0.12.1 release. **Existing VMs with PD-backed `/data` need a one-time host-side `sudo chown -R 999:999 /var/lib/docker/volumes/agnes_data/_data && sudo docker restart agnes-app-1 agnes-scheduler-1` to recover** — Terraform `metadata_startup_script` only runs on boot, so an apply alone does not retro-fix running VMs.
|
- **Non-root container couldn't write to host-bind-mounted `/data` after the v0.12.1 USER-agnes flip.** `infra/modules/customer-instance/startup-script.sh.tpl` now `chown -R 999:999 /data` after creating the persistent-disk subdirs (`state`, `analytics`, `extracts`). Without this, a freshly-attached PD is root-owned by default and `USER agnes` (uid 999) cannot open `/data/state/system.duckdb` for write — every authed request 500s with `IOException: Cannot open file ... Permission denied` while `/api/health` (which doesn't open the system DB) keeps returning 200, masking the failure from health-only monitoring. Regression first observed on `agnes-development` on 2026-04-29 after the auto-upgrade picked up `:stable` from the 0.12.1 release. **Existing VMs with PD-backed `/data` need a one-time host-side `sudo chown -R 999:999 /var/lib/docker/volumes/agnes_data/_data && sudo docker restart agnes-app-1 agnes-scheduler-1` to recover** — Terraform `metadata_startup_script` only runs on boot, so an apply alone does not retro-fix running VMs.
|
||||||
- `Dockerfile` pins the `agnes` user to `uid:gid 999:999` explicitly (`useradd --uid 999`). Previously the uid was whatever Debian's `useradd --system` assigned next — happened to be 999 today, but a future base-image change picking 998 or 1000 would silently desync from the startup-script's `chown 999:999`, reintroducing the same incident. Pinning makes the contract grep-able from both sides.
|
- `Dockerfile` pins the `agnes` user to `uid:gid 999:999` explicitly (`useradd --uid 999`). Previously the uid was whatever Debian's `useradd --system` assigned next — happened to be 999 today, but a future base-image change picking 998 or 1000 would silently desync from the startup-script's `chown 999:999`, reintroducing the same incident. Pinning makes the contract grep-able from both sides.
|
||||||
- `scripts/smoke-test.sh` no longer silently SKIPs every authed check when `bootstrap` returns 403 (users exist) and `SMOKE_TOKEN` is not set — it now FAILs loudly. Also adds an unauthenticated DB-touching probe (`POST /auth/email/request`) before bootstrap, since `/api/health` deliberately doesn't open `system.duckdb` (kept cheap for LB probes) and so cannot detect filesystem/permission issues. The new probe catches the foundryai-development class of regression even on instances where bootstrap is closed.
|
- `scripts/smoke-test.sh` no longer silently SKIPs every authed check when `bootstrap` returns 403 (users exist) and `SMOKE_TOKEN` is not set — it now FAILs loudly. Also adds an unauthenticated DB-touching probe (`POST /auth/email/request`) before bootstrap, since `/api/health` deliberately doesn't open `system.duckdb` (kept cheap for LB probes) and so cannot detect filesystem/permission issues. The new probe catches the foundryai-development class of regression even on instances where bootstrap is closed.
|
||||||
|
|
@ -23,56 +32,6 @@ CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every C
|
||||||
|
|
||||||
- `release.yml` adds an `e2e-bind-mount` job that boots the freshly built image against a host-bind-mounted `/data` directory (instead of the named volume the existing `smoke-test` job uses). Docker initializes a fresh named volume by copying from the image's `/data` — which the Dockerfile chowns to `agnes:agnes` before flipping USER — so the named-volume path always works. The bind-mount path mirrors what GCE VMs run via `docker-compose.host-mount.yml`, and includes a negative assertion (write must fail on root-owned `/data` before the operator chown) plus a positive assertion (smoke passes after the chown). Locks in the contract that broke `agnes-development`: removing `chown 999:999` from `startup-script.sh.tpl` or changing the Dockerfile uid pin breaks CI.
|
- `release.yml` adds an `e2e-bind-mount` job that boots the freshly built image against a host-bind-mounted `/data` directory (instead of the named volume the existing `smoke-test` job uses). Docker initializes a fresh named volume by copying from the image's `/data` — which the Dockerfile chowns to `agnes:agnes` before flipping USER — so the named-volume path always works. The bind-mount path mirrors what GCE VMs run via `docker-compose.host-mount.yml`, and includes a negative assertion (write must fail on root-owned `/data` before the operator chown) plus a positive assertion (smoke passes after the chown). Locks in the contract that broke `agnes-development`: removing `chown 999:999` from `startup-script.sh.tpl` or changing the Dockerfile uid pin breaks CI.
|
||||||
|
|
||||||
## [0.15.0] — 2026-04-29
|
|
||||||
|
|
||||||
### Added
|
|
||||||
|
|
||||||
- **Corporate-memory v1 + v1.5 — confidence, contradictions, audience distribution, and rule sync.** Issue #72.
|
|
||||||
- Schema v15: `knowledge_items` gains context-engineering columns (`confidence`, `domain`, `entities`, `source_type`, `source_ref`, `valid_from`, `valid_until`, `supersedes`, `sensitivity`, `is_personal`); new `knowledge_contradictions` table for surfacing conflicting facts and `session_extraction_state` for the verification detector's idempotent resume.
|
|
||||||
- Schema v16: `verification_evidence` table — one row per analyst confirmation, indexed on `item_id`, drives the `confidence` calculation in `services/corporate_memory/confidence.py` (linear-decay with floor + additional-verifier boost; configurable via `instance.yaml`).
|
|
||||||
- **Server**: `GET /api/memory/bundle` returns mandatory + ranked-approved items within a token budget (default 6000 ≈ 24KB) — drives `da sync`'s rule write. `GET /api/memory/stats` now uses SQL aggregation (no full-list materialization, doesn't block the event loop). `POST /api/memory/admin/mandate` and `POST /api/memory/admin/batch` accept an `audience` field; the audience is matched against the caller's `user_group_members` JOIN `user_groups` on read so a user in group `finance` sees `audience='group:finance'` items, and admins see all. Verification-detector endpoint extracts knowledge candidates from session JSONL files and merges evidence into the existing item when a fact is re-asserted.
|
|
||||||
- **CLI**: `da sync` step 7 (`_fetch_and_write_rules`) calls `/api/memory/bundle`, writes `mandatory` items to `.claude/rules/km_<id>.md` (one file per item) and concatenates `approved` into `.claude/rules/km_approved.md`. Stale `km_*.md` files from a previous run are pruned. Best-effort: any HTTP/JSON failure is logged and sync continues.
|
|
||||||
- **Auto-tagging**: `services/corporate_memory/tagger.py` runs an LLM extraction pass on knowledge create/update when `ai:` is configured in `instance.yaml`. Wrapped in `asyncio.to_thread` so it doesn't block the event loop. Best-effort: missing config or LLM error → item created with no auto-tags.
|
|
||||||
- **Per-item privacy**: `is_personal` items are visible only to the contributor and platform admins (members of the `Admin` system group). The `_can_view_item` helper takes a pre-computed `is_priv` flag so list endpoints don't re-query `user_group_members` per item.
|
|
||||||
- **Audit log**: every admin action (mandate / approve / reject / revoke / mandate-batch / contradiction resolve) writes a row tagged `corporate_memory.<action>` with the affected item ids and reason field.
|
|
||||||
- `/me/debug` — self-only auth diagnostic page. Shows the logged-in user their own decoded JWT claims (no raw token), group memberships with sources and bound `external_id` when present, resource grants effective via those memberships, and a "Refetch from Google (dry-run)" button that issues a fresh `fetch_user_groups` call and reports the diff against the cached `user_group_members` snapshot without writing anything. Gated by `AGNES_DEBUG_AUTH=true` env var (default off → route returns 404 and the navbar item is not rendered). Intended for dev / staging VMs; do not enable on customer-facing instances. Issue #116.
|
|
||||||
|
|
||||||
### Changed
|
|
||||||
|
|
||||||
- **`POST /api/memory/{id}/vote` accepts `vote=0`** to retract a previous vote (toggle un-vote from the UI). Pre-fix the API rejected vote=0 with 400 and the UI's toggle logic silently no-op'd on un-toggle.
|
|
||||||
- **`/admin/corporate-memory` and `/corporate-memory/admin` are gated by `require_admin`** (admin-group membership) instead of the v9-era `require_role(Role.KM_ADMIN)`. The km_admin role was collapsed into admin in main's RBAC v13; module authors needing finer-grained corporate-memory curation should use a `resource_grants` row of type `corporate_memory_admin`.
|
|
||||||
|
|
||||||
### Internal
|
|
||||||
|
|
||||||
- `_effective_groups` in `app/api/memory.py` queries `user_group_members JOIN user_groups` instead of reading the `users.groups` JSON column (dropped in v13). The audience-distribution tests in `tests/test_memory_api.py` use a `_add_user_to_group` helper that inserts into `user_group_members` + `user_groups` directly.
|
|
||||||
- `jsonschema` added to dev dependencies for the corporate-memory schema-validation fixtures (`tests/test_corporate_memory_v1.py::TestSchemaValidation`). Production code does not import it.
|
|
||||||
|
|
||||||
## [0.14.0] — 2026-04-29
|
|
||||||
|
|
||||||
### Added
|
|
||||||
|
|
||||||
- **v2 fetch primitives — discovery + scoped fetch** for analytical workflows. Replaces the BigQuery wrap-view pattern (which caused "Response too large" on multi-hundred-million-row source views) with a Claude-session-driven toolkit. Server exposes `GET /api/v2/catalog` (RBAC-filtered table list with flavor + fetch-via hints), `GET /api/v2/schema/{table_id}` (column metadata + BQ flavor hints), `GET /api/v2/sample/{table_id}` (N sample rows), `POST /api/v2/scan` (validator + RBAC + quota + max_result_bytes guard, Arrow IPC stream), `POST /api/v2/scan/estimate` (BigQuery dryRun, no execution). CLI gains `da catalog`, `da schema`, `da describe`, `da fetch <table> --select … --where … --limit N [--estimate] [--as <name>]`, `da snapshot list/refresh/drop/prune`, `da disk-info`. WHERE-clause validator at `app/api/where_validator.py` is sqlglot-backed with structural rejects + function allow-list + column-existence enforcement. Process-local quota tracker (concurrent + daily bytes per user). New `cli/skills/agnes-data-querying.md` standalone skill + CLAUDE.md addendum tells Claude to discover → estimate → fetch a filtered subset locally → analyze. 11 new `instance.yaml` knobs (`api.scan_max_concurrent_per_user`, `api.scan_daily_bytes_per_user`, `api.scan_max_result_bytes`, `api.where_clause_max_length`, `api.catalog_cache_ttl_seconds`, `api.schema_cache_ttl_seconds`, `api.sample_cache_ttl_seconds`, `data_source.bigquery.billing_project`, `data_source.bigquery.legacy_wrap_views`, `snapshots.dir`, `snapshots.cache_size_limit_gb`). Issue #101.
|
|
||||||
- `data_source.bigquery.billing_project` config knob — explicit billing project for BQ scan + estimate. Defaults to `data_source.bigquery.project`. Matters for cross-project read patterns where the VM service account has `bigquery.data.*` on the data project but lacks `serviceusage.services.use` there; setting this to a project where the SA holds `serviceusage.services.use` fixes the dry-run 403.
|
|
||||||
|
|
||||||
### Changed
|
|
||||||
|
|
||||||
- **BREAKING**: `connectors/bigquery/extractor.py` no longer creates a wrap view (`SELECT * FROM bigquery_query(...)`) for VIEW / MATERIALIZED_VIEW entities by default. Operators relying on the old behavior must set `data_source.bigquery.legacy_wrap_views: true` in `instance.yaml` for one release cycle. BASE TABLE entities are unchanged. The new `da fetch` workflow replaces wrap views for analytical queries — see CLAUDE.md "Querying Agnes data — agent rails".
|
|
||||||
- `RegisterTableRequest.primary_key` and `UpdateTableRequest.primary_key` accept `Optional[List[str]]` for composite keys (session-grain MSA tables key on `(session_id, event_date)`, browse rows on more); a bare string remains accepted for backward compat via a `field_validator(mode="before")` that wraps it in a one-element list. Old CLI scripts posting `"primary_key": "session_id"` continue to work.
|
|
||||||
- `GET /api/v2/catalog` now caches the underlying `repo.list_all()` rows globally with the documented `api.catalog_cache_ttl_seconds` default (300s) and runs RBAC fresh per request. The previous per-user cache served stale RBAC-filtered results for up to TTL after a permission flip — `v2_schema.py` and `v2_sample.py` already had this pattern; `v2_catalog.py` now matches.
|
|
||||||
|
|
||||||
## [0.13.0] — 2026-04-29
|
|
||||||
|
|
||||||
### Added
|
|
||||||
|
|
||||||
- **Windows/PowerShell wrapper for local dev.** New `scripts/run-local-dev.ps1` mirrors `scripts/run-local-dev.sh` for operators on Windows where GNU Make / bash aren't available — same compose stack (`docker-compose.yml` + `docker-compose.dev.yml` + `docker-compose.local-dev.yml`), same `LOCAL_DEV_GROUPS` default seeding, same `up` / `down` / `logs` actions. Run `.\scripts\run-local-dev.ps1` for the fast path (reuses existing image) or `.\scripts\run-local-dev.ps1 -Build` to force `--build` after `pyproject.toml` / `Dockerfile` changes. Verified on Docker Desktop for Windows. See `docs/local-development.md`.
|
|
||||||
- **Admin server configuration editor** at `/admin/server-config` — admins can now view and edit `instance.yaml` from the web UI without SSHing into the host. Two new endpoints (`GET /api/admin/server-config` returns the current config with secret-looking values masked; `POST /api/admin/server-config` deep-merges a section patch into `DATA_DIR/state/instance.yaml`). The page lists the editable sections (`instance`, `data_source`, `email`, `telegram`, `jira`, `theme`, `server`, `auth`) and renders a per-field form. Saves touching `auth.*` or `server.*` ("danger zone" — can lock operators out) require an explicit confirmation step. Every save writes an `instance_config.update` row to `audit_log` with a per-field diff (secret values masked as `***`, field paths preserved so a rotation is recorded as `email.smtp_password: *** → ***`). Issue #91.
|
|
||||||
|
|
||||||
### Fixed
|
|
||||||
|
|
||||||
- `app/instance_config.py:load_instance_config` now deep-merges the static `CONFIG_DIR/instance.yaml` with the writable overlay at `DATA_DIR/state/instance.yaml` instead of returning the overlay verbatim when present. Pre-fix, the first save through the new server-config editor (which writes only the section the operator actually touched) caused every consumer of static-only sections (corporate memory, dataset list, OpenMetadata client) to fall through to empty defaults until the overlay was deleted. Issue #91.
|
|
||||||
- `POST /api/admin/configure` now uses the same narrow-overlay write strategy as the new server-config editor: it reads the overlay verbatim (no static fallback), patches only `instance` / `auth` / `data_source`, and writes atomically via tmp + `os.replace`. Pre-fix it seeded `existing` from the env-resolved merged config when no overlay file was present and dumped the whole thing back, persisting cleartext `${ENV_VAR}` values (e.g. `smtp_password`) into the writable overlay even though the wizard never touched those sections. Issue #91.
|
|
||||||
- `POST /api/admin/server-config` now strips redaction sentinels (`***` / `<empty>`) out of every secret-keyed leaf in the incoming patch before the deep-merge. The companion GET endpoint masks secret-keyed children inside nested objects (e.g. `data_source.keboola.token_env`), and the form renders those nested objects as JSON textareas — without the scrub, a no-op save would round-trip the masked JSON back and overwrite the real overlay value (`token_env: "KEBOOLA_STORAGE_TOKEN"` → `"***"`), silently breaking the next sync. Defense-in-depth on both sides: the client form scrubs before posting, and the server scrubs before merge so an API caller (CLI / script) can't corrupt secrets either. Issue #91.
|
|
||||||
|
|
||||||
## [0.16.0] — 2026-04-29
|
## [0.16.0] — 2026-04-29
|
||||||
|
|
||||||
Minor release. Comprehensive deploy safety audit — CI/CD pipeline hardening, 50+ new tests covering previously untested failure modes, DB schema health check, config versioning, and BigQuery ATTACH error resilience. Built on top of v0.15.0 / `2e1dfb7`.
|
Minor release. Comprehensive deploy safety audit — CI/CD pipeline hardening, 50+ new tests covering previously untested failure modes, DB schema health check, config versioning, and BigQuery ATTACH error resilience. Built on top of v0.15.0 / `2e1dfb7`.
|
||||||
|
|
|
||||||
|
|
@ -257,7 +257,7 @@ Admin-managed git repos cloned nightly to `${DATA_DIR}/marketplaces/<slug>/`
|
||||||
so FastAPI can read their contents from disk.
|
so FastAPI can read their contents from disk.
|
||||||
|
|
||||||
- Register via `/admin/marketplaces` (admin UI) or `POST /api/marketplaces`.
|
- Register via `/admin/marketplaces` (admin UI) or `POST /api/marketplaces`.
|
||||||
- Scheduler calls `src.marketplace.sync_marketplaces()` in-process at `daily 03:00` UTC — no HTTP round-trip to the main app.
|
- Scheduler calls `POST /api/marketplaces/sync-all` (admin-only, authed via `SCHEDULER_API_TOKEN`) at `daily 03:00` UTC. Routing through HTTP keeps the app the sole writer to `system.duckdb` — the previous in-process call from the scheduler container raced the app's long-lived DB handle and 500-ed on `Could not set lock on file`.
|
||||||
- Manual re-sync from the UI ("Sync now") hits `POST /api/marketplaces/{id}/sync`.
|
- Manual re-sync from the UI ("Sync now") hits `POST /api/marketplaces/{id}/sync`.
|
||||||
- PATs for private repos persist to `${DATA_DIR}/state/.env_overlay` (chmod 600) as `AGNES_MARKETPLACE_<SLUG>_TOKEN`. DuckDB stores only the env-var name (`token_env`), never the secret.
|
- PATs for private repos persist to `${DATA_DIR}/state/.env_overlay` (chmod 600) as `AGNES_MARKETPLACE_<SLUG>_TOKEN`. DuckDB stores only the env-var name (`token_env`), never the secret.
|
||||||
- Registry lives in DuckDB table `marketplace_registry` (schema v9).
|
- Registry lives in DuckDB table `marketplace_registry` (schema v9).
|
||||||
|
|
|
||||||
|
|
@ -22,6 +22,7 @@ from src.marketplace import (
|
||||||
MarketplaceNotFound,
|
MarketplaceNotFound,
|
||||||
delete_marketplace_dir,
|
delete_marketplace_dir,
|
||||||
is_valid_slug,
|
is_valid_slug,
|
||||||
|
sync_marketplaces,
|
||||||
sync_one,
|
sync_one,
|
||||||
)
|
)
|
||||||
from src.repositories.audit import AuditRepository
|
from src.repositories.audit import AuditRepository
|
||||||
|
|
@ -409,3 +410,46 @@ async def trigger_sync(
|
||||||
{"commit": result["commit"], "action": result["action"]},
|
{"commit": result["commit"], "action": result["action"]},
|
||||||
)
|
)
|
||||||
return result
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
@router.post("/sync-all")
|
||||||
|
def trigger_sync_all(
|
||||||
|
user: dict = Depends(require_admin),
|
||||||
|
conn: duckdb.DuckDBPyConnection = Depends(_get_db),
|
||||||
|
):
|
||||||
|
"""Sync every registered marketplace.
|
||||||
|
|
||||||
|
Wired up so the scheduler service can drive the nightly refresh over
|
||||||
|
HTTP. The previous implementation called ``src.marketplace.sync_marketplaces``
|
||||||
|
in-process from the scheduler container, which conflicted with the app's
|
||||||
|
long-lived ``system.duckdb`` handle (DuckDB allows only one writer per
|
||||||
|
file across processes). Routing through the app inherits the existing
|
||||||
|
connection without contention.
|
||||||
|
|
||||||
|
Declared ``def`` (not ``async def``) so FastAPI runs it in a thread
|
||||||
|
pool — :func:`sync_marketplaces` does blocking I/O (subprocess git
|
||||||
|
clones with ``GIT_TIMEOUT_SEC=300`` per repo, DuckDB writes, a
|
||||||
|
process-wide threading.Lock) and would freeze the event loop for the
|
||||||
|
duration of a bulk sync if it ran on the asyncio thread. Health
|
||||||
|
checks, login redirects, and every other concurrent request keep
|
||||||
|
serving while the bulk sync churns through the registry.
|
||||||
|
|
||||||
|
One audit row per call summarises the outcome — per-marketplace details
|
||||||
|
live in ``marketplace_registry`` and the per-call result payload below.
|
||||||
|
"""
|
||||||
|
result = sync_marketplaces()
|
||||||
|
# _audit appends "marketplace:" to the target id when writing the
|
||||||
|
# resource column. "_all" produces "marketplace:_all" — a stable,
|
||||||
|
# greppable sentinel for bulk-sync rows; the real per-marketplace
|
||||||
|
# commit/error breakdown is in the params payload.
|
||||||
|
_audit(
|
||||||
|
conn,
|
||||||
|
user["id"],
|
||||||
|
"marketplace.sync_all",
|
||||||
|
"_all",
|
||||||
|
{
|
||||||
|
"synced": [r.get("id") for r in result.get("synced", [])],
|
||||||
|
"errors": [{"id": e.get("id"), "error": e.get("error")} for e in result.get("errors", [])],
|
||||||
|
},
|
||||||
|
)
|
||||||
|
return result
|
||||||
|
|
|
||||||
|
|
@ -179,6 +179,21 @@ async def get_current_user(
|
||||||
detail="Missing or invalid Authorization header",
|
detail="Missing or invalid Authorization header",
|
||||||
)
|
)
|
||||||
|
|
||||||
|
# Shared-secret path for the in-cluster scheduler. Checked before
|
||||||
|
# pat_resolver because the scheduler token is not a JWT — feeding it to
|
||||||
|
# verify_token() would log a spurious decode warning every cron tick.
|
||||||
|
# See app/auth/scheduler_token.py for the threat model.
|
||||||
|
from app.auth.scheduler_token import get_scheduler_user, is_scheduler_token
|
||||||
|
if is_scheduler_token(token):
|
||||||
|
scheduler_user = get_scheduler_user(conn)
|
||||||
|
if scheduler_user:
|
||||||
|
_attach_admin_flag(scheduler_user, conn)
|
||||||
|
return scheduler_user
|
||||||
|
raise HTTPException(
|
||||||
|
status_code=status.HTTP_401_UNAUTHORIZED,
|
||||||
|
detail="Scheduler user not provisioned",
|
||||||
|
)
|
||||||
|
|
||||||
from app.auth.pat_resolver import resolve_token_to_user
|
from app.auth.pat_resolver import resolve_token_to_user
|
||||||
user, reason = resolve_token_to_user(conn, token, request)
|
user, reason = resolve_token_to_user(conn, token, request)
|
||||||
if user:
|
if user:
|
||||||
|
|
@ -226,8 +241,21 @@ async def get_optional_user(
|
||||||
|
|
||||||
|
|
||||||
async def require_session_token(request: Request, user: dict = Depends(get_current_user)) -> dict:
|
async def require_session_token(request: Request, user: dict = Depends(get_current_user)) -> dict:
|
||||||
"""Like get_current_user but rejects PAT — for endpoints that must not
|
"""Like get_current_user but rejects every non-interactive token kind —
|
||||||
be callable via a long-lived CI token (e.g. creating new tokens, changing password)."""
|
for endpoints that must not be callable via a long-lived service or CI
|
||||||
|
credential (e.g. creating new tokens, changing password).
|
||||||
|
|
||||||
|
Two non-interactive paths exist today:
|
||||||
|
|
||||||
|
1. **PAT** — JWT with ``typ="pat"``. Detected by decoding the JWT and
|
||||||
|
inspecting the claim.
|
||||||
|
2. **Scheduler shared secret** — opaque string equal to
|
||||||
|
``SCHEDULER_API_TOKEN``. Not a JWT, so ``verify_token`` returns None
|
||||||
|
and the PAT-claim check would silently pass — meaning a caller
|
||||||
|
holding the scheduler secret could mint persistent PATs through
|
||||||
|
``POST /auth/tokens`` that survive a secret rotation. Explicit
|
||||||
|
check here closes that bypass.
|
||||||
|
"""
|
||||||
auth = request.headers.get("authorization", "")
|
auth = request.headers.get("authorization", "")
|
||||||
token = None
|
token = None
|
||||||
if auth.startswith("Bearer "):
|
if auth.startswith("Bearer "):
|
||||||
|
|
@ -235,6 +263,12 @@ async def require_session_token(request: Request, user: dict = Depends(get_curre
|
||||||
if not token and request:
|
if not token and request:
|
||||||
token = request.cookies.get("access_token")
|
token = request.cookies.get("access_token")
|
||||||
if token:
|
if token:
|
||||||
|
from app.auth.scheduler_token import is_scheduler_token
|
||||||
|
if is_scheduler_token(token):
|
||||||
|
raise HTTPException(
|
||||||
|
status_code=status.HTTP_403_FORBIDDEN,
|
||||||
|
detail="This endpoint requires an interactive session, not a service token",
|
||||||
|
)
|
||||||
from app.auth.jwt import verify_token
|
from app.auth.jwt import verify_token
|
||||||
payload = verify_token(token) or {}
|
payload = verify_token(token) or {}
|
||||||
if payload.get("typ") == "pat":
|
if payload.get("typ") == "pat":
|
||||||
|
|
|
||||||
136
app/auth/scheduler_token.py
Normal file
136
app/auth/scheduler_token.py
Normal file
|
|
@ -0,0 +1,136 @@
|
||||||
|
"""Shared-secret auth path for the in-cluster scheduler service.
|
||||||
|
|
||||||
|
The scheduler container ships every cron tick to the FastAPI app over HTTP
|
||||||
|
(see ``services.scheduler.__main__``). It needs a long-lived credential to
|
||||||
|
authenticate itself, but minting a real PAT for it requires a logged-in
|
||||||
|
session — chicken-and-egg at first boot.
|
||||||
|
|
||||||
|
The pragmatic solution: both the ``app`` and ``scheduler`` containers source
|
||||||
|
the same ``.env`` (via Docker Compose ``env_file: .env``). The
|
||||||
|
``infra/modules/customer-instance/startup-script.sh.tpl`` generates a random
|
||||||
|
``SCHEDULER_API_TOKEN`` once at VM provisioning and writes it there. When a
|
||||||
|
caller presents that exact secret as ``Authorization: Bearer <secret>``, the
|
||||||
|
app loads (or seeds on demand) a synthetic ``scheduler@system.local`` user
|
||||||
|
that is a member of the ``Admin`` system group — so existing RBAC paths
|
||||||
|
continue to work without special-casing.
|
||||||
|
|
||||||
|
Constraints on the secret (enforced here, not parsed):
|
||||||
|
|
||||||
|
- Empty / unset → this auth path is **disabled**. Production deploys should
|
||||||
|
set it; dev / LOCAL_DEV_MODE typically doesn't, since the scheduler
|
||||||
|
rides the dev-bypass instead.
|
||||||
|
- Length < 32 → treated as misconfiguration and disabled. Prevents an
|
||||||
|
operator typo that sets ``SCHEDULER_API_TOKEN=todo`` from accidentally
|
||||||
|
granting admin to a 4-character bearer.
|
||||||
|
- Comparison uses :func:`hmac.compare_digest` — constant-time so a remote
|
||||||
|
caller cannot mount a length-discrimination timing attack.
|
||||||
|
|
||||||
|
Audit: every action by this user is attributed to ``scheduler@system.local``,
|
||||||
|
visible in ``audit_log`` as a normal admin actor. Rotating the secret is
|
||||||
|
``edit .env → docker compose restart app scheduler``; no DB write needed.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import hmac
|
||||||
|
import logging
|
||||||
|
import os
|
||||||
|
import uuid
|
||||||
|
from typing import Optional
|
||||||
|
|
||||||
|
import duckdb
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
# Identity of the synthetic user that backs the shared-secret auth path.
|
||||||
|
# Kept stable so audit-log entries from the scheduler are easy to filter.
|
||||||
|
SCHEDULER_USER_EMAIL = "scheduler@system.local"
|
||||||
|
SCHEDULER_USER_NAME = "Scheduler"
|
||||||
|
|
||||||
|
# Floor on the secret length. 32 bytes ≈ 256 bits of entropy if generated
|
||||||
|
# from /dev/urandom; well above the brute-force frontier and well above any
|
||||||
|
# typo a human is plausibly going to make.
|
||||||
|
SCHEDULER_TOKEN_MIN_LENGTH = 32
|
||||||
|
|
||||||
|
|
||||||
|
def get_scheduler_secret() -> str:
|
||||||
|
"""Return the configured shared secret, stripped. Empty when disabled."""
|
||||||
|
return os.environ.get("SCHEDULER_API_TOKEN", "").strip()
|
||||||
|
|
||||||
|
|
||||||
|
def is_scheduler_token(token: str) -> bool:
|
||||||
|
"""True iff ``token`` exactly matches the configured shared secret.
|
||||||
|
|
||||||
|
Returns False when the env var is empty or shorter than the minimum
|
||||||
|
length (auth path disabled). Uses constant-time comparison.
|
||||||
|
"""
|
||||||
|
if not token:
|
||||||
|
return False
|
||||||
|
secret = get_scheduler_secret()
|
||||||
|
if not secret or len(secret) < SCHEDULER_TOKEN_MIN_LENGTH:
|
||||||
|
return False
|
||||||
|
return hmac.compare_digest(token, secret)
|
||||||
|
|
||||||
|
|
||||||
|
def ensure_scheduler_user(conn: duckdb.DuckDBPyConnection) -> dict:
|
||||||
|
"""Idempotently provision the scheduler user + Admin group membership.
|
||||||
|
|
||||||
|
Called both from the app's startup hook (so the user exists from the
|
||||||
|
very first boot) and lazily from :func:`get_scheduler_user` so a token
|
||||||
|
presented before the next restart of the app still resolves.
|
||||||
|
|
||||||
|
Returns the user dict in the same shape ``UserRepository.get_by_email``
|
||||||
|
yields elsewhere — the caller treats it as any other authenticated user.
|
||||||
|
"""
|
||||||
|
from src.db import SYSTEM_ADMIN_GROUP
|
||||||
|
from src.repositories.user_group_members import UserGroupMembersRepository
|
||||||
|
from src.repositories.users import UserRepository
|
||||||
|
|
||||||
|
repo = UserRepository(conn)
|
||||||
|
user = repo.get_by_email(SCHEDULER_USER_EMAIL)
|
||||||
|
if not user:
|
||||||
|
user_id = str(uuid.uuid4())
|
||||||
|
repo.create(
|
||||||
|
id=user_id,
|
||||||
|
email=SCHEDULER_USER_EMAIL,
|
||||||
|
name=SCHEDULER_USER_NAME,
|
||||||
|
role="admin",
|
||||||
|
# No password_hash — this user authenticates via the shared
|
||||||
|
# secret only, never via /auth/login. Keeps the bootstrap
|
||||||
|
# check ("any user has a password?") accurate.
|
||||||
|
password_hash=None,
|
||||||
|
)
|
||||||
|
user = repo.get_by_email(SCHEDULER_USER_EMAIL)
|
||||||
|
logger.info("Seeded scheduler service user: %s", SCHEDULER_USER_EMAIL)
|
||||||
|
|
||||||
|
admin_group = conn.execute(
|
||||||
|
"SELECT id FROM user_groups WHERE name = ?", [SYSTEM_ADMIN_GROUP],
|
||||||
|
).fetchone()
|
||||||
|
if admin_group:
|
||||||
|
UserGroupMembersRepository(conn).add_member(
|
||||||
|
user_id=user["id"],
|
||||||
|
group_id=admin_group[0],
|
||||||
|
source="system_seed",
|
||||||
|
added_by="app.auth.scheduler_token:ensure_scheduler_user",
|
||||||
|
)
|
||||||
|
|
||||||
|
return user
|
||||||
|
|
||||||
|
|
||||||
|
def get_scheduler_user(conn: duckdb.DuckDBPyConnection) -> Optional[dict]:
|
||||||
|
"""Look up the scheduler user, seeding it on demand if absent.
|
||||||
|
|
||||||
|
Returns None only when seeding fails — typically a malformed schema or
|
||||||
|
an out-of-band DB error. The caller (``get_current_user``) maps None
|
||||||
|
to a normal 401 so the failure is observable but does not crash.
|
||||||
|
"""
|
||||||
|
from src.repositories.users import UserRepository
|
||||||
|
|
||||||
|
user = UserRepository(conn).get_by_email(SCHEDULER_USER_EMAIL)
|
||||||
|
if user:
|
||||||
|
return user
|
||||||
|
try:
|
||||||
|
return ensure_scheduler_user(conn)
|
||||||
|
except Exception as e: # noqa: BLE001
|
||||||
|
logger.error("Failed to provision scheduler user on demand: %s", e)
|
||||||
|
return None
|
||||||
29
app/main.py
29
app/main.py
|
|
@ -292,6 +292,35 @@ def create_app() -> FastAPI:
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.warning(f"Could not seed admin: {e}")
|
logger.warning(f"Could not seed admin: {e}")
|
||||||
|
|
||||||
|
# Seed the synthetic scheduler user when SCHEDULER_API_TOKEN is configured,
|
||||||
|
# so the very first cron tick after a fresh deploy already has a valid
|
||||||
|
# actor to attribute audit-log entries to. The lazy seed in
|
||||||
|
# `app.auth.scheduler_token.get_scheduler_user` covers the case where the
|
||||||
|
# secret is rotated mid-life, but doing it here keeps startup observable.
|
||||||
|
from app.auth.scheduler_token import get_scheduler_secret
|
||||||
|
if get_scheduler_secret():
|
||||||
|
try:
|
||||||
|
from app.auth.scheduler_token import (
|
||||||
|
SCHEDULER_TOKEN_MIN_LENGTH,
|
||||||
|
ensure_scheduler_user,
|
||||||
|
)
|
||||||
|
from src.db import get_system_db
|
||||||
|
secret = get_scheduler_secret()
|
||||||
|
if len(secret) < SCHEDULER_TOKEN_MIN_LENGTH:
|
||||||
|
logger.warning(
|
||||||
|
"SCHEDULER_API_TOKEN is set but only %d chars — auth path"
|
||||||
|
" disabled (minimum %d). Generate a longer secret in .env.",
|
||||||
|
len(secret), SCHEDULER_TOKEN_MIN_LENGTH,
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
conn = get_system_db()
|
||||||
|
try:
|
||||||
|
ensure_scheduler_user(conn)
|
||||||
|
finally:
|
||||||
|
conn.close()
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning(f"Could not seed scheduler user: {e}")
|
||||||
|
|
||||||
# C8: Warn when no user has a password_hash — bootstrap endpoint is open.
|
# C8: Warn when no user has a password_hash — bootstrap endpoint is open.
|
||||||
# This is intentional UX (operator can claim seed admin), but the open
|
# This is intentional UX (operator can claim seed admin), but the open
|
||||||
# window should be visible in startup logs so it's not forgotten.
|
# window should be visible in startup logs so it's not forgotten.
|
||||||
|
|
|
||||||
|
|
@ -75,6 +75,25 @@ if [ "$DATA_SOURCE" = "keboola" ]; then
|
||||||
fi
|
fi
|
||||||
JWT_KEY=$(gcloud secrets versions access latest --secret=agnes-$${CUSTOMER_NAME}-jwt-secret)
|
JWT_KEY=$(gcloud secrets versions access latest --secret=agnes-$${CUSTOMER_NAME}-jwt-secret)
|
||||||
|
|
||||||
|
# SCHEDULER_API_TOKEN — shared secret between the app and scheduler containers.
|
||||||
|
# Both source the same /opt/agnes/.env via Docker Compose env_file:, so the
|
||||||
|
# scheduler's outbound bearer token always matches the app's expected value.
|
||||||
|
# See app/auth/scheduler_token.py for the auth path it unlocks.
|
||||||
|
#
|
||||||
|
# Preserve across reboots: the token is plumbed into a long-lived synthetic
|
||||||
|
# user, and rotating it forces a restart of both containers. Read back from
|
||||||
|
# an existing .env when present; mint fresh only on the first boot.
|
||||||
|
SCHEDULER_API_TOKEN=""
|
||||||
|
if [ -f "$APP_DIR/.env" ]; then
|
||||||
|
SCHEDULER_API_TOKEN=$(grep -E '^SCHEDULER_API_TOKEN=' "$APP_DIR/.env" | head -1 | cut -d= -f2- | tr -d '"' || true)
|
||||||
|
fi
|
||||||
|
if [ -z "$SCHEDULER_API_TOKEN" ]; then
|
||||||
|
# 64 hex chars = 256 bits of /dev/urandom entropy. Floor enforced in
|
||||||
|
# app/auth/scheduler_token.SCHEDULER_TOKEN_MIN_LENGTH is 32; 64 leaves
|
||||||
|
# headroom for a future tightening without re-provisioning every VM.
|
||||||
|
SCHEDULER_API_TOKEN=$(openssl rand -hex 32)
|
||||||
|
fi
|
||||||
|
|
||||||
# Optional Google OAuth credentials. If the operator has created
|
# Optional Google OAuth credentials. If the operator has created
|
||||||
# google-oauth-client-{id,secret} secrets in the project's Secret Manager
|
# google-oauth-client-{id,secret} secrets in the project's Secret Manager
|
||||||
# AND wired them via runtime_secrets in the calling Terraform, the VM SA can
|
# AND wired them via runtime_secrets in the calling Terraform, the VM SA can
|
||||||
|
|
@ -118,6 +137,7 @@ KEBOOLA_STORAGE_TOKEN=$KEBOOLA_TOKEN
|
||||||
KEBOOLA_STACK_URL=$KEBOOLA_STACK_URL
|
KEBOOLA_STACK_URL=$KEBOOLA_STACK_URL
|
||||||
SEED_ADMIN_EMAIL=$SEED_ADMIN_EMAIL
|
SEED_ADMIN_EMAIL=$SEED_ADMIN_EMAIL
|
||||||
SEED_ADMIN_PASSWORD=$SEED_ADMIN_PASSWORD
|
SEED_ADMIN_PASSWORD=$SEED_ADMIN_PASSWORD
|
||||||
|
SCHEDULER_API_TOKEN=$SCHEDULER_API_TOKEN
|
||||||
LOG_LEVEL=info
|
LOG_LEVEL=info
|
||||||
DOMAIN=$DOMAIN
|
DOMAIN=$DOMAIN
|
||||||
AGNES_TAG=$IMAGE_TAG
|
AGNES_TAG=$IMAGE_TAG
|
||||||
|
|
|
||||||
|
|
@ -1,6 +1,6 @@
|
||||||
[project]
|
[project]
|
||||||
name = "agnes-the-ai-analyst"
|
name = "agnes-the-ai-analyst"
|
||||||
version = "0.15.0"
|
version = "0.17.0"
|
||||||
description = "Agnes — AI Data Analyst platform for AI analytical systems"
|
description = "Agnes — AI Data Analyst platform for AI analytical systems"
|
||||||
requires-python = ">=3.11,<3.14"
|
requires-python = ">=3.11,<3.14"
|
||||||
license = "MIT"
|
license = "MIT"
|
||||||
|
|
|
||||||
|
|
@ -1,12 +1,20 @@
|
||||||
"""Scheduler service — replaces systemd timers.
|
"""Scheduler service — replaces systemd timers.
|
||||||
|
|
||||||
Lightweight sidecar that fires scheduled jobs. Two job kinds:
|
Lightweight sidecar that fires scheduled jobs over HTTP against the main
|
||||||
- "http": POST/GET an endpoint on the main app (e.g. data-refresh).
|
app. Authenticates with ``SCHEDULER_API_TOKEN`` (shared-secret synthetic
|
||||||
- "fn": call a Python function in-process (e.g. marketplaces sync).
|
admin — see ``app.auth.scheduler_token``); falls back to no-auth in
|
||||||
|
LOCAL_DEV_MODE.
|
||||||
|
|
||||||
Schedules are strings parsed by src.scheduler.is_table_due — accepts
|
Schedules are strings parsed by ``src.scheduler.is_table_due`` — accepts
|
||||||
"every 15m", "every 1h", "daily 03:00", "daily 07:00,13:00".
|
"every 15m", "every 1h", "daily 03:00", "daily 07:00,13:00".
|
||||||
|
|
||||||
|
Why every job is HTTP and nothing runs in-process: the scheduler container
|
||||||
|
shares ``/data/state/system.duckdb`` with the app container, but DuckDB
|
||||||
|
permits only one writer per file across processes. An in-process call
|
||||||
|
from the scheduler raced the app's long-lived handle and 500-ed on
|
||||||
|
``Could not set lock on file``. Going through HTTP makes the app the sole
|
||||||
|
writer; the scheduler is reduced to a pure cron clock.
|
||||||
|
|
||||||
Usage: python -m services.scheduler
|
Usage: python -m services.scheduler
|
||||||
"""
|
"""
|
||||||
|
|
||||||
|
|
@ -36,21 +44,17 @@ _token_warning_emitted = False
|
||||||
def _get_auth_token() -> str:
|
def _get_auth_token() -> str:
|
||||||
"""Return the bearer token for API calls.
|
"""Return the bearer token for API calls.
|
||||||
|
|
||||||
Production: ``SCHEDULER_API_TOKEN`` env var carries a long-lived PAT
|
Production: ``SCHEDULER_API_TOKEN`` is a shared secret generated by the
|
||||||
minted via ``/tokens`` for a service-account user with the roles the
|
Terraform startup script and written to ``/opt/agnes/.env``. Both the
|
||||||
jobs need (typically ``core.admin`` for sync triggers). Set it.
|
``app`` and ``scheduler`` containers source the same .env via Docker
|
||||||
|
Compose ``env_file:``, so the secret is symmetric. The app validates
|
||||||
|
incoming Bearer tokens against this env var (constant-time compare in
|
||||||
|
``app.auth.scheduler_token``) and resolves matches to a synthetic
|
||||||
|
``scheduler@system.local`` user that is a member of the Admin group.
|
||||||
|
|
||||||
Dev / LOCAL_DEV_MODE: leave it unset. The scheduler returns the empty
|
Dev / LOCAL_DEV_MODE: leave it unset. The scheduler returns the empty
|
||||||
string and calls the API without an ``Authorization`` header — the
|
string and calls the API without an ``Authorization`` header — the
|
||||||
API's dev-bypass auto-authenticates the request as the dev user.
|
API's dev-bypass auto-authenticates the request as the dev user.
|
||||||
|
|
||||||
The previous implementation tried to auto-fetch a token by POSTing to
|
|
||||||
``/auth/token`` with just the seed admin's email. That endpoint
|
|
||||||
requires email + password (or rejects external-auth accounts that
|
|
||||||
have no local password), so the call always 401-ed and the scheduler
|
|
||||||
log was noisy with one access-log line per cron tick. Removed in
|
|
||||||
favor of explicit configuration: either set the PAT or rely on
|
|
||||||
LOCAL_DEV_MODE.
|
|
||||||
"""
|
"""
|
||||||
global _token_warning_emitted
|
global _token_warning_emitted
|
||||||
if SCHEDULER_API_TOKEN:
|
if SCHEDULER_API_TOKEN:
|
||||||
|
|
@ -65,24 +69,23 @@ def _get_auth_token() -> str:
|
||||||
return ""
|
return ""
|
||||||
|
|
||||||
|
|
||||||
def _marketplaces_job():
|
# Schedule definitions: (name, schedule_string, endpoint, method, timeout_sec).
|
||||||
"""Entry point for the nightly marketplaces sync.
|
# All jobs are HTTP — see the module docstring for why nothing runs
|
||||||
|
# in-process anymore. ``daily 03:00`` for marketplaces matches the cadence
|
||||||
Imported lazily so the scheduler container still starts even if the
|
# the previous in-process job used; the endpoint is admin-only and
|
||||||
module has an import-time issue in development — a failure here only
|
# idempotent (it iterates the registry and per-marketplace errors do not
|
||||||
kills one job, not the whole loop.
|
# abort the run).
|
||||||
"""
|
#
|
||||||
from src.marketplace import sync_marketplaces
|
# timeout_sec: per-job override for the httpx call. Marketplaces gets a
|
||||||
return sync_marketplaces()
|
# generous 15 min because the app handler iterates every registered
|
||||||
|
# marketplace under a single lock with up to 300s of git timeout per
|
||||||
|
# entry — at 120s (the default that data-refresh uses) a real-world
|
||||||
# Schedule definitions: (name, schedule_string, kind, target)
|
# registry of more than 2-3 slow repos times out the scheduler call,
|
||||||
# kind = "http" -> target = (endpoint, method)
|
# which then re-fires on the next 30s tick and queues a redundant sync.
|
||||||
# kind = "fn" -> target = callable_returning_any
|
|
||||||
JOBS = [
|
JOBS = [
|
||||||
("data-refresh", "every 15m", "http", ("/api/sync/trigger", "POST")),
|
("data-refresh", "every 15m", "/api/sync/trigger", "POST", 120),
|
||||||
("health-check", "every 5m", "http", ("/api/health", "GET")),
|
("health-check", "every 5m", "/api/health", "GET", 30),
|
||||||
("marketplaces", "daily 03:00", "fn", _marketplaces_job),
|
("marketplaces", "daily 03:00", "/api/marketplaces/sync-all", "POST", 900),
|
||||||
]
|
]
|
||||||
|
|
||||||
_running = True
|
_running = True
|
||||||
|
|
@ -94,7 +97,7 @@ def _signal_handler(sig, frame):
|
||||||
_running = False
|
_running = False
|
||||||
|
|
||||||
|
|
||||||
def _call_api(endpoint: str, method: str = "POST") -> bool:
|
def _call_api(endpoint: str, method: str, timeout_sec: int) -> bool:
|
||||||
"""Call the main app API. Returns True on success."""
|
"""Call the main app API. Returns True on success."""
|
||||||
url = f"{API_URL}{endpoint}"
|
url = f"{API_URL}{endpoint}"
|
||||||
headers = {}
|
headers = {}
|
||||||
|
|
@ -103,9 +106,9 @@ def _call_api(endpoint: str, method: str = "POST") -> bool:
|
||||||
headers["Authorization"] = f"Bearer {token}"
|
headers["Authorization"] = f"Bearer {token}"
|
||||||
try:
|
try:
|
||||||
if method == "POST":
|
if method == "POST":
|
||||||
resp = httpx.post(url, headers=headers, timeout=120)
|
resp = httpx.post(url, headers=headers, timeout=timeout_sec)
|
||||||
else:
|
else:
|
||||||
resp = httpx.get(url, headers=headers, timeout=30)
|
resp = httpx.get(url, headers=headers, timeout=timeout_sec)
|
||||||
if resp.status_code < 400:
|
if resp.status_code < 400:
|
||||||
logger.info(f"Job {endpoint}: {resp.status_code}")
|
logger.info(f"Job {endpoint}: {resp.status_code}")
|
||||||
return True
|
return True
|
||||||
|
|
@ -117,17 +120,6 @@ def _call_api(endpoint: str, method: str = "POST") -> bool:
|
||||||
return False
|
return False
|
||||||
|
|
||||||
|
|
||||||
def _call_fn(label: str, fn) -> bool:
|
|
||||||
"""Run an in-process callable. Returns True on success."""
|
|
||||||
try:
|
|
||||||
result = fn()
|
|
||||||
logger.info("Job %s OK: %s", label, result)
|
|
||||||
return True
|
|
||||||
except Exception as e:
|
|
||||||
logger.error("Job %s failed: %s", label, e)
|
|
||||||
return False
|
|
||||||
|
|
||||||
|
|
||||||
def run():
|
def run():
|
||||||
signal.signal(signal.SIGTERM, _signal_handler)
|
signal.signal(signal.SIGTERM, _signal_handler)
|
||||||
signal.signal(signal.SIGINT, _signal_handler)
|
signal.signal(signal.SIGINT, _signal_handler)
|
||||||
|
|
@ -140,18 +132,11 @@ def run():
|
||||||
|
|
||||||
while _running:
|
while _running:
|
||||||
now_iso = datetime.now(timezone.utc).isoformat()
|
now_iso = datetime.now(timezone.utc).isoformat()
|
||||||
for name, schedule, kind, target in JOBS:
|
for name, schedule, endpoint, method, timeout_sec in JOBS:
|
||||||
if not is_table_due(schedule, last_run[name]):
|
if not is_table_due(schedule, last_run[name]):
|
||||||
continue
|
continue
|
||||||
logger.info("Running job: %s (%s)", name, schedule)
|
logger.info("Running job: %s (%s)", name, schedule)
|
||||||
if kind == "http":
|
ok = _call_api(endpoint, method, timeout_sec)
|
||||||
endpoint, method = target
|
|
||||||
ok = _call_api(endpoint, method)
|
|
||||||
elif kind == "fn":
|
|
||||||
ok = _call_fn(name, target)
|
|
||||||
else:
|
|
||||||
logger.error("Unknown job kind %r for %s", kind, name)
|
|
||||||
ok = False
|
|
||||||
if ok:
|
if ok:
|
||||||
last_run[name] = now_iso
|
last_run[name] = now_iso
|
||||||
# 30s tick is plenty: interval jobs have minute-level resolution,
|
# 30s tick is plenty: interval jobs have minute-level resolution,
|
||||||
|
|
|
||||||
153
tests/test_auth_scheduler_token.py
Normal file
153
tests/test_auth_scheduler_token.py
Normal file
|
|
@ -0,0 +1,153 @@
|
||||||
|
"""Tests for the SCHEDULER_API_TOKEN shared-secret auth path."""
|
||||||
|
|
||||||
|
import tempfile
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def fresh_db(monkeypatch):
|
||||||
|
"""Isolated DuckDB + JWT secret per test, mirroring tests/test_pat.py."""
|
||||||
|
with tempfile.TemporaryDirectory() as tmp:
|
||||||
|
monkeypatch.setenv("DATA_DIR", tmp)
|
||||||
|
monkeypatch.setenv("TESTING", "1")
|
||||||
|
monkeypatch.setenv("JWT_SECRET_KEY", "test-jwt-secret-key-minimum-32-chars!!")
|
||||||
|
# Clean slate — clear any inherited token from the host shell.
|
||||||
|
monkeypatch.delenv("SCHEDULER_API_TOKEN", raising=False)
|
||||||
|
# Force pristine state — earlier tests in the same session may have
|
||||||
|
# opened the singleton; drop it so the new DATA_DIR takes effect.
|
||||||
|
from src.db import close_system_db
|
||||||
|
close_system_db()
|
||||||
|
yield tmp
|
||||||
|
close_system_db()
|
||||||
|
|
||||||
|
|
||||||
|
def test_is_scheduler_token_disabled_when_env_unset(fresh_db, monkeypatch):
|
||||||
|
"""Empty SCHEDULER_API_TOKEN must disable the auth path entirely.
|
||||||
|
|
||||||
|
A bug here would let any caller authenticate with empty Bearer "" — the
|
||||||
|
constant-time compare would also be empty — granting admin to anyone.
|
||||||
|
"""
|
||||||
|
from app.auth.scheduler_token import is_scheduler_token
|
||||||
|
|
||||||
|
monkeypatch.delenv("SCHEDULER_API_TOKEN", raising=False)
|
||||||
|
assert is_scheduler_token("") is False
|
||||||
|
assert is_scheduler_token("anything") is False
|
||||||
|
|
||||||
|
|
||||||
|
def test_is_scheduler_token_disabled_when_env_too_short(fresh_db, monkeypatch):
|
||||||
|
"""Operator typo (SCHEDULER_API_TOKEN=todo) must NOT grant admin.
|
||||||
|
|
||||||
|
The minimum length floor exists specifically to prevent a 4-char bearer
|
||||||
|
from accidentally matching a 4-char misconfigured secret.
|
||||||
|
"""
|
||||||
|
from app.auth.scheduler_token import is_scheduler_token
|
||||||
|
|
||||||
|
monkeypatch.setenv("SCHEDULER_API_TOKEN", "too-short")
|
||||||
|
assert is_scheduler_token("too-short") is False
|
||||||
|
|
||||||
|
|
||||||
|
def test_is_scheduler_token_matches_only_exact_value(fresh_db, monkeypatch):
|
||||||
|
from app.auth.scheduler_token import is_scheduler_token
|
||||||
|
|
||||||
|
secret = "x" * 64 # > min length
|
||||||
|
monkeypatch.setenv("SCHEDULER_API_TOKEN", secret)
|
||||||
|
assert is_scheduler_token(secret) is True
|
||||||
|
assert is_scheduler_token(secret + "trailing") is False
|
||||||
|
assert is_scheduler_token(secret[:-1]) is False
|
||||||
|
assert is_scheduler_token("y" * 64) is False
|
||||||
|
|
||||||
|
|
||||||
|
def test_ensure_scheduler_user_seeds_user_and_admin_membership(fresh_db, monkeypatch):
|
||||||
|
"""First call seeds; second call is a no-op idempotent re-add."""
|
||||||
|
from app.auth.scheduler_token import (
|
||||||
|
SCHEDULER_USER_EMAIL,
|
||||||
|
ensure_scheduler_user,
|
||||||
|
)
|
||||||
|
from src.db import SYSTEM_ADMIN_GROUP, get_system_db
|
||||||
|
|
||||||
|
conn = get_system_db()
|
||||||
|
try:
|
||||||
|
user1 = ensure_scheduler_user(conn)
|
||||||
|
assert user1["email"] == SCHEDULER_USER_EMAIL
|
||||||
|
# Admin group membership exists.
|
||||||
|
admin_group = conn.execute(
|
||||||
|
"SELECT id FROM user_groups WHERE name = ?", [SYSTEM_ADMIN_GROUP],
|
||||||
|
).fetchone()
|
||||||
|
assert admin_group is not None
|
||||||
|
membership = conn.execute(
|
||||||
|
"SELECT 1 FROM user_group_members WHERE user_id = ? AND group_id = ?",
|
||||||
|
[user1["id"], admin_group[0]],
|
||||||
|
).fetchone()
|
||||||
|
assert membership is not None
|
||||||
|
|
||||||
|
# Second call — same id, no duplicate membership row.
|
||||||
|
user2 = ensure_scheduler_user(conn)
|
||||||
|
assert user2["id"] == user1["id"]
|
||||||
|
rows = conn.execute(
|
||||||
|
"SELECT COUNT(*) FROM user_group_members WHERE user_id = ? AND group_id = ?",
|
||||||
|
[user1["id"], admin_group[0]],
|
||||||
|
).fetchone()
|
||||||
|
assert rows[0] == 1
|
||||||
|
finally:
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
|
||||||
|
def test_get_scheduler_user_lazy_seeds_when_absent(fresh_db, monkeypatch):
|
||||||
|
"""First lookup with no prior seed should provision on demand.
|
||||||
|
|
||||||
|
The startup hook in app.main also seeds eagerly, but the scheduler may
|
||||||
|
present the token before main.py has finished its lifespan setup on a
|
||||||
|
cold boot — get_scheduler_user must close that gap.
|
||||||
|
"""
|
||||||
|
from app.auth.scheduler_token import (
|
||||||
|
SCHEDULER_USER_EMAIL,
|
||||||
|
get_scheduler_user,
|
||||||
|
)
|
||||||
|
from src.db import get_system_db
|
||||||
|
from src.repositories.users import UserRepository
|
||||||
|
|
||||||
|
conn = get_system_db()
|
||||||
|
try:
|
||||||
|
# Confirm user does not exist before the call.
|
||||||
|
assert UserRepository(conn).get_by_email(SCHEDULER_USER_EMAIL) is None
|
||||||
|
user = get_scheduler_user(conn)
|
||||||
|
assert user is not None
|
||||||
|
assert user["email"] == SCHEDULER_USER_EMAIL
|
||||||
|
finally:
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
|
||||||
|
def test_require_session_token_rejects_scheduler_secret(fresh_db, monkeypatch):
|
||||||
|
"""The shared scheduler secret must NOT pass `require_session_token`.
|
||||||
|
|
||||||
|
/auth/tokens (PAT minting) is gated by `require_session_token`, which
|
||||||
|
historically rejected only PATs (JWTs with typ=pat). The scheduler
|
||||||
|
secret is opaque so verify_token() returns None and the PAT-claim
|
||||||
|
check would silently pass — letting a compromised secret forge
|
||||||
|
persistent PATs that survive a rotation. Regression guard for the
|
||||||
|
Devin review on PR #127.
|
||||||
|
"""
|
||||||
|
import asyncio
|
||||||
|
from unittest.mock import MagicMock
|
||||||
|
|
||||||
|
from fastapi import HTTPException
|
||||||
|
|
||||||
|
from app.auth.dependencies import require_session_token
|
||||||
|
|
||||||
|
secret = "x" * 64
|
||||||
|
monkeypatch.setenv("SCHEDULER_API_TOKEN", secret)
|
||||||
|
|
||||||
|
request = MagicMock()
|
||||||
|
request.headers = {"authorization": f"Bearer {secret}"}
|
||||||
|
request.cookies = {}
|
||||||
|
|
||||||
|
user = {"id": "scheduler-id", "email": "scheduler@system.local"}
|
||||||
|
try:
|
||||||
|
asyncio.run(require_session_token(request=request, user=user))
|
||||||
|
except HTTPException as exc:
|
||||||
|
assert exc.status_code == 403
|
||||||
|
# Detail should signal "interactive only", flavor doesn't matter.
|
||||||
|
assert "interactive" in exc.detail.lower()
|
||||||
|
else:
|
||||||
|
raise AssertionError("require_session_token must reject scheduler secret")
|
||||||
Loading…
Reference in a new issue