* fix(scheduler): HTTP marketplaces job + SCHEDULER_API_TOKEN shared secret
Two scheduler-reliability bugs surfaced after the v0.12.1 USER-agnes flip:
1. The marketplaces job called src.marketplace.sync_marketplaces() in-process
from the scheduler container, racing the app's long-lived system.duckdb
handle. DuckDB rejects cross-process writers — every cron tick 500-ed on
"Could not set lock on file ... PID 0".
2. The data-refresh + new marketplaces jobs both 401-ed on the API because
SCHEDULER_API_TOKEN was never propagated by the Terraform startup script.
The scheduler had no credential to authenticate with.
Fix:
- New POST /api/marketplaces/sync-all (admin-only) drives the nightly refresh
through the app process so it inherits the existing DB connection.
- Scheduler swaps fn->http for marketplaces; all jobs are now plain HTTP and
the scheduler is reduced to a cron clock.
- New app/auth/scheduler_token.py adds a shared-secret auth path. The
startup script generates a 256-bit secret on first boot, persists it
across reboots, and writes it to /opt/agnes/.env. Both containers source
the same .env. The app validates incoming Bearer tokens against the env
var (constant-time, length-floored) and resolves matches to a synthetic
scheduler@system.local user that's a member of the Admin system group.
Audit-log entries from the scheduler are attributed to this user.
- app/main.py seeds the synthetic user at startup so the first cron tick
has a valid actor; lazy seed in get_scheduler_user covers token rotation
before the next app restart.
Tests: 5 new in tests/test_auth_scheduler_token.py covering empty/short
secret rejection, exact-match comparison, idempotent user seeding, and
lazy provisioning. 142 marketplace + scheduler tests + 96 auth tests
remain green.
Existing VMs with .env from before this change need a one-time
re-provisioning (re-run startup-script or rotate via openssl rand);
documented in CHANGELOG.
* fix(audit): use '_all' sentinel for bulk marketplace sync — Devin review #127
Avoids the literal string 'marketplace:None' in the audit_log resource
column when the bulk sync endpoint writes its summary row.
* fix(scheduler): unblock event loop + per-job timeouts — Devin review #127
Two findings from Devin re-review on commit 5fbad15:
1. BUG: trigger_sync_all was async def, so FastAPI ran it on the asyncio
event loop. sync_marketplaces() does blocking I/O (subprocess git
clones up to GIT_TIMEOUT_SEC=300 each, threading.Lock, DuckDB writes)
and would freeze every concurrent request for the duration of a bulk
sync. Switched to plain def so FastAPI auto-routes to the thread pool.
2. ANALYSIS: scheduler used a fixed 120s httpx timeout for every POST.
Bulk marketplace sync iterates the registry under a single lock with
up to 300s per repo — easily exceeds 120s on 2-3 slow repos. The
scheduler then sees a timeout, doesn't update last_run, and re-fires
on the next 30s tick, queueing redundant work. Per-job timeout
override added to the JOBS tuple; marketplaces gets 900s (15 min),
data-refresh keeps 120s, health-check 30s.
* fix(auth): require_session_token rejects scheduler shared secret — Devin review #127
require_session_token gates /auth/tokens (PAT minting). Pre-fix it only
rejected JWTs with typ=pat — but the scheduler shared secret is an opaque
string, so verify_token() returns None, payload becomes {}, and the
PAT-claim check silently passed. A caller bearing SCHEDULER_API_TOKEN
could mint persistent PATs that survive a secret rotation.
Added explicit is_scheduler_token() check before the PAT-claim check;
new regression test in tests/test_auth_scheduler_token.py.
Devin's other note (pre-existing async def trigger_sync at marketplaces.py:392
also calls blocking sync_one) — Devin flagged it as out-of-scope for this PR
and I agree; tracking separately.
* release(0.17.0): cut + clean up CHANGELOG duplicates
Cuts 0.17.0 (minor: scheduler shared-secret auth + sync-all endpoint
plus the deploy-shape fixes that landed since the last release tag).
Bumps pyproject from 0.15.0 — also corrects the missed bump from PR #120
(v0.16.0 was tagged on GitHub and shipped as :stable, but pyproject
stayed at 0.15.0, so /api/version, /cli/latest, and `da --version` had
been under-reporting the running release).
Removes the long-form duplicate entries for 0.13.0 / 0.14.0 / 0.15.0
above [0.16.0] — the canonical short summaries (with GitHub-release
links) already exist below 0.16.0, the long forms were leftover state
from before those versions were cut and have been silently shadowed
ever since.
This squashes 13 commits from ma/staging plus a small docstring translation
into a single coherent unit. Three workstreams.
== RBAC v13 redesign ==
- Drops core.viewer/analyst/km_admin/admin hierarchy and the
internal_roles / group_mappings / user_role_grants / plugin_access tables.
- Replaced by user_group_members + resource_grants. Atomic v12→v13 backfill
wrapped in BEGIN/COMMIT; ROLLBACK leaves schema_version at 12 for retry.
- Two authorization primitives in app.auth.access:
require_admin — Admin-group god-mode
require_resource_access(rt, "{path}") — entity-scoped grants
Single DB lookup per request; no session cache; no implies BFS.
- /admin/access UI (single page) replaces /admin/role-mapping +
/admin/plugin-access. CLI `da admin group/grant *` replaces
`da admin role/mapping/grant-role/revoke-role/effective-roles`.
- ResourceType.TABLE listing-only — admins can record table grants,
runtime enforcement still flows through legacy dataset_permissions
(migration plan in docs/TODO-rbac-data-enforcement.md).
== Claude Code marketplace ==
- Aggregated /marketplace.zip + /marketplace.git/* (PAT-gated,
RBAC-filtered, content-addressed cache via dulwich).
- Admin god-mode dropped on the marketplace surface — admins curate
their own view via grants like everyone else.
- Bare-repo cache materializes per RBAC-filtered ETag; stale entries
not pruned in this iteration (disclaimed in git_backend.py docstring).
== #81#83#44 security/ops hardening ==
- #81 Group A — orchestrator ATTACH allow-listing (extension/url/alias).
- #81 Group B — Keboola extractor 3-state exit codes:
0 success / 1 total fail / 2 PARTIAL fail
Sync API logs PARTIAL FAILURE alert on exit 2. Operators with binary
alerting must teach it the new partial signal.
- #81 Group C — schema v10 view_ownership; rejects silent overwrite
of a prior connector's view name on collision.
- #81 Group D — extractor-side identifier validation.
- #83 — Jira webhook fail-closed when JIRA_WEBHOOK_SECRET unset
+ path-traversal fix.
- #44 — entire /api/scripts/* surface is admin-only (planted-script +
sandbox-bypass risk closed).
== Web UI polish + deploy fix ==
- /admin/access: live grant-count badges (no stale snapshot revert),
shared-header CSS link added to /catalog and /admin/{tables,permissions},
per-resource-type colored stripes.
- docker-compose.host-mount.yml: bind,rbind so dual-disk hosts don't
silently shadow sub-mounts and write state to the wrong disk.
== OSS vendor-neutralization (waves 1+2) ==
- scripts/grpn/ → scripts/ops/. Customer-specific identifiers
(project IDs, internal hostnames, dev/prod VM IPs, brand names)
replaced with placeholders across code, docs, Terraform, Caddyfile,
OAuth probe, and planning docs. Downstream infra repos that copied
scripts/grpn/agnes-tls-rotate.sh or agnes-auto-upgrade.sh must
update the path.
== Translation ==
- src/repositories/user_groups.py::ensure_system docstring translated
from Czech to English for codebase consistency.
Co-authored-by: Mina Rustamyan <mina@keboola.com>