* fix(security+ops): #82 #85 #87 — auth hardening, API validation, deploy posture Security and operational hardening across three issue groups: - M23: docker-compose.override.yml → docker-compose.dev.yml (BREAKING, prod foot-gun) - C13: Container runs as non-root user 'agnes' (USER directive in Dockerfile) - M21: Docker resource limits (mem_limit, cpus) on app + scheduler - M22: Caddyfile security headers (X-Frame-Options, X-Content-Type-Options, Referrer-Policy, -Server) - M17: /api/health split into minimal (unauth) + /api/health/detailed (auth) (BREAKING) - M26: release.yml restricts build-and-push to main + workflow_dispatch; paths-ignore for docs - C2: table_id traversal validation on /api/data/{table_id}/download - M4: Upload streaming (chunk-read + temp file) instead of full-buffer; /local-md hashed filename - C5: reset_token removed from POST /api/users/{id}/reset-password response - C8: Startup WARNING when no user has password_hash (bootstrap window visible) - M9: Audit log on failed web form login (mirrors /auth/token endpoint) - M10: Atomic magic-link consume via compare-and-swap (CONSUMED: marker + DuckDB conflict catch) Also: SSRF protection on /api/admin/configure (#46), memory stats SQL aggregation (#90) Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com> * fix(review): SSRF 169.254.x.x + IPv6 multicast; M10 marker cleanup safety Review fixes: - Add 169.254.0.0/16 (link-local, cloud metadata) to SSRF regex — was missing, allowing requests to AWS/GCP/Azure metadata endpoints - Add ff[0-9a-f]{2}: (IPv6 multicast) to SSRF regex - M10: wrap Step 3 (CONSUMED marker cleanup) in try-except with warning log — prevents unhandled exception if DB write fails after successful token consumption - Add test for 169.254.169.254 SSRF rejection Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com> * fix(review): SSRF IPv6 bypass, CLI health endpoint, upload FD leak Address Devin Review findings on PR #104: 1. SSRF IPv6 bypass: Replace hostname regex with DNS resolution + ipaddress module checks. The old regex patterns like `fe80:` only matched up to the first colon, missing real IPv6 addresses like `fe80::1`, `fc00::1`, `ff02::1`. The new approach resolves the hostname via getaddrinfo and checks each resulting IP against ipaddress.is_private/is_loopback/is_link_local/is_reserved/is_multicast. 2. CLI commands broken: `da setup test-connection`, `da setup verify`, `da diagnose`, `da status` all called /api/health expecting the old format (status=="healthy", services dict). Now they call /api/health/detailed for service-level checks (with graceful fallback to the minimal endpoint when auth is not configured). 3. Temp file handle leak: _stream_to_temp returns an open NamedTemporaryFile; callers now close it before shutil.move() to prevent FD leaks until GC. Also adds IPv6 SSRF test cases (loopback, link-local, unique-local, multicast) with mocked DNS resolution for test environment independence. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com> * fix(review): download regex blocks hyphenated IDs; document health split Address Devin Review round-3 findings on PR #104: 1. _SAFE_IDENTIFIER regex blocked hyphenated table IDs: The download endpoint used the strict SQL-identifier regex which does not allow dots or hyphens, but Keboola table IDs like in.c-crm.orders contain both. Switched to _SAFE_QUOTED_IDENTIFIER which allows dots and hyphens while still blocking path-traversal chars (/, .., \) and quote/control characters. Added test for hyphenated/dotted IDs. 2. Documented health endpoint split in DEPLOYMENT.md: Added Health checks & external monitoring section explaining both endpoints (minimal unauth /api/health vs authenticated /api/health/detailed) and how to wire external monitoring tools to the detailed endpoint with a PAT. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com> * release(0.12.1): cut hotfix for snapshot integrity + #82/#85/#87 hardening * fix(security): apply CAS pattern to password reset confirm (#82/M10 follow-up) Devin review on the rebased PR flagged the asymmetry: magic-link verify got the atomic compare-and-swap pattern in the original M10 fix, but password reset confirm at /auth/password/reset/confirm was still using read-validate-clear. Two concurrent POSTs with the same valid reset token could both succeed in setting different new passwords (last-write- wins). Lower severity than the magic-link race because the attacker would need the reset token AND to race the legitimate user, but the asymmetry was a polish gap. Mirrors app/auth/providers/email.py::_consume_token CAS exactly: write unique CONSUMED:<random> marker via UPDATE...WHERE token=old_token, then SELECT to verify our marker won, then proceed. Only the winner clears the marker and applies the password change. New regression test_concurrent_reset_only_one_wins in tests/test_password_flows.py::TestResetConfirm pins the contract: two ThreadPoolExecutor workers + Barrier hit /reset/confirm with the same token; exactly one gets 302 (password applied), the other gets 200 with 'Invalid or expired'. Sanity-checked against the pre-CAS code — both POSTs got 302 (race confirmed). --------- Co-authored-by: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
592 lines
60 KiB
Markdown
592 lines
60 KiB
Markdown
# Changelog
|
||
|
||
All notable changes to Agnes AI Data Analyst.
|
||
|
||
Format: [Keep a Changelog](https://keepachangelog.com/en/1.1.0/). Versions follow [Semantic Versioning](https://semver.org/spec/v2.0.0.html), pre-1.0 — public surface (CLI flags, REST endpoints, `instance.yaml` schema, `extract.duckdb` contract) may shift between minor versions; breaking changes called out under **Changed** or **Removed** with the **BREAKING** marker.
|
||
|
||
CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every CI build; semver tags (`v0.X.Y`) are cut at release boundaries and reference the same commit as a `stable-*` tag from the same day.
|
||
|
||
---
|
||
|
||
## [Unreleased]
|
||
|
||
<!-- Add bullets here. Group: Added / Changed / Fixed / Removed / Internal.
|
||
Mark breaking changes with **BREAKING** at the start of the bullet. -->
|
||
|
||
## [0.12.1] — 2026-04-28
|
||
|
||
Patch release. Hotfixes the pre-migration snapshot-integrity bug shipped in [v0.12.0](https://github.com/keboola/agnes-the-ai-analyst/releases/tag/v0.12.0) and bundles the security/ops hardening from issue groups #82 (auth hardening), #85 (API validation), #87 (deploy posture), plus #46 (SSRF) and #90 (memory stats blocking).
|
||
|
||
### Added
|
||
|
||
- Path-traversal validation on `/api/data/{table_id}/download` — `table_id` is
|
||
now checked against `_SAFE_QUOTED_IDENTIFIER` regex (allows dots and hyphens
|
||
for Keboola-style IDs like `in.c-crm.orders`) before any filesystem or DB
|
||
operation; unsafe values return 404 (no info leakage). See issue #85/C2.
|
||
- SSRF protection on `POST /api/admin/configure` — `keboola_url` is validated
|
||
against private/reserved networks (127.0.0.0/8, 10.0.0.0/8, 172.16.0.0/12,
|
||
192.168.0.0/16, localhost, IPv6 loopback/link-local/unique-local). Uses
|
||
DNS resolution + `ipaddress` module for robust IPv6 handling (catches
|
||
abbreviated forms like `fe80::1`, `fc00::1`). See issue #46.
|
||
- Caddyfile security headers: `X-Frame-Options DENY`,
|
||
`X-Content-Type-Options nosniff`,
|
||
`Referrer-Policy strict-origin-when-cross-origin`, `-Server` (strip).
|
||
See issue #87/M22.
|
||
- Container runs as non-root user `agnes` — `USER` directive added to
|
||
Dockerfile with `useradd` + `chown`. See issue #87/C13.
|
||
- Docker resource limits: `mem_limit: 4g`, `mem_reservation: 1g`,
|
||
`cpus: 2.0` on `app`; `mem_limit: 2g`, `cpus: 1.0` on `scheduler`.
|
||
See issue #87/M21.
|
||
- Startup warning when no user has `password_hash` — alerts operators that
|
||
`/auth/bootstrap` is reachable. See issue #82/C8.
|
||
- Audit logging for failed web form login attempts (`/auth/password/login/web`)
|
||
— mirrors the existing `/auth/token` audit trail. See issue #82/M9.
|
||
- `/api/health/detailed` endpoint (authenticated) — returns full diagnostics
|
||
(version, schema, sync state, user count). Minimal `/api/health` (unauth)
|
||
returns only `{"status": "ok"}` for load balancers. See issue #87/M17.
|
||
- Health endpoint monitoring guide in `docs/DEPLOYMENT.md` — documents both
|
||
endpoints and how to wire external monitoring tools (Datadog, Prometheus,
|
||
UptimeRobot) to `/api/health/detailed` with a PAT.
|
||
|
||
### Changed
|
||
|
||
- **BREAKING** `docker-compose.override.yml` renamed to `docker-compose.dev.yml`.
|
||
Docker Compose auto-merges `docker-compose.override.yml` on every host with
|
||
the repo, silently enabling dev mode (source mount + `--reload`) on
|
||
production. The new name requires explicit `-f docker-compose.dev.yml`,
|
||
eliminating the foot-gun. Update any scripts or workflows that relied on
|
||
auto-merge. `scripts/run-local-dev.sh` and `Makefile` updated accordingly.
|
||
See issue #87/M23.
|
||
- **BREAKING** `/api/health` now returns a minimal `{"status": "ok"}` payload
|
||
(unauthenticated, for load balancers). Full diagnostics moved to
|
||
`/api/health/detailed` (requires authentication). Scripts that parsed
|
||
`/api/health` for version, sync state, or user count must switch to
|
||
`/api/health/detailed` with an `Authorization` header. CLI commands
|
||
(`da setup test-connection`, `da setup verify`, `da diagnose`, `da status`)
|
||
updated to call `/api/health/detailed` for service-level checks, with
|
||
graceful fallback to the minimal endpoint when auth is not configured.
|
||
See issue #87/M17.
|
||
- `release.yml` CI workflow: `build-and-push` job now only runs on `main`
|
||
pushes or manual `workflow_dispatch` triggers. Non-main branch pushes run
|
||
tests only. Added `paths-ignore` for `docs/**`, `*.md`, `LICENSE`.
|
||
See issue #87/M26.
|
||
|
||
### Fixed
|
||
|
||
- **Pre-migration snapshot integrity** — the snapshot file written
|
||
before a v(N-k)→vN migration now captures the true on-disk state
|
||
*before* any DDL runs, instead of the post-self-heal state the
|
||
0.12.0 hoist (#106) introduced. With the unconditional
|
||
`conn.execute(_SYSTEM_SCHEMA)` at the top of `_ensure_schema`, the
|
||
full set of modern-binary tables (`view_ownership`,
|
||
`marketplace_registry`, `user_groups`, `resource_grants`, etc.) was
|
||
materialized first, then `CHECKPOINT` flushed them to disk, and
|
||
`shutil.copy2` copied the already-modified DB as the
|
||
"pre-migration" snapshot — so an operator inspecting the snapshot
|
||
for rollback debugging saw the binary's full table set instead of
|
||
the old schema. Functionally rollback still worked (extra empty
|
||
tables are harmless and re-running migration is idempotent), but
|
||
the snapshot was misleading. Fix: gate the self-heal call on
|
||
`current >= SCHEMA_VERSION`. The split-brain (`current >
|
||
SCHEMA_VERSION`) and same-version safety-net (`current ==
|
||
SCHEMA_VERSION`) paths still self-heal as before; the migration
|
||
path (`current < SCHEMA_VERSION`) takes its snapshot first and
|
||
then runs `_SYSTEM_SCHEMA` from inside the existing migration
|
||
block.
|
||
- `reset_token` no longer leaks in the JSON response body of
|
||
`POST /api/users/{id}/reset-password`. The `reset_url` still contains the
|
||
token (as intended), but the raw secret is no longer exposed to DevTools,
|
||
proxy logs, or CLI stdout. CLI `admin reset-password` now prints the URL
|
||
instead of the bare token. See issue #82/C5.
|
||
- `/api/memory/stats` no longer blocks the async event loop — replaced
|
||
`repo.list_items(limit=10000)` + Python loop with a single SQL
|
||
`GROUP BY` aggregation. See issue #90.
|
||
- Magic-link token consumption is now atomic — compare-and-swap pattern
|
||
with a unique `CONSUMED:` marker prevents two concurrent verifies from
|
||
both succeeding. DuckDB concurrent-write conflicts are caught and
|
||
converted to 401. See issue #82/M10.
|
||
- Password reset confirm (`POST /auth/password/reset/confirm`) now uses
|
||
the same compare-and-swap pattern as the magic-link flow — closes the
|
||
remaining asymmetry on `users.reset_token` consumption. Lower severity
|
||
than the magic-link race because the reset flow ends with a new
|
||
password (an attacker would need the reset token *and* to race the
|
||
legitimate user) but the consistency closes a polish gap. New
|
||
regression `test_concurrent_reset_only_one_wins` in
|
||
`tests/test_password_flows.py::TestResetConfirm`.
|
||
- Upload endpoints (`/sessions`, `/artifacts`) now stream to a temp file with
|
||
cumulative size check instead of buffering the entire body in memory before
|
||
the size cap — prevents OOM from oversized uploads. Temp file handle is
|
||
properly closed before `shutil.move` to avoid FD leaks. See issue #85/M4.
|
||
- `/api/upload/local-md` uses a SHA-256 hashed filename instead of raw
|
||
`user_email` — stable per user, no charset surprises from email addresses.
|
||
See issue #85/M4.
|
||
- `/auth/bootstrap` 403 message no longer leaks user count. See issue #82/n1.
|
||
|
||
### Internal
|
||
|
||
- New regression `test_split_brain_future_version_with_missing_tables_self_heals`
|
||
in `tests/test_db.py::TestMigrationSafety` — synthesizes a v99 DB whose only
|
||
table is `schema_version`, runs `_ensure_schema`, asserts that the v13-era
|
||
core tables (`users`, `user_groups`, `user_group_members`, `resource_grants`)
|
||
get materialized *and* that `schema_version` stays at 99 (self-heal without
|
||
falsely advertising a downgrade).
|
||
- New regression `test_pre_migration_snapshot_excludes_post_self_heal_tables`
|
||
pins the snapshot-integrity contract: a v2→vN migration's snapshot must not
|
||
contain any post-v2 table from the modern binary. Sanity-checked against the
|
||
pre-fix unconditional hoist — fails with 6 leaked tables.
|
||
- `test_future_version_is_noop` docstring updated to reflect that the
|
||
self-heal pass *does* run on a future-version DB, just doesn't touch the
|
||
version row. The test still passes unchanged — its only assertion was the
|
||
version-row contract, which holds.
|
||
- `test_no_override_file` regression test asserts `docker-compose.override.yml`
|
||
does not exist post-rename. See issue #87/M23.
|
||
|
||
## [0.12.0] — 2026-04-28
|
||
|
||
### Changed
|
||
|
||
- `/admin/access` resource tree now visually separates the three-level hierarchy (resource type → block/bucket → item). Each resource-type section gets a colored left stripe and a faint tinted banner; sections are separated by an 8px neutral gap. Stripe colors cycle 4-wide via `nth-child` so adding new resource types to `app/resource_types.py` works without touching CSS. The first-position color is the project primary blue (`#0073D1`), avoiding the violet (`#6366f1`) reserved for granted items.
|
||
|
||
### Added
|
||
|
||
- `ResourceType.TABLE` — admins can grant table-level access per `user_group` via the `/admin/access` page. Tables registered in `table_registry` are listed grouped by `bucket`, with the existing per-block "Grant all" / "Revoke all" bulk actions. Listing and grant storage only — runtime enforcement still flows through legacy `dataset_permissions`; the migration plan lives in `docs/TODO-rbac-data-enforcement.md`.
|
||
- `AGNES_ENABLE_TABLE_GRANTS` env var (default off) gates the half-built `ResourceType.TABLE` chip. While disabled the chip is hidden from `/admin/access` and `POST /api/admin/grants` returns **422** with the env-var name in `detail` on a TABLE grant attempt. Existing TABLE rows in `resource_grants` stay listable and deletable — the flag controls UI exposure and new-grant acceptance only, never blocks cleanup.
|
||
- `da admin break-glass <user>` CLI — recovery path when the operator has locked themselves out of `/admin/access`. Adds the user to the Admin user_group with `source='system_seed'` regardless of RBAC state. Bypasses authentication; relies on filesystem access to `${DATA_DIR}/state/system.duckdb` implying host-level trust. Document this in deployment runbooks alongside `SEED_ADMIN_EMAIL`.
|
||
|
||
### Internal
|
||
|
||
- `scripts/seed_dummy_tables.py` — populates `table_registry` with 12 dummy tables across 3 buckets (`in.c-finance`, `in.c-marketing`, `in.c-product`), each with `is_public=False`, for exercising the new `/admin/access` Tables section without a configured data source.
|
||
- `/marketplace.zip` short-circuits to `304` before any file IO or ZIP compression on a matching `If-None-Match`. Hot path on every Claude Code SessionStart hook. Backed by an in-process `cachetools.TTLCache` over the resolved-plugins → ETag map (default 120s, env-tunable via `AGNES_MARKETPLACE_ETAG_TTL`, set `0` to disable). `invalidate_etag_cache()` is called by marketplace sync after refresh so the next request re-hashes against new on-disk content instead of waiting for TTL expiry. New explicit dependency: `cachetools>=5.3.0`.
|
||
|
||
### Fixed
|
||
|
||
- `/admin/access` group sidebar grant-count badges no longer revert to a stale value when switching between groups. The badge was reading `state.groups[i].grant_count`, a snapshot populated once at `/access-overview` load; toggling a grant only updated the DOM (via `refreshCounts`), not that field, so the next `renderGroups` call (triggered by `selectGroup`) would clobber the live count with the original snapshot. `renderGroups` now derives the count live from `state.grants`, the array that `toggleGrant`/`bulkSet` keep in sync. Server data was always correct — only the in-page badge drifted until refresh.
|
||
- `/catalog`, `/admin/tables`, and `/admin/permissions` pages now render the shared top header correctly. The pages include `_app_header.html` (which uses `.app-*` CSS classes) but were not linking `style-custom.css` where those classes are defined; only `dashboard.html` and `base.html` did. Without the stylesheet the nav links, dropdowns, and user menu rendered as unstyled inline text. Added the missing `<link>` to all three templates.
|
||
- `PATCH /api/admin/groups/{id}` on a system group now correctly accepts description-only updates while still rejecting renames. The endpoint guard previously short-circuited with `409 "System groups are immutable"` for any mutation, which contradicted the repository layer's narrowed contract (rename-only rejection) — a description-only payload like `{"description": "..."}` would hit the endpoint short-circuit and never reach the repo. The endpoint now 409s only when `payload.name` differs from the existing name; a no-op rename (same name in payload) is dropped from the update before reaching the repo.
|
||
- Google OAuth callback no longer wipes a user's `google_sync` group memberships on a transient Workspace API failure. `fetch_user_groups` is fail-soft and returns `[]` for both "no groups" and "API error" — the callback used to feed that empty list into `replace_google_sync_groups`, which deletes all `source='google_sync'` rows for the user and then inserts zero. A login during a transient Cloud Identity hiccup would silently drop every Workspace-synced membership the user had built up. Admin-added memberships (`source='admin'`) were already protected. The callback now skips `replace_google_sync_groups` when the fetch returns empty and logs "preserving existing memberships" instead. Trade-off: a user whose Workspace groups were genuinely cleared keeps stale memberships until the next non-empty sync — accepted until `fetch_user_groups` learns to distinguish empty-success from empty-failure.
|
||
- `docker-compose.host-mount.yml` now uses `o: bind,rbind` instead of `o: bind` for the `data` volume. With a plain bind, sub-mounts under `/data` on the host (e.g. the dual-disk layout where sdc is mounted on `/data/state`) are silently shadowed inside the container by an empty subdirectory on the parent disk. The container then writes `system.duckdb` and other state to the wrong disk; the dedicated state disk receives no writes and accumulates only the snapshot left by the migration script. Recursive bind propagates existing sub-mounts at container start, so the container sees the same filesystem the host does. Operators on dual-disk VMs need to copy the live DB from `/var/lib/docker/volumes/agnes_data/_data/state/` (sdb's empty subdir) onto `/data/state/` (sdc) **before** redeploying with the fix, or the next start will surface the stale snapshot.
|
||
|
||
### Changed
|
||
|
||
- **BREAKING** Marketplace endpoint (`/marketplace.zip`, `/marketplace.git/*`) no longer god-modes for Admin members. `src.marketplace_filter.resolve_allowed_plugins` now filters every caller — admins included — through `resource_grants`. Admins curate their own marketplace view by granting plugins to the Admin group (or any group they belong to). Existing installs where the only membership on Admin is the admin themselves will see an empty marketplace until grants are added in `/admin/access`. App-level authorization (`require_admin`, `can_access` for non-marketplace types) is unaffected — Admin is still god mode there.
|
||
- **BREAKING** RBAC redesigned around two layers: app-level access via the `Admin` user-group (god mode short-circuit) and resource-level access via a generic `(group, resource_type, resource_id)` grant model. The four-value `core.viewer/analyst/km_admin/admin` hierarchy with `implies` BFS expansion is gone — every protected endpoint now uses either `require_admin` or `require_resource_access(ResourceType.X, "{path}")` from the new `app.auth.access` module. Authorization is decided per-request via a single DB lookup; no session cache, no dual-path resolver, no `_hydrate_legacy_role` shim. See `docs/RBAC.md`.
|
||
- **BREAKING** `internal_roles`, `group_mappings`, `user_role_grants`, and `plugin_access` tables removed. Replaced by `user_group_members` (binds users to user_groups with a `source` enum: `admin` / `google_sync` / `system_seed`) and `resource_grants` (group → `(resource_type, resource_id)`). Schema v13; the migration backfills from v12 atomically — `users.groups` JSON is converted into `user_group_members` rows with `source='google_sync'`, `core.admin` grants become Admin-group memberships with `source='system_seed'`, and `plugin_access` rows become `resource_grants` of type `marketplace_plugin`. The `users.groups` JSON column is dropped; the deprecated `users.role` column is kept NULL as a legacy artifact.
|
||
- **BREAKING** Schema v14 — `user_group_members` and `resource_grants` now declare DuckDB foreign-key constraints on `group_id` (referencing `user_groups.id`). Cascade deletes can no longer leave orphaned member / grant rows pointing at a deleted group. Migration is RENAME → CREATE-with-FK → INSERT → DROP, wrapped in `BEGIN TRANSACTION` so a partial failure rolls back without leaving the DB at a half-applied schema. Forks that touched these tables outside the documented repository APIs need to verify the FK direction matches their writes.
|
||
- **BREAKING** Admin REST surface unified under `/api/admin/groups`, `/api/admin/groups/{id}/members`, `/api/admin/grants`, `/api/admin/resource-types`. `app.api.role_management` and `app.api.plugin_access` removed. The web UI route `/admin/role-mapping` and `/admin/plugin-access` are replaced by a single `/admin/access` page; the `_app_header.html` link is renamed to "Access".
|
||
- **BREAKING** CLI subcommands `da admin role *`, `da admin mapping *`, `da admin grant-role`, `da admin revoke-role`, `da admin effective-roles` removed. New subcommands: `da admin group {list,create,delete,members,add-member,remove-member}` and `da admin grant {list,create,delete,resource-types}`. `da admin set-role <user> admin` still works as a thin wrapper that toggles Admin-group membership.
|
||
- Module authors no longer call `register_internal_role(...)`. Resource types are an `app.resource_types.ResourceType` `StrEnum` paired with a `ResourceTypeSpec` registered in `RESOURCE_TYPES`; adding a new resource type means adding one enum member, one `list_blocks(conn)` projection delegate, and one spec entry — all in `app/resource_types.py`. The registry drives both `/api/admin/resource-types` and `/api/admin/access-overview`, so there's no second wiring step. No DB migration, no startup hook.
|
||
- Google OAuth callback writes Cloud Identity group memberships into `user_group_members` (source='google_sync') instead of `users.groups` JSON. Manual admin-added memberships (source='admin') survive subsequent logins.
|
||
|
||
### Removed
|
||
|
||
- `app/auth/role_resolver.py`, `app/api/role_management.py`, `app/api/plugin_access.py`.
|
||
- `src/repositories/internal_roles.py`, `src/repositories/group_mappings.py`, `src/repositories/user_role_grants.py`.
|
||
- `app/web/templates/admin_role_mapping.html`, `app/web/templates/admin_plugin_access.html`.
|
||
- `Role` enum + `has_role`, `is_admin`, `is_km_admin`, `is_analyst`, `_is_admin_user_dict`, `set_user_role`, `get_user_role` from `src/rbac.py`. Dataset-access helpers (`can_access_table`, `get_accessible_tables`, `has_dataset_access`) preserved.
|
||
- Test files: `test_role_resolver.py`, `test_api_role_management.py`, `test_admin_role_mapping_ui.py`, `test_cli_admin_role.py`, `test_schema_v9_migration.py`, `test_plugin_access_api.py`.
|
||
|
||
### Internal
|
||
|
||
- `src/db.py` schema bumped to v13. New helpers `_seed_system_groups` (idempotent Admin/Everyone seed, runs on every connect) and `_v12_to_v13_finalize` (one-shot backfill + DROP cascade) replace `_seed_core_roles` and `_backfill_users_role_to_grants`.
|
||
- `app.auth.access` is the new authorization vocabulary: `_user_group_ids`, `is_user_admin`, `can_access`, `require_admin`, `require_resource_access`. Lives in its own module to avoid the circular import that would happen if it sat in `app.auth.dependencies` (the dependency factory needs `get_current_user` from there).
|
||
- New `tests/helpers/auth.py::grant_admin(conn, user_id)` — adds a user to the Admin system group so `require_admin` resolves to True. Updated test fixtures across `test_admin_tokens_ui`, `test_password_flows`, `test_pat`, `test_api`, `test_api_complete`, `test_api_scripts`, `test_web_ui` to call it after `UserRepository.create(role="admin")`. The legacy `users.role` column alone is no longer the admin marker.
|
||
- Skipped at module level (rewrite required for v13): `test_admin_user_capabilities_ui` (asserts the gone v9 capabilities UI), `test_marketplace_server_zip` and `test_marketplace_server_git` (depend on the removed `PluginAccessRepository`).
|
||
- Skipped individually as v13 behavior changes: `TestScriptRBAC` in `test_security` (scripts are now any-signed-in-user, not analyst+), profile-page tests in `test_web_ui` that asserted `core.analyst` / `Direct grants` / `Effective roles` markers from the dropped role hierarchy.
|
||
|
||
### Added
|
||
|
||
- **Schema v10** introduces `view_ownership` to detect cross-connector
|
||
view-name collisions in the master analytics DB (issue #81 Group C).
|
||
When two connectors register the same `_meta.table_name`, the
|
||
orchestrator now refuses to silently overwrite the prior owner's view —
|
||
it logs a `view_ownership collision` ERROR identifying both sources
|
||
and the colliding name, and the second source's view is NOT created.
|
||
Previously this was last-write-wins, which depended on directory
|
||
iteration order and could change deployment-to-deployment. Operators
|
||
resolve a collision by renaming `name` in `table_registry` on one side
|
||
(registry-side aliasing — `source_table` stays unchanged, only the
|
||
view name changes). The orchestrator pre-scans every connector's
|
||
`_meta` at the start of each rebuild and releases stale ownerships
|
||
immediately (when ALL pre-scans succeed; if any fail, reconcile is
|
||
skipped to avoid silently stealing a transient-IO source's name),
|
||
so a renamed table frees its name in the SAME rebuild that introduces
|
||
the rename — no two-step waits needed. New module
|
||
`src/repositories/view_ownership.py` exposes the repository.
|
||
|
||
### Changed
|
||
|
||
- **BREAKING (ops)**: Keboola extractor now exits with three distinct
|
||
codes instead of two (issue #81 Group B / M14): `0` = full success,
|
||
`1` = full failure, `2` = **partial** failure (some tables succeeded,
|
||
some failed). Previously `exit(0)` fired even when 9 of 10 tables
|
||
failed, masking partial failures from the sync API and any operator
|
||
alerting hooked to non-zero exit codes. The sync API
|
||
(`POST /api/sync/trigger`) now logs `PARTIAL FAILURE (exit 2)` as a
|
||
data-quality alert (distinct from `FAILED (exit 1)`) and continues to
|
||
the orchestrator rebuild step — successful tables from this run plus
|
||
unchanged tables from previous runs stay queryable. Operators whose
|
||
alerting treated any non-zero exit as a hard error must teach it that
|
||
exit 2 is a partial-failure signal, not a deploy failure.
|
||
- **BREAKING (security)**: The entire Script API is now **admin-only** (issue #44).
|
||
`GET /api/scripts`, `POST /api/scripts/deploy`, `POST /api/scripts/run`, and
|
||
`POST /api/scripts/{id}/run` all require the admin role; previously the list
|
||
endpoint was open to any authenticated user and deploy/run were analyst-accessible.
|
||
Two reasons: (1) the AST + string-blocklist sandbox in `_execute_script` is
|
||
defense-in-depth and known to be bypassable through introspection chains
|
||
(`__class__.__base__.__subclasses__()`, `__globals__['__builtins__']`,
|
||
`__mro__` traversal — the dunder pattern list was tightened in this PR but
|
||
the policy is "the role gate is the trust boundary, not the blocklist");
|
||
(2) gating only `/run` left a planted-script attack open — an analyst could
|
||
deploy a malicious script and wait for an admin to run it. Operators who
|
||
need scripted workflows for non-admin users should run them on the user's
|
||
behalf or expose the relevant data via the read-only `/api/data` surface
|
||
instead. **Migration for cron / scheduler PATs:** if a non-admin PAT is
|
||
wired into a scheduler that hits `/api/scripts/{id}/run` or
|
||
`/api/scripts/run`, the request now returns 403. Add the PAT user to the
|
||
Admin group via `/admin/access` or
|
||
`da admin group add-member Admin <pat-user-email>`. PATs themselves do not
|
||
need re-issuing — group membership is read at request time.
|
||
- **BREAKING (ops)**: Generic ops scripts moved out of the customer-named
|
||
`scripts/grpn/` directory into `scripts/ops/` as part of the OSS
|
||
vendor-neutralization (issue #88):
|
||
- `scripts/grpn/agnes-tls-rotate.sh` → `scripts/ops/agnes-tls-rotate.sh`
|
||
- `scripts/grpn/agnes-auto-upgrade.sh` → `scripts/ops/agnes-auto-upgrade.sh`
|
||
|
||
Downstream consumer infra repos that copy these scripts onto VMs (e.g. via
|
||
their own `startup.sh`) must update the source path. The OSS-shipped
|
||
`infra/modules/customer-instance/` Terraform module is unaffected — it
|
||
embeds equivalent logic inline via heredoc and does not source-by-path
|
||
from `scripts/`. Script behaviour and env vars are unchanged. Cross-refs
|
||
in `README.md`, `CLAUDE.md`, `docs/DEPLOYMENT.md`, `Caddyfile`, and
|
||
`docker-compose.yml` were updated.
|
||
- **OSS neutralization (wave 2 — code, tests, planning docs)**. Customer
|
||
identifiers replaced with placeholders across the codebase to ready the
|
||
repo for public release (issue #88):
|
||
|
||
- **Code docstrings**: `connectors/openmetadata/{client,transformer,enricher}.py`,
|
||
`src/catalog_export.py`, `scripts/duckdb_manager.py` — `prj-grp-…` →
|
||
`my-bq-project` / `prj-example-1234`, `AIAgent.FoundryAI` →
|
||
`AIAgent.MyAgent` (in docstrings) / `AIAgent.Example` (in test fixtures),
|
||
`FoundryAIDataModel` → `AnalyticsDataModel`.
|
||
- **Test fixtures** in `tests/test_openmetadata_enricher.py`,
|
||
`tests/test_duckdb_manager.py`, `tests/test_catalog_export.py`,
|
||
`tests/test_openmetadata_transformer.py` — same set of replacements,
|
||
behaviour-preserving (157 tests still green).
|
||
- **Terraform module** `infra/modules/customer-instance/variables.tf`:
|
||
`customer_name` description rewritten in English, examples switched
|
||
from `keboola, grpn` to `acme, example`.
|
||
- **Workflow** `.github/workflows/keboola-deploy.yml`: comment "Groupon-side
|
||
dev VMs" → generic "per-developer dev VMs".
|
||
- **Caddyfile**: TLS-rotation cross-ref updated to `scripts/ops/…` and
|
||
Keboola-specific aside removed.
|
||
- **Auth docs** `docs/auth-groups.md` and the OAuth probe in
|
||
`scripts/debug/probe_google_groups.py`: GCP project name `kids-ai-data-analysis`
|
||
replaced with placeholder `acme-internal-prod`.
|
||
- **Planning docs** under `docs/superpowers/plans/` and `…/specs/`: the
|
||
five hackathon-era documents (`2026-04-21-deployment-log.md`,
|
||
`…-multi-customer-deployment.md`, `…-issues-14-and-10.md`,
|
||
`…-hackathon-dry-run.md`, the spec) had `34.77.94.14` / `34.77.102.61`
|
||
replaced with `<dev-vm-ip>` / `<prod-vm-ip>`, `Groupon`/`GRPN`/`grpn`
|
||
with `Acme`/`another-customer`, and `prj-grp-…` with `prj-example-…`.
|
||
|
||
### Fixed
|
||
|
||
- **BREAKING (security CRITICAL)**: Jira webhook handler is now
|
||
fail-closed (issue #83). Previously, if `JIRA_WEBHOOK_SECRET` was
|
||
unset, `_verify_signature` returned `True` and any unauthenticated
|
||
POST to `/webhooks/jira` could trigger the full ingest pipeline. The
|
||
handler now returns **503** when the secret is missing
|
||
(operator-misconfiguration signal, distinct from 401 wrong-signature).
|
||
Operators relying on the no-secret = accept-everything mode (don't —
|
||
it was never documented) must set `JIRA_WEBHOOK_SECRET` before this
|
||
merges.
|
||
- **Security (CRITICAL)**: Jira issue keys arriving via webhooks are now
|
||
validated against the canonical `^[A-Z][A-Z0-9]{0,31}-[0-9]{1,12}\Z` format
|
||
(`[0-9]` not `\d` to refuse non-ASCII Unicode digits, `\Z` not `$` to
|
||
refuse trailing newlines that `$` would tolerate)
|
||
before any filesystem operation (issue #83). Previously, `issue_key` flowed
|
||
unsanitized into `connectors/jira/service.py` (`save_issue`,
|
||
`download_attachment`, `_handle_deletion`, `process_webhook_event`) and
|
||
`connectors/jira/incremental_transform.py`, enabling path traversal
|
||
(`../../etc/passwd` style writes outside the Jira data dir). New module
|
||
`connectors/jira/validation.py` provides `is_valid_issue_key` (regex
|
||
whitelist; underscore deliberately excluded — Atlassian rejects underscores
|
||
in real project keys) and `safe_join_under` (`Path.resolve()` containment
|
||
check). Both are enforced at every filesystem boundary, defense-in-depth.
|
||
- **Security (CRITICAL)**: `webhookEvent` (the second attacker-controlled field
|
||
in Jira webhook payloads) was used as a filename component in
|
||
`_log_webhook_event` without sanitization (issue #83 reviewer follow-up).
|
||
A payload with `webhookEvent: "../../tmp/pwn"` could write a JSON dump
|
||
outside `WEBHOOK_LOG_DIR`. The handler now strips everything that isn't
|
||
`[A-Za-z0-9_-]` (dot deliberately excluded to defeat `..` survival),
|
||
clips length to 64 chars, and routes the final filename through
|
||
`safe_join_under`.
|
||
- **Security (CRITICAL)**: hardened the connector → orchestrator trust
|
||
boundary on BOTH the rebuild path
|
||
(`src/orchestrator.py::_attach_remote_extensions`) AND the read-only
|
||
query path (`src/db.py::_reattach_remote_extensions`, called by
|
||
`get_analytics_db_readonly()` on every request) — issue #81 Group A.
|
||
Three fixes: (1) DuckDB extensions referenced by `_remote_attach` are
|
||
matched against a hard allowlist (default: `keboola, bigquery`;
|
||
override via `AGNES_REMOTE_ATTACH_EXTENSIONS`). Install path splits
|
||
built-in (LOAD only) from community (`INSTALL FROM community; LOAD`
|
||
on rebuild path; LOAD only on the read-only query path which must
|
||
not touch the network). (2) `token_env` names are matched against a
|
||
hard allowlist (default: `KBC_TOKEN`, `KBC_STORAGE_TOKEN`,
|
||
`KEBOOLA_STORAGE_TOKEN`, `GOOGLE_APPLICATION_CREDENTIALS`; override
|
||
via `AGNES_REMOTE_ATTACH_TOKEN_ENVS`). Names must additionally match
|
||
`^[A-Z][A-Z0-9_]{0,63}$`. A malicious connector cannot ask the
|
||
orchestrator to read `JWT_SECRET_KEY` / `SESSION_SECRET` /
|
||
`OPENAI_API_KEY` and exfiltrate them via `ATTACH ... TOKEN`.
|
||
(3) The URL passed to `ATTACH` is now single-quote-escaped on both
|
||
paths. Also fixed a `table_schema` vs `table_catalog` mismatch that
|
||
silently no-op'd `_attach_remote_extensions` for every connector
|
||
(the rebuild-path hardening would have been moot in production
|
||
without this fix). New module `src/orchestrator_security.py`
|
||
centralises the policy and exposes `log_effective_policy()`, called
|
||
from app startup so an operator's typo in
|
||
`AGNES_REMOTE_ATTACH_EXTENSIONS` (which **replaces** the default,
|
||
not extends it — a setting of `httpfs` would silently lock out
|
||
`keboola, bigquery`) is visible at boot rather than at the next
|
||
failed attach. See
|
||
`docs/superpowers/plans/2026-04-27-issue-81-trust-boundary.md`.
|
||
- **Security (MEDIUM)**: extractor-side identifier validation (issue
|
||
#81 Group D / M15). The Keboola and BigQuery extractors interpolate
|
||
`table_name`, `bucket` / `dataset`, and `source_table` from
|
||
`table_registry` directly into `CREATE OR REPLACE VIEW`,
|
||
`INSERT INTO _meta`, and `COPY ... TO` SQL. Anyone with write access
|
||
to `table_registry` (admin, registry-write API) could inject SQL via
|
||
these identifiers. New shared module `src/identifier_validation.py`
|
||
exposes a strict `validate_identifier` (for our own view names —
|
||
`^[a-zA-Z_][a-zA-Z0-9_]{0,63}$`, used for `table_name` so it matches
|
||
the orchestrator's rebuild-time check and dashed names fail fast at
|
||
extraction rather than being silently dropped at rebuild) and a
|
||
relaxed `validate_quoted_identifier` (for upstream-typed names like
|
||
Keboola `in.c-foo` / BigQuery `my-dataset`:
|
||
`[a-zA-Z0-9_][a-zA-Z0-9_.\-]*`, refusing any character that could
|
||
close a `"..."` identifier literal). The orchestrator's existing
|
||
`_validate_identifier` was lifted into the new module so both layers
|
||
share a single source of truth; both extractors skip-and-continue on
|
||
unsafe rows (logged + counted in failure stats; the rest of the
|
||
registry still processes).
|
||
|
||
### Removed
|
||
|
||
- Customer-specific manual-deploy helper `scripts/grpn/Makefile` and its
|
||
README, plus the corresponding hackathon deploy log under
|
||
`docs/superpowers/plans/2026-04-22-grpn-deploy-learnings.md`. These
|
||
documented one operator's hand-rolled stopgap for an org-policy-blocked
|
||
Terraform flow and do not belong in vendor-neutral OSS.
|
||
- `scripts/switch-dev-vm.sh` — hackathon-era helper hardcoded to a specific
|
||
shared dev VM. Per-developer dev VMs are
|
||
the supported pattern now; operators who need an equivalent should use
|
||
`gcloud compute ssh <vm> --command "sed -i …/.env && sudo /usr/local/bin/agnes-auto-upgrade.sh"`
|
||
with their own VM details.
|
||
|
||
### Internal
|
||
|
||
- Sandbox blocklist now flags introspection-chain dunders explicitly:
|
||
`__subclasses__`, `__globals__`, `__class__`, `__base__`, `__bases__`,
|
||
`__mro__`, `__dict__`, `__code__`, `__builtins__`. `__init__` and
|
||
`__getattribute__` are intentionally **not** in the list — substring match
|
||
would flag every legitimate `def __init__(self):`. The chain breaks at
|
||
the next link anyway.
|
||
- New regression test `test_run_pwn_payload_blocked` parametrized over the
|
||
exact PoC from issue #44 plus two equivalent variants (lambda+`__globals__`,
|
||
`__mro__` traversal). If the dunder list is silently weakened in a future
|
||
refactor, the test fails. New `test_*_requires_admin` tests parametrized
|
||
over all three non-admin core roles (analyst, viewer, km_admin).
|
||
- `tests/conftest.py::seeded_app` extended with `viewer_token` and
|
||
`km_admin_token` so role-gating tests cover all four core roles.
|
||
|
||
### Migrated
|
||
|
||
- **Schema bumped from v9 to v10**. Auto-migration applies on next start
|
||
(creates the `view_ownership` table; data on disk is unaffected). The
|
||
pre-migration snapshot machinery (added at v8→v9) covers v9→v10 too —
|
||
if anything goes wrong during the migration, the snapshot at
|
||
`<DATA_DIR>/state/system.duckdb.pre-migrate` lets you roll back.
|
||
|
||
## [0.11.5] — 2026-04-27
|
||
|
||
Follow-up release for PR #73: addresses four rounds of Devin AI review on the role-management-complete branch. No new public-API surface; the user-visible payoff is that v8→v9-migrated installations now work end-to-end (login flows, user list, admin nav, privilege revocation), and `make local-dev` startup is finally quiet.
|
||
|
||
### Fixed
|
||
|
||
- **Privilege retention after grant revocation via the new REST API** (Devin review #73). `_hydrate_legacy_role` previously short-circuited on a truthy `user.get("role")`. The role-management endpoints (`POST/DELETE /api/admin/users/{id}/role-grants`, plus the `changeCoreRole` UI flow) only mutate `user_role_grants` — they don't touch the legacy `users.role` column. After a downgrade-via-API, the stale legacy value would keep `user["role"] = "admin"` in memory; `_is_admin_user_dict` and the catalog/sync admin-bypass short-circuits then silently retained elevated table access even though `require_internal_role` correctly denied the API gates. Fix: always re-resolve from `user_role_grants` regardless of the legacy column, making the grants table the single source of truth on every authenticated request. Cost: one DB round-trip per request (same as the existing PAT-aware fallback).
|
||
- **Dev-bypass + OAuth callback dropped direct grants from the session cache** (Devin review #73). Both call sites passed `external_groups` only to `resolve_internal_roles`, never the user's id — so `user_role_grants` rows were resolved on the per-request DB-fallback path inside `require_internal_role` instead of the cache. Functionally correct, but every admin-gated request paid a DB round-trip and the dev-bypass log line read "resolved 0 internal role(s)" for an obviously-admin user, which was confusing during debugging. Fix: pass `user_id` so the cache reflects the union at sign-in.
|
||
- `GET /api/users` returned **HTTP 500** for any v8→v9-migrated installation. The migration NULL-s legacy `users.role` (kept as a deprecated artifact because DuckDB FK blocks DROP COLUMN), but `UserResponse.role` is a required `str` Pydantic field — every user listing failed validation. `/admin/users` showed only "Failed to load users" and the new `/admin/users/{id}` Detail link was unreachable. Fix: route every user dict returned by the API through `_hydrate_legacy_role` (same shim already used by `get_current_user`), which derives the legacy enum value from `user_role_grants` for migrated users. Also fixes a quieter dual of the same bug — `target["role"] == "admin"` short-circuits in `update_user`/`delete_user` would silently no-op on migrated admins, letting the operator demote/delete the last admin against the documented protection.
|
||
- **Scheduler log-noise**: every cron tick produced a `POST /auth/token 401 Unauthorized` access-log line because the scheduler's auto-fetch fallback was always broken — it called `/auth/token` with just an email, but the endpoint requires email + password. Fix: removed the auto-fetch path entirely. Operators set `SCHEDULER_API_TOKEN` (a long-lived PAT) in production; in `LOCAL_DEV_MODE` the dev-bypass auto-authenticates the un-tokenized request, so jobs continue to work.
|
||
- **HTTP 500 on `POST /auth/token` for v8-migrated users** (Devin review #73 round 3). `TokenResponse.role` is a required `str` Pydantic field, but the v8→v9 migration NULL-s the legacy `users.role` column for every existing user. The login endpoint passed the raw NULL through to Pydantic, raising `ValidationError` → 500. Same root cause produced semantically wrong (but non-crashing) JWTs from Google OAuth, password, and email-magic-link flows — they wrote `role: null` into the issued token; downstream `_hydrate_legacy_role` in `get_current_user` would correct the per-request view, but the token payload itself stayed misleading. Fix: hydrate inline in each login flow before reading `user["role"]` — `app/auth/router.py` (`POST /auth/token`), `app/auth/providers/google.py` (OAuth callback), `app/auth/providers/password.py` (5 flows: JSON login, web login, JSON setup, web reset, web setup), and `app/auth/providers/email.py` (centralized in `_consume_token`, covers both magic-link `/verify` endpoints). New regression class `TestAuthLoginFlowsPostMigration` in `tests/test_schema_v9_migration.py` pins both the no-crash and the correct-role contracts for all four legacy levels (viewer/analyst/km_admin/admin).
|
||
- **`docs/RBAC.md` documented an `implies=[…]` keyword on `register_internal_role()` that the function doesn't accept** (Devin review #73 round 3). A module author copying the example would hit `TypeError: got an unexpected keyword argument 'implies'` at import time. Reality: `implies` is currently seeded only for the `core.*` hierarchy via `_seed_core_roles` in `src/db.py` — the registry-side write path doesn't exist yet. Rewrote the *Implies hierarchy* and *Module-author workflow* sections to document what's actually supported in 0.11.4 and what a future change would need to add.
|
||
- **`_seed_core_roles` was advertised as a per-connect safety net but only ran during fresh installs and the v8→v9 migration** (Devin review #73 round 4). The docstring promised "called from `_ensure_schema` on every connect" so an accidental `DELETE FROM internal_roles WHERE key = 'core.admin'` (or a doc-tweak release that updated `_CORE_ROLES_SEED` without bumping the schema version) would self-heal on the next process start. In reality both call sites lived inside `if current < SCHEMA_VERSION:` — once the DB was on v9, the seed function never ran again, leaving any deletion permanent and any in-code `display_name`/`description`/`implies` change requiring a manual SQL deploy. Fix: added an unconditional tail call to `_seed_core_roles(conn)` at the bottom of `_ensure_schema`, gated only by `current <= SCHEMA_VERSION` so the future-version-rollback contract still holds. New regression class `TestSeedCoreRolesSafetyNet` in `tests/test_schema_v9_migration.py` pins all three contracts (deleted row re-seeds, mutated `display_name` re-syncs from code, `applied_at` doesn't churn on already-current DBs).
|
||
- **`make local-dev` startup spammed an `AuthlibDeprecationWarning` from upstream's own `_joserfc_helpers.py`** every time `app/auth/providers/google.py` triggered the `from authlib.integrations.starlette_client import OAuth` import chain. The warning is upstream-internal — authlib telling itself to migrate from `authlib.jose` to `joserfc` before its 2.0 cut — and isn't actionable on our side until either authlib ships the fix or we rewrite OAuth on top of `joserfc` directly. Filtered the specific warning class at the top of `app/main.py` (with a message-based fallback if the class moves in a future authlib release) so the warning no longer pollutes operator-facing stdout. Other `DeprecationWarning`s remain visible.
|
||
|
||
### Added
|
||
|
||
- **`/profile` now self-services every user's role situation.** Three new sections rendered server-side for *all* signed-in users (not just admins): *Effective roles* (the full resolver output as chip cloud — direct grants ∪ group-derived ∪ implies-expanded), *Direct grants* (rows in `user_role_grants` with source label: `auto-seed` from v8 backfill vs. `direct` admin grant), and *Roles via groups* (which Cloud Identity / dev group grants which role for the current user). Non-admins finally see *why* a particular feature is or isn't accessible without asking an admin to read the DB. Admins additionally see a deep-link to `/admin/users/{id}` for editing their own grants in place.
|
||
- **`/admin/role-mapping` group ID picker.** A new "Known groups" panel above the create-mapping form surfaces clickable chips of group IDs known to the system: the calling admin's own `session.google_groups` (with human-readable names + a "your group" tag) merged with distinct `external_group_id`s already used in existing mappings (tagged "already mapped"). Click a chip → fills the form's external-group-id input and focuses the role select. Empty-state copy points the operator at `LOCAL_DEV_GROUPS` / Google sign-in when the picker is empty, instead of leaving them to guess Cloud Identity opaque IDs from memory.
|
||
|
||
### Changed
|
||
|
||
- Renamed `docs/internal-roles.md` → **`docs/RBAC.md`**. Standard industry term, more discoverable for engineers grepping for "RBAC" in a new repo. Added Quickstart-by-role sections (operator / end-user / module author) and a step-by-step *Module-author workflow* with code examples for registering a key, gating endpoints, declaring implies hierarchies, and writing a contract test against the gate. Cross-references in code (`app/api/admin.py`, `tests/test_role_resolver.py`) updated. `CLAUDE.md` now points contributors at the new doc from the *Extensibility → RBAC* section. Historical CHANGELOG entries (`[0.11.3]` / `[0.11.4]` body) keep the original `internal-roles.md` filename — they describe what shipped at that version and aren't retro-edited.
|
||
|
||
## [0.11.4] — 2026-04-27
|
||
|
||
Role-management complete release. Sjednocuje legacy `users.role` enum (viewer/analyst/km_admin/admin) with the v8 internal-roles foundation under one model with implies hierarchy, ships admin UI + REST API + CLI for managing both group mappings and direct user grants, and wires `require_internal_role` for PAT-aware resolution so admin endpoints work uniformly across OAuth and headless callers.
|
||
|
||
### Added
|
||
|
||
- **Schema v9 — unified role model.** New `user_role_grants(user_id, internal_role_id, granted_by, source)` table for direct user→role assignments (complementary to `group_mappings` which assigns via Cloud Identity group). Two new columns on `internal_roles`: `implies` (JSON array of role keys this role transitively grants) and `is_core` (BOOL, distinguishes seeded core.* hierarchy from module-registered roles). Migration v8→v9 seeds four `core.*` rows (`core.viewer/analyst/km_admin/admin`) with the legacy hierarchy as `implies` (`core.admin → core.km_admin → core.analyst → core.viewer`), backfills one `user_role_grants` row per existing user mirroring their pre-v9 `users.role` value (`source='auto-seed'`), and NULLs the legacy column.
|
||
- **PAT-aware `require_internal_role`.** Two-path resolution: session cache first (OAuth flow), DB-backed `user_role_grants` fallback (PAT/headless flow). Admin CLI scripts now hit gated endpoints uniformly without an OAuth round-trip. The PAT-specific 403 message from 0.11.3 is removed — PAT now legitimately resolves through direct grants.
|
||
- **Implies expansion at resolve time.** New `expand_implies(role_keys, conn)` helper in `app.auth.role_resolver` does BFS over the `implies` graph; `resolve_internal_roles` calls it at the end so a single `core.admin` grant expands to the full four-level hierarchy automatically.
|
||
- **Dotted role-key namespace.** Regex extended to allow `core.admin`, `context_engineering.admin`, `corporate_memory.curator` style keys (max 64 chars, lower-snake-case segments separated by dots). The owner_module column should match the prefix before the first dot.
|
||
- **REST API for role management.** New router `app/api/role_management.py` under `/api/admin`: `GET/POST/DELETE` on `group-mappings`, `users/{id}/role-grants`, plus `GET internal-roles` and `GET users/{id}/effective-roles` (debug). All gated by `require_internal_role("core.admin")` — works for both OAuth admins (cookie) and admin PATs.
|
||
- **Admin UI `/admin/role-mapping`.** Browse internal roles, manage Cloud Identity group → role mappings (table view + create/delete forms). User detail page extended with three sections: *Core role* (single-select for `core.*`), *Additional capabilities* (multi-checkbox for module roles), *Effective roles* (debug view of direct + group-derived + expanded set).
|
||
- **`da admin` CLI subcommands.** `role list`, `role show <key>`, `mapping list/create/delete`, `grant-role <email> <key>`, `revoke-role <email> <key>`, `effective-roles <email>`. All run over PAT — use them in CI scripts to grant/revoke roles without going through the browser.
|
||
|
||
### Changed
|
||
|
||
- **BREAKING (semantics, not API).** `users.role` column NULL-ed during v8→v9 migration. Reads via `UserRepository.get_by_*` still return the column but the value is always NULL after upgrade — code reading `user["role"]` directly in business logic gets `None`. The legacy `Role` enum (`Role.VIEWER/ANALYST/KM_ADMIN/ADMIN`) and convenience helpers (`is_admin`, `has_role`, etc. in `src/rbac.py`) continue to work — they now read from `user_role_grants` via the resolver. Sweeping `user.get("role") == "admin"` checks were rewritten to the new helper. The column itself is preserved physically because DuckDB rejects DROP COLUMN while a FK references the table; physical drop is deferred to a future schema-rebuild migration.
|
||
- `require_role(Role.X)` and `require_admin` are now thin wrappers over `require_internal_role(f"core.{role}")`. Behavior identical for OAuth users (admin role from group_mappings); PAT users now succeed when they hold a direct `core.admin` grant.
|
||
- `UserRepository.create()` and `update()` mirror role changes into `user_role_grants` automatically (`_grant_core_role` helper); existing setup code keeps working without changes.
|
||
- `UserRepository.delete()` pre-deletes `user_role_grants` rows (DuckDB FK doesn't auto-cascade).
|
||
- `UserRepository.count_admins()` reads `user_role_grants ⨝ internal_roles WHERE key='core.admin'` — the legacy `users.role = 'admin'` count would always return 0 after backfill.
|
||
- `app/api/admin.py` module-level docstring documents the v9 pattern for module authors who want to add their own capability gates.
|
||
- `docs/internal-roles.md` rewritten to remove the v8 "no UI yet" caveat, document the implies hierarchy, the dual session/DB resolution pathway, and the dotted-namespace key convention.
|
||
|
||
### Removed
|
||
|
||
- `require_internal_role`'s session-only enforcement (the v8 *"This endpoint needs an interactive (OAuth) session — Bearer/PAT tokens do not carry session-resolved roles"* error message). PAT clients with a matching `user_role_grants` row now pass the gate uniformly.
|
||
|
||
### Internal
|
||
|
||
- New `UserRoleGrantsRepository` in `src/repositories/user_role_grants.py` mirrors the style of `GroupMappingsRepository` (list/get/create/delete + per-user / per-role indices).
|
||
- INFO-level audit log on grant + mapping mutations (action strings: `role_mapping.created/deleted`, `role_grant.created/deleted`, resource `mapping:<id>` / `grant:<id>`).
|
||
- "Last admin protection" on `DELETE /api/admin/users/{id}/role-grants/{grant_id}`: refuses to delete the final `core.admin` grant in the system (mirrors existing `count_admins` protection on user deletion / deactivation).
|
||
|
||
## [0.11.3] — 2026-04-26
|
||
|
||
Authorization-foundation release — adds the internal-roles layer between Cloud Identity groups and per-module capability checks. Schema v8 migration; no admin UI yet (follow-up).
|
||
|
||
### Added
|
||
|
||
- **Internal roles + group mapping (foundation).** Schema v8 adds two tables: `internal_roles` (app-defined capabilities like `context_admin`, `agent_operator`, registered by Agnes modules at import time) and `group_mappings` (many-to-many bindings of Cloud Identity group IDs to internal role keys, managed by admins). New `app.auth.role_resolver` module exposes `register_internal_role(...)` for module authors, `sync_registered_roles_to_db(...)` (run once at startup, idempotent), `resolve_internal_roles(external_groups, conn)` (called at sign-in, writes resolved keys into `session["internal_roles"]`), and a `require_internal_role("…")` FastAPI dependency factory for permission checks. Resolution runs at sign-in (Google OAuth callback + dev-bypass — populates on first request and whenever external groups change, mirroring the OAuth callback's always-write semantics). No DB hit per request. Refresh requires re-login, same semantics as `session.google_groups`. **No admin UI yet** — mapping rows must be created via the repository directly until the management UI ships in a follow-up. PAT/headless clients carry no session and therefore cannot pass `require_internal_role` gates by design — `require_internal_role` distinguishes "signed-in but missing role" from "no session at all" and surfaces a PAT-specific 403 detail in the second case so an API consumer hitting the wall sees what to fix. See `docs/internal-roles.md` → *PAT and headless requests*.
|
||
|
||
### Changed
|
||
|
||
- `docs/internal-roles.md` documents `Admin → Users → deactivate then reactivate` as the supported "force re-resolve now" lever for users you can't get to log out (long-lived sessions, automated clients) — invalidates the existing session and forces a fresh sign-in on the next request.
|
||
|
||
### Internal
|
||
|
||
- INFO-level audit log on every successful resolve (OAuth callback + dev-bypass) so a "wrong role" complaint is debuggable from the log alone — admin can correlate "user X claims they lost access" with the resolver output without replaying the request.
|
||
- Startup warning when `SESSION_SECRET` is shorter than 32 chars, matching the existing `JWT_SECRET_KEY` gate. Both HMAC surfaces sign trust-laden state (`session.internal_roles`, `session.google_groups`, JWTs) — keeping the two gates consistent so a weak secret gets surfaced at boot, not after a quiet downgrade.
|
||
- `_clear_registry_for_tests()` now refuses to run unless `TESTING=1` so a stray import path in production can't drop the registered capabilities.
|
||
|
||
## [0.11.2] — 2026-04-26
|
||
|
||
Dev-experience patch release — make `LOCAL_DEV_MODE` realistic enough to actually exercise group-aware code paths on `localhost`, and consolidate scattered dev-onboarding instructions into a single `docs/local-development.md`.
|
||
|
||
### Added
|
||
|
||
- **`LOCAL_DEV_GROUPS` env var** mocks `session.google_groups` for the auto-logged-in dev user when `LOCAL_DEV_MODE=1`. JSON array matching the production shape (`[{"id":"…","name":"…"}]`) so group-aware UI and access-control code paths can be exercised on `localhost` without a Google OAuth round-trip. Honored only under `LOCAL_DEV_MODE=1`. The startup banner reports the parsed group IDs (or warns loudly when the value is set but malformed), so a typo gets surfaced at boot rather than silently on the first authenticated request. Session injection mirrors the production OAuth callback's "always-write" semantics — including clearing stale groups when the operator unsets `LOCAL_DEV_GROUPS` mid-session. See `docs/auth-groups.md` → *Local-dev mock*.
|
||
- **`make local-dev` now seeds two default mocked groups** (`Local Dev Engineers` + `Local Dev Admins` on `example.com`) via `scripts/run-local-dev.sh`, so first-boot `/profile` is non-empty out of the box. Override with `LOCAL_DEV_GROUPS='[…]' make local-dev`; disable with `LOCAL_DEV_GROUPS= make local-dev`.
|
||
- **`docs/local-development.md`** — single onboarding doc for working on Agnes locally: TL;DR, what `LOCAL_DEV_MODE` actually bypasses, group mocking, what isn't mocked, and the security-rails reminder that dev mode must never reach a production deploy.
|
||
|
||
### Internal
|
||
|
||
- Fix nightly `docker-e2e` CI failures: refresh two stale assertions that had drifted from the live API. `tests/test_docker_full.py::test_app_returns_html_on_root` now expects the auth-aware `302 → /login` (root has redirected since the auth middleware landed); `tests/test_e2e_docker.py::TestDockerHealth::test_health_has_duckdb` now reads `services["duckdb_state"]` (current health-payload shape, already validated by `tests/test_api.py`). No application behavior change — these only ran in the scheduled nightly job, so the drift went unnoticed for several PRs.
|
||
|
||
## [0.11.1] — 2026-04-26
|
||
|
||
Patch release — hotfix the missed Caddy env passthrough that should have shipped with 0.11.0, plus codify changelog discipline so this kind of drift gets caught at PR review time next time.
|
||
|
||
### Fixed
|
||
|
||
- `docker-compose.yml` caddy service now passes `CADDY_TLS` through to the container (`- CADDY_TLS` bare-form passthrough). Without it the `Caddyfile` `{$CADDY_TLS:default}` substitution always falls back to cert-file mode regardless of what the operator wrote into `.env`, and Caddy crash-loops on Let's Encrypt / internal-CA deployments. Should have shipped with #52; first attempt was #55, accidentally closed before merging.
|
||
|
||
### Internal
|
||
|
||
- `CLAUDE.md` — non-negotiable changelog discipline: every PR touching user-visible behavior must update `CHANGELOG.md` under `## [Unreleased]` in the same PR.
|
||
|
||
## [0.11.0] — 2026-04-26
|
||
|
||
First tagged semver release. The `version = "2.x"` strings that appeared in earlier `pyproject.toml` snapshots were arbitrary placeholders from the initial scaffold and never reflected actual API maturity — resetting to pre-1.0 to signal that things may still shift.
|
||
|
||
### Added — Auth
|
||
|
||
- **Google Workspace groups on `/profile`.** OAuth callback fetches the signed-in user's group memberships via Cloud Identity (`searchTransitiveGroups` with the `security` label — see `docs/auth-groups.md` for the GCP setup checklist and the `security`-vs-`discussion_forum` gotcha). Profile link added to the user dropdown.
|
||
- **Password reset + invite flows** for web and admin (`/auth/password/reset`, `/admin/users/invite`).
|
||
- **Personal access tokens (PAT)** with separate `:typ=pat` JWT claim, per-token revoke, last-used IP tracking, "My tokens" + admin "All tokens" UI.
|
||
- **Email magic-link provider** (itsdangerous-signed token).
|
||
- **Optional `SEED_ADMIN_PASSWORD`** to pre-hash the seed admin (dev convenience).
|
||
|
||
### Added — Deploy
|
||
|
||
- **`keboola-deploy.yml` workflow.** Tag-triggered alternative to `release.yml` for shared dev VMs that want explicit "deploy when I tag" semantics. Publishes immutable `:keboola-deploy-<tag>` + floating `:keboola-deploy-latest` alias.
|
||
- **Caddy + Let's Encrypt + corporate-CA TLS.** `Caddyfile` parametrized via `$CADDY_TLS` env var so a single file serves three regimes: cert-file (corp PKI), Let's Encrypt auto-issue, Caddy-internal-CA. URL-driven cert rotation with self-signed fallback (`scripts/grpn/agnes-tls-rotate.sh`). `docker-compose.tls.yml` overlay closes host `:8000` when Caddy fronts.
|
||
- **`dev_instances` schema in `customer-instance` Terraform module** gains optional `tls_mode` + `domain` (mirrors `prod_instance`). `infra-v1.6.0` tag.
|
||
- **Optional Google OAuth credentials from Secret Manager.** Module reads `google-oauth-client-{id,secret}` at boot if present; graceful fallback so non-Google deployments aren't affected.
|
||
- **`LOCAL_DEV_MODE` + `make local-dev-up` / `local-dev-down`** for one-keystroke local stack with magic-link auth pre-wired.
|
||
- **Per-developer `dev-<prefix>-latest` GHCR alias** for branches matching `<prefix>/<branch>` — push-to-deploy on personal dev VMs.
|
||
- **`/setup` web wizard** for first-time instance setup, plus headless `POST /api/admin/configure` and `POST /api/admin/discover-and-register`.
|
||
- **Smoke-test job in CI** (Docker-in-CI after every release) + `scripts/smoke-test.sh` for post-deploy verification.
|
||
|
||
### Added — CLI
|
||
|
||
- **Wheel distribution** + auto-update check on startup.
|
||
- `--version` flag, `--dry-run` + `X/N` progress on `da sync`, durable sync (atomic writes + manifest hash + retry on transient errors).
|
||
- gzip on JSON/HTML responses (server-side).
|
||
|
||
### Added — Data
|
||
|
||
- **Remote query engine.** Two-phase BigQuery + DuckDB engine for tables too large to sync locally (`--register-bq` flag).
|
||
- **Business metrics.** Standardized `metric_definitions` table in DuckDB with starter pack importer (`da metrics import`).
|
||
- **`/api/health`** returns `version`, `channel`, `commit_sha`, `image_tag`, `schema_version`.
|
||
- **Custom connector mount support** (`connectors/custom/`).
|
||
- **OpenAPI snapshot test** for breaking-change detection.
|
||
|
||
### Added — Docs / tooling
|
||
|
||
- `docs/auth-groups.md`, `docs/DEPLOYMENT.md`, `docs/HACKATHON.md`, `docs/ONBOARDING.md` runbooks.
|
||
- `scripts/debug/probe_google_groups.py` — stdlib-only probe for diagnosing Cloud Identity API issues without a deploy cycle.
|
||
- Schema migration safety tests (idempotency, data preservation, snapshot).
|
||
- Pre-migration snapshot of `system.duckdb` before schema upgrades.
|
||
- Auto-generated JWT and session secrets with file persistence (`/data/state/.jwt_secret`).
|
||
- Startup banner logging version, channel, and schema version.
|
||
|
||
### Changed
|
||
|
||
- **BREAKING (deployment)** — Caddy compose profile renamed `production` → `tls`. Existing `docker compose --profile production up -d` invocations need to switch.
|
||
- **BREAKING (deployment)** — Default `Caddyfile` mode is now cert-file (`tls /certs/fullchain.pem /certs/privkey.pem`); for the previous Let's Encrypt auto-issue behaviour set `CADDY_TLS=tls <ops-email>` in `.env`. See `docs/auth-groups.md` and `Caddyfile` inline docs.
|
||
- Schema migration v5→v6→v7: adds `users.active`, `personal_access_tokens` table, `personal_access_tokens.last_used_ip`. Auto-applied at boot.
|
||
- Image-level `AGNES_VERSION` now sourced from `pyproject.toml` at build time (no more drift between `da --version` and the package metadata).
|
||
- **Vendor-agnostic OSS rule** codified in `CLAUDE.md` — customer-specific names, hostnames, project IDs belong in consumer infra repos, not in this OSS distribution.
|
||
|
||
### Fixed — Security
|
||
|
||
- Open-redirect guard for backslash in `safe_next_path`.
|
||
- `SessionMiddleware max_age=3600 + https_only` (was browser-session forever, plain-HTTP-OK).
|
||
- Timezone-aware datetimes in Keboola metadata cache.
|
||
- Atomic magic-link token consumption (closes double-use race under concurrent clicks).
|
||
- Bootstrap backdoor closed when passwordless seed admin exists.
|
||
- urllib3 1.26→2.6.3 (resolves 4 Dependabot security alerts).
|
||
- argon2-cffi adopted for password hashing.
|
||
- See [docs/security-audit-2026-04.md](docs/security-audit-2026-04.md) for the full audit (renamed from `docs/padak-security.md` in #94).
|
||
|
||
### Fixed — Other
|
||
|
||
- `uvicorn --proxy-headers --forwarded-allow-ips='*'` so OAuth callbacks resolve to https when behind a TLS terminator.
|
||
- `scripts/grpn/agnes-tls-rotate.sh` hardened: `--max-redirs 0` + `--proto '=https'` on cert fetch, post-fetch PEM validation (rejects HTML error pages from corp portals), `ulimit -c 0` to suppress coredumps that could leak the unencrypted privkey, POSIX-safe `${arr[@]+"${arr[@]}"}` array expansion.
|
||
- `scripts/tls-fetch.sh` — generic URL fetcher (`sm://`, `gs://`, `https://`, `file://`) with redirect refusal + PEM validation.
|
||
- `kbcstorage` moved to optional dep — unblocks urllib3 security updates; primary Keboola path now uses the DuckDB Keboola extension.
|
||
- Dependencies consolidated into `pyproject.toml` (no more `requirements.txt`).
|
||
|
||
### Internal
|
||
|
||
- Test suite expanded to 1357+ tests (4 layers — unit, integration, web smoke, journey).
|
||
|
||
[0.11.3]: https://github.com/keboola/agnes-the-ai-analyst/releases/tag/v0.11.3
|
||
[0.11.2]: https://github.com/keboola/agnes-the-ai-analyst/releases/tag/v0.11.2
|
||
[0.11.1]: https://github.com/keboola/agnes-the-ai-analyst/releases/tag/v0.11.1
|
||
[0.11.0]: https://github.com/keboola/agnes-the-ai-analyst/releases/tag/v0.11.0
|