agnes-the-ai-analyst/docs/internal-roles.md
Petr Simecek 6c36b26979
release(0.11.3): internal roles + external→internal group mapping (foundation) (#71)
* feat(auth): internal roles + external→internal group mapping (foundation)

Two-layer authorization model: external Cloud Identity groups (org-managed)
get mapped onto internal Agnes-defined capabilities (app-managed) via an
admin-curated many-to-many table. Per-request permission checks read off
the session — no DB hit. Refresh requires re-login.

Schema v8 — new tables:
- internal_roles (id, key UNIQUE, display_name, description, owner_module, …)
  — app-defined capabilities like 'context_admin'. Modules self-register at
  import; the startup hook syncs the registry into this table (idempotent).
- group_mappings (id, external_group_id, internal_role_id FK, …)
  — admin-managed bindings, UNIQUE(external_group_id, internal_role_id).

app/auth/role_resolver.py — new module:
- register_internal_role(key, display_name, description, owner_module)
  Module-author entry point. lower_snake_case key, immutable, validated.
  Same key + same fields = no-op (re-import safe); same key + different
  fields = ValueError so two modules can't silently overwrite each other.
- sync_registered_roles_to_db(conn) — startup reconciliation. Inserts new
  keys, updates drifted metadata, never deletes (preserves mappings).
- resolve_internal_roles(external_groups, conn) — joins group_mappings.
  Sorted, deduplicated role-key list. Plugged into google_callback +
  dev-bypass branch in get_current_user.
- require_internal_role('key') — FastAPI dependency factory; reads
  session.internal_roles; 403 with explicit message when missing.

Resolution runs at sign-in only (Google callback + LOCAL_DEV_GROUPS change
in dev-bypass) — same semantics as session.google_groups. No admin UI yet;
mappings created via repository directly until follow-up PR ships UI.

21 new tests in tests/test_role_resolver.py: register/list, idempotency,
collision detection, key-format validation; sync insert/update/no-delete;
resolve empty/single/many-to-many/malformed-input; e2e via
LOCAL_DEV_GROUPS — gated endpoint allowed/denied + direct session-cookie
inspection. Full sweep: 178/178 passed across auth + db + repo tests.
(Two pre-existing test_catalog_export.py failures verified unrelated.)

* fix(auth): polish review feedback — first-request dev populate + PAT doc

Two follow-ups from a code-reviewer pass on the foundation commit before
opening the PR:

- Dev-bypass populates session["internal_roles"] on the first request
  after sign-in, not just when external groups change. The previous
  guard only resolved when groups_changed=True, which left a hole for
  the LOCAL_DEV_GROUPS=`""` (explicit empty) flow: target=[],
  current=None, neither write branch fires, internal_roles stays
  unset, and require_internal_role then 403s with no roles to check
  against. The OAuth callback writes session["internal_roles"]
  unconditionally on sign-in (even []); dev-bypass now matches that
  semantics. Adds a single-pass populate gated on the key being
  absent from the session, so subsequent same-state requests still
  no-op (cheap session lookup, no resolver call).

- Document that internal roles are session-scoped and PAT/headless
  clients will get 403 from any require_internal_role(...) endpoint.
  Same constraint already applies to session.google_groups (PAT JWTs
  deliberately don't snapshot group memberships — they could change
  after issuance with no way to re-sign), but the doc didn't surface
  this — an operator pointing a CLI at a role-gated endpoint would
  see 403 with no clue why. New "PAT and headless requests" section
  spells out the constraint, the rationale, and the three escape
  valves (use users.role for the gate; route through OAuth; wait for
  the planned `da admin grant-role` CLI helper).

54 auth tests still pass locally (21 role-resolver + 33 existing
auth-provider).

* release(0.11.3): cut release for the internal-roles foundation

Bumps pyproject.toml 0.11.2 → 0.11.3 and renames CHANGELOG's
[Unreleased] section to [0.11.3] — 2026-04-26 (with a fresh
empty [Unreleased] skeleton appended). Adds the matching
[0.11.3] link reference at the bottom of CHANGELOG so the
section heading renders as a hyperlink to the GitHub release
page once the tag lands.

The bullet itself is unchanged content; the rephrasing of
"dev-bypass when external groups change" → "dev-bypass —
populates on first request and whenever external groups
change, mirroring the OAuth callback's always-write
semantics" reflects the polish committed in d590579, plus
the appended PAT/headless caveat pointing at the doc
section that landed in the same polish pass.

* fix(auth): address review feedback from Pavel — PAT-specific 403, audit logs, hardening

Round-2 polish over the internal-roles foundation, addressing Pavel's review
on PR #71. No behavior change for the happy path; tightens the safety rails
and makes the failure modes self-explanatory.

User-visible:
- require_internal_role now distinguishes "no session" (Bearer/PAT caller)
  from "signed in but missing role" and surfaces a PAT-specific 403 detail
  in the first case ("This endpoint needs an interactive (OAuth) session
  — Bearer/PAT tokens do not carry session-resolved roles by design").
- docs/internal-roles.md documents deactivate+reactivate as the supported
  "force re-resolve now" lever for users that can't be made to log out.

Internal hardening:
- INFO-level audit log on every successful resolve (OAuth callback +
  dev-bypass) so a wrong-role complaint is debuggable from the log alone.
- Startup warning when SESSION_SECRET is shorter than 32 chars, matching
  the existing JWT_SECRET_KEY gate — both HMAC surfaces sign trust-laden
  state (session.internal_roles, session.google_groups, JWTs).
- _clear_registry_for_tests() now refuses to run unless TESTING=1 so a
  stray import path in production can't drop the registered capabilities.

Tests:
- 4 new tests in tests/test_role_resolver.py covering: stale-session
  contract after a mid-session mapping revoke (pin the documented
  limitation), PAT 403 detail wording, OAuth pipeline data flow from
  external groups to internal_roles, and the dev-bypass empty-list
  fallback when the resolver raises.

CHANGELOG.md updated under [0.11.3] (### Changed + ### Internal).
CLAUDE.md schema doc bumped from v7 to v8.

---------

Co-authored-by: Claude <noreply@anthropic.com>
2026-04-26 23:49:10 +02:00

7.2 KiB

Internal roles + external group mapping

Two-layer authorization model for Agnes:

  • External groups — Cloud Identity / Google Workspace groups, pulled at sign-in into session.google_groups. Owned by the organization; Agnes only reads them. See docs/auth-groups.md.
  • Internal roles — Agnes-defined capabilities (e.g. context_admin, agent_operator, dataset_finance_reader). Owned by Agnes. Registered in code by module authors, persisted in the internal_roles table.
  • Group mappings — admin-managed many-to-many table binding external group IDs to internal role keys. The resolver joins this table at sign-in and writes the resolved role keys into session["internal_roles"].

Permission checks read off the session — no DB hit per request.

When to use which

You want to gate on … Use …
"Is this user signed in at all?" Depends(get_current_user)
"Coarse global role" (admin / analyst / viewer) Depends(require_admin) / Depends(require_role(Role.ANALYST))users.role column
"Specific module capability" Depends(require_internal_role("context_admin")) — this doc

users.role stays the coarse gate for "may enter the building"; internal roles are the fine-grained per-module capabilities layered on top.

Module-author workflow (registering a role)

In your module's import path (e.g. services/context_engineering/__init__.py):

from app.auth.role_resolver import register_internal_role

register_internal_role(
    "context_admin",
    display_name="Context Engineering Admin",
    description="Manages prompt templates and retrieval settings.",
    owner_module="context_engineering",
)

Constraints on key:

  • lower_snake_case, starts with a letter, ≤ 64 chars (^[a-z][a-z0-9_]{0,63}$)
  • immutable — referenced from code; renaming would silently break every existing mapping. Pick carefully.
  • registering the same key twice with the same fields is a no-op (re-import safe); registering with different fields raises ValueError. If two modules collide, one of them must rename.

register_internal_role only populates the in-process registry. The startup hook in app/main.py calls sync_registered_roles_to_db(conn) to reconcile the registry into the internal_roles table:

  • Inserts keys that don't exist yet
  • Updates display_name / description / owner_module when they've drifted from code
  • Never deletes — a role disappearing from code (module unloaded) keeps its DB row and any mappings until an admin explicitly removes it

Admin workflow (mapping external → internal)

Until the management UI ships, mappings are created via repository directly:

from src.db import get_system_db
from src.repositories.group_mappings import GroupMappingsRepository
from src.repositories.internal_roles import InternalRolesRepository
import uuid

conn = get_system_db()
role = InternalRolesRepository(conn).get_by_key("context_admin")
GroupMappingsRepository(conn).create(
    id=str(uuid.uuid4()),
    external_group_id="engineering@example.com",  # Cloud Identity group ID
    internal_role_id=role["id"],
    assigned_by="admin@example.com",
)
conn.close()

After the mapping is created, affected users must sign out and back in for the resolver to pick it up — same refresh semantics as Google's group cache.

If you can't get the user to log out (long-lived session, automated client), Admin → Users → deactivate then reactivate invalidates the existing session and forces a fresh sign-in on the next request. That is the supported "force re-resolve now" lever — there's no per-user role-cache invalidation API today.

Permission check (callsite)

from fastapi import Depends
from app.auth.role_resolver import require_internal_role

@router.post("/context/templates")
async def update_template(
    body: TemplateUpdate,
    user: dict = Depends(require_internal_role("context_admin")),
):
    ...

The dependency reads session["internal_roles"] (populated at sign-in); a missing role returns 403 Requires internal role 'context_admin'. Unauthenticated requests still get 401 from the upstream get_current_user dependency.

Local development

LOCAL_DEV_GROUPS mocks session.google_groups (see docs/auth-groups.mdLocal-dev mock). The dev-bypass branch in app/auth/dependencies.py re-runs the resolver every time the mocked groups change, so editing LOCAL_DEV_GROUPS + hitting any auth-required endpoint refreshes session["internal_roles"] on the next request — no need to bounce the app.

Typical dev setup:

export LOCAL_DEV_MODE=1
export LOCAL_DEV_GROUPS='[{"id":"engineering@example.com","name":"Engineering"}]'
# Register the role + create the mapping (one-shot script or manual SQL),
# then hit any protected endpoint — dev user now holds context_admin.

PAT and headless requests

Internal roles are session-scoped only. Personal Access Tokens (PAT) and other Bearer-token clients carry a JWT that proves identity but not a signed session cookie, so session["internal_roles"] is never populated for them. Concretely: any endpoint protected by Depends(require_internal_role(…)) will return 403 for a PAT client even when the corresponding user's external groups would map to that role through a browser sign-in.

This is intentional, not a bug — the same constraint already applies to session.google_groups, and PAT-issued JWTs deliberately don't snapshot that list (the user's group memberships can change after the token was issued without any way to re-sign the token). Two practical implications:

  • Don't gate PAT-callable endpoints with require_internal_role. Use users.role (require_admin / require_role(Role.ANALYST)) for the coarse check, or check the JWT claims directly. Internal roles fit OAuth-flow consumers (the web UI) and the dev bypass.
  • If you need a CLI/script to act with elevated capability, the current escape valves are: (a) issue the PAT to a user whose users.role already covers it, (b) call the endpoint through the OAuth flow from a browser session, or (c) wait for the planned da admin grant-role CLI helper (see Future work) which will store an explicit per-user grant outside the group-mapping flow.

Resolution timing

Resolver runs at sign-in only:

  • Google OAuth callback (app/auth/providers/google.py) — after _fetch_google_groups, before issuing the JWT
  • Dev-bypass branch (app/auth/dependencies.py) — when LOCAL_DEV_GROUPS value changes for a session

Per-request reads are off session["internal_roles"] only; no DB hit. Trade-off: a user with a stale session keeps stale roles until they log out + back in. Same as session.google_groups. Cheaper than per-request DB lookup; matches the existing mental model.

Future work (not in this PR)

  • Admin UI under /admin/role-mapping — list registered roles + their current mappings, add/remove mappings, surface drift between registry and DB.
  • Audit-log entries for mapping create/delete (write into audit_log with action="role_mapping.created" / "role_mapping.deleted", resource=f"mapping:{id}").
  • Optional CLI helper da admin grant-role <user-email> <role-key> for ad-hoc grants without going through external groups.