* feat(auth): internal roles + external→internal group mapping (foundation)
Two-layer authorization model: external Cloud Identity groups (org-managed)
get mapped onto internal Agnes-defined capabilities (app-managed) via an
admin-curated many-to-many table. Per-request permission checks read off
the session — no DB hit. Refresh requires re-login.
Schema v8 — new tables:
- internal_roles (id, key UNIQUE, display_name, description, owner_module, …)
— app-defined capabilities like 'context_admin'. Modules self-register at
import; the startup hook syncs the registry into this table (idempotent).
- group_mappings (id, external_group_id, internal_role_id FK, …)
— admin-managed bindings, UNIQUE(external_group_id, internal_role_id).
app/auth/role_resolver.py — new module:
- register_internal_role(key, display_name, description, owner_module)
Module-author entry point. lower_snake_case key, immutable, validated.
Same key + same fields = no-op (re-import safe); same key + different
fields = ValueError so two modules can't silently overwrite each other.
- sync_registered_roles_to_db(conn) — startup reconciliation. Inserts new
keys, updates drifted metadata, never deletes (preserves mappings).
- resolve_internal_roles(external_groups, conn) — joins group_mappings.
Sorted, deduplicated role-key list. Plugged into google_callback +
dev-bypass branch in get_current_user.
- require_internal_role('key') — FastAPI dependency factory; reads
session.internal_roles; 403 with explicit message when missing.
Resolution runs at sign-in only (Google callback + LOCAL_DEV_GROUPS change
in dev-bypass) — same semantics as session.google_groups. No admin UI yet;
mappings created via repository directly until follow-up PR ships UI.
21 new tests in tests/test_role_resolver.py: register/list, idempotency,
collision detection, key-format validation; sync insert/update/no-delete;
resolve empty/single/many-to-many/malformed-input; e2e via
LOCAL_DEV_GROUPS — gated endpoint allowed/denied + direct session-cookie
inspection. Full sweep: 178/178 passed across auth + db + repo tests.
(Two pre-existing test_catalog_export.py failures verified unrelated.)
* fix(auth): polish review feedback — first-request dev populate + PAT doc
Two follow-ups from a code-reviewer pass on the foundation commit before
opening the PR:
- Dev-bypass populates session["internal_roles"] on the first request
after sign-in, not just when external groups change. The previous
guard only resolved when groups_changed=True, which left a hole for
the LOCAL_DEV_GROUPS=`""` (explicit empty) flow: target=[],
current=None, neither write branch fires, internal_roles stays
unset, and require_internal_role then 403s with no roles to check
against. The OAuth callback writes session["internal_roles"]
unconditionally on sign-in (even []); dev-bypass now matches that
semantics. Adds a single-pass populate gated on the key being
absent from the session, so subsequent same-state requests still
no-op (cheap session lookup, no resolver call).
- Document that internal roles are session-scoped and PAT/headless
clients will get 403 from any require_internal_role(...) endpoint.
Same constraint already applies to session.google_groups (PAT JWTs
deliberately don't snapshot group memberships — they could change
after issuance with no way to re-sign), but the doc didn't surface
this — an operator pointing a CLI at a role-gated endpoint would
see 403 with no clue why. New "PAT and headless requests" section
spells out the constraint, the rationale, and the three escape
valves (use users.role for the gate; route through OAuth; wait for
the planned `da admin grant-role` CLI helper).
54 auth tests still pass locally (21 role-resolver + 33 existing
auth-provider).
* release(0.11.3): cut release for the internal-roles foundation
Bumps pyproject.toml 0.11.2 → 0.11.3 and renames CHANGELOG's
[Unreleased] section to [0.11.3] — 2026-04-26 (with a fresh
empty [Unreleased] skeleton appended). Adds the matching
[0.11.3] link reference at the bottom of CHANGELOG so the
section heading renders as a hyperlink to the GitHub release
page once the tag lands.
The bullet itself is unchanged content; the rephrasing of
"dev-bypass when external groups change" → "dev-bypass —
populates on first request and whenever external groups
change, mirroring the OAuth callback's always-write
semantics" reflects the polish committed in d590579, plus
the appended PAT/headless caveat pointing at the doc
section that landed in the same polish pass.
* fix(auth): address review feedback from Pavel — PAT-specific 403, audit logs, hardening
Round-2 polish over the internal-roles foundation, addressing Pavel's review
on PR #71. No behavior change for the happy path; tightens the safety rails
and makes the failure modes self-explanatory.
User-visible:
- require_internal_role now distinguishes "no session" (Bearer/PAT caller)
from "signed in but missing role" and surfaces a PAT-specific 403 detail
in the first case ("This endpoint needs an interactive (OAuth) session
— Bearer/PAT tokens do not carry session-resolved roles by design").
- docs/internal-roles.md documents deactivate+reactivate as the supported
"force re-resolve now" lever for users that can't be made to log out.
Internal hardening:
- INFO-level audit log on every successful resolve (OAuth callback +
dev-bypass) so a wrong-role complaint is debuggable from the log alone.
- Startup warning when SESSION_SECRET is shorter than 32 chars, matching
the existing JWT_SECRET_KEY gate — both HMAC surfaces sign trust-laden
state (session.internal_roles, session.google_groups, JWTs).
- _clear_registry_for_tests() now refuses to run unless TESTING=1 so a
stray import path in production can't drop the registered capabilities.
Tests:
- 4 new tests in tests/test_role_resolver.py covering: stale-session
contract after a mid-session mapping revoke (pin the documented
limitation), PAT 403 detail wording, OAuth pipeline data flow from
external groups to internal_roles, and the dev-bypass empty-list
fallback when the resolver raises.
CHANGELOG.md updated under [0.11.3] (### Changed + ### Internal).
CLAUDE.md schema doc bumped from v7 to v8.
---------
Co-authored-by: Claude <noreply@anthropic.com>
|
||
|---|---|---|
| .github/workflows | ||
| app | ||
| cli | ||
| config | ||
| connectors | ||
| dev_docs | ||
| docs | ||
| infra | ||
| scripts | ||
| services | ||
| src | ||
| tests | ||
| .dockerignore | ||
| .gitignore | ||
| ARCHITECTURE.md | ||
| Caddyfile | ||
| CHANGELOG.md | ||
| CLAUDE.md | ||
| docker-compose.ci.yml | ||
| docker-compose.host-mount.yml | ||
| docker-compose.local-dev.yml | ||
| docker-compose.override.yml | ||
| docker-compose.prod.yml | ||
| docker-compose.test.yml | ||
| docker-compose.tls.yml | ||
| docker-compose.yml | ||
| Dockerfile | ||
| LICENSE | ||
| Makefile | ||
| pyproject.toml | ||
| pytest.ini | ||
| README.md | ||
| uv.lock | ||
Agnes — AI Data Analyst
Agnes is an open-source data distribution platform for AI analytical systems. It extracts data from configured sources into DuckDB, serves it via a FastAPI backend, and distributes Parquet files to analysts who query them locally using Claude Code and DuckDB.
Each data source produces a self-describing extract.duckdb file. The SyncOrchestrator attaches all extract databases into a master analytics.duckdb, making every table available through a unified view layer without copying data unnecessarily.
Architecture: extract.duckdb Contract
Every connector produces the same output structure:
/data/extracts/{source_name}/
├── extract.duckdb ← _meta table + views
└── data/ ← parquet files (local sources only)
The orchestrator scans /data/extracts/*/extract.duckdb, attaches each into analytics.duckdb, and creates master views.
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Keboola │ │ BigQuery │ │ Jira │
│ extractor │ │ extractor │ │ webhooks │
│ (DuckDB ext) │ │ (remote BQ) │ │ (incremental)│
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │
▼ ▼ ▼
extract.duckdb extract.duckdb extract.duckdb
+ data/*.parquet (views → BQ) + data/*.parquet
│ │ │
└─────────────────┼─────────────────┘
▼
SyncOrchestrator.rebuild()
ATTACH → master views in analytics.duckdb
│
┌──────────┼──────────┐
▼ ▼ ▼
FastAPI CLI
(serve) (da sync)
Supported Data Sources
| Source | Mode | Description |
|---|---|---|
| Keboola | Batch pull | DuckDB Keboola extension downloads tables to Parquet on a schedule |
| BigQuery | Remote attach | DuckDB BQ extension; queries execute in BigQuery, no local download |
| Jira | Real-time push | Webhook receiver updates Parquet files incrementally |
Adding a new source means creating connectors/<name>/extractor.py that produces extract.duckdb with a _meta table (table_name, description, rows, size_bytes, extracted_at, query_mode). The orchestrator attaches it automatically.
Quick Start with Docker
# Clone the repository
git clone https://github.com/keboola/agnes-the-ai-analyst.git
cd agnes-the-ai-analyst
# Copy and edit configuration
cp config/instance.yaml.example config/instance.yaml
cp config/.env.template .env
# Edit both files for your environment
# Start the app and scheduler
docker compose up
# Start with all optional services (Telegram bot, etc.)
docker compose --profile full up
# Start with TLS (Caddy on :443 with corporate-CA certs from /data/state/certs)
docker compose -f docker-compose.yml -f docker-compose.prod.yml -f docker-compose.tls.yml \
--profile tls up -d
Once running, the FastAPI app is available at http://localhost:8000 (or https://$DOMAIN in TLS mode). See docs/DEPLOYMENT.md for cert provisioning + auto-rotation via scripts/grpn/agnes-tls-rotate.sh. Trigger a manual sync:
curl -X POST http://localhost:8000/api/sync/trigger
Development Setup
# Create and activate virtual environment
python3 -m venv .venv && source .venv/bin/activate
# Install dependencies
uv pip install ".[dev]"
# Run FastAPI locally with hot reload
uvicorn app.main:app --reload
# Run the test suite
pytest tests/ -v
Project Structure
├── src/ # Core engine
│ ├── db.py # DuckDB schema (system.duckdb, analytics.duckdb)
│ ├── orchestrator.py # SyncOrchestrator — ATTACHes extract.duckdb files
│ ├── repositories/ # DuckDB-backed CRUD (sync_state, table_registry, users, etc.)
│ ├── profiler.py # Data profiling
│ └── catalog_export.py # OpenMetadata catalog export
├── app/ # FastAPI application
│ ├── main.py # App setup, router registration
│ ├── api/ # REST API (sync, data, catalog, admin, auth)
│ ├── auth/ # Auth providers (Google OAuth, email magic link, desktop JWT)
│ └── web/ # HTML dashboard routes
├── connectors/ # Data source connectors (extract.duckdb contract)
│ ├── keboola/ # Keboola: extractor.py (DuckDB extension) + client.py (fallback)
│ ├── bigquery/ # BigQuery: extractor.py (remote-only via DuckDB BQ extension)
│ └── jira/ # Jira: webhook + incremental parquet → extract.duckdb
├── cli/ # CLI tool (`da sync`, `da query`, `da admin`)
├── services/ # Standalone services (scheduler, telegram_bot, ws_gateway, etc.)
├── scripts/ # Utility + migration scripts
├── config/ # Configuration templates (instance.yaml.example)
├── docs/ # Documentation + metric YAML definitions
└── tests/ # Test suite (633 tests)
Configuration
| File | Purpose |
|---|---|
config/instance.yaml |
Instance-specific settings: branding, data source type, auth provider, Google domain |
.env |
Secrets and environment variables — never committed |
system.duckdb table_registry table |
Table definitions managed via POST /api/admin/tables/{id} or the web UI |
Copy the example to get started:
cp config/instance.yaml.example config/instance.yaml
See config/instance.yaml.example for all available options.
Documentation
- Hackathon TL;DR — condensed deploy + dev playbooks (for both humans and AI agents)
- Onboarding Guide — end-to-end Terraform deployment into a GCP project (recommended for production)
- Deployment Guide — chooses between Terraform and Docker Compose; covers OSS self-host
- Configuration Reference —
instance.yaml, env vars, per-instance options - Architecture — orchestrator, extractors, DB layout
- Quickstart — local development
Contributing
- Fork the repository and create a feature branch.
- Run
pytest tests/ -vto verify all tests pass before opening a pull request. - Keep commits focused and messages concise.
- Open a pull request against
mainwith a clear description of the change.
For bugs and feature requests, open a GitHub issue.
License
This project is licensed under the MIT License.