AI-Cognitive-Leap/agnes-the-ai-analyst

Fork 0

Fork of keboola/agnes-the-ai-analyst (via manana2520 GitHub fork). Develop here, push to GitHub fork to open upstream PRs.

Find a file

Petr Simecek 6c36b26979 release(0.11.3): internal roles + external→internal group mapping (foundation) (#71 ) * feat(auth): internal roles + external→internal group mapping (foundation) Two-layer authorization model: external Cloud Identity groups (org-managed) get mapped onto internal Agnes-defined capabilities (app-managed) via an admin-curated many-to-many table. Per-request permission checks read off the session — no DB hit. Refresh requires re-login. Schema v8 — new tables: - internal_roles (id, key UNIQUE, display_name, description, owner_module, …) — app-defined capabilities like 'context_admin'. Modules self-register at import; the startup hook syncs the registry into this table (idempotent). - group_mappings (id, external_group_id, internal_role_id FK, …) — admin-managed bindings, UNIQUE(external_group_id, internal_role_id). app/auth/role_resolver.py — new module: - register_internal_role(key, display_name, description, owner_module) Module-author entry point. lower_snake_case key, immutable, validated. Same key + same fields = no-op (re-import safe); same key + different fields = ValueError so two modules can't silently overwrite each other. - sync_registered_roles_to_db(conn) — startup reconciliation. Inserts new keys, updates drifted metadata, never deletes (preserves mappings). - resolve_internal_roles(external_groups, conn) — joins group_mappings. Sorted, deduplicated role-key list. Plugged into google_callback + dev-bypass branch in get_current_user. - require_internal_role('key') — FastAPI dependency factory; reads session.internal_roles; 403 with explicit message when missing. Resolution runs at sign-in only (Google callback + LOCAL_DEV_GROUPS change in dev-bypass) — same semantics as session.google_groups. No admin UI yet; mappings created via repository directly until follow-up PR ships UI. 21 new tests in tests/test_role_resolver.py: register/list, idempotency, collision detection, key-format validation; sync insert/update/no-delete; resolve empty/single/many-to-many/malformed-input; e2e via LOCAL_DEV_GROUPS — gated endpoint allowed/denied + direct session-cookie inspection. Full sweep: 178/178 passed across auth + db + repo tests. (Two pre-existing test_catalog_export.py failures verified unrelated.) * fix(auth): polish review feedback — first-request dev populate + PAT doc Two follow-ups from a code-reviewer pass on the foundation commit before opening the PR: - Dev-bypass populates session["internal_roles"] on the first request after sign-in, not just when external groups change. The previous guard only resolved when groups_changed=True, which left a hole for the LOCAL_DEV_GROUPS=`""` (explicit empty) flow: target=[], current=None, neither write branch fires, internal_roles stays unset, and require_internal_role then 403s with no roles to check against. The OAuth callback writes session["internal_roles"] unconditionally on sign-in (even []); dev-bypass now matches that semantics. Adds a single-pass populate gated on the key being absent from the session, so subsequent same-state requests still no-op (cheap session lookup, no resolver call). - Document that internal roles are session-scoped and PAT/headless clients will get 403 from any require_internal_role(...) endpoint. Same constraint already applies to session.google_groups (PAT JWTs deliberately don't snapshot group memberships — they could change after issuance with no way to re-sign), but the doc didn't surface this — an operator pointing a CLI at a role-gated endpoint would see 403 with no clue why. New "PAT and headless requests" section spells out the constraint, the rationale, and the three escape valves (use users.role for the gate; route through OAuth; wait for the planned `da admin grant-role` CLI helper). 54 auth tests still pass locally (21 role-resolver + 33 existing auth-provider). * release(0.11.3): cut release for the internal-roles foundation Bumps pyproject.toml 0.11.2 → 0.11.3 and renames CHANGELOG's [Unreleased] section to [0.11.3] — 2026-04-26 (with a fresh empty [Unreleased] skeleton appended). Adds the matching [0.11.3] link reference at the bottom of CHANGELOG so the section heading renders as a hyperlink to the GitHub release page once the tag lands. The bullet itself is unchanged content; the rephrasing of "dev-bypass when external groups change" → "dev-bypass — populates on first request and whenever external groups change, mirroring the OAuth callback's always-write semantics" reflects the polish committed in d590579, plus the appended PAT/headless caveat pointing at the doc section that landed in the same polish pass. * fix(auth): address review feedback from Pavel — PAT-specific 403, audit logs, hardening Round-2 polish over the internal-roles foundation, addressing Pavel's review on PR #71. No behavior change for the happy path; tightens the safety rails and makes the failure modes self-explanatory. User-visible: - require_internal_role now distinguishes "no session" (Bearer/PAT caller) from "signed in but missing role" and surfaces a PAT-specific 403 detail in the first case ("This endpoint needs an interactive (OAuth) session — Bearer/PAT tokens do not carry session-resolved roles by design"). - docs/internal-roles.md documents deactivate+reactivate as the supported "force re-resolve now" lever for users that can't be made to log out. Internal hardening: - INFO-level audit log on every successful resolve (OAuth callback + dev-bypass) so a wrong-role complaint is debuggable from the log alone. - Startup warning when SESSION_SECRET is shorter than 32 chars, matching the existing JWT_SECRET_KEY gate — both HMAC surfaces sign trust-laden state (session.internal_roles, session.google_groups, JWTs). - _clear_registry_for_tests() now refuses to run unless TESTING=1 so a stray import path in production can't drop the registered capabilities. Tests: - 4 new tests in tests/test_role_resolver.py covering: stale-session contract after a mid-session mapping revoke (pin the documented limitation), PAT 403 detail wording, OAuth pipeline data flow from external groups to internal_roles, and the dev-bypass empty-list fallback when the resolver raises. CHANGELOG.md updated under [0.11.3] (### Changed + ### Internal). CLAUDE.md schema doc bumped from v7 to v8. --------- Co-authored-by: Claude <noreply@anthropic.com>		2026-04-26 23:49:10 +02:00
.github/workflows	feat(deploy): keboola-deploy tag-triggered workflow + Caddyfile LE/internal modes + dev_instances TLS support (#52 )	2026-04-25 23:19:00 +02:00
app	release(0.11.3): internal roles + external→internal group mapping (foundation) (#71 )	2026-04-26 23:49:10 +02:00
cli	release(2.1.0): durable sync, CLI auto-update, versioned wheel URL, version unification (#43 )	2026-04-22 21:18:18 +02:00
config	feat(tls): corporate-CA HTTPS with URL-driven rotation, on-VM CSR gen, self-signed fallback (#51 )	2026-04-25 19:51:25 +00:00
connectors	fix: strip HTML from table and column descriptions in OpenMetadata enricher	2026-04-09 18:42:37 +02:00
dev_docs	docs: update stale v1 docs to v2 Docker/FastAPI/DuckDB architecture	2026-04-09 18:44:25 +02:00
docs	release(0.11.3): internal roles + external→internal group mapping (foundation) (#71 )	2026-04-26 23:49:10 +02:00
infra	feat(auth): Google Workspace groups on /profile + tag-triggered Keboola deploy workflow (#56 )	2026-04-26 00:56:44 +02:00
scripts	release(0.11.2): LOCAL_DEV_GROUPS dev mock + Makefile defaults + docs/local-development.md (#70 )	2026-04-26 16:48:55 +02:00
services	fix: make bot.py FileHandler resilient to missing log directory	2026-04-13 13:28:59 +02:00
src	release(0.11.3): internal roles + external→internal group mapping (foundation) (#71 )	2026-04-26 23:49:10 +02:00
tests	release(0.11.3): internal roles + external→internal group mapping (foundation) (#71 )	2026-04-26 23:49:10 +02:00
.dockerignore	refactor: consolidate deps into pyproject.toml, remove requirements.txt	2026-04-09 13:17:59 +02:00
.gitignore	infra: add bootstrap-gcp.sh for per-customer GCP setup	2026-04-21 16:18:35 +02:00
ARCHITECTURE.md	Update docs for modular architecture (auth/, services/, scripts/)	2026-03-09 13:11:40 +01:00
Caddyfile	feat(deploy): keboola-deploy tag-triggered workflow + Caddyfile LE/internal modes + dev_instances TLS support (#52 )	2026-04-25 23:19:00 +02:00
CHANGELOG.md	release(0.11.3): internal roles + external→internal group mapping (foundation) (#71 )	2026-04-26 23:49:10 +02:00
CLAUDE.md	release(0.11.3): internal roles + external→internal group mapping (foundation) (#71 )	2026-04-26 23:49:10 +02:00
docker-compose.ci.yml	feat: multi-instance deployment — all 14 must-have items from spec	2026-04-10 11:57:42 +02:00
docker-compose.host-mount.yml	fix(ci): move bind-mount of /data to separate overlay, fix CI smoke test	2026-04-21 16:54:18 +02:00
docker-compose.local-dev.yml	release(0.11.2): LOCAL_DEV_GROUPS dev mock + Makefile defaults + docs/local-development.md (#70 )	2026-04-26 16:48:55 +02:00
docker-compose.override.yml	chore(deploy): trust proxy headers + document HTTPS env vars (#48 )	2026-04-24 08:52:53 +02:00
docker-compose.prod.yml	fix(ci): move bind-mount of /data to separate overlay, fix CI smoke test	2026-04-21 16:54:18 +02:00
docker-compose.test.yml	chore(deploy): trust proxy headers + document HTTPS env vars (#48 )	2026-04-24 08:52:53 +02:00
docker-compose.tls.yml	feat(tls): corporate-CA HTTPS with URL-driven rotation, on-VM CSR gen, self-signed fallback (#51 )	2026-04-25 19:51:25 +00:00
docker-compose.yml	fix(deploy): pass CADDY_TLS through to caddy container (#55 )	2026-04-26 01:46:42 +02:00
Dockerfile	chore(deploy): trust proxy headers + document HTTPS env vars (#48 )	2026-04-24 08:52:53 +02:00
LICENSE	OSS cleanup: remove internal references, harden deployment, add config env interpolation	2026-03-09 07:59:57 +01:00
Makefile	release(0.11.2): LOCAL_DEV_GROUPS dev mock + Makefile defaults + docs/local-development.md (#70 )	2026-04-26 16:48:55 +02:00
pyproject.toml	release(0.11.3): internal roles + external→internal group mapping (foundation) (#71 )	2026-04-26 23:49:10 +02:00
pytest.ini	test: add shared test infrastructure (fixtures, factories, assertions, mocks)	2026-04-12 11:05:35 +02:00
README.md	feat(tls): corporate-CA HTTPS with URL-driven rotation, on-VM CSR gen, self-signed fallback (#51 )	2026-04-25 19:51:25 +00:00
uv.lock	chore(deps): bump python-multipart from 0.0.24 to 0.0.26	2026-04-21 13:26:19 +00:00

README.md

Agnes — AI Data Analyst

Agnes is an open-source data distribution platform for AI analytical systems. It extracts data from configured sources into DuckDB, serves it via a FastAPI backend, and distributes Parquet files to analysts who query them locally using Claude Code and DuckDB.

Each data source produces a self-describing extract.duckdb file. The SyncOrchestrator attaches all extract databases into a master analytics.duckdb, making every table available through a unified view layer without copying data unnecessarily.

Architecture: extract.duckdb Contract

Every connector produces the same output structure:

/data/extracts/{source_name}/
├── extract.duckdb          ← _meta table + views
└── data/                   ← parquet files (local sources only)

The orchestrator scans /data/extracts/*/extract.duckdb, attaches each into analytics.duckdb, and creates master views.

┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│   Keboola    │  │   BigQuery   │  │   Jira       │
│  extractor   │  │  extractor   │  │  webhooks    │
│ (DuckDB ext) │  │ (remote BQ)  │  │ (incremental)│
└──────┬───────┘  └──────┬───────┘  └──────┬───────┘
       │                 │                 │
       ▼                 ▼                 ▼
   extract.duckdb    extract.duckdb    extract.duckdb
   + data/*.parquet  (views → BQ)      + data/*.parquet
       │                 │                 │
       └─────────────────┼─────────────────┘
                         ▼
              SyncOrchestrator.rebuild()
              ATTACH → master views in analytics.duckdb
                         │
              ┌──────────┼──────────┐
              ▼          ▼          ▼
          FastAPI      CLI
          (serve)    (da sync)

Supported Data Sources

Source	Mode	Description
Keboola	Batch pull	DuckDB Keboola extension downloads tables to Parquet on a schedule
BigQuery	Remote attach	DuckDB BQ extension; queries execute in BigQuery, no local download
Jira	Real-time push	Webhook receiver updates Parquet files incrementally

Adding a new source means creating connectors/<name>/extractor.py that produces extract.duckdb with a _meta table (table_name, description, rows, size_bytes, extracted_at, query_mode). The orchestrator attaches it automatically.

Quick Start with Docker

# Clone the repository
git clone https://github.com/keboola/agnes-the-ai-analyst.git
cd agnes-the-ai-analyst

# Copy and edit configuration
cp config/instance.yaml.example config/instance.yaml
cp config/.env.template .env
# Edit both files for your environment

# Start the app and scheduler
docker compose up

# Start with all optional services (Telegram bot, etc.)
docker compose --profile full up

# Start with TLS (Caddy on :443 with corporate-CA certs from /data/state/certs)
docker compose -f docker-compose.yml -f docker-compose.prod.yml -f docker-compose.tls.yml \
    --profile tls up -d

Once running, the FastAPI app is available at http://localhost:8000 (or https://$DOMAIN in TLS mode). See docs/DEPLOYMENT.md for cert provisioning + auto-rotation via scripts/grpn/agnes-tls-rotate.sh. Trigger a manual sync:

curl -X POST http://localhost:8000/api/sync/trigger

Development Setup

# Create and activate virtual environment
python3 -m venv .venv && source .venv/bin/activate

# Install dependencies
uv pip install ".[dev]"

# Run FastAPI locally with hot reload
uvicorn app.main:app --reload

# Run the test suite
pytest tests/ -v

Project Structure

├── src/                    # Core engine
│   ├── db.py               # DuckDB schema (system.duckdb, analytics.duckdb)
│   ├── orchestrator.py     # SyncOrchestrator — ATTACHes extract.duckdb files
│   ├── repositories/       # DuckDB-backed CRUD (sync_state, table_registry, users, etc.)
│   ├── profiler.py         # Data profiling
│   └── catalog_export.py   # OpenMetadata catalog export
├── app/                    # FastAPI application
│   ├── main.py             # App setup, router registration
│   ├── api/                # REST API (sync, data, catalog, admin, auth)
│   ├── auth/               # Auth providers (Google OAuth, email magic link, desktop JWT)
│   └── web/                # HTML dashboard routes
├── connectors/             # Data source connectors (extract.duckdb contract)
│   ├── keboola/            # Keboola: extractor.py (DuckDB extension) + client.py (fallback)
│   ├── bigquery/           # BigQuery: extractor.py (remote-only via DuckDB BQ extension)
│   └── jira/               # Jira: webhook + incremental parquet → extract.duckdb
├── cli/                    # CLI tool (`da sync`, `da query`, `da admin`)
├── services/               # Standalone services (scheduler, telegram_bot, ws_gateway, etc.)
├── scripts/                # Utility + migration scripts
├── config/                 # Configuration templates (instance.yaml.example)
├── docs/                   # Documentation + metric YAML definitions
└── tests/                  # Test suite (633 tests)

Configuration

File	Purpose
`config/instance.yaml`	Instance-specific settings: branding, data source type, auth provider, Google domain
`.env`	Secrets and environment variables — never committed
`system.duckdb` `table_registry` table	Table definitions managed via `POST /api/admin/tables/{id}` or the web UI

Copy the example to get started:

cp config/instance.yaml.example config/instance.yaml

See config/instance.yaml.example for all available options.

Documentation

Hackathon TL;DR — condensed deploy + dev playbooks (for both humans and AI agents)
Onboarding Guide — end-to-end Terraform deployment into a GCP project (recommended for production)
Deployment Guide — chooses between Terraform and Docker Compose; covers OSS self-host
Configuration Reference — instance.yaml, env vars, per-instance options
Architecture — orchestrator, extractors, DB layout
Quickstart — local development

Contributing

Fork the repository and create a feature branch.
Run pytest tests/ -v to verify all tests pass before opening a pull request.
Keep commits focused and messages concise.
Open a pull request against main with a clear description of the change.

For bugs and feature requests, open a GitHub issue.

License

This project is licensed under the MIT License.