agnes-the-ai-analyst/CLAUDE.md
Petr Simecek 6c36b26979
release(0.11.3): internal roles + external→internal group mapping (foundation) (#71)
* feat(auth): internal roles + external→internal group mapping (foundation)

Two-layer authorization model: external Cloud Identity groups (org-managed)
get mapped onto internal Agnes-defined capabilities (app-managed) via an
admin-curated many-to-many table. Per-request permission checks read off
the session — no DB hit. Refresh requires re-login.

Schema v8 — new tables:
- internal_roles (id, key UNIQUE, display_name, description, owner_module, …)
  — app-defined capabilities like 'context_admin'. Modules self-register at
  import; the startup hook syncs the registry into this table (idempotent).
- group_mappings (id, external_group_id, internal_role_id FK, …)
  — admin-managed bindings, UNIQUE(external_group_id, internal_role_id).

app/auth/role_resolver.py — new module:
- register_internal_role(key, display_name, description, owner_module)
  Module-author entry point. lower_snake_case key, immutable, validated.
  Same key + same fields = no-op (re-import safe); same key + different
  fields = ValueError so two modules can't silently overwrite each other.
- sync_registered_roles_to_db(conn) — startup reconciliation. Inserts new
  keys, updates drifted metadata, never deletes (preserves mappings).
- resolve_internal_roles(external_groups, conn) — joins group_mappings.
  Sorted, deduplicated role-key list. Plugged into google_callback +
  dev-bypass branch in get_current_user.
- require_internal_role('key') — FastAPI dependency factory; reads
  session.internal_roles; 403 with explicit message when missing.

Resolution runs at sign-in only (Google callback + LOCAL_DEV_GROUPS change
in dev-bypass) — same semantics as session.google_groups. No admin UI yet;
mappings created via repository directly until follow-up PR ships UI.

21 new tests in tests/test_role_resolver.py: register/list, idempotency,
collision detection, key-format validation; sync insert/update/no-delete;
resolve empty/single/many-to-many/malformed-input; e2e via
LOCAL_DEV_GROUPS — gated endpoint allowed/denied + direct session-cookie
inspection. Full sweep: 178/178 passed across auth + db + repo tests.
(Two pre-existing test_catalog_export.py failures verified unrelated.)

* fix(auth): polish review feedback — first-request dev populate + PAT doc

Two follow-ups from a code-reviewer pass on the foundation commit before
opening the PR:

- Dev-bypass populates session["internal_roles"] on the first request
  after sign-in, not just when external groups change. The previous
  guard only resolved when groups_changed=True, which left a hole for
  the LOCAL_DEV_GROUPS=`""` (explicit empty) flow: target=[],
  current=None, neither write branch fires, internal_roles stays
  unset, and require_internal_role then 403s with no roles to check
  against. The OAuth callback writes session["internal_roles"]
  unconditionally on sign-in (even []); dev-bypass now matches that
  semantics. Adds a single-pass populate gated on the key being
  absent from the session, so subsequent same-state requests still
  no-op (cheap session lookup, no resolver call).

- Document that internal roles are session-scoped and PAT/headless
  clients will get 403 from any require_internal_role(...) endpoint.
  Same constraint already applies to session.google_groups (PAT JWTs
  deliberately don't snapshot group memberships — they could change
  after issuance with no way to re-sign), but the doc didn't surface
  this — an operator pointing a CLI at a role-gated endpoint would
  see 403 with no clue why. New "PAT and headless requests" section
  spells out the constraint, the rationale, and the three escape
  valves (use users.role for the gate; route through OAuth; wait for
  the planned `da admin grant-role` CLI helper).

54 auth tests still pass locally (21 role-resolver + 33 existing
auth-provider).

* release(0.11.3): cut release for the internal-roles foundation

Bumps pyproject.toml 0.11.2 → 0.11.3 and renames CHANGELOG's
[Unreleased] section to [0.11.3] — 2026-04-26 (with a fresh
empty [Unreleased] skeleton appended). Adds the matching
[0.11.3] link reference at the bottom of CHANGELOG so the
section heading renders as a hyperlink to the GitHub release
page once the tag lands.

The bullet itself is unchanged content; the rephrasing of
"dev-bypass when external groups change" → "dev-bypass —
populates on first request and whenever external groups
change, mirroring the OAuth callback's always-write
semantics" reflects the polish committed in d590579, plus
the appended PAT/headless caveat pointing at the doc
section that landed in the same polish pass.

* fix(auth): address review feedback from Pavel — PAT-specific 403, audit logs, hardening

Round-2 polish over the internal-roles foundation, addressing Pavel's review
on PR #71. No behavior change for the happy path; tightens the safety rails
and makes the failure modes self-explanatory.

User-visible:
- require_internal_role now distinguishes "no session" (Bearer/PAT caller)
  from "signed in but missing role" and surfaces a PAT-specific 403 detail
  in the first case ("This endpoint needs an interactive (OAuth) session
  — Bearer/PAT tokens do not carry session-resolved roles by design").
- docs/internal-roles.md documents deactivate+reactivate as the supported
  "force re-resolve now" lever for users that can't be made to log out.

Internal hardening:
- INFO-level audit log on every successful resolve (OAuth callback +
  dev-bypass) so a wrong-role complaint is debuggable from the log alone.
- Startup warning when SESSION_SECRET is shorter than 32 chars, matching
  the existing JWT_SECRET_KEY gate — both HMAC surfaces sign trust-laden
  state (session.internal_roles, session.google_groups, JWTs).
- _clear_registry_for_tests() now refuses to run unless TESTING=1 so a
  stray import path in production can't drop the registered capabilities.

Tests:
- 4 new tests in tests/test_role_resolver.py covering: stale-session
  contract after a mid-session mapping revoke (pin the documented
  limitation), PAT 403 detail wording, OAuth pipeline data flow from
  external groups to internal_roles, and the dev-bypass empty-list
  fallback when the resolver raises.

CHANGELOG.md updated under [0.11.3] (### Changed + ### Internal).
CLAUDE.md schema doc bumped from v7 to v8.

---------

Co-authored-by: Claude <noreply@anthropic.com>
2026-04-26 23:49:10 +02:00

298 lines
16 KiB
Markdown

# AI Data Analyst
Open-source data distribution platform for AI analytical systems. Extracts data from sources into DuckDB, serves via FastAPI, and distributes parquets to analysts who use Claude Code for local analysis.
## First-Time Setup
When a user opens this project for the first time, guide them through interactive setup:
### Step 1: Gather Information
Ask the user for:
1. Company domain (e.g., "acme.com") - used for Google OAuth
2. Data source type: keboola / bigquery / csv
3. Instance name (e.g., "Acme Data Analyst")
### Step 2: Generate Configuration
1. Copy `config/instance.yaml.example` to `config/instance.yaml`
2. Fill in values from Step 1
3. If Keboola: ask for Storage API token, stack URL, project ID
4. Create `.env` from `config/.env.template`
### Step 3: Register Tables
1. Use the FastAPI admin API (`POST /api/admin/tables/{id}`) or webapp UI to register tables
2. Tables are stored in DuckDB `table_registry` with source_type, bucket, source_table, query_mode
3. For migration from old format: `python scripts/migrate_registry_to_duckdb.py`
### Step 4: Docker Deployment
```bash
docker compose up # Start app + scheduler
docker compose --profile full up # Include telegram bot
# HTTPS mode — Caddy + corporate-CA certs at /data/state/certs
docker compose -f docker-compose.yml -f docker-compose.prod.yml -f docker-compose.tls.yml \
--profile tls up -d
```
See `docs/DEPLOYMENT.md`**TLS** for cert provisioning + `scripts/grpn/agnes-tls-rotate.sh` (daily refetch from `TLS_FULLCHAIN_URL`, `SIGUSR1` reload on diff, no-op when unchanged). The infra repo's `startup.sh` installs this as a systemd timer automatically.
## Project Structure
```
├── src/ # Core engine
│ ├── db.py # DuckDB schema (system.duckdb, analytics.duckdb)
│ ├── orchestrator.py # SyncOrchestrator — ATTACHes extract.duckdb files
│ ├── repositories/ # DuckDB-backed CRUD (sync_state, table_registry, users, etc.)
│ ├── profiler.py # Data profiling
│ └── catalog_export.py # OpenMetadata catalog export
├── app/ # FastAPI application
│ ├── main.py # App setup, router registration
│ ├── api/ # REST API (sync, data, catalog, admin, auth)
│ └── web/ # HTML dashboard routes
├── connectors/ # Data source connectors (extract.duckdb contract)
│ ├── keboola/ # Keboola: extractor.py (DuckDB extension) + client.py (fallback)
│ ├── bigquery/ # BigQuery: extractor.py (remote-only via DuckDB BQ extension)
│ └── jira/ # Jira: webhook + incremental parquet → extract.duckdb
├── cli/ # CLI tool (`da sync`, `da query`, `da admin`)
├── app/auth/ # Authentication (FastAPI-based providers)
├── services/ # Standalone services (scheduler, telegram_bot, ws_gateway, etc.)
├── server/ # Legacy deployment infrastructure
├── scripts/ # Utility + migration scripts
├── config/ # Configuration templates (instance.yaml.example)
├── docs/ # Documentation + metric YAML definitions
└── tests/ # Test suite (633 tests)
```
## Architecture: extract.duckdb Contract
Every data source produces the same output:
```
/data/extracts/{source_name}/
├── extract.duckdb ← _meta table + views
└── data/ ← parquet files (local sources only)
```
### Remote table support (`_remote_attach`)
Extractors with remote/passthrough tables (query_mode='remote') include a `_remote_attach` table
in extract.duckdb so the orchestrator can re-ATTACH the external DuckDB extension at query time:
```sql
CREATE TABLE _remote_attach (
alias VARCHAR, -- DuckDB alias used in views, e.g. 'kbc'
extension VARCHAR, -- Extension name, e.g. 'keboola'
url VARCHAR, -- Connection URL
token_env VARCHAR -- Env-var name holding the auth token (NOT the token itself)
);
```
The orchestrator reads this table, installs/loads the extension, reads the token from the
environment, and ATTACHes the external source. Views referencing `kbc."bucket"."table"` then
resolve correctly. This mechanism is generic — any connector can use it.
The SyncOrchestrator scans `/data/extracts/*/extract.duckdb`, ATTACHes each into master `analytics.duckdb`, and creates views.
```
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Keboola │ │ BigQuery │ │ Jira │
│ extractor │ │ extractor │ │ webhooks │
│ (DuckDB ext) │ │ (remote BQ) │ │ (incremental)│
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │
▼ ▼ ▼
extract.duckdb extract.duckdb extract.duckdb
+ data/*.parquet (views → BQ) + data/*.parquet
│ │ │
└─────────────────┼─────────────────┘
SyncOrchestrator.rebuild()
ATTACH → master views in analytics.duckdb
┌──────────┼──────────┐
▼ ▼ ▼
FastAPI CLI
(serve) (da sync)
```
Three source types:
- **Batch pull** (Keboola): DuckDB extension downloads to parquet, scheduled
- **Remote attach** (BigQuery): DuckDB BQ extension, no download, queries go to BQ
- **Real-time push** (Jira): Webhooks update parquets incrementally
## Configuration
Instance-specific config: `config/instance.yaml` (see example).
Environment variables: `.env` (never committed).
Table definitions: DuckDB `table_registry` table in `system.duckdb`.
## Development
```bash
# Setup
python3 -m venv .venv && source .venv/bin/activate
uv pip install ".[dev]"
# Run FastAPI locally
uvicorn app.main:app --reload
# Run tests
pytest tests/ -v
# Trigger sync manually
curl -X POST http://localhost:8000/api/sync/trigger
# Docker
docker compose up
```
## Business Metrics
Standardized metric definitions live in DuckDB (`metric_definitions` table). Import starter pack:
```bash
da metrics import docs/metrics/
```
### For AI agents analyzing data:
Before computing any business metric, look up the canonical definition:
1. `da metrics list` — find the relevant metric
2. `da metrics show revenue/mrr` — read the SQL and business rules
3. Use the SQL from the metric definition, adapt to the specific question
Never invent metric calculations — always use the canonical definitions.
## Hybrid Queries (BigQuery + Local)
For tables too large to sync locally, use hybrid queries that JOIN local data with on-demand BigQuery results:
```bash
da query --sql "SELECT o.*, t.views FROM orders o JOIN traffic t ON o.date = t.date" \
--register-bq "traffic=SELECT date, SUM(views) as views FROM dataset.web WHERE date > '2026-01-01' GROUP BY 1"
```
The `--register-bq` flag executes a BigQuery subquery, loads the result into memory, and makes it available as a DuckDB view for the final SQL. Multiple `--register-bq` flags can be used for multiple BQ sources.
For complex SQL, use stdin mode:
```bash
echo '{"register_bq": {"traffic": "SELECT ..."}, "sql": "SELECT ..."}' | da query --stdin
```
## Extensibility
### Data Sources (extract.duckdb contract)
New connector = `connectors/<name>/extractor.py` producing `extract.duckdb + data/`.
Must create `_meta` table with columns: table_name, description, rows, size_bytes, extracted_at, query_mode.
Orchestrator ATTACHes it automatically.
### Authentication
Auth providers in `app/auth/` (FastAPI-based):
- **Google**: OAuth via Google (Workspace group memberships pulled at sign-in — see `docs/auth-groups.md` for the GCP setup checklist + the `security` label gotcha)
- **Email**: Email magic link (itsdangerous token)
- **Desktop**: JWT for API
## Release & deploy workflows
Two separate release.yml-style workflows produce GHCR images. Pick the one that matches what you're shipping.
### `release.yml` — auto-build on every push
Runs on **every** push to **every** branch.
- Push to `main``:stable`, `:stable-YYYY.MM.N` (CalVer).
- Push to non-main `<prefix>/<branch>``:dev`, `:dev-YYYY.MM.N`, `:dev-<branch-slug>`, and (when prefix isn't a Git Flow convention) `:dev-<prefix>-latest` alias.
VMs that pin to a floating tag (`:dev`, `:dev-<prefix>-latest`) auto-upgrade within ~5 min via the cron in `agnes-auto-upgrade.sh`. Convenient for per-developer dev VMs; **footgun for shared dev VMs** (last pusher wins, regardless of who).
### `keboola-deploy.yml` — tag-triggered, explicit deploy only
Runs **only** on git tags matching `keboola-deploy-*`. Publishes:
- `:keboola-deploy-<git-tag-suffix>` — immutable, tied to the exact commit
- `:keboola-deploy-latest` — floating alias the consumer pins to
**Operator workflow:**
```bash
git checkout <commit-or-branch>
git tag keboola-deploy-<descriptive-name>
git push origin keboola-deploy-<descriptive-name>
# → workflow builds + publishes both tags
# → VM cron picks up :keboola-deploy-latest within ~5 min
# → manual cron trigger (skip the wait): sudo /usr/local/bin/agnes-auto-upgrade.sh on the VM
```
Use this when the consumer (e.g. a customer dev VM) needs **deploy-when-I-decide** semantics — no surprise rollouts from upstream branch pushes by other contributors. The infra repo pins `image_tag = "keboola-deploy-latest"` on the relevant VM.
### Module versioning
The customer-instance Terraform module under `infra/modules/customer-instance/` is published as `infra-vMAJOR.MINOR.PATCH` git tags (separate from app CalVer tags). Bump on any module-API change; downstream infra repos pin to the tag in their `source = "github.com/keboola/agnes-the-ai-analyst//infra/modules/customer-instance?ref=infra-v1.X.Y"`.
After merging a module change to `main`:
```bash
git tag infra-vX.Y.Z origin/main
git push origin infra-vX.Y.Z
```
### Replacing a VM after a startup-script change
Module sets `lifecycle { ignore_changes = [metadata_startup_script] }` on `google_compute_instance.vm` so normal `terraform apply` doesn't churn running VMs. To propagate a startup-script update, trigger the consumer's apply workflow manually with the VM resource address — typical workflow_dispatch input is `recreate_targets='module.agnes.google_compute_instance.vm["<vm-name>"]'`.
## Key Implementation Details
### DuckDB Schema (src/db.py)
- Schema v8 with auto-migration v1→…→v8 (v5 adds `users.active`, v6 adds `personal_access_tokens`, v7 adds `personal_access_tokens.last_used_ip`, v8 adds `internal_roles` + `group_mappings`)
- `table_registry`: id, name, source_type, bucket, source_table, query_mode, sync_schedule, etc.
- `sync_state`, `sync_history`: track extraction progress
- `users`, `dataset_permissions`, `audit_log`: auth + RBAC
- System DB at `{DATA_DIR}/state/system.duckdb`
- Analytics DB at `{DATA_DIR}/analytics/server.duckdb`
### SyncOrchestrator (src/orchestrator.py)
- `rebuild()`: scans extracts dir, ATTACHes all, creates master views, updates sync_state
- `rebuild_source(name)`: single source (used after Jira webhooks)
- Thread-safe via `_rebuild_lock`
### Connector Pattern
- **Keboola**: `connectors/keboola/extractor.py` uses DuckDB Keboola extension, fallback to `client.py`
- **BigQuery**: `connectors/bigquery/extractor.py` uses DuckDB BQ extension (remote-only, no download)
- **Jira**: `connectors/jira/webhook.py``incremental_transform.py``extract_init.py` updates `_meta`
- `connectors/keboola/client.py`: legacy Keboola Storage API wrapper (kept as fallback)
### Config Loading
1. `config/loader.py` loads `instance.yaml`
2. `app/instance_config.py` exposes `get_data_source_type()`, `get_value()`
3. Table config lives in DuckDB `table_registry` (not markdown files)
### Files NOT to modify (stable infrastructure)
- `connectors/jira/file_lock.py` - Advisory file locking
- `connectors/jira/transform.py` - Core Jira transform logic
- `services/ws_gateway/` - WebSocket notification gateway
## Vendor-agnostic OSS — no customer-specific content
This repo is the public OSS distribution. **Nothing customer-specific belongs in code, configuration defaults, comments, docs, commit messages, PR titles, or PR bodies.** That includes:
- Specific deployments or brands (private VM names, internal product brands, organization names that aren't already public sponsors).
- Cloud project IDs, internal hostnames, runbook paths from a particular install (`/opt/<deployment>`, `<host>.<internal-domain>`, `prj-<org>-…`, internal SA emails).
- Cross-references to private repos (`<private-org>/<private-repo>#NN`). Describe the integration in generic terms or link to public examples instead.
When you motivate a change, frame it abstractly ("behind a TLS-terminating reverse proxy", "in containerized deploys") rather than naming a specific operator. When you show examples, use placeholders (`example.com`, `<your-host>`, `<install-dir>`). When config has reasonable defaults pulled from one deployment's habits, generalize them or surface them as documented examples — not hard-coded assumptions.
Customer-specific automation, hostnames, and identities live in private infra repos that *consume* this OSS. The OSS describes capabilities, defaults, and configuration knobs — not how a specific operator wired them up.
## Changelog discipline — non-negotiable
**Every PR that adds, removes, or changes user-visible behavior MUST update `CHANGELOG.md` in the same PR.** No exceptions, no follow-ups, no "I'll do it after merge". User-visible = anything an operator, end-user, or downstream integrator can observe: CLI flags / output / exit codes, REST endpoints / payloads / status codes, web UI, `instance.yaml` schema, env vars, `extract.duckdb` contract, Docker / compose / Caddyfile knobs, default behaviors, breaking changes, security fixes.
**How:**
- Add a bullet under the topmost `## [Unreleased]` heading (create one if missing — it sits above the latest released version).
- Group by `### Added` / `### Changed` / `### Fixed` / `### Removed` / `### Internal` (Keep-a-Changelog sections).
- Mark breaking changes with `**BREAKING**` at the start of the bullet — operators grep for that string before bumping the pin.
- Reference the relevant doc/runbook if one exists (e.g. `see docs/auth-groups.md`), don't restate it.
- Internal-only changes (refactors, test additions, dependency bumps without behavior change) go under `### Internal` — still log them, just keep them terse.
**When you cut a release:**
- Rename `## [Unreleased]``## [X.Y.Z] — YYYY-MM-DD`.
- Append a new empty `## [Unreleased]` section at the top so the next PR has somewhere to land.
- Bump `version` in `pyproject.toml` to match `X.Y.Z`.
- Tag the merge commit as `vX.Y.Z` and push the tag.
**If you find yourself opening a PR without a CHANGELOG entry, stop and add one before requesting review.** Reviewers should bounce PRs that touch user-visible behavior without a changelog update — same way they'd bounce a PR with no test changes for new logic.
## Git Commits & Pull Requests
- Keep commit messages clean and concise
- Do not include AI attribution in commits or PRs
- Before opening a PR, scan the diff and the PR body for the customer-specific tokens listed above (`grep -niE '<token1>|<token2>|...'`). If anything matches, generalize or remove it.