AI-Cognitive-Leap/agnes-the-ai-analyst

Fork 0

Fork of keboola/agnes-the-ai-analyst (via manana2520 GitHub fork). Develop here, push to GitHub fork to open upstream PRs.

Find a file

Petr Simecek 1c18cdf15f release(0.11.2): LOCAL_DEV_GROUPS dev mock + Makefile defaults + docs/local-development.md (#70 ) * feat(auth): mock session.google_groups in LOCAL_DEV_MODE via LOCAL_DEV_GROUPS LOCAL_DEV_MODE auto-logged-in the dev user but left session.google_groups empty, so group-aware UI/code paths can't be exercised on localhost without a real Google OAuth round-trip. New LOCAL_DEV_GROUPS env var (JSON array matching the production {id, name} shape) populates the session on every dev-bypass request — same structure the OAuth callback writes, so mock and prod stay in lockstep. Compare-then-write avoids spurious Set-Cookie noise on PAT/CLI requests; malformed input falls back to [] with a WARNING so the dev mock never breaks the dev flow. * refactor(auth): fail-fast LOCAL_DEV_GROUPS at startup + cache + no-mutate Three small follow-ups on the same dev-mock vector before merge: - Validate LOCAL_DEV_GROUPS at app startup and report the parsed group IDs in the LOCAL_DEV_MODE banner. A malformed value now warns loudly at boot instead of silently logging on the first authenticated request, where it's easy to miss. - Cache the parsed result single-slot, keyed by the raw env-string. Avoids re-parsing JSON on every authenticated request without test-isolation surprises — when the env value changes, the key changes and the cache transparently rebuilds. - Stop mutating the parsed-input dicts (item.setdefault → spread-merge) so the cached list stays a fresh value on every rebuild. - Replace the try/except guard around request.session with hasattr — SessionMiddleware is always registered, the silent except was paranoid. Tests grow by a direct session-cookie inspection (decoupled from the profile template) and three startup-banner log assertions. * fix(auth): drop fragile session-decoder test + actually skip empty-target write Two follow-ups on the LOCAL_DEV_GROUPS feature before merge: - Drop test_session_holds_mocked_groups_directly. It manually decoded the signed session cookie via TimestampSigner + base64, hardcoding both the Starlette session-cookie format and the 14-day max_age. Starlette has changed its session encoding before (URLSafeTimedSerializer pre-0.20) and would do so again silently — the test would fail with a cryptic BadSignature, not a clear "mock is broken" signal. The remaining test_dev_user_sees_mocked_groups_on_profile already covers the same observable signal (mocked groups in /profile body) without coupling to Starlette internals. - Actually skip the session write when target_groups is empty. The previous comment claimed compare-then-write avoided spurious Set-Cookie noise on PAT/CLI requests, but on those requests session.get("google_groups") is None and target is [], so None != [] always evaluates True and the write fired anyway, marking the session dirty and re-issuing Set-Cookie on every request. Adding `target_groups and ...` to the guard makes the comment honest: empty mock now genuinely no-ops, stable browser sessions still skip via value-equality, and the only remaining write is the one that actually changes state. 33 auth tests still pass locally. * fix(auth): match production's always-write semantics for stale dev groups Devin code-review finding on PR #70: my earlier `target_groups and ...` short-circuit silently diverged from the production OAuth callback. In app/auth/providers/google.py:189-194 the callback always writes session.google_groups on each login — including [] on failure or empty token — so the session always reflects authoritative current state. The mock should match. Failure mode the previous guard left open: a developer sets LOCAL_DEV_GROUPS=[{...}] for a session, the groups land in the signed cookie, then the developer unsets the env var and reloads. target → [], session.get → [{...}], `if target_groups and ...` is False, no write, stale groups stay in the browser session indefinitely. Mock now lies about state until logout. Fix splits the guard: - target_groups truthy + value-changed → write the new mock (existing path) - target_groups falsy + non-empty stored → write [] to clear stale state - otherwise no-op (target [] + stored None/[]: no transition to record) PAT/CLI requests with no prior session still take the no-op path (target=[], session.get → None which is falsy), so the original goal of suppressing spurious Set-Cookie noise on token traffic is preserved. Tests already cover the populated and unset paths; the new clear-stale branch is correct by construction (production has the same shape) and the rare manual reset workflow. * release(0.11.2): default mocked groups in make local-dev + docs/local-development.md Cuts 0.11.2 around the LOCAL_DEV_GROUPS work plus a small dev-experience follow-up: every `make local-dev` now boots with two sensible default mocked groups (Local Dev Engineers + Local Dev Admins on example.com), so /profile and group-aware code paths render something realistic without the operator having to discover and set LOCAL_DEV_GROUPS. Layered so the default lives in the workflow, not the contract: - scripts/run-local-dev.sh seeds LOCAL_DEV_GROUPS via shell ":=" syntax — only sets the var when the operator hasn't already. Override: LOCAL_DEV_GROUPS='[...]' make local-dev. Disable: LOCAL_DEV_GROUPS= make local-dev. - docker-compose.local-dev.yml swaps the commented JSON example for a bare `- LOCAL_DEV_GROUPS` passthrough — the value comes from the shell, the compose file just propagates it. Operators running `docker compose up` directly without the wrapper script get an empty mock (correct: they didn't opt into the make-driven defaults). - Makefile help line mentions the mocked groups so the behavior is visible without grepping. New docs/local-development.md consolidates dev-onboarding instructions that were previously scattered across docker-compose.local-dev.yml inline comments, docs/auth-groups.md "Local-dev mock" section, the Makefile help text, and CLAUDE.md "First-Time Setup". Single page now covers TL;DR, what LOCAL_DEV_MODE actually bypasses, group mocking controls + verification, what is not mocked (Cloud Identity, real OAuth, admin Workspace permissions), and the safety rails that keep the dev shortcuts off production. Version bump 0.11.1 → 0.11.2 in pyproject.toml, CHANGELOG cuts [Unreleased] → [0.11.2] — 2026-04-26 with a fresh empty [Unreleased] skeleton. * fix(local-dev): default LOCAL_DEV_GROUPS truncated by shell parameter expansion Reported by an operator running `make local-dev` against the freshly released 0.11.2 — the LOCAL_DEV_MODE banner showed: LOCAL_DEV_GROUPS is not valid JSON, ignoring: Expecting ',' delimiter: line 1 column 70 (char 69) LOCAL_DEV_GROUPS is set but produced no valid groups — check the WARNING above for the parse error. Cause: the default value lived inside `${LOCAL_DEV_GROUPS:=…}` parameter expansion. Bash matches `}` to close the expansion at the first `}` encountered in the body, regardless of context — even one inside a nested JSON object literal. The two-element JSON array was therefore truncated to the first group's closing brace, leaving an unparseable fragment: [{"id":"local-dev-engineers@example.com","name":"Local Dev Engineers" There is no escaping syntax for `}` inside parameter expansion (the backslash escapes I had only escaped the quotes — `}` reaches bash literally). Fix: hold the default in a single-quoted variable and reference it through `${LOCAL_DEV_GROUPS:-$DEFAULT_LOCAL_DEV_GROUPS}`. The variable's value is opaque to the expansion — no `}` matching inside it — so the JSON survives intact. Verified with `python -m json`: parsed OK: 2 groups: ['local-dev-engineers@example.com', 'local-dev-admins@example.com'] Operators on a running 0.11.2 stack: `make local-dev-down && make local-dev` to pick up the corrected default. * fix(local-dev): respect LOCAL_DEV_GROUPS= disable path + add 0.11.2 changelog link Two follow-ups from a Devin code-review pass on PR #70: - run-local-dev.sh: switch ${LOCAL_DEV_GROUPS:-$DEFAULT} to ${LOCAL_DEV_GROUPS-$DEFAULT} (no leading colon). The :- form substitutes the default when the variable is unset OR set-but-empty, silently overwriting the documented disable knob. Three places promise this works — docs/local-development.md, the CHANGELOG entry, and the script's own comment — so the bug was an operator-facing lie, not just an implementation detail. The bare - form only substitutes on unset, so `LOCAL_DEV_GROUPS= make local-dev` now reaches the Python parser as "" and short-circuits to []. Verified with both empty and unset shells. - CHANGELOG.md: add the [0.11.2] link reference at the bottom. Keep-a-Changelog convention is to mirror every version heading with a release-tag link in the footer; the 0.11.2 heading was missing its counterpart, breaking the Markdown link rendering on GitHub. --------- Co-authored-by: Claude <noreply@anthropic.com>		2026-04-26 16:48:55 +02:00
.github/workflows	feat(deploy): keboola-deploy tag-triggered workflow + Caddyfile LE/internal modes + dev_instances TLS support (#52 )	2026-04-25 23:19:00 +02:00
app	release(0.11.2): LOCAL_DEV_GROUPS dev mock + Makefile defaults + docs/local-development.md (#70 )	2026-04-26 16:48:55 +02:00
cli	release(2.1.0): durable sync, CLI auto-update, versioned wheel URL, version unification (#43 )	2026-04-22 21:18:18 +02:00
config	feat(tls): corporate-CA HTTPS with URL-driven rotation, on-VM CSR gen, self-signed fallback (#51 )	2026-04-25 19:51:25 +00:00
connectors	fix: strip HTML from table and column descriptions in OpenMetadata enricher	2026-04-09 18:42:37 +02:00
dev_docs	docs: update stale v1 docs to v2 Docker/FastAPI/DuckDB architecture	2026-04-09 18:44:25 +02:00
docs	release(0.11.2): LOCAL_DEV_GROUPS dev mock + Makefile defaults + docs/local-development.md (#70 )	2026-04-26 16:48:55 +02:00
infra	feat(auth): Google Workspace groups on /profile + tag-triggered Keboola deploy workflow (#56 )	2026-04-26 00:56:44 +02:00
scripts	release(0.11.2): LOCAL_DEV_GROUPS dev mock + Makefile defaults + docs/local-development.md (#70 )	2026-04-26 16:48:55 +02:00
services	fix: make bot.py FileHandler resilient to missing log directory	2026-04-13 13:28:59 +02:00
src	User management + PAT + CLI distribution + HTML auth redirect (#9 #10 #11 #12 ) (#28 )	2026-04-22 14:24:28 +02:00
tests	release(0.11.2): LOCAL_DEV_GROUPS dev mock + Makefile defaults + docs/local-development.md (#70 )	2026-04-26 16:48:55 +02:00
.dockerignore	refactor: consolidate deps into pyproject.toml, remove requirements.txt	2026-04-09 13:17:59 +02:00
.gitignore	infra: add bootstrap-gcp.sh for per-customer GCP setup	2026-04-21 16:18:35 +02:00
ARCHITECTURE.md	Update docs for modular architecture (auth/, services/, scripts/)	2026-03-09 13:11:40 +01:00
Caddyfile	feat(deploy): keboola-deploy tag-triggered workflow + Caddyfile LE/internal modes + dev_instances TLS support (#52 )	2026-04-25 23:19:00 +02:00
CHANGELOG.md	release(0.11.2): LOCAL_DEV_GROUPS dev mock + Makefile defaults + docs/local-development.md (#70 )	2026-04-26 16:48:55 +02:00
CLAUDE.md	docs(claude): non-negotiable CHANGELOG.md update rule + [Unreleased] skeleton (#59 )	2026-04-26 01:10:32 +02:00
docker-compose.ci.yml	feat: multi-instance deployment — all 14 must-have items from spec	2026-04-10 11:57:42 +02:00
docker-compose.host-mount.yml	fix(ci): move bind-mount of /data to separate overlay, fix CI smoke test	2026-04-21 16:54:18 +02:00
docker-compose.local-dev.yml	release(0.11.2): LOCAL_DEV_GROUPS dev mock + Makefile defaults + docs/local-development.md (#70 )	2026-04-26 16:48:55 +02:00
docker-compose.override.yml	chore(deploy): trust proxy headers + document HTTPS env vars (#48 )	2026-04-24 08:52:53 +02:00
docker-compose.prod.yml	fix(ci): move bind-mount of /data to separate overlay, fix CI smoke test	2026-04-21 16:54:18 +02:00
docker-compose.test.yml	chore(deploy): trust proxy headers + document HTTPS env vars (#48 )	2026-04-24 08:52:53 +02:00
docker-compose.tls.yml	feat(tls): corporate-CA HTTPS with URL-driven rotation, on-VM CSR gen, self-signed fallback (#51 )	2026-04-25 19:51:25 +00:00
docker-compose.yml	fix(deploy): pass CADDY_TLS through to caddy container (#55 )	2026-04-26 01:46:42 +02:00
Dockerfile	chore(deploy): trust proxy headers + document HTTPS env vars (#48 )	2026-04-24 08:52:53 +02:00
LICENSE	OSS cleanup: remove internal references, harden deployment, add config env interpolation	2026-03-09 07:59:57 +01:00
Makefile	release(0.11.2): LOCAL_DEV_GROUPS dev mock + Makefile defaults + docs/local-development.md (#70 )	2026-04-26 16:48:55 +02:00
pyproject.toml	release(0.11.2): LOCAL_DEV_GROUPS dev mock + Makefile defaults + docs/local-development.md (#70 )	2026-04-26 16:48:55 +02:00
pytest.ini	test: add shared test infrastructure (fixtures, factories, assertions, mocks)	2026-04-12 11:05:35 +02:00
README.md	feat(tls): corporate-CA HTTPS with URL-driven rotation, on-VM CSR gen, self-signed fallback (#51 )	2026-04-25 19:51:25 +00:00
uv.lock	chore(deps): bump python-multipart from 0.0.24 to 0.0.26	2026-04-21 13:26:19 +00:00

README.md

Agnes — AI Data Analyst

Agnes is an open-source data distribution platform for AI analytical systems. It extracts data from configured sources into DuckDB, serves it via a FastAPI backend, and distributes Parquet files to analysts who query them locally using Claude Code and DuckDB.

Each data source produces a self-describing extract.duckdb file. The SyncOrchestrator attaches all extract databases into a master analytics.duckdb, making every table available through a unified view layer without copying data unnecessarily.

Architecture: extract.duckdb Contract

Every connector produces the same output structure:

/data/extracts/{source_name}/
├── extract.duckdb          ← _meta table + views
└── data/                   ← parquet files (local sources only)

The orchestrator scans /data/extracts/*/extract.duckdb, attaches each into analytics.duckdb, and creates master views.

┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│   Keboola    │  │   BigQuery   │  │   Jira       │
│  extractor   │  │  extractor   │  │  webhooks    │
│ (DuckDB ext) │  │ (remote BQ)  │  │ (incremental)│
└──────┬───────┘  └──────┬───────┘  └──────┬───────┘
       │                 │                 │
       ▼                 ▼                 ▼
   extract.duckdb    extract.duckdb    extract.duckdb
   + data/*.parquet  (views → BQ)      + data/*.parquet
       │                 │                 │
       └─────────────────┼─────────────────┘
                         ▼
              SyncOrchestrator.rebuild()
              ATTACH → master views in analytics.duckdb
                         │
              ┌──────────┼──────────┐
              ▼          ▼          ▼
          FastAPI      CLI
          (serve)    (da sync)

Supported Data Sources

Source	Mode	Description
Keboola	Batch pull	DuckDB Keboola extension downloads tables to Parquet on a schedule
BigQuery	Remote attach	DuckDB BQ extension; queries execute in BigQuery, no local download
Jira	Real-time push	Webhook receiver updates Parquet files incrementally

Adding a new source means creating connectors/<name>/extractor.py that produces extract.duckdb with a _meta table (table_name, description, rows, size_bytes, extracted_at, query_mode). The orchestrator attaches it automatically.

Quick Start with Docker

# Clone the repository
git clone https://github.com/keboola/agnes-the-ai-analyst.git
cd agnes-the-ai-analyst

# Copy and edit configuration
cp config/instance.yaml.example config/instance.yaml
cp config/.env.template .env
# Edit both files for your environment

# Start the app and scheduler
docker compose up

# Start with all optional services (Telegram bot, etc.)
docker compose --profile full up

# Start with TLS (Caddy on :443 with corporate-CA certs from /data/state/certs)
docker compose -f docker-compose.yml -f docker-compose.prod.yml -f docker-compose.tls.yml \
    --profile tls up -d

Once running, the FastAPI app is available at http://localhost:8000 (or https://$DOMAIN in TLS mode). See docs/DEPLOYMENT.md for cert provisioning + auto-rotation via scripts/grpn/agnes-tls-rotate.sh. Trigger a manual sync:

curl -X POST http://localhost:8000/api/sync/trigger

Development Setup

# Create and activate virtual environment
python3 -m venv .venv && source .venv/bin/activate

# Install dependencies
uv pip install ".[dev]"

# Run FastAPI locally with hot reload
uvicorn app.main:app --reload

# Run the test suite
pytest tests/ -v

Project Structure

├── src/                    # Core engine
│   ├── db.py               # DuckDB schema (system.duckdb, analytics.duckdb)
│   ├── orchestrator.py     # SyncOrchestrator — ATTACHes extract.duckdb files
│   ├── repositories/       # DuckDB-backed CRUD (sync_state, table_registry, users, etc.)
│   ├── profiler.py         # Data profiling
│   └── catalog_export.py   # OpenMetadata catalog export
├── app/                    # FastAPI application
│   ├── main.py             # App setup, router registration
│   ├── api/                # REST API (sync, data, catalog, admin, auth)
│   ├── auth/               # Auth providers (Google OAuth, email magic link, desktop JWT)
│   └── web/                # HTML dashboard routes
├── connectors/             # Data source connectors (extract.duckdb contract)
│   ├── keboola/            # Keboola: extractor.py (DuckDB extension) + client.py (fallback)
│   ├── bigquery/           # BigQuery: extractor.py (remote-only via DuckDB BQ extension)
│   └── jira/               # Jira: webhook + incremental parquet → extract.duckdb
├── cli/                    # CLI tool (`da sync`, `da query`, `da admin`)
├── services/               # Standalone services (scheduler, telegram_bot, ws_gateway, etc.)
├── scripts/                # Utility + migration scripts
├── config/                 # Configuration templates (instance.yaml.example)
├── docs/                   # Documentation + metric YAML definitions
└── tests/                  # Test suite (633 tests)

Configuration

File	Purpose
`config/instance.yaml`	Instance-specific settings: branding, data source type, auth provider, Google domain
`.env`	Secrets and environment variables — never committed
`system.duckdb` `table_registry` table	Table definitions managed via `POST /api/admin/tables/{id}` or the web UI

Copy the example to get started:

cp config/instance.yaml.example config/instance.yaml

See config/instance.yaml.example for all available options.

Documentation

Hackathon TL;DR — condensed deploy + dev playbooks (for both humans and AI agents)
Onboarding Guide — end-to-end Terraform deployment into a GCP project (recommended for production)
Deployment Guide — chooses between Terraform and Docker Compose; covers OSS self-host
Configuration Reference — instance.yaml, env vars, per-instance options
Architecture — orchestrator, extractors, DB layout
Quickstart — local development

Contributing

Fork the repository and create a feature branch.
Run pytest tests/ -v to verify all tests pass before opening a pull request.
Keep commits focused and messages concise.
Open a pull request against main with a clear description of the change.

For bugs and feature requests, open a GitHub issue.

License

This project is licensed under the MIT License.