Fork of keboola/agnes-the-ai-analyst (via manana2520 GitHub fork). Develop here, push to GitHub fork to open upstream PRs.
Find a file
ZdenekSrotyr c4d23cf235 feat(admin-prompt): update editor UX + docs for banner context
- admin_welcome.html: update subtitle, description, placeholder cheatsheet
  (drop tables/metrics/marketplaces/sync_interval; add user-null note and
  security note). Textarea initial value is now empty (no default template
  to show). Preview pane uses innerHTML (HTML output). refreshStatus sets
  editor to empty when no override. Preview pane styled as light surface.
  Reset modal copy updated (no banner shown, not "OSS-shipped template").
- config/claude_md_template.txt: deleted (markdown template is gone;
  default is now no banner).
- docs/agent-setup-prompt.md: rewritten for variant C — describes the
  /setup banner, smaller placeholder table, security/sanitization notes,
  anonymous-user guard, example HTML snippet.
2026-05-03 16:12:13 +02:00
.github fix(ci): smoke-test stale route + rollback ghcr auth + issues:write (#140) 2026-04-30 09:42:27 +02:00
app feat(admin-prompt): update editor UX + docs for banner context 2026-05-03 16:12:13 +02:00
cli feat(admin-prompt): variant C — banner on /setup, drop CLAUDE.md generation 2026-05-03 16:12:13 +02:00
config feat(admin-prompt): update editor UX + docs for banner context 2026-05-03 16:12:13 +02:00
connectors fix(materialized): address 4 Devin Review findings on PR #152 2026-05-01 20:58:17 +02:00
dev_docs fix(ci): smoke-test stale route + rollback ghcr auth + issues:write (#140) 2026-04-30 09:42:27 +02:00
docs feat(admin-prompt): update editor UX + docs for banner context 2026-05-03 16:12:13 +02:00
infra refactor(ops): bake all host artifacts into image, drop every curl-from-main (#149) 2026-04-30 21:40:25 +02:00
scripts feat(diagnose) + docs: warn on USER_PROJECT_DENIED footgun + document all newly-exposed knobs 2026-05-01 20:27:24 +02:00
services feat(observability): request_id end-to-end + dev debug toolbar + centralized logging (#136) 2026-04-29 22:54:21 +02:00
src feat(admin-prompt): variant C — banner on /setup, drop CLAUDE.md generation 2026-05-03 16:12:13 +02:00
tests feat(admin-prompt): variant C — banner on /setup, drop CLAUDE.md generation 2026-05-03 16:12:13 +02:00
.dockerignore
.gitignore chore: ignore .worktrees/ for local isolated workspaces 2026-05-03 16:10:48 +02:00
.pre-commit-config.yaml feat(ci+tests): deploy safety audit — linting, rollback, smoke tests, 50+ new tests (#120) 2026-04-29 09:18:55 +02:00
ARCHITECTURE.md feat(ci+tests): deploy safety audit — linting, rollback, smoke tests, 50+ new tests (#120) 2026-04-29 09:18:55 +02:00
Caddyfile fix(security+ops) + release(0.12.1): #82 #85 #87 hardening + cut 0.12.1 (#104) 2026-04-28 19:57:30 +02:00
CHANGELOG.md polish: drop dead CSS, fix docstring drift, add agent-prompt route test 2026-05-03 16:12:13 +02:00
CLAUDE.md feat(diagnose) + docs: warn on USER_PROJECT_DENIED footgun + document all newly-exposed knobs 2026-05-01 20:27:24 +02:00
docker-compose.ci.yml feat: multi-instance deployment — all 14 must-have items from spec 2026-04-10 11:57:42 +02:00
docker-compose.dev.yml fix(security+ops) + release(0.12.1): #82 #85 #87 hardening + cut 0.12.1 (#104) 2026-04-28 19:57:30 +02:00
docker-compose.host-mount.yml feat(rbac+marketplace): RBAC v13 + Claude Code marketplace + #81/#83/#44 hardening 2026-04-28 14:25:04 +02:00
docker-compose.local-dev.yml release(0.11.2): LOCAL_DEV_GROUPS dev mock + Makefile defaults + docs/local-development.md (#70) 2026-04-26 16:48:55 +02:00
docker-compose.prod.yml fix(ci): move bind-mount of /data to separate overlay, fix CI smoke test 2026-04-21 16:54:18 +02:00
docker-compose.test.yml chore(deploy): trust proxy headers + document HTTPS env vars (#48) 2026-04-24 08:52:53 +02:00
docker-compose.tls.yml feat(tls): corporate-CA HTTPS with URL-driven rotation, on-VM CSR gen, self-signed fallback (#51) 2026-04-25 19:51:25 +00:00
docker-compose.yml fix(security+ops) + release(0.12.1): #82 #85 #87 hardening + cut 0.12.1 (#104) 2026-04-28 19:57:30 +02:00
Dockerfile refactor(ops): bake all host artifacts into image, drop every curl-from-main (#149) 2026-04-30 21:40:25 +02:00
LICENSE
Makefile fix(security+ops) + release(0.12.1): #82 #85 #87 hardening + cut 0.12.1 (#104) 2026-04-28 19:57:30 +02:00
pyproject.toml security(auth): per-IP rate limit + last-admin guard (#165) 2026-05-02 21:08:33 +02:00
pytest.ini feat(rbac+marketplace): RBAC v13 + Claude Code marketplace + #81/#83/#44 hardening 2026-04-28 14:25:04 +02:00
README.md docs(readme): reflect 0.30.0 — Keboola materialized parity + tab UI + analyst hooks 2026-05-02 08:46:12 +02:00
uv.lock feat(observability): request_id end-to-end + dev debug toolbar + centralized logging (#136) 2026-04-29 22:54:21 +02:00

Agnes — AI Data Analyst

Agnes is an open-source data distribution platform for AI analytical systems. It extracts data from configured sources into DuckDB, serves it via a FastAPI backend, and distributes Parquet files to analysts who query them locally using Claude Code and DuckDB.

Each data source produces a self-describing extract.duckdb file. The SyncOrchestrator attaches all extract databases into a master analytics.duckdb, making every table available through a unified view layer without copying data unnecessarily.

Architecture: extract.duckdb Contract

Every connector produces the same output structure:

/data/extracts/{source_name}/
├── extract.duckdb          ← _meta table + views
└── data/                   ← parquet files (local sources only)

The orchestrator scans /data/extracts/*/extract.duckdb, attaches each into analytics.duckdb, and creates master views.

┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│   Keboola    │  │   BigQuery   │  │   Jira       │
│  extractor   │  │  extractor   │  │  webhooks    │
│ (DuckDB ext) │  │ (remote BQ)  │  │ (incremental)│
└──────┬───────┘  └──────┬───────┘  └──────┬───────┘
       │                 │                 │
       ▼                 ▼                 ▼
   extract.duckdb    extract.duckdb    extract.duckdb
   + data/*.parquet  (views → BQ)      + data/*.parquet
       │                 │                 │
       └─────────────────┼─────────────────┘
                         ▼
              SyncOrchestrator.rebuild()
              ATTACH → master views in analytics.duckdb
                         │
              ┌──────────┼──────────┐
              ▼          ▼          ▼
          FastAPI      CLI
          (serve)    (da sync)

Supported Data Sources

Mode Distribution Sources Use when
Batch pull (local) Parquet on disk, scheduled Keboola Source has a native bulk-export and the table fits on disk
Materialized SQL (materialized) Parquet on disk, scheduled query BigQuery, Keboola Source table is too large to mirror as-is; you want a curated subset / aggregate on disk
Remote attach (remote) View only, no download BigQuery Table is too large to materialize; latency cost of remote query is acceptable
Real-time push Incremental parquet Jira Source is event-driven and you need sub-minute freshness

The first three modes are what da sync distributes to analysts. The fourth is server-side only — analysts query Jira data through the same da sync-distributed parquets.

Admins manage per-source registrations through the /admin/tables UI (per-connector tabs for BigQuery / Keboola / Jira) or the da admin register-table CLI; per-row "Manage access" deep-links to /admin/access for granting tables to user groups via resource_grants(group, ResourceType.TABLE, table_id).

Analysts get a closed loop with Claude Code: da analyst setup writes <workspace>/.claude/settings.json with SessionStart (da sync --quiet) and SessionEnd (da sync --upload-only --quiet) hooks so every Claude Code session starts with fresh RBAC-filtered parquets and ends with the session log uploaded back.

Adding a new source means creating connectors/<name>/extractor.py that produces extract.duckdb with a _meta table (table_name, description, rows, size_bytes, extracted_at, query_mode). The orchestrator attaches it automatically.

Quick Start with Docker

# Clone the repository
git clone https://github.com/keboola/agnes-the-ai-analyst.git
cd agnes-the-ai-analyst

# Copy and edit configuration
cp config/instance.yaml.example config/instance.yaml
cp config/.env.template .env
# Edit both files for your environment

# Start the app and scheduler
docker compose up

# Start with all optional services (Telegram bot, etc.)
docker compose --profile full up

# Start with TLS (Caddy on :443 with corporate-CA certs from /data/state/certs)
docker compose -f docker-compose.yml -f docker-compose.prod.yml -f docker-compose.tls.yml \
    --profile tls up -d

Once running, the FastAPI app is available at http://localhost:8000 (or https://$DOMAIN in TLS mode). See docs/DEPLOYMENT.md for cert provisioning + auto-rotation via scripts/ops/agnes-tls-rotate.sh. Trigger a manual sync:

curl -X POST http://localhost:8000/api/sync/trigger

Local sync & auto-update

Analysts run Claude Code against a local DuckDB built from RBAC-filtered parquets pulled from the server. da sync is the distribution path:

da sync             # delta-pull: manifest → MD5 compare → download changed → rebuild views
da sync --quiet     # same, no progress output (for hooks/cron)
da sync --upload-only  # push session jsonl + CLAUDE.local.md back to the server

da analyst setup writes Claude Code lifecycle hooks into <workspace>/.claude/settings.json:

  • SessionStartda sync --quiet — fresh data on every session
  • SessionEndda sync --upload-only --quiet — uploads notes and session log

Hooks live at workspace level so they only fire in this analyst workspace, not in unrelated Claude Code sessions on the same machine.

Admin: which tables auto-sync to whom

The auto-sync set per analyst is the intersection of:

  1. Tables with query_mode IN ('local', 'materialized') — these have parquets on disk and end up in the manifest
  2. Tables granted to one of the analyst's groups via resource_grants(group, ResourceType.TABLE, table_id) (see docs/RBAC.md)

To enroll a new table for auto-sync, register it (or update its query_mode) and grant it to the relevant groups in /admin/access. New analysts get the same set on their next da sync.

For BigQuery, register a query_mode='materialized' table with a SQL body:

da admin register-table orders_90d \
    --source-type bigquery \
    --query-mode materialized \
    --query @docs/queries/orders_90d.sql \
    --schedule "every 6h"

The scheduler runs the query through the DuckDB BigQuery extension on each tick that's due, writes the result as a parquet, and the analyst picks it up on the next da sync. Cost guardrail: data_source.bigquery.max_bytes_per_materialize (default 10 GiB) — operations exceeding the BQ dry-run estimate are skipped.

Development Setup

# Create and activate virtual environment
python3 -m venv .venv && source .venv/bin/activate

# Install dependencies
uv pip install ".[dev]"

# Run FastAPI locally with hot reload
uvicorn app.main:app --reload

# Run the test suite
pytest tests/ -v

Project Structure

├── src/                    # Core engine
│   ├── db.py               # DuckDB schema (system.duckdb, analytics.duckdb)
│   ├── orchestrator.py     # SyncOrchestrator — ATTACHes extract.duckdb files
│   ├── repositories/       # DuckDB-backed CRUD (sync_state, table_registry, users, etc.)
│   ├── profiler.py         # Data profiling
│   └── catalog_export.py   # OpenMetadata catalog export
├── app/                    # FastAPI application
│   ├── main.py             # App setup, router registration
│   ├── api/                # REST API (sync, data, catalog, admin, auth)
│   ├── auth/               # Auth providers (Google OAuth, email magic link, desktop JWT)
│   └── web/                # HTML dashboard routes
├── connectors/             # Data source connectors (extract.duckdb contract)
│   ├── keboola/            # Keboola: extractor.py (DuckDB extension) + client.py (fallback)
│   ├── bigquery/           # BigQuery: extractor.py (remote-only via DuckDB BQ extension)
│   └── jira/               # Jira: webhook + incremental parquet → extract.duckdb
├── cli/                    # CLI tool (`da sync`, `da query`, `da admin`)
├── services/               # Standalone services (scheduler, telegram_bot, ws_gateway, etc.)
├── scripts/                # Utility + migration scripts
├── config/                 # Configuration templates (instance.yaml.example)
├── docs/                   # Documentation + metric YAML definitions
└── tests/                  # Test suite (633 tests)

Configuration

File Purpose
config/instance.yaml Instance-specific settings: branding, data source type, auth provider, Google domain
.env Secrets and environment variables — never committed
system.duckdb table_registry table Table definitions managed via POST /api/admin/register-table (or PUT /api/admin/registry/{id} to update) or the web UI

Copy the example to get started:

cp config/instance.yaml.example config/instance.yaml

See config/instance.yaml.example for all available options.

Documentation

Contributing

  1. Fork the repository and create a feature branch.
  2. Run pytest tests/ -v to verify all tests pass before opening a pull request.
  3. Keep commits focused and messages concise.
  4. Open a pull request against main with a clear description of the change.

For bugs and feature requests, open a GitHub issue.

License

This project is licensed under the MIT License.