AI-Cognitive-Leap/agnes-the-ai-analyst

Fork 0

Fork of keboola/agnes-the-ai-analyst (via manana2520 GitHub fork). Develop here, push to GitHub fork to open upstream PRs.

Find a file

ZdenekSrotyr 1ca5295d54 docs: add HACKATHON.md — condensed deploy + dev playbooks (#21 ) Written for both humans and AI agents — explicit commands, expected outputs, troubleshooting tables, 'safe to run anytime' vs 'requires thought' sections, pitfalls checklist. Three parts: 1. Deploy for a new customer (45 min target, 7 steps) 2. Develop against Agnes (branch → image → dev VM loop, common tasks) 3. AI agent checklist (guardrails, verification, common pitfalls) Complements the deep docs (ONBOARDING.md, DEPLOYMENT.md, architecture.md) with a practical quick-reference for hackathon-style deploys.		2026-04-21 21:33:06 +02:00
.github/workflows	ci: propagate infra-v* tags to template repo + auto-merge rules (#17 )	2026-04-21 21:32:58 +02:00
app	feat(ui): version badge as shared partial, injected into every full-page template	2026-04-21 20:51:55 +02:00
cli	fix: table_catalog in re-attach query, --limit in hybrid CLI	2026-04-11 20:13:35 +02:00
config	feat: add CLAUDE.md template for analyst bootstrap	2026-04-10 19:41:07 +02:00
connectors	fix: strip HTML from table and column descriptions in OpenMetadata enricher	2026-04-09 18:42:37 +02:00
dev_docs	docs: update stale v1 docs to v2 Docker/FastAPI/DuckDB architecture	2026-04-09 18:44:25 +02:00
docs	docs: add HACKATHON.md — condensed deploy + dev playbooks (#21 )	2026-04-21 21:33:06 +02:00
infra	fix(version): bake AGNES_VERSION/CHANNEL/COMMIT_SHA into image ENV	2026-04-21 21:00:04 +02:00
scripts	chore: add switch-dev-vm.sh helper for hackathon (#20 )	2026-04-21 21:33:02 +02:00
services	fix: make bot.py FileHandler resilient to missing log directory	2026-04-13 13:28:59 +02:00
src	fix: BQ COUNT subquery alias, wrap ImportError in RemoteQueryError	2026-04-11 20:29:03 +02:00
tests	fix(auth): /auth/bootstrap activates seed users, disabled only by real password	2026-04-21 20:01:20 +02:00
.dockerignore	refactor: consolidate deps into pyproject.toml, remove requirements.txt	2026-04-09 13:17:59 +02:00
.gitignore	infra: add bootstrap-gcp.sh for per-customer GCP setup	2026-04-21 16:18:35 +02:00
ARCHITECTURE.md	Update docs for modular architecture (auth/, services/, scripts/)	2026-03-09 13:11:40 +01:00
Caddyfile	feat: add Caddy HTTPS reverse proxy and production compose override	2026-04-09 16:39:23 +02:00
CHANGELOG.md	feat: multi-instance deployment — all 14 must-have items from spec	2026-04-10 11:57:42 +02:00
CLAUDE.md	docs: add hybrid query usage instructions to CLAUDE.md	2026-04-11 11:11:10 +02:00
docker-compose.ci.yml	feat: multi-instance deployment — all 14 must-have items from spec	2026-04-10 11:57:42 +02:00
docker-compose.host-mount.yml	fix(ci): move bind-mount of /data to separate overlay, fix CI smoke test	2026-04-21 16:54:18 +02:00
docker-compose.override.yml	chore: Docker prod config (Python 3.13, no reload), fix utcnow deprecation, update docs	2026-04-08 12:10:47 +02:00
docker-compose.prod.yml	fix(ci): move bind-mount of /data to separate overlay, fix CI smoke test	2026-04-21 16:54:18 +02:00
docker-compose.test.yml	feat: add SEED_ADMIN_EMAIL for Docker test environments	2026-03-31 09:48:12 +02:00
docker-compose.yml	feat: multi-instance deployment — all 14 must-have items from spec	2026-04-10 11:57:42 +02:00
Dockerfile	fix(image): add AGNES_COMMIT_SHA build-arg to Dockerfile + release.yml	2026-04-21 21:00:30 +02:00
LICENSE	OSS cleanup: remove internal references, harden deployment, add config env interpolation	2026-03-09 07:59:57 +01:00
Makefile	feat: multi-instance deployment — all 14 must-have items from spec	2026-04-10 11:57:42 +02:00
pyproject.toml	chore(deps): bump python-multipart from 0.0.24 to 0.0.26	2026-04-21 13:26:19 +00:00
pytest.ini	test: add shared test infrastructure (fixtures, factories, assertions, mocks)	2026-04-12 11:05:35 +02:00
README.md	docs: add HACKATHON.md — condensed deploy + dev playbooks (#21 )	2026-04-21 21:33:06 +02:00
uv.lock	chore(deps): bump python-multipart from 0.0.24 to 0.0.26	2026-04-21 13:26:19 +00:00

README.md

Agnes — AI Data Analyst

Agnes is an open-source data distribution platform for AI analytical systems. It extracts data from configured sources into DuckDB, serves it via a FastAPI backend, and distributes Parquet files to analysts who query them locally using Claude Code and DuckDB.

Each data source produces a self-describing extract.duckdb file. The SyncOrchestrator attaches all extract databases into a master analytics.duckdb, making every table available through a unified view layer without copying data unnecessarily.

Architecture: extract.duckdb Contract

Every connector produces the same output structure:

/data/extracts/{source_name}/
├── extract.duckdb          ← _meta table + views
└── data/                   ← parquet files (local sources only)

The orchestrator scans /data/extracts/*/extract.duckdb, attaches each into analytics.duckdb, and creates master views.

┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│   Keboola    │  │   BigQuery   │  │   Jira       │
│  extractor   │  │  extractor   │  │  webhooks    │
│ (DuckDB ext) │  │ (remote BQ)  │  │ (incremental)│
└──────┬───────┘  └──────┬───────┘  └──────┬───────┘
       │                 │                 │
       ▼                 ▼                 ▼
   extract.duckdb    extract.duckdb    extract.duckdb
   + data/*.parquet  (views → BQ)      + data/*.parquet
       │                 │                 │
       └─────────────────┼─────────────────┘
                         ▼
              SyncOrchestrator.rebuild()
              ATTACH → master views in analytics.duckdb
                         │
              ┌──────────┼──────────┐
              ▼          ▼          ▼
          FastAPI      CLI
          (serve)    (da sync)

Supported Data Sources

Source	Mode	Description
Keboola	Batch pull	DuckDB Keboola extension downloads tables to Parquet on a schedule
BigQuery	Remote attach	DuckDB BQ extension; queries execute in BigQuery, no local download
Jira	Real-time push	Webhook receiver updates Parquet files incrementally

Adding a new source means creating connectors/<name>/extractor.py that produces extract.duckdb with a _meta table (table_name, description, rows, size_bytes, extracted_at, query_mode). The orchestrator attaches it automatically.

Quick Start with Docker

# Clone the repository
git clone https://github.com/keboola/agnes-the-ai-analyst.git
cd agnes-the-ai-analyst

# Copy and edit configuration
cp config/instance.yaml.example config/instance.yaml
cp config/.env.template .env
# Edit both files for your environment

# Start the app and scheduler
docker compose up

# Start with all optional services (Telegram bot, etc.)
docker compose --profile full up

Once running, the FastAPI app is available at http://localhost:8000. Trigger a manual sync:

curl -X POST http://localhost:8000/api/sync/trigger

Development Setup

# Create and activate virtual environment
python3 -m venv .venv && source .venv/bin/activate

# Install dependencies
uv pip install ".[dev]"

# Run FastAPI locally with hot reload
uvicorn app.main:app --reload

# Run the test suite
pytest tests/ -v

Project Structure

├── src/                    # Core engine
│   ├── db.py               # DuckDB schema (system.duckdb, analytics.duckdb)
│   ├── orchestrator.py     # SyncOrchestrator — ATTACHes extract.duckdb files
│   ├── repositories/       # DuckDB-backed CRUD (sync_state, table_registry, users, etc.)
│   ├── profiler.py         # Data profiling
│   └── catalog_export.py   # OpenMetadata catalog export
├── app/                    # FastAPI application
│   ├── main.py             # App setup, router registration
│   ├── api/                # REST API (sync, data, catalog, admin, auth)
│   ├── auth/               # Auth providers (Google OAuth, email magic link, desktop JWT)
│   └── web/                # HTML dashboard routes
├── connectors/             # Data source connectors (extract.duckdb contract)
│   ├── keboola/            # Keboola: extractor.py (DuckDB extension) + client.py (fallback)
│   ├── bigquery/           # BigQuery: extractor.py (remote-only via DuckDB BQ extension)
│   └── jira/               # Jira: webhook + incremental parquet → extract.duckdb
├── cli/                    # CLI tool (`da sync`, `da query`, `da admin`)
├── services/               # Standalone services (scheduler, telegram_bot, ws_gateway, etc.)
├── scripts/                # Utility + migration scripts
├── config/                 # Configuration templates (instance.yaml.example)
├── docs/                   # Documentation + metric YAML definitions
└── tests/                  # Test suite (633 tests)

Configuration

File	Purpose
`config/instance.yaml`	Instance-specific settings: branding, data source type, auth provider, Google domain
`.env`	Secrets and environment variables — never committed
`system.duckdb` `table_registry` table	Table definitions managed via `POST /api/admin/tables/{id}` or the web UI

Copy the example to get started:

cp config/instance.yaml.example config/instance.yaml

See config/instance.yaml.example for all available options.

Documentation

Hackathon TL;DR — condensed deploy + dev playbooks (for both humans and AI agents)
Onboarding Guide — end-to-end Terraform deployment into a GCP project (recommended for production)
Deployment Guide — chooses between Terraform and Docker Compose; covers OSS self-host
Configuration Reference — instance.yaml, env vars, per-instance options
Architecture — orchestrator, extractors, DB layout
Quickstart — local development

Contributing

Fork the repository and create a feature branch.
Run pytest tests/ -v to verify all tests pass before opening a pull request.
Keep commits focused and messages concise.
Open a pull request against main with a clear description of the change.

For bugs and feature requests, open a GitHub issue.

License

This project is licensed under the MIT License.