13 Devin findings across 10 files: 🔴 Critical: - app/api/v2_catalog.py:42 — `_fetch_hint` returns `da fetch` in /api/v2/catalog responses (user-visible in every catalog list) - cli/skills/agnes-data-querying.md — 11 stale `da fetch`/`da sync` refs in the bundled skill markdown - config/claude_md_template.txt:38 — referenced `agnes pull --docs-only` flag that does NOT exist in agnes pull (removed; spec only ships --quiet/--json/ --dry-run) 🟡 Important: - app/api/admin.py:252 — `da fetch` in bq_max_scan_bytes hint - cli/commands/auth.py:119 — `da sync` in import-token docstring (--help text) - cli/commands/tokens.py:48 — "Export it so `da` can use it" prose - ARCHITECTURE.md — 4 stale rows in CLI commands table - README.md — stale paragraphs for analysts (da sync, da analyst setup) 🚩 Substantive observations addressed: - app/api/query.py:249,302,489 — server-side error/help strings still said `da sync`/`da fetch` (returned in API responses to clients) - cli/commands/snapshot.py:235-241 — DuckDB existence guard incorrectly blocked `--estimate` (server-side dry-run that never opens local DB). Added test ensuring estimate path skips the guard. Skipped (intentionally historical): - app/api/admin.py:2377,2429,2437 — historical comments describing past manifest-vs-sync_state bug; past tense, accurate to keep as `da sync`.
203 lines
10 KiB
Markdown
203 lines
10 KiB
Markdown
# Agnes — AI Data Analyst
|
|
|
|
Agnes is an open-source data distribution platform for AI analytical systems. It extracts data from configured sources into DuckDB, serves it via a FastAPI backend, and distributes Parquet files to analysts who query them locally using Claude Code and DuckDB.
|
|
|
|
Each data source produces a self-describing `extract.duckdb` file. The `SyncOrchestrator` attaches all extract databases into a master `analytics.duckdb`, making every table available through a unified view layer without copying data unnecessarily.
|
|
|
|
## Architecture: extract.duckdb Contract
|
|
|
|
Every connector produces the same output structure:
|
|
|
|
```
|
|
/data/extracts/{source_name}/
|
|
├── extract.duckdb ← _meta table + views
|
|
└── data/ ← parquet files (local sources only)
|
|
```
|
|
|
|
The orchestrator scans `/data/extracts/*/extract.duckdb`, attaches each into `analytics.duckdb`, and creates master views.
|
|
|
|
```
|
|
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
|
|
│ Keboola │ │ BigQuery │ │ Jira │
|
|
│ extractor │ │ extractor │ │ webhooks │
|
|
│ (DuckDB ext) │ │ (remote BQ) │ │ (incremental)│
|
|
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
|
|
│ │ │
|
|
▼ ▼ ▼
|
|
extract.duckdb extract.duckdb extract.duckdb
|
|
+ data/*.parquet (views → BQ) + data/*.parquet
|
|
│ │ │
|
|
└─────────────────┼─────────────────┘
|
|
▼
|
|
SyncOrchestrator.rebuild()
|
|
ATTACH → master views in analytics.duckdb
|
|
│
|
|
┌──────────┼──────────┐
|
|
▼ ▼ ▼
|
|
FastAPI CLI
|
|
(serve) (agnes pull)
|
|
```
|
|
|
|
## Supported Data Sources
|
|
|
|
| Mode | Distribution | Sources | Use when |
|
|
|------|--------------|---------|----------|
|
|
| **Batch pull** (`local`) | Parquet on disk, scheduled | Keboola | Source has a native bulk-export and the table fits on disk |
|
|
| **Materialized SQL** (`materialized`) | Parquet on disk, scheduled query | BigQuery, Keboola | Source table is too large to mirror as-is; you want a curated subset / aggregate on disk |
|
|
| **Remote attach** (`remote`) | View only, no download | BigQuery | Table is too large to materialize; latency cost of remote query is acceptable |
|
|
| **Real-time push** | Incremental parquet | Jira | Source is event-driven and you need sub-minute freshness |
|
|
|
|
The first three modes are what `agnes pull` distributes to analysts. The fourth is server-side only — analysts query Jira data through the same `agnes pull`-distributed parquets.
|
|
|
|
Admins manage per-source registrations through the `/admin/tables` UI (per-connector tabs for BigQuery / Keboola / Jira) or the `agnes admin register-table` CLI; per-row "Manage access" deep-links to `/admin/access` for granting tables to user groups via `resource_grants(group, ResourceType.TABLE, table_id)`.
|
|
|
|
Analysts get a closed loop with Claude Code: `agnes init` writes `<workspace>/.claude/settings.json` with SessionStart (`agnes pull --quiet`) and SessionEnd (`agnes push --quiet`) hooks so every Claude Code session starts with fresh RBAC-filtered parquets and ends with the session log uploaded back.
|
|
|
|
Adding a new source means creating `connectors/<name>/extractor.py` that produces `extract.duckdb` with a `_meta` table (`table_name`, `description`, `rows`, `size_bytes`, `extracted_at`, `query_mode`). The orchestrator attaches it automatically.
|
|
|
|
## Quick Start with Docker
|
|
|
|
```bash
|
|
# Clone the repository
|
|
git clone https://github.com/keboola/agnes-the-ai-analyst.git
|
|
cd agnes-the-ai-analyst
|
|
|
|
# Copy and edit configuration
|
|
cp config/instance.yaml.example config/instance.yaml
|
|
cp config/.env.template .env
|
|
# Edit both files for your environment
|
|
|
|
# Start the app and scheduler
|
|
docker compose up
|
|
|
|
# Start with all optional services (Telegram bot, etc.)
|
|
docker compose --profile full up
|
|
|
|
# Start with TLS (Caddy on :443 with corporate-CA certs from /data/state/certs)
|
|
docker compose -f docker-compose.yml -f docker-compose.prod.yml -f docker-compose.tls.yml \
|
|
--profile tls up -d
|
|
```
|
|
|
|
Once running, the FastAPI app is available at `http://localhost:8000` (or `https://$DOMAIN` in TLS mode). See [`docs/DEPLOYMENT.md`](docs/DEPLOYMENT.md) for cert provisioning + auto-rotation via `scripts/ops/agnes-tls-rotate.sh`. Trigger a manual sync:
|
|
|
|
```bash
|
|
curl -X POST http://localhost:8000/api/sync/trigger
|
|
```
|
|
|
|
## Local sync & auto-update
|
|
|
|
Analysts run Claude Code against a local DuckDB built from RBAC-filtered parquets pulled from the server. `agnes pull` is the distribution path:
|
|
|
|
```bash
|
|
agnes pull # delta-pull: manifest → MD5 compare → download changed → rebuild views
|
|
agnes pull --quiet # same, no progress output (for hooks/cron)
|
|
agnes push # push session jsonl + CLAUDE.local.md back to the server
|
|
```
|
|
|
|
`agnes init` writes Claude Code lifecycle hooks into `<workspace>/.claude/settings.json`:
|
|
|
|
- `SessionStart` → `agnes pull --quiet` — fresh data on every session
|
|
- `SessionEnd` → `agnes push --quiet` — uploads notes and session log
|
|
|
|
Hooks live at workspace level so they only fire in this analyst workspace, not in unrelated Claude Code sessions on the same machine.
|
|
|
|
### Admin: which tables auto-sync to whom
|
|
|
|
The auto-sync set per analyst is the intersection of:
|
|
|
|
1. Tables with `query_mode IN ('local', 'materialized')` — these have parquets on disk and end up in the manifest
|
|
2. Tables granted to one of the analyst's groups via `resource_grants(group, ResourceType.TABLE, table_id)` (see [`docs/RBAC.md`](docs/RBAC.md))
|
|
|
|
To enroll a new table for auto-sync, register it (or update its `query_mode`) and grant it to the relevant groups in `/admin/access`. New analysts get the same set on their next `agnes pull`.
|
|
|
|
For BigQuery, register a `query_mode='materialized'` table with a SQL body:
|
|
|
|
```bash
|
|
agnes admin register-table orders_90d \
|
|
--source-type bigquery \
|
|
--query-mode materialized \
|
|
--query @docs/queries/orders_90d.sql \
|
|
--schedule "every 6h"
|
|
```
|
|
|
|
The scheduler runs the query through the DuckDB BigQuery extension on each tick that's due, writes the result as a parquet, and the analyst picks it up on the next `agnes pull`. Cost guardrail: `data_source.bigquery.max_bytes_per_materialize` (default 10 GiB) — operations exceeding the BQ dry-run estimate are skipped.
|
|
|
|
## Development Setup
|
|
|
|
```bash
|
|
# Create and activate virtual environment
|
|
python3 -m venv .venv && source .venv/bin/activate
|
|
|
|
# Install dependencies
|
|
uv pip install ".[dev]"
|
|
|
|
# Run FastAPI locally with hot reload
|
|
uvicorn app.main:app --reload
|
|
|
|
# Run the test suite
|
|
pytest tests/ -v
|
|
```
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
├── src/ # Core engine
|
|
│ ├── db.py # DuckDB schema (system.duckdb, analytics.duckdb)
|
|
│ ├── orchestrator.py # SyncOrchestrator — ATTACHes extract.duckdb files
|
|
│ ├── repositories/ # DuckDB-backed CRUD (sync_state, table_registry, users, etc.)
|
|
│ ├── profiler.py # Data profiling
|
|
│ └── catalog_export.py # OpenMetadata catalog export
|
|
├── app/ # FastAPI application
|
|
│ ├── main.py # App setup, router registration
|
|
│ ├── api/ # REST API (sync, data, catalog, admin, auth)
|
|
│ ├── auth/ # Auth providers (Google OAuth, email magic link, desktop JWT)
|
|
│ └── web/ # HTML dashboard routes
|
|
├── connectors/ # Data source connectors (extract.duckdb contract)
|
|
│ ├── keboola/ # Keboola: extractor.py (DuckDB extension) + client.py (fallback)
|
|
│ ├── bigquery/ # BigQuery: extractor.py (remote-only via DuckDB BQ extension)
|
|
│ └── jira/ # Jira: webhook + incremental parquet → extract.duckdb
|
|
├── cli/ # CLI tool (`agnes pull`, `agnes query`, `agnes admin`)
|
|
├── services/ # Standalone services (scheduler, telegram_bot, ws_gateway, etc.)
|
|
├── scripts/ # Utility + migration scripts
|
|
├── config/ # Configuration templates (instance.yaml.example)
|
|
├── docs/ # Documentation + metric YAML definitions
|
|
└── tests/ # Test suite (633 tests)
|
|
```
|
|
|
|
## Configuration
|
|
|
|
| File | Purpose |
|
|
|------|---------|
|
|
| `config/instance.yaml` | Instance-specific settings: branding, data source type, auth provider, Google domain |
|
|
| `.env` | Secrets and environment variables — never committed |
|
|
| `system.duckdb` `table_registry` table | Table definitions managed via `POST /api/admin/register-table` (or `PUT /api/admin/registry/{id}` to update) or the web UI |
|
|
|
|
Copy the example to get started:
|
|
|
|
```bash
|
|
cp config/instance.yaml.example config/instance.yaml
|
|
```
|
|
|
|
See `config/instance.yaml.example` for all available options.
|
|
|
|
## Documentation
|
|
|
|
- [Hackathon TL;DR](docs/HACKATHON.md) — condensed deploy + dev playbooks (for both humans and AI agents)
|
|
- [Onboarding Guide](docs/ONBOARDING.md) — end-to-end Terraform deployment into a GCP project (recommended for production)
|
|
- [Deployment Guide](docs/DEPLOYMENT.md) — chooses between Terraform and Docker Compose; covers OSS self-host
|
|
- [Configuration Reference](docs/CONFIGURATION.md) — `instance.yaml`, env vars, per-instance options
|
|
- [Architecture](ARCHITECTURE.md) — orchestrator, extractors, DB layout
|
|
- [Quickstart](docs/QUICKSTART.md) — local development
|
|
|
|
## Contributing
|
|
|
|
1. Fork the repository and create a feature branch.
|
|
2. Run `pytest tests/ -v` to verify all tests pass before opening a pull request.
|
|
3. Keep commits focused and messages concise.
|
|
4. Open a pull request against `main` with a clear description of the change.
|
|
|
|
For bugs and feature requests, open a GitHub issue.
|
|
|
|
## License
|
|
|
|
This project is licensed under the [MIT License](LICENSE).
|