- architecture.md rewritten for v2 (FastAPI, DuckDB, Docker) — removed all Flask/rsync/SSH/systemd references - Deleted PLAN.md and REFACTORING_PLAN.md (completed, superseded) - auto-install.md replaced with redirect to DEPLOYMENT.md - Fixed absolute paths in superpowers plan doc
149 lines
6.2 KiB
Markdown
149 lines
6.2 KiB
Markdown
# Agnes — AI Data Analyst
|
|
|
|
Agnes is an open-source data distribution platform for AI analytical systems. It extracts data from configured sources into DuckDB, serves it via a FastAPI backend, and distributes Parquet files to analysts who query them locally using Claude Code and DuckDB.
|
|
|
|
Each data source produces a self-describing `extract.duckdb` file. The `SyncOrchestrator` attaches all extract databases into a master `analytics.duckdb`, making every table available through a unified view layer without copying data unnecessarily.
|
|
|
|
## Architecture: extract.duckdb Contract
|
|
|
|
Every connector produces the same output structure:
|
|
|
|
```
|
|
/data/extracts/{source_name}/
|
|
├── extract.duckdb ← _meta table + views
|
|
└── data/ ← parquet files (local sources only)
|
|
```
|
|
|
|
The orchestrator scans `/data/extracts/*/extract.duckdb`, attaches each into `analytics.duckdb`, and creates master views.
|
|
|
|
```
|
|
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
|
|
│ Keboola │ │ BigQuery │ │ Jira │
|
|
│ extractor │ │ extractor │ │ webhooks │
|
|
│ (DuckDB ext) │ │ (remote BQ) │ │ (incremental)│
|
|
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
|
|
│ │ │
|
|
▼ ▼ ▼
|
|
extract.duckdb extract.duckdb extract.duckdb
|
|
+ data/*.parquet (views → BQ) + data/*.parquet
|
|
│ │ │
|
|
└─────────────────┼─────────────────┘
|
|
▼
|
|
SyncOrchestrator.rebuild()
|
|
ATTACH → master views in analytics.duckdb
|
|
│
|
|
┌──────────┼──────────┐
|
|
▼ ▼ ▼
|
|
FastAPI CLI
|
|
(serve) (da sync)
|
|
```
|
|
|
|
## Supported Data Sources
|
|
|
|
| Source | Mode | Description |
|
|
|--------|------|-------------|
|
|
| **Keboola** | Batch pull | DuckDB Keboola extension downloads tables to Parquet on a schedule |
|
|
| **BigQuery** | Remote attach | DuckDB BQ extension; queries execute in BigQuery, no local download |
|
|
| **Jira** | Real-time push | Webhook receiver updates Parquet files incrementally |
|
|
|
|
Adding a new source means creating `connectors/<name>/extractor.py` that produces `extract.duckdb` with a `_meta` table (`table_name`, `description`, `rows`, `size_bytes`, `extracted_at`, `query_mode`). The orchestrator attaches it automatically.
|
|
|
|
## Quick Start with Docker
|
|
|
|
```bash
|
|
# Clone the repository
|
|
git clone https://github.com/keboola/agnes-the-ai-analyst.git
|
|
cd agnes-the-ai-analyst
|
|
|
|
# Copy and edit configuration
|
|
cp config/instance.yaml.example config/instance.yaml
|
|
cp config/.env.template .env
|
|
# Edit both files for your environment
|
|
|
|
# Start the app and scheduler
|
|
docker compose up
|
|
|
|
# Start with all optional services (Telegram bot, etc.)
|
|
docker compose --profile full up
|
|
```
|
|
|
|
Once running, the FastAPI app is available at `http://localhost:8000`. Trigger a manual sync:
|
|
|
|
```bash
|
|
curl -X POST http://localhost:8000/api/sync/trigger
|
|
```
|
|
|
|
## Development Setup
|
|
|
|
```bash
|
|
# Create and activate virtual environment
|
|
python3 -m venv .venv && source .venv/bin/activate
|
|
|
|
# Install dependencies
|
|
uv pip install -r requirements.txt
|
|
|
|
# Run FastAPI locally with hot reload
|
|
uvicorn app.main:app --reload
|
|
|
|
# Run the test suite
|
|
pytest tests/ -v
|
|
```
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
├── src/ # Core engine
|
|
│ ├── db.py # DuckDB schema (system.duckdb, analytics.duckdb)
|
|
│ ├── orchestrator.py # SyncOrchestrator — ATTACHes extract.duckdb files
|
|
│ ├── repositories/ # DuckDB-backed CRUD (sync_state, table_registry, users, etc.)
|
|
│ ├── profiler.py # Data profiling
|
|
│ └── catalog_export.py # OpenMetadata catalog export
|
|
├── app/ # FastAPI application
|
|
│ ├── main.py # App setup, router registration
|
|
│ ├── api/ # REST API (sync, data, catalog, admin, auth)
|
|
│ ├── auth/ # Auth providers (Google OAuth, email magic link, desktop JWT)
|
|
│ └── web/ # HTML dashboard routes
|
|
├── connectors/ # Data source connectors (extract.duckdb contract)
|
|
│ ├── keboola/ # Keboola: extractor.py (DuckDB extension) + client.py (fallback)
|
|
│ ├── bigquery/ # BigQuery: extractor.py (remote-only via DuckDB BQ extension)
|
|
│ └── jira/ # Jira: webhook + incremental parquet → extract.duckdb
|
|
├── cli/ # CLI tool (`da sync`, `da query`, `da admin`)
|
|
├── services/ # Standalone services (scheduler, telegram_bot, ws_gateway, etc.)
|
|
├── scripts/ # Utility + migration scripts
|
|
├── config/ # Configuration templates (instance.yaml.example)
|
|
├── docs/ # Documentation + metric YAML definitions
|
|
└── tests/ # Test suite (633 tests)
|
|
```
|
|
|
|
## Configuration
|
|
|
|
| File | Purpose |
|
|
|------|---------|
|
|
| `config/instance.yaml` | Instance-specific settings: branding, data source type, auth provider, Google domain |
|
|
| `.env` | Secrets and environment variables — never committed |
|
|
| `system.duckdb` `table_registry` table | Table definitions managed via `POST /api/admin/tables/{id}` or the web UI |
|
|
|
|
Copy the example to get started:
|
|
|
|
```bash
|
|
cp config/instance.yaml.example config/instance.yaml
|
|
```
|
|
|
|
See `config/instance.yaml.example` for all available options.
|
|
|
|
## Documentation
|
|
|
|
- [Deployment Guide](docs/DEPLOYMENT.md) — server provisioning, Docker, environment setup
|
|
|
|
## Contributing
|
|
|
|
1. Fork the repository and create a feature branch.
|
|
2. Run `pytest tests/ -v` to verify all tests pass before opening a pull request.
|
|
3. Keep commits focused and messages concise.
|
|
4. Open a pull request against `main` with a clear description of the change.
|
|
|
|
For bugs and feature requests, open a GitHub issue.
|
|
|
|
## License
|
|
|
|
This project is licensed under the [MIT License](LICENSE).
|