Fork of keboola/agnes-the-ai-analyst (via manana2520 GitHub fork). Develop here, push to GitHub fork to open upstream PRs.
Find a file
ZdenekSrotyr 675a29c1c7 fix: DuckDB connection pool — shared connection avoids lock conflicts
Fixes #9 — background sync tasks could not access system.duckdb
because FastAPI held an exclusive lock. Now uses single shared
connection per DATA_DIR with cursor() for thread safety.
2026-03-31 13:01:04 +02:00
.github/workflows chore: exclude CI workflow from push (needs workflow scope) 2026-03-27 17:41:27 +01:00
app feat: access request UI — catalog badges, request modal, admin approval page 2026-03-31 12:45:29 +02:00
auth Add auth.disabled_providers config to skip auth providers 2026-03-11 12:54:23 +01:00
cli feat: CLI admin commands — register-table, discover-and-register, list-tables 2026-03-31 12:55:03 +02:00
config feat: add dataset permissions, script execution, Kamal config, CI/CD 2026-03-27 15:40:11 +01:00
connectors fix: legacy extractor constructs full Keboola table ID from bucket+source_table 2026-03-31 12:06:38 +02:00
dev_docs Update paths in docs and sudoers after services/ extraction 2026-03-09 13:02:13 +01:00
docs feat: implement data access control — table-level permissions 2026-03-31 12:33:31 +02:00
examples/notifications
infra fix: update Terraform for extract.duckdb architecture 2026-03-31 09:49:32 +02:00
scripts refactor: delete old server infra — 4,200 lines removed 2026-03-31 08:06:41 +02:00
services refactor: delete old server infra — 4,200 lines removed 2026-03-31 08:06:41 +02:00
src fix: DuckDB connection pool — shared connection avoids lock conflicts 2026-03-31 13:01:04 +02:00
tests feat: implement data access control — table-level permissions 2026-03-31 12:33:31 +02:00
webapp feat: add centralized RBAC module — replace Linux group auth 2026-03-31 08:04:35 +02:00
.dockerignore feat: add Docker, CLI tool, scheduler, and agent skills 2026-03-27 15:30:03 +01:00
.gitignore chore: exclude CI workflow from push (needs workflow scope) 2026-03-27 17:41:27 +01:00
ARCHITECTURE.md Update docs for modular architecture (auth/, services/, scripts/) 2026-03-09 13:11:40 +01:00
CLAUDE.md docs: rewrite CLAUDE.md for extract.duckdb architecture 2026-03-31 07:52:44 +02:00
docker-compose.test.yml feat: add SEED_ADMIN_EMAIL for Docker test environments 2026-03-31 09:48:12 +02:00
docker-compose.yml docs: rewrite CLAUDE.md for extract.duckdb architecture 2026-03-31 07:52:44 +02:00
Dockerfile feat: complete web UI + auth providers + template compatibility 2026-03-27 17:34:39 +01:00
LICENSE OSS cleanup: remove internal references, harden deployment, add config env interpolation 2026-03-09 07:59:57 +01:00
llms.txt Restructure docs for OSS readability 2026-03-09 10:42:45 +01:00
Makefile Fix sync_schedule validation to accept multi-time daily format 2026-03-17 13:21:14 +01:00
pyproject.toml feat: add Docker, CLI tool, scheduler, and agent skills 2026-03-27 15:30:03 +01:00
pytest.ini feat: add E2E test suite — API, extractor, Docker 2026-03-31 08:18:54 +02:00
README.md Update docs for modular architecture (auth/, services/, scripts/) 2026-03-09 13:11:40 +01:00
requirements.txt feat: add FastAPI server with auth, RBAC, and all API endpoints 2026-03-27 15:19:18 +01:00

AI Data Analyst

A data distribution platform for AI analytical systems. It pulls data from configured sources, converts it to Parquet format, and distributes it to analysts who query it locally using Claude Code and DuckDB.

How It Works

flowchart TB
    subgraph Sources["Data Sources"]
        A[(Keboola)]
        B[(CSV Files)]
        C[(BigQuery / Snowflake)]
        style C stroke-dasharray: 5 5
    end

    subgraph Broker["Data Broker Server"]
        D[Source Adapter]
        E[Parquet Converter]
        D --> E
    end

    subgraph Analyst["Analyst Machine"]
        F[Parquet Files]
        G[(DuckDB)]
        H((Claude Code))
        F --> G
        G --> H
    end

    A --> D
    B --> D
    C -.->|planned| D
    E -->|rsync over SSH| F
  1. The server fetches data from a configured source using the appropriate adapter.
  2. Raw data is converted to typed, columnar Parquet files.
  3. Analysts sync Parquet files to their machines over SSH (rsync).
  4. Claude Code queries the local DuckDB database and returns results with insights.

Features

  • Pluggable data sources -- connector interface supporting Keboola out of the box, CSV import, and extensible to BigQuery, Snowflake, and others.
  • Pluggable authentication -- auto-discovered auth providers (Google OAuth, email/password, desktop JWT, or custom).
  • Automatic Parquet conversion -- source data is converted to typed, partitioned Parquet files for efficient local querying.
  • SSH-based distribution -- analysts sync data securely via rsync; no cloud credentials leave the server.
  • Claude Code as analyst interface -- natural language queries against DuckDB, powered by Claude.
  • Claude Code as installer -- the CLAUDE.md file guides Claude Code through automated project setup for new analysts.
  • Self-service webapp -- web UI for user onboarding, SSH key management, sync settings, and data catalog browsing.
  • Corporate Memory -- shared knowledge base that aggregates analyst insights and distributes approved rules back to the team.
  • Configurable per-instance -- a single config/instance.yaml controls branding, authentication, data source, user mapping, and more.
  • Access control -- role-based permissions with standard analyst, privileged analyst, and admin tiers.

Quick Start

See docs/QUICKSTART.md for full setup instructions.

The short version:

# 1. Clone the repository
git clone https://github.com/your-org/ai-data-analyst.git
cd ai-data-analyst

# 2. Copy and edit configuration
cp config/instance.yaml.example config/instance.yaml
cp config/data_description.md.example config/data_description.md
# Edit both files for your environment

# 3. Deploy the server
# See docs/DEPLOYMENT.md for detailed server setup

# 4. Analysts connect via the webapp and sync data
bash server/scripts/sync_data.sh

Project Structure

ai-data-analyst/
├── config/                        # Instance configuration
│   ├── instance.yaml.example      # Main config template (copy to instance.yaml)
│   └── data_description.md.example  # Data schema template
│
├── src/                           # Core data sync engine (vendor-neutral)
│   ├── data_sync.py              # Orchestrates data pull + DataSource ABC
│   ├── parquet_manager.py        # CSV to Parquet conversion
│   ├── config.py                 # Configuration loader
│   └── profiler.py               # Data profiling for catalog
│
├── connectors/                    # Data source connectors (pluggable)
│   ├── keboola/                   # Keboola Storage connector
│   │   ├── adapter.py            # KeboolaDataSource (implements DataSource)
│   │   └── client.py             # Low-level Keboola API client
│   └── jira/                      # Jira webhook connector
│
├── auth/                          # Authentication providers (pluggable)
│   ├── google/                    # Google OAuth provider
│   ├── password/                  # Email/password provider
│   └── desktop/                   # Desktop JWT provider (API-only)
│
├── services/                      # Standalone services (own systemd units)
│   ├── telegram_bot/              # Telegram notification bot
│   ├── ws_gateway/                # WebSocket notification gateway
│   ├── corporate_memory/          # AI knowledge aggregation
│   └── session_collector/         # Claude Code session collector
│
├── webapp/                        # Flask web application
│   └── ...                        # User onboarding, settings, catalog
│
├── server/                        # Deployment infrastructure only
│   ├── deploy.sh                  # Deployment script (auto-discovers services)
│   └── ...                        # Sudoers, nginx, setup scripts
│
├── scripts/                       # Helper scripts
│   ├── sync_data.sh              # Sync data from server
│   ├── setup_views.sh            # Initialize DuckDB views
│   └── dev_run.py                # Dev server with auth bypass
│
├── docs/                          # User-facing documentation
├── dev_docs/                      # Developer and operator documentation
├── tests/                         # Test suite
├── requirements.txt               # Python dependencies
├── CLAUDE.md                      # Instructions for Claude Code
└── README.md                      # This file

Supported Data Sources

Adapter Status Description
Keboola Storage Available Pulls tables via the Keboola Storage API
CSV Planned Imports local or mounted CSV files
BigQuery Planned Google BigQuery adapter
Snowflake Planned Snowflake adapter

Adding a new data source means creating a connector module in connectors/ that implements the DataSource interface from src/data_sync.py, and setting data_source.type in config/instance.yaml. See connectors/keboola/ for a reference implementation.

Using with Claude Code

Once data is synced, open Claude Code in the project directory and ask questions in natural language:

What are the top 10 customers by revenue this quarter?
Show me the trend in support ticket volume over the last 6 months.

Claude Code will connect to the local DuckDB database, write and execute SQL, and return results with analysis.

Documentation

License

This project is licensed under the MIT License.


Questions or issues? Open a GitHub issue.