AI-Cognitive-Leap/agnes-the-ai-analyst

Fork 0

Fork of keboola/agnes-the-ai-analyst (via manana2520 GitHub fork). Develop here, push to GitHub fork to open upstream PRs.

Find a file

Petr 26c4e0934d OSS cleanup: remove internal references, harden deployment, add config env interpolation Phase 1 - Internal reference cleanup: - Delete dev_docs/meetings/ (internal meeting notes/transcripts) - Replace hardcoded usernames (padak/matejkys/dasa) with deploy/generic - Replace "Internal AI Data Analyst" with "AI Data Analyst" - Replace keboola/internal_ai_data_analyst URLs with your-org/ai-data-analyst - Replace /tmp/keboola_load/ with /tmp/data_analyst_staging/ in dev_docs Phase 2 - Deployment hardening: - Tighten sudoers wildcards to explicit paths (visudo, sudoers cp) - setup.sh creates all groups (data-ops, dataread, data-private) and deploy user - webapp-setup.sh copies sudoers-webapp from repo instead of inline definition - deploy.sh conditional copy for data_description.md (not in git for OSS) - deploy.sh ownership changed to deploy:data-ops for /data/{scripts,docs,examples} Phase 3 - Config and misc: - Add ${ENV_VAR} interpolation to config/loader.py - Expand config/instance.yaml.example with all sections (admins, deployment, auth, etc.) - Create config/.env.template for secret values - Add MIT LICENSE - Fix .gitignore: add .venv/, docs/data_description.md - Fix README.md: CSV status Planned, remove metrics/, update license text - Translate Czech comments in requirements.txt to English - Fix test_account_service.py: mock username mapping instead of relying on instance config All 118 tests pass.		2026-03-09 07:59:57 +01:00
.github/workflows	Initial commit: OSS data distribution platform	2026-03-08 23:31:28 +01:00
config	OSS cleanup: remove internal references, harden deployment, add config env interpolation	2026-03-09 07:59:57 +01:00
dev_docs	OSS cleanup: remove internal references, harden deployment, add config env interpolation	2026-03-09 07:59:57 +01:00
dev_scripts	Initial commit: OSS data distribution platform	2026-03-08 23:31:28 +01:00
docs	OSS cleanup: remove internal references, harden deployment, add config env interpolation	2026-03-09 07:59:57 +01:00
examples/notifications	Initial commit: OSS data distribution platform	2026-03-08 23:31:28 +01:00
scripts	OSS cleanup: remove internal references, harden deployment, add config env interpolation	2026-03-09 07:59:57 +01:00
server	OSS cleanup: remove internal references, harden deployment, add config env interpolation	2026-03-09 07:59:57 +01:00
src	Initial commit: OSS data distribution platform	2026-03-08 23:31:28 +01:00
tests	OSS cleanup: remove internal references, harden deployment, add config env interpolation	2026-03-09 07:59:57 +01:00
webapp	OSS cleanup: remove internal references, harden deployment, add config env interpolation	2026-03-09 07:59:57 +01:00
.gitignore	OSS cleanup: remove internal references, harden deployment, add config env interpolation	2026-03-09 07:59:57 +01:00
CLAUDE.md	Initial commit: OSS data distribution platform	2026-03-08 23:31:28 +01:00
LICENSE	OSS cleanup: remove internal references, harden deployment, add config env interpolation	2026-03-09 07:59:57 +01:00
pytest.ini	Initial commit: OSS data distribution platform	2026-03-08 23:31:28 +01:00
README.md	OSS cleanup: remove internal references, harden deployment, add config env interpolation	2026-03-09 07:59:57 +01:00
requirements.txt	OSS cleanup: remove internal references, harden deployment, add config env interpolation	2026-03-09 07:59:57 +01:00

README.md

AI Data Analyst

A data distribution platform for AI analytical systems. It pulls data from configured sources, converts it to Parquet format, and distributes it to analysts who query it locally using Claude Code and DuckDB.

How It Works

flowchart TB
    subgraph Sources["Data Sources"]
        A[(Keboola)]
        B[(CSV Files)]
        C[(BigQuery / Snowflake)]
        style C stroke-dasharray: 5 5
    end

    subgraph Broker["Data Broker Server"]
        D[Source Adapter]
        E[Parquet Converter]
        D --> E
    end

    subgraph Analyst["Analyst Machine"]
        F[Parquet Files]
        G[(DuckDB)]
        H((Claude Code))
        F --> G
        G --> H
    end

    A --> D
    B --> D
    C -.->|planned| D
    E -->|rsync over SSH| F

The server fetches data from a configured source using the appropriate adapter.
Raw data is converted to typed, columnar Parquet files.
Analysts sync Parquet files to their machines over SSH (rsync).
Claude Code queries the local DuckDB database and returns results with insights.

Features

Pluggable data sources -- adapter interface supporting Keboola out of the box, CSV import, and extensible to BigQuery, Snowflake, and others.
Automatic Parquet conversion -- source data is converted to typed, partitioned Parquet files for efficient local querying.
SSH-based distribution -- analysts sync data securely via rsync; no cloud credentials leave the server.
Claude Code as analyst interface -- natural language queries against DuckDB, powered by Claude.
Claude Code as installer -- the CLAUDE.md file guides Claude Code through automated project setup for new analysts.
Self-service webapp -- web UI for user onboarding, SSH key management, sync settings, and data catalog browsing.
Corporate Memory -- shared knowledge base that aggregates analyst insights and distributes approved rules back to the team.
Configurable per-instance -- a single config/instance.yaml controls branding, authentication, data source, user mapping, and more.
Access control -- role-based permissions with standard analyst, privileged analyst, and admin tiers.

Quick Start

See docs/QUICKSTART.md for full setup instructions.

The short version:

# 1. Clone the repository
git clone https://github.com/your-org/ai-data-analyst.git
cd ai-data-analyst

# 2. Copy and edit configuration
cp config/instance.yaml.example config/instance.yaml
cp config/data_description.md.example config/data_description.md
# Edit both files for your environment

# 3. Deploy the server
# See docs/DEPLOYMENT.md for detailed server setup

# 4. Analysts connect via the webapp and sync data
bash server/scripts/sync_data.sh

Project Structure

ai-data-analyst/
├── config/                        # Instance configuration
│   ├── instance.yaml.example      # Main config template (copy to instance.yaml)
│   └── data_description.md.example  # Data schema template
│
├── src/                           # Server-side Python code
│   ├── adapters/                  # Data source adapters
│   │   ├── base.py               # Adapter interface (ABC)
│   │   └── keboola_adapter.py    # Keboola Storage adapter
│   ├── data_sync.py              # Orchestrates data pull from sources
│   ├── parquet_manager.py        # CSV to Parquet conversion
│   ├── config.py                 # Configuration loader
│   └── profiler.py               # Data profiling for catalog
│
├── webapp/                        # Flask web application
│   └── ...                        # User onboarding, settings, catalog
│
├── server/                        # Deployment and server management
│   ├── deploy.sh                  # Deployment script
│   └── ...                        # Systemd units, sudoers, cron jobs
│
├── scripts/                       # Analyst-facing helper scripts
│   ├── sync_data.sh              # Sync data from server
│   └── setup_views.sh            # Initialize DuckDB views
│
├── docs/                          # User-facing documentation
│   ├── QUICKSTART.md             # Setup guide
│   └── data_description.md       # Table schemas (single source of truth)
│
├── dev_docs/                      # Developer and operator documentation
│   ├── server.md                 # Server administration
│   └── security.md               # Security model
│
├── tests/                         # Test suite
├── requirements.txt               # Python dependencies
├── CLAUDE.md                      # Instructions for Claude Code
└── README.md                      # This file

Supported Data Sources

Adapter	Status	Description
Keboola Storage	Available	Pulls tables via the Keboola Storage API
CSV	Planned	Imports local or mounted CSV files
BigQuery	Planned	Google BigQuery adapter
Snowflake	Planned	Snowflake adapter

Adding a new adapter means implementing the DataSource interface in src/adapters/ and setting data_source.type in config/instance.yaml. See src/adapters/base.py for the contract.

Using with Claude Code

Once data is synced, open Claude Code in the project directory and ask questions in natural language:

What are the top 10 customers by revenue this quarter?

Show me the trend in support ticket volume over the last 6 months.

Claude Code will connect to the local DuckDB database, write and execute SQL, and return results with analysis.

Documentation

Quick Start -- End-to-end setup for new deployments
Configuration -- All configuration options explained
Deployment -- Server provisioning and deployment guide
Data Sources -- How to configure and extend data source adapters
Server Administration -- Day-to-day server operations
Security -- Access control and security model

License

This project is licensed under the MIT License.

Questions or issues? Open a GitHub issue.