Open-source AI data analyst platform extracted from internal repo. Includes data sync engine, Keboola adapter, Flask web portal, server deployment scripts, and configuration templates.
5.8 KiB
5.8 KiB
AI Data Analyst
Open-source data distribution platform for AI analytical systems. Syncs data from various sources, converts to Parquet, and distributes to analysts who use Claude Code for local analysis.
First-Time Setup
When a user opens this project for the first time, guide them through interactive setup:
Step 1: Gather Information
Ask the user for:
- Company domain (e.g., "acme.com") - used for Google OAuth
- Data source type: keboola / csv / bigquery (future)
- Instance name (e.g., "Acme Data Analyst")
Step 2: Generate Configuration
- Copy
config/instance.yaml.exampletoconfig/instance.yaml - Fill in values from Step 1
- If Keboola: ask for Storage API token, stack URL, project ID
- Create
.envfromconfig/.env.template
Step 3: Generate Data Description
- If Keboola adapter: use the API to fetch table metadata and generate
docs/data_description.md - If CSV: ask user to describe their data files
- The file defines tables, sync strategies, and schema
Step 4: Server Setup (if deploying)
- Guide VM provisioning (or use existing server)
- Run
server/setup.shon the target VM - Run
server/webapp-setup.shfor the web portal - Set up CI/CD from
.github/workflows/deploy.yml.example
Project Structure
├── src/ # Core data sync engine
│ ├── adapters/ # Data source adapters (Keboola, CSV, etc.)
│ ├── config.py # Configuration from data_description.md
│ ├── data_sync.py # Sync orchestration
│ ├── parquet_manager.py # Parquet file management
│ └── profiler.py # Data profiling
├── webapp/ # Flask web portal (login, dashboard, API)
├── server/ # Server deployment (systemd, scripts)
├── scripts/ # Utility scripts (sync, DuckDB setup)
├── config/ # Configuration templates
│ ├── instance.yaml.example
│ └── data_description.md.example
├── docs/ # Documentation
└── tests/ # Test suite
Architecture
Data Source (Keboola / CSV / BigQuery)
│
▼
┌─────────────────────────────────┐
│ Data Broker Server │
│ ├── /data/src_data/parquet/ │ Converted data
│ ├── /data/docs/ │ Documentation
│ └── /data/scripts/ │ Helper scripts
└─────────────────────────────────┘
│ rsync (via ~/server/ symlinks)
▼
┌─────────────────────────────────┐
│ Analyst (local machine) │
│ ├── ./server/ (read-only) │ parquet, docs, scripts
│ └── ./user/ (workspace) │ duckdb, notifications
└─────────────────────────────────┘
Configuration
Instance-specific config is in config/instance.yaml. See config/instance.yaml.example for all options.
Environment variables go in .env (never committed to git).
Data schema is defined in docs/data_description.md (YAML blocks in markdown).
Development
# Setup
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# Run webapp locally
flask --app webapp.app run --debug
# Run tests
pytest tests/ -v
# Sync data
python -m src.data_sync
Data Source Adapters
The platform supports pluggable data sources via src/adapters/:
- Keboola (
keboola): Syncs from Keboola Storage API - CSV (
csv): Import from local CSV files (planned) - BigQuery (
bigquery): Query from Google BigQuery (planned)
Configure in config/instance.yaml under data_source.type.
Server Management
# Add analyst user
sudo add-analyst username "ssh-rsa AAAA..."
# Add privileged analyst
sudo add-analyst username "ssh-rsa AAAA..." --private
# List analysts
list-analysts
# Server monitoring
uptime && free -h && df -h /data
Returning Users
When reopening the project in Claude Code:
- Sync latest data:
bash server/scripts/sync_data.sh - Verify DuckDB:
ls -lh user/duckdb/analytics.duckdb - Start analyzing with Claude Code
Key Implementation Details
Config Loading Chain
config/loader.pyloadsinstance.yaml(checks$CONFIG_DIR, then./config/)webapp/config.pycalls_load_instance_config()at module level_get(config, *keys, default="")traverses nested dicts safelyinject_config()context processor exposesConfigto all Jinja templates- Templates use
{{ config.INSTANCE_NAME }},{{ config.INSTANCE_SUBTITLE }}, etc.
Adapter Pattern
- Factory:
src/adapters/__init__.py->create_data_source(adapter_type, **kwargs) - ABC:
DataSourceclass insrc/data_sync.py(lines 149-172) - Keboola:
src/adapters/keboola_adapter.py-> thin facade wrappingLocalKeboolaSource - Core Keboola logic:
src/keboola_client.py(788 lines, Keboola Storage API wrapper)
Server Patterns
- Atomic JSON writes:
tempfile.mkstemp()+os.fchmod(fd, 0o660)+os.replace() - User home writes:
sudo /usr/bin/install -o {user} -g {user}pattern - Staging dir:
/tmp/data_analyst_staging(deploy.sh creates it with setgid) - Dev docs:
dev_docs/server.mddocuments all established patterns
Files NOT to modify (stable infrastructure)
src/parquet_manager.py- Parquet conversion enginesrc/jira_file_lock.py- Advisory file lockingsrc/incremental_jira_transform.py- Jira monthly Parquet transformserver/ws_gateway/- WebSocket notification gateway
Git Commits & Pull Requests
- Keep commit messages clean and concise
- Do not include AI attribution in commits or PRs