- server/deploy.sh: KEBOOLA_ENV_FILE -> SYNC_ENV_FILE - server/ws-gateway.service, notify-bot.service: remove Keboola from descriptions - .gitignore: generic comment for data directory - CLAUDE.md, README.md, ARCHITECTURE.md: update paths from src/adapters to connectors/ - docs/DATA_SOURCES.md: update custom connector guide to connectors/ pattern - connectors/jira/README.md: keboola-analyst -> data-analyst in config paths - dev_docs/desktop-app.md: KeboolaAnalyst -> DataAnalyst branding
6 KiB
6 KiB
AI Data Analyst
Open-source data distribution platform for AI analytical systems. Syncs data from various sources, converts to Parquet, and distributes to analysts who use Claude Code for local analysis.
First-Time Setup
When a user opens this project for the first time, guide them through interactive setup:
Step 1: Gather Information
Ask the user for:
- Company domain (e.g., "acme.com") - used for Google OAuth
- Data source type: keboola / csv / bigquery (future)
- Instance name (e.g., "Acme Data Analyst")
Step 2: Generate Configuration
- Copy
config/instance.yaml.exampletoconfig/instance.yaml - Fill in values from Step 1
- If Keboola: ask for Storage API token, stack URL, project ID
- Create
.envfromconfig/.env.template
Step 3: Generate Data Description
- If Keboola adapter: use the API to fetch table metadata and generate
docs/data_description.md - If CSV: ask user to describe their data files
- The file defines tables, sync strategies, and schema
Step 4: Server Setup (if deploying)
- Guide VM provisioning (or use existing server)
- Run
server/setup.shon the target VM - Run
server/webapp-setup.shfor the web portal - Set up CI/CD from
.github/workflows/deploy.yml.example
Project Structure
├── src/ # Core data sync engine
│ ├── config.py # Configuration from data_description.md
│ ├── data_sync.py # Sync orchestration + DataSource ABC
│ ├── parquet_manager.py # Parquet file management
│ └── profiler.py # Data profiling
├── connectors/ # Data source connectors
│ ├── keboola/ # Keboola Storage connector
│ └── jira/ # Jira webhook connector
├── webapp/ # Flask web portal (login, dashboard, API)
├── server/ # Server deployment (systemd, scripts)
├── scripts/ # Utility scripts (sync, DuckDB setup)
├── config/ # Configuration templates
│ ├── instance.yaml.example
│ └── data_description.md.example
├── docs/ # Documentation
└── tests/ # Test suite
Architecture
Data Source (Keboola / CSV / BigQuery)
│
▼
┌─────────────────────────────────┐
│ Data Broker Server │
│ ├── /data/src_data/parquet/ │ Converted data
│ ├── /data/docs/ │ Documentation
│ └── /data/scripts/ │ Helper scripts
└─────────────────────────────────┘
│ rsync (via ~/server/ symlinks)
▼
┌─────────────────────────────────┐
│ Analyst (local machine) │
│ ├── ./server/ (read-only) │ parquet, docs, scripts
│ └── ./user/ (workspace) │ duckdb, notifications
└─────────────────────────────────┘
Configuration
Instance-specific config is in config/instance.yaml. See config/instance.yaml.example for all options.
Environment variables go in .env (never committed to git).
Data schema is defined in docs/data_description.md (YAML blocks in markdown).
Development
# Setup
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# Run webapp locally
flask --app webapp.app run --debug
# Run tests
pytest tests/ -v
# Sync data
python -m src.data_sync
Data Source Adapters
The platform supports pluggable data sources via connectors/:
- Keboola (
keboola): Syncs from Keboola Storage API (seeconnectors/keboola/) - CSV (
csv): Import from local CSV files (planned) - BigQuery (
bigquery): Query from Google BigQuery (planned)
Configure in config/instance.yaml under data_source.type.
Server Management
# Add analyst user
sudo add-analyst username "ssh-rsa AAAA..."
# Add privileged analyst
sudo add-analyst username "ssh-rsa AAAA..." --private
# List analysts
list-analysts
# Server monitoring
uptime && free -h && df -h /data
Returning Users
When reopening the project in Claude Code:
- Sync latest data:
bash server/scripts/sync_data.sh - Verify DuckDB:
ls -lh user/duckdb/analytics.duckdb - Start analyzing with Claude Code
Key Implementation Details
Config Loading Chain
config/loader.pyloadsinstance.yaml(checks$CONFIG_DIR, then./config/)webapp/config.pycalls_load_instance_config()at module level_get(config, *keys, default="")traverses nested dicts safelyinject_config()context processor exposesConfigto all Jinja templates- Templates use
{{ config.INSTANCE_NAME }},{{ config.INSTANCE_SUBTITLE }}, etc.
Connector Pattern
- ABC:
DataSourceclass insrc/data_sync.py - Registry:
create_data_source()insrc/data_sync.pyauto-discovers connectors inconnectors/ - Keboola:
connectors/keboola/adapter.py->KeboolaDataSourceimplementingDataSource - Core Keboola logic:
connectors/keboola/client.py(Keboola Storage API wrapper)
Server Patterns
- Atomic JSON writes:
tempfile.mkstemp()+os.fchmod(fd, 0o660)+os.replace() - User home writes:
sudo /usr/bin/install -o {user} -g {user}pattern - Staging dir:
/tmp/data_analyst_staging(deploy.sh creates it with setgid) - Dev docs:
dev_docs/server.mddocuments all established patterns
Files NOT to modify (stable infrastructure)
src/parquet_manager.py- Parquet conversion engineconnectors/jira/file_lock.py- Advisory file lockingconnectors/jira/incremental_transform.py- Jira monthly Parquet transformserver/ws_gateway/- WebSocket notification gateway
Git Commits & Pull Requests
- Keep commit messages clean and concise
- Do not include AI attribution in commits or PRs