# AI Data Analyst Open-source data distribution platform for AI analytical systems. Syncs data from various sources, converts to Parquet, and distributes to analysts who use Claude Code for local analysis. ## First-Time Setup When a user opens this project for the first time, guide them through interactive setup: ### Step 1: Gather Information Ask the user for: 1. Company domain (e.g., "acme.com") - used for Google OAuth 2. Data source type: keboola / csv / bigquery (future) 3. Instance name (e.g., "Acme Data Analyst") ### Step 2: Generate Configuration 1. Copy `config/instance.yaml.example` to `config/instance.yaml` 2. Fill in values from Step 1 3. If Keboola: ask for Storage API token, stack URL, project ID 4. Create `.env` from `config/.env.template` ### Step 3: Generate Data Description 1. If Keboola adapter: use the API to fetch table metadata and generate `docs/data_description.md` 2. If CSV: ask user to describe their data files 3. The file defines tables, sync strategies, and schema ### Step 4: Server Setup (if deploying) 1. Guide VM provisioning (or use existing server) 2. Run `server/setup.sh` on the target VM 3. Run `server/webapp-setup.sh` for the web portal 4. Set up CI/CD from `.github/workflows/deploy.yml.example` ## Project Structure ``` ├── src/ # Core data sync engine (vendor-neutral) │ ├── config.py # Configuration from data_description.md │ ├── data_sync.py # Sync orchestration + DataSource ABC │ ├── parquet_manager.py # Parquet file management │ └── profiler.py # Data profiling ├── connectors/ # Data source connectors (pluggable) │ ├── keboola/ # Keboola Storage connector │ └── jira/ # Jira webhook connector ├── auth/ # Authentication providers (pluggable) │ ├── google/ # Google OAuth provider │ ├── email/ # Email magic link provider │ └── desktop/ # Desktop JWT provider (API-only) ├── services/ # Standalone services (own systemd units) │ ├── telegram_bot/ # Telegram notification bot │ ├── ws_gateway/ # WebSocket notification gateway │ ├── corporate_memory/ # AI knowledge aggregation │ └── session_collector/ # Claude Code session collector ├── webapp/ # Flask web portal (login, dashboard, API) ├── server/ # Deployment infrastructure only ├── scripts/ # Utility scripts (sync, DuckDB setup, dev) ├── config/ # Configuration templates │ ├── instance.yaml.example │ └── data_description.md.example ├── docs/ # Documentation │ └── metrics/ # Business metric YAML definitions │ ├── revenue/ # Revenue metrics (total_revenue, AOV, etc.) │ ├── customers/ # Customer metrics (count, repeat rate) │ ├── marketing/ # Marketing metrics (ROI, CPA, conversion) │ └── support/ # Support metrics (resolution time, CSAT) └── tests/ # Test suite ``` ## Architecture ``` Data Source (Keboola / CSV / BigQuery) │ ▼ ┌─────────────────────────────────┐ │ Data Broker Server │ │ ├── /data/src_data/parquet/ │ Converted data │ ├── /data/docs/ │ Documentation │ └── /data/scripts/ │ Helper scripts └─────────────────────────────────┘ │ rsync (via ~/server/ symlinks) ▼ ┌─────────────────────────────────┐ │ Analyst (local machine) │ │ ├── ./server/ (read-only) │ parquet, docs, scripts │ └── ./user/ (workspace) │ duckdb, notifications └─────────────────────────────────┘ ``` ## Configuration Instance-specific config is in `config/instance.yaml`. See `config/instance.yaml.example` for all options. Environment variables go in `.env` (never committed to git). Data schema is defined in `docs/data_description.md` (YAML blocks in markdown). ### Dual-Repo Deployment Production uses two repos on the server: - **OSS repo** (`/opt/data-analyst/repo/`): application code, no secrets or config - **Instance repo** (`/opt/data-analyst/instance/`): private config, secrets template, data schema Symlinks bridge them: `repo/config/instance.yaml -> instance/config/instance.yaml`. Each repo has its own SSH deploy key (github-oss / github-cfg aliases). See `docs/auto-install.md` for full setup guide. ## Development ```bash # Setup python3 -m venv .venv source .venv/bin/activate pip install -r requirements.txt # Run webapp locally flask --app webapp.app run --debug # Run tests pytest tests/ -v # Sync data python -m src.data_sync ``` ## Extensibility ### Data Sources Pluggable data source connectors in `connectors/`: - **Keboola** (`keboola`): Syncs from Keboola Storage API - **CSV** (`csv`): Import from local CSV files (planned) - New connector = `connectors//adapter.py` implementing `DataSource` ### Authentication Pluggable auth providers in `auth/`: - **Google** (`google`): OAuth via Google - **Email** (`email`): Email magic link (itsdangerous token, no password needed) - **Password** (`password`): Username/password authentication - **Desktop** (`desktop`): JWT for desktop app API - New provider = `auth//provider.py` implementing `AuthProvider` Configure data source in `config/instance.yaml` under `data_source.type`. ## Server Management ```bash # Add analyst user sudo add-analyst username "ssh-rsa AAAA..." # Add privileged analyst sudo add-analyst username "ssh-rsa AAAA..." --private # List analysts list-analysts # Server monitoring uptime && free -h && df -h /data ``` ## Returning Users When reopening the project in Claude Code: 1. Sync latest data: `rsync -avz --no-perms --no-group data-analyst:server/parquet/ ./server/parquet/` 2. Verify DuckDB: `ls -lh user/duckdb/analytics.duckdb` 3. Start analyzing with Claude Code ## Key Implementation Details ### Config Loading Chain 1. `config/loader.py` loads `instance.yaml` (checks `$CONFIG_DIR`, then `./config/`) 2. `webapp/config.py` calls `_load_instance_config()` at module level 3. `_get(config, *keys, default="")` traverses nested dicts safely 4. `inject_config()` context processor exposes `Config` to all Jinja templates 5. Templates use `{{ config.INSTANCE_NAME }}`, `{{ config.INSTANCE_SUBTITLE }}`, etc. ### Connector Pattern - ABC: `DataSource` class in `src/data_sync.py` - Registry: `create_data_source()` in `src/data_sync.py` auto-discovers connectors in `connectors/` - Keboola: `connectors/keboola/adapter.py` -> `KeboolaDataSource` implementing `DataSource` - Core Keboola logic: `connectors/keboola/client.py` (Keboola Storage API wrapper) ### Auth Provider Pattern - ABC: `AuthProvider` class in `auth/__init__.py` - Discovery: `discover_providers()` scans `auth/*/provider.py` - Providers: google, email, desktop (each exports `provider` instance) - Email provider: uses `itsdangerous.URLSafeTimedSerializer` for magic link tokens - Multi-domain: `auth.allowed_domain` in instance.yaml supports comma-separated domains - Session contract: all providers set `session["user"] = {"email", "name", "picture"}` ### Service Pattern - Self-contained modules in `services/` with `__main__.py` for `python -m services.` - Systemd files in `services//systemd/`, auto-discovered by `deploy.sh` - Services: telegram_bot, ws_gateway, corporate_memory, session_collector ### Business Metrics Pattern - YAML definitions in `docs/metrics/{category}/{metric}.yml` (list with one dict) - `webapp/utils/metric_parser.py` - parses YAML, structures for modal UI, auto-discovers `sql_*` fields - `webapp/app.py` `_load_metrics_data()` - scans metrics dir, groups by category, returns ordered list - Catalog template renders dynamically via Jinja loop (no hardcoded metrics) - Profiler links metrics to tables via `used_by_metrics` in `profiles.json` - Production: metrics in instance repo deployed to `/data/docs/metrics/` - Sample/dev: OSS repo `docs/metrics/` (10 e-commerce metrics) ### Table Registry Pattern - `src/table_registry.py` - central CRUD for registered tables with atomic JSON persistence - Audit logging for register/unregister operations - Generates `data_description.md` from registry state ### Server Patterns - Atomic JSON writes: `tempfile.mkstemp()` + `os.fchmod(fd, 0o660)` + `os.replace()` - User home writes: `sudo /usr/bin/install -o {user} -g {user}` pattern - Staging dir: `/tmp/data_analyst_staging` (deploy.sh creates it with setgid) - Dev docs: `dev_docs/server.md` documents all established patterns ### Files NOT to modify (stable infrastructure) - `src/parquet_manager.py` - Parquet conversion engine - `connectors/jira/file_lock.py` - Advisory file locking - `connectors/jira/incremental_transform.py` - Jira monthly Parquet transform - `services/ws_gateway/` - WebSocket notification gateway ## Git Commits & Pull Requests - Keep commit messages clean and concise - Do not include AI attribution in commits or PRs