Add docs/metrics/ to project structure, Business Metrics and Table Registry patterns to implementation details, password auth provider to extensibility section, fix sync command for returning users.
221 lines
9.3 KiB
Markdown
221 lines
9.3 KiB
Markdown
# AI Data Analyst
|
|
|
|
Open-source data distribution platform for AI analytical systems. Syncs data from various sources, converts to Parquet, and distributes to analysts who use Claude Code for local analysis.
|
|
|
|
## First-Time Setup
|
|
|
|
When a user opens this project for the first time, guide them through interactive setup:
|
|
|
|
### Step 1: Gather Information
|
|
Ask the user for:
|
|
1. Company domain (e.g., "acme.com") - used for Google OAuth
|
|
2. Data source type: keboola / csv / bigquery (future)
|
|
3. Instance name (e.g., "Acme Data Analyst")
|
|
|
|
### Step 2: Generate Configuration
|
|
1. Copy `config/instance.yaml.example` to `config/instance.yaml`
|
|
2. Fill in values from Step 1
|
|
3. If Keboola: ask for Storage API token, stack URL, project ID
|
|
4. Create `.env` from `config/.env.template`
|
|
|
|
### Step 3: Generate Data Description
|
|
1. If Keboola adapter: use the API to fetch table metadata and generate `docs/data_description.md`
|
|
2. If CSV: ask user to describe their data files
|
|
3. The file defines tables, sync strategies, and schema
|
|
|
|
### Step 4: Server Setup (if deploying)
|
|
1. Guide VM provisioning (or use existing server)
|
|
2. Run `server/setup.sh` on the target VM
|
|
3. Run `server/webapp-setup.sh` for the web portal
|
|
4. Set up CI/CD from `.github/workflows/deploy.yml.example`
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
├── src/ # Core data sync engine (vendor-neutral)
|
|
│ ├── config.py # Configuration from data_description.md
|
|
│ ├── data_sync.py # Sync orchestration + DataSource ABC
|
|
│ ├── parquet_manager.py # Parquet file management
|
|
│ └── profiler.py # Data profiling
|
|
├── connectors/ # Data source connectors (pluggable)
|
|
│ ├── keboola/ # Keboola Storage connector
|
|
│ └── jira/ # Jira webhook connector
|
|
├── auth/ # Authentication providers (pluggable)
|
|
│ ├── google/ # Google OAuth provider
|
|
│ ├── email/ # Email magic link provider
|
|
│ └── desktop/ # Desktop JWT provider (API-only)
|
|
├── services/ # Standalone services (own systemd units)
|
|
│ ├── telegram_bot/ # Telegram notification bot
|
|
│ ├── ws_gateway/ # WebSocket notification gateway
|
|
│ ├── corporate_memory/ # AI knowledge aggregation
|
|
│ └── session_collector/ # Claude Code session collector
|
|
├── webapp/ # Flask web portal (login, dashboard, API)
|
|
├── server/ # Deployment infrastructure only
|
|
├── scripts/ # Utility scripts (sync, DuckDB setup, dev)
|
|
├── config/ # Configuration templates
|
|
│ ├── instance.yaml.example
|
|
│ └── data_description.md.example
|
|
├── docs/ # Documentation
|
|
│ └── metrics/ # Business metric YAML definitions
|
|
│ ├── revenue/ # Revenue metrics (total_revenue, AOV, etc.)
|
|
│ ├── customers/ # Customer metrics (count, repeat rate)
|
|
│ ├── marketing/ # Marketing metrics (ROI, CPA, conversion)
|
|
│ └── support/ # Support metrics (resolution time, CSAT)
|
|
└── tests/ # Test suite
|
|
```
|
|
|
|
## Architecture
|
|
|
|
```
|
|
Data Source (Keboola / CSV / BigQuery)
|
|
│
|
|
▼
|
|
┌─────────────────────────────────┐
|
|
│ Data Broker Server │
|
|
│ ├── /data/src_data/parquet/ │ Converted data
|
|
│ ├── /data/docs/ │ Documentation
|
|
│ └── /data/scripts/ │ Helper scripts
|
|
└─────────────────────────────────┘
|
|
│ rsync (via ~/server/ symlinks)
|
|
▼
|
|
┌─────────────────────────────────┐
|
|
│ Analyst (local machine) │
|
|
│ ├── ./server/ (read-only) │ parquet, docs, scripts
|
|
│ └── ./user/ (workspace) │ duckdb, notifications
|
|
└─────────────────────────────────┘
|
|
```
|
|
|
|
## Configuration
|
|
|
|
Instance-specific config is in `config/instance.yaml`. See `config/instance.yaml.example` for all options.
|
|
|
|
Environment variables go in `.env` (never committed to git).
|
|
|
|
Data schema is defined in `docs/data_description.md` (YAML blocks in markdown).
|
|
|
|
### Dual-Repo Deployment
|
|
Production uses two repos on the server:
|
|
- **OSS repo** (`/opt/data-analyst/repo/`): application code, no secrets or config
|
|
- **Instance repo** (`/opt/data-analyst/instance/`): private config, secrets template, data schema
|
|
|
|
Symlinks bridge them: `repo/config/instance.yaml -> instance/config/instance.yaml`.
|
|
Each repo has its own SSH deploy key (github-oss / github-cfg aliases).
|
|
See `docs/auto-install.md` for full setup guide.
|
|
|
|
## Development
|
|
|
|
```bash
|
|
# Setup
|
|
python3 -m venv .venv
|
|
source .venv/bin/activate
|
|
pip install -r requirements.txt
|
|
|
|
# Run webapp locally
|
|
flask --app webapp.app run --debug
|
|
|
|
# Run tests
|
|
pytest tests/ -v
|
|
|
|
# Sync data
|
|
python -m src.data_sync
|
|
```
|
|
|
|
## Extensibility
|
|
|
|
### Data Sources
|
|
Pluggable data source connectors in `connectors/`:
|
|
- **Keboola** (`keboola`): Syncs from Keboola Storage API
|
|
- **CSV** (`csv`): Import from local CSV files (planned)
|
|
- New connector = `connectors/<name>/adapter.py` implementing `DataSource`
|
|
|
|
### Authentication
|
|
Pluggable auth providers in `auth/`:
|
|
- **Google** (`google`): OAuth via Google
|
|
- **Email** (`email`): Email magic link (itsdangerous token, no password needed)
|
|
- **Password** (`password`): Username/password authentication
|
|
- **Desktop** (`desktop`): JWT for desktop app API
|
|
- New provider = `auth/<name>/provider.py` implementing `AuthProvider`
|
|
|
|
Configure data source in `config/instance.yaml` under `data_source.type`.
|
|
|
|
## Server Management
|
|
|
|
```bash
|
|
# Add analyst user
|
|
sudo add-analyst username "ssh-rsa AAAA..."
|
|
|
|
# Add privileged analyst
|
|
sudo add-analyst username "ssh-rsa AAAA..." --private
|
|
|
|
# List analysts
|
|
list-analysts
|
|
|
|
# Server monitoring
|
|
uptime && free -h && df -h /data
|
|
```
|
|
|
|
## Returning Users
|
|
|
|
When reopening the project in Claude Code:
|
|
1. Sync latest data: `rsync -avz --no-perms --no-group data-analyst:server/parquet/ ./server/parquet/`
|
|
2. Verify DuckDB: `ls -lh user/duckdb/analytics.duckdb`
|
|
3. Start analyzing with Claude Code
|
|
|
|
## Key Implementation Details
|
|
|
|
### Config Loading Chain
|
|
1. `config/loader.py` loads `instance.yaml` (checks `$CONFIG_DIR`, then `./config/`)
|
|
2. `webapp/config.py` calls `_load_instance_config()` at module level
|
|
3. `_get(config, *keys, default="")` traverses nested dicts safely
|
|
4. `inject_config()` context processor exposes `Config` to all Jinja templates
|
|
5. Templates use `{{ config.INSTANCE_NAME }}`, `{{ config.INSTANCE_SUBTITLE }}`, etc.
|
|
|
|
### Connector Pattern
|
|
- ABC: `DataSource` class in `src/data_sync.py`
|
|
- Registry: `create_data_source()` in `src/data_sync.py` auto-discovers connectors in `connectors/`
|
|
- Keboola: `connectors/keboola/adapter.py` -> `KeboolaDataSource` implementing `DataSource`
|
|
- Core Keboola logic: `connectors/keboola/client.py` (Keboola Storage API wrapper)
|
|
|
|
### Auth Provider Pattern
|
|
- ABC: `AuthProvider` class in `auth/__init__.py`
|
|
- Discovery: `discover_providers()` scans `auth/*/provider.py`
|
|
- Providers: google, email, desktop (each exports `provider` instance)
|
|
- Email provider: uses `itsdangerous.URLSafeTimedSerializer` for magic link tokens
|
|
- Multi-domain: `auth.allowed_domain` in instance.yaml supports comma-separated domains
|
|
- Session contract: all providers set `session["user"] = {"email", "name", "picture"}`
|
|
|
|
### Service Pattern
|
|
- Self-contained modules in `services/` with `__main__.py` for `python -m services.<name>`
|
|
- Systemd files in `services/<name>/systemd/`, auto-discovered by `deploy.sh`
|
|
- Services: telegram_bot, ws_gateway, corporate_memory, session_collector
|
|
|
|
### Business Metrics Pattern
|
|
- YAML definitions in `docs/metrics/{category}/{metric}.yml` (list with one dict)
|
|
- `webapp/utils/metric_parser.py` - parses YAML, structures for modal UI, auto-discovers `sql_*` fields
|
|
- `webapp/app.py` `_load_metrics_data()` - scans metrics dir, groups by category, returns ordered list
|
|
- Catalog template renders dynamically via Jinja loop (no hardcoded metrics)
|
|
- Profiler links metrics to tables via `used_by_metrics` in `profiles.json`
|
|
- Production: metrics in instance repo deployed to `/data/docs/metrics/`
|
|
- Sample/dev: OSS repo `docs/metrics/` (10 e-commerce metrics)
|
|
|
|
### Table Registry Pattern
|
|
- `src/table_registry.py` - central CRUD for registered tables with atomic JSON persistence
|
|
- Audit logging for register/unregister operations
|
|
- Generates `data_description.md` from registry state
|
|
|
|
### Server Patterns
|
|
- Atomic JSON writes: `tempfile.mkstemp()` + `os.fchmod(fd, 0o660)` + `os.replace()`
|
|
- User home writes: `sudo /usr/bin/install -o {user} -g {user}` pattern
|
|
- Staging dir: `/tmp/data_analyst_staging` (deploy.sh creates it with setgid)
|
|
- Dev docs: `dev_docs/server.md` documents all established patterns
|
|
|
|
### Files NOT to modify (stable infrastructure)
|
|
- `src/parquet_manager.py` - Parquet conversion engine
|
|
- `connectors/jira/file_lock.py` - Advisory file locking
|
|
- `connectors/jira/incremental_transform.py` - Jira monthly Parquet transform
|
|
- `services/ws_gateway/` - WebSocket notification gateway
|
|
|
|
## Git Commits & Pull Requests
|
|
|
|
- Keep commit messages clean and concise
|
|
- Do not include AI attribution in commits or PRs
|