diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md index c555d0a..0a163d9 100644 --- a/ARCHITECTURE.md +++ b/ARCHITECTURE.md @@ -64,20 +64,44 @@ Flask app for user onboarding, settings, and data catalog. | `config/.env.template` | Secret variable placeholders | | `docs/data_description.md` | Table schemas + sync strategies (not committed) | -### 4. Server Infrastructure (`server/`) +### 4. Auth Providers (`auth/`) -Deployment, systemd services, security. +Pluggable authentication via auto-discovered providers. + +| File | Role | +|------|------| +| `auth/__init__.py` | `AuthProvider` ABC + `discover_providers()` scanner | +| `auth/google/provider.py` | Google OAuth (extracted from webapp/auth.py) | +| `auth/password/provider.py` | Email/password (delegates to webapp/password_auth) | +| `auth/desktop/provider.py` | Desktop JWT auth (API-only, hidden from login page) | + +To add a new provider: create `auth//provider.py` implementing `AuthProvider`, export a `provider` instance. No core changes needed. + +### 5. Standalone Services (`services/`) + +Self-contained services with own systemd units, auto-discovered by `deploy.sh`. + +| Directory | Role | +|-----------|------| +| `services/telegram_bot/` | Telegram notification bot + dispatch | +| `services/ws_gateway/` | WebSocket gateway for desktop app | +| `services/corporate_memory/` | AI knowledge aggregation from analyst sessions | +| `services/session_collector/` | Claude Code session metadata collector | + +### 6. Server Infrastructure (`server/`) + +Deployment only -- no application code. | File | Role | |------|------| | `server/setup.sh` | Initial server provisioning (groups, users, dirs) | | `server/webapp-setup.sh` | Nginx, SSL, systemd for webapp | -| `server/deploy.sh` | CI/CD deployment script | +| `server/deploy.sh` | CI/CD deployment (auto-discovers `services/*/systemd/*`) | | `server/sudoers-deploy` | Least-privilege sudo rules for deploy user | | `server/sudoers-webapp` | Sudo rules for www-data (webapp) | | `server/bin/` | Management scripts (add-analyst, list-analysts, etc.) | -### 5. Analyst Scripts (`scripts/`) +### 7. Analyst Scripts (`scripts/`) Helper scripts synced to analyst machines. @@ -129,6 +153,8 @@ inject_config() context processor ## Key Patterns - **Connector pattern**: Dynamic connector registry in `src/data_sync.py`, `connectors/keboola/` for reference +- **Auth provider pattern**: Auto-discovered from `auth/*/provider.py`, each implements `AuthProvider` ABC +- **Service pattern**: Self-contained modules in `services/` with own `__main__.py` and `systemd/` directory - **Atomic writes**: `tempfile.mkstemp()` + `os.fchmod()` + `os.replace()` for JSON state files - **User home writes**: `sudo install -o {user} -g {user}` for writing to analyst home dirs - **Config interpolation**: `${ENV_VAR}` in YAML resolved at load time, missing vars logged as warnings diff --git a/CLAUDE.md b/CLAUDE.md index 958eb55..4e611b9 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -32,17 +32,26 @@ Ask the user for: ## Project Structure ``` -├── src/ # Core data sync engine +├── src/ # Core data sync engine (vendor-neutral) │ ├── config.py # Configuration from data_description.md │ ├── data_sync.py # Sync orchestration + DataSource ABC │ ├── parquet_manager.py # Parquet file management │ └── profiler.py # Data profiling -├── connectors/ # Data source connectors +├── connectors/ # Data source connectors (pluggable) │ ├── keboola/ # Keboola Storage connector │ └── jira/ # Jira webhook connector +├── auth/ # Authentication providers (pluggable) +│ ├── google/ # Google OAuth provider +│ ├── password/ # Email/password provider +│ └── desktop/ # Desktop JWT provider (API-only) +├── services/ # Standalone services (own systemd units) +│ ├── telegram_bot/ # Telegram notification bot +│ ├── ws_gateway/ # WebSocket notification gateway +│ ├── corporate_memory/ # AI knowledge aggregation +│ └── session_collector/ # Claude Code session collector ├── webapp/ # Flask web portal (login, dashboard, API) -├── server/ # Server deployment (systemd, scripts) -├── scripts/ # Utility scripts (sync, DuckDB setup) +├── server/ # Deployment infrastructure only +├── scripts/ # Utility scripts (sync, DuckDB setup, dev) ├── config/ # Configuration templates │ ├── instance.yaml.example │ └── data_description.md.example @@ -97,14 +106,22 @@ pytest tests/ -v python -m src.data_sync ``` -## Data Source Adapters +## Extensibility -The platform supports pluggable data sources via `connectors/`: -- **Keboola** (`keboola`): Syncs from Keboola Storage API (see `connectors/keboola/`) +### Data Sources +Pluggable data source connectors in `connectors/`: +- **Keboola** (`keboola`): Syncs from Keboola Storage API - **CSV** (`csv`): Import from local CSV files (planned) -- **BigQuery** (`bigquery`): Query from Google BigQuery (planned) +- New connector = `connectors//adapter.py` implementing `DataSource` -Configure in `config/instance.yaml` under `data_source.type`. +### Authentication +Pluggable auth providers in `auth/`: +- **Google** (`google`): OAuth via Google +- **Password** (`password`): Email/password with magic links +- **Desktop** (`desktop`): JWT for desktop app API +- New provider = `auth//provider.py` implementing `AuthProvider` + +Configure data source in `config/instance.yaml` under `data_source.type`. ## Server Management @@ -144,6 +161,17 @@ When reopening the project in Claude Code: - Keboola: `connectors/keboola/adapter.py` -> `KeboolaDataSource` implementing `DataSource` - Core Keboola logic: `connectors/keboola/client.py` (Keboola Storage API wrapper) +### Auth Provider Pattern +- ABC: `AuthProvider` class in `auth/__init__.py` +- Discovery: `discover_providers()` scans `auth/*/provider.py` +- Providers: google, password, desktop (each exports `provider` instance) +- Session contract: all providers set `session["user"] = {"email", "name", "picture"}` + +### Service Pattern +- Self-contained modules in `services/` with `__main__.py` for `python -m services.` +- Systemd files in `services//systemd/`, auto-discovered by `deploy.sh` +- Services: telegram_bot, ws_gateway, corporate_memory, session_collector + ### Server Patterns - Atomic JSON writes: `tempfile.mkstemp()` + `os.fchmod(fd, 0o660)` + `os.replace()` - User home writes: `sudo /usr/bin/install -o {user} -g {user}` pattern diff --git a/README.md b/README.md index bbe8b7b..e01fecb 100644 --- a/README.md +++ b/README.md @@ -40,7 +40,8 @@ flowchart TB ## Features -- **Pluggable data sources** -- adapter interface supporting Keboola out of the box, CSV import, and extensible to BigQuery, Snowflake, and others. +- **Pluggable data sources** -- connector interface supporting Keboola out of the box, CSV import, and extensible to BigQuery, Snowflake, and others. +- **Pluggable authentication** -- auto-discovered auth providers (Google OAuth, email/password, desktop JWT, or custom). - **Automatic Parquet conversion** -- source data is converted to typed, partitioned Parquet files for efficient local querying. - **SSH-based distribution** -- analysts sync data securely via rsync; no cloud credentials leave the server. - **Claude Code as analyst interface** -- natural language queries against DuckDB, powered by Claude. @@ -81,37 +82,43 @@ ai-data-analyst/ │ ├── instance.yaml.example # Main config template (copy to instance.yaml) │ └── data_description.md.example # Data schema template │ -├── src/ # Server-side Python code +├── src/ # Core data sync engine (vendor-neutral) │ ├── data_sync.py # Orchestrates data pull + DataSource ABC │ ├── parquet_manager.py # CSV to Parquet conversion │ ├── config.py # Configuration loader │ └── profiler.py # Data profiling for catalog │ -├── connectors/ # Data source connectors +├── connectors/ # Data source connectors (pluggable) │ ├── keboola/ # Keboola Storage connector │ │ ├── adapter.py # KeboolaDataSource (implements DataSource) │ │ └── client.py # Low-level Keboola API client │ └── jira/ # Jira webhook connector │ +├── auth/ # Authentication providers (pluggable) +│ ├── google/ # Google OAuth provider +│ ├── password/ # Email/password provider +│ └── desktop/ # Desktop JWT provider (API-only) +│ +├── services/ # Standalone services (own systemd units) +│ ├── telegram_bot/ # Telegram notification bot +│ ├── ws_gateway/ # WebSocket notification gateway +│ ├── corporate_memory/ # AI knowledge aggregation +│ └── session_collector/ # Claude Code session collector +│ ├── webapp/ # Flask web application │ └── ... # User onboarding, settings, catalog │ -├── server/ # Deployment and server management -│ ├── deploy.sh # Deployment script -│ └── ... # Systemd units, sudoers, cron jobs +├── server/ # Deployment infrastructure only +│ ├── deploy.sh # Deployment script (auto-discovers services) +│ └── ... # Sudoers, nginx, setup scripts │ -├── scripts/ # Analyst-facing helper scripts +├── scripts/ # Helper scripts │ ├── sync_data.sh # Sync data from server -│ └── setup_views.sh # Initialize DuckDB views +│ ├── setup_views.sh # Initialize DuckDB views +│ └── dev_run.py # Dev server with auth bypass │ ├── docs/ # User-facing documentation -│ ├── QUICKSTART.md # Setup guide -│ └── data_description.md # Table schemas (single source of truth) -│ ├── dev_docs/ # Developer and operator documentation -│ ├── server.md # Server administration -│ └── security.md # Security model -│ ├── tests/ # Test suite ├── requirements.txt # Python dependencies ├── CLAUDE.md # Instructions for Claude Code