docs
This commit is contained in:
parent
495940d6b8
commit
879bc6c44f
2 changed files with 939 additions and 0 deletions
511
docs/architecture.md
Normal file
511
docs/architecture.md
Normal file
|
|
@ -0,0 +1,511 @@
|
|||
# Architecture — Detailed Reference
|
||||
|
||||
Comprehensive architectural overview of the OSS AI Data Analyst platform.
|
||||
For a concise summary, see [../ARCHITECTURE.md](../ARCHITECTURE.md).
|
||||
|
||||
## Top-Level Module Map
|
||||
|
||||
```
|
||||
oss-ai-data-analyst/
|
||||
├── src/ Core engine (config, sync, parquet, profiling)
|
||||
├── connectors/ Pluggable data connectors (keboola, jira)
|
||||
├── auth/ Pluggable auth providers (google, password, desktop)
|
||||
├── services/ Standalone background services
|
||||
├── webapp/ Flask web portal (dashboard, catalog, API)
|
||||
├── server/ Server deployment (setup, deploy, nginx, systemd)
|
||||
├── scripts/ Analyst-side utility scripts (sync, DuckDB, dev server)
|
||||
├── config/ Instance configuration (loader, templates)
|
||||
├── examples/ Example notification scripts
|
||||
├── tests/ Test suite
|
||||
├── dev_docs/ Internal development documentation
|
||||
└── docs/ User-facing documentation
|
||||
```
|
||||
|
||||
## Block Diagram
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ EXTERNAL DATA SOURCES │
|
||||
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
|
||||
│ │ Keboola │ │ Jira │ │ CSV │ │ BigQuery │ │
|
||||
│ │ Storage │ │ Cloud │ │ (plan) │ │ (plan) │ │
|
||||
│ └────┬─────┘ └────┬─────┘ └──────────┘ └──────────┘ │
|
||||
└────────┼──────────────┼────────────────────────────────────────────────────┘
|
||||
│ │
|
||||
▼ ▼
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ CONNECTORS (connectors/) auto-discovery via importlib │
|
||||
│ │
|
||||
│ ┌──────────────────────────┐ ┌─────────────────────────────────────┐ │
|
||||
│ │ connectors/keboola/ │ │ connectors/jira/ │ │
|
||||
│ │ │ │ │ │
|
||||
│ │ adapter.py │ │ webhook.py Flask blueprint │ │
|
||||
│ │ KeboolaDataSource (ABC) │ │ service.py Jira REST API client │ │
|
||||
│ │ full/incr/partitioned │ │ transform.py JSON -> 6 Parquet tbl│ │
|
||||
│ │ │ │ incremental_transform.py realtime │ │
|
||||
│ │ client.py │ │ file_lock.py POSIX advisory locks │ │
|
||||
│ │ Keboola Storage API │ │ │ │
|
||||
│ │ type mapping + cache │ │ scripts/ backfill, SLA poll, │ │
|
||||
│ │ │ │ consistency check │ │
|
||||
│ │ tests/ │ │ systemd/ jira-sla-poll, │ │
|
||||
│ └──────────────────────────┘ │ jira-consistency │ │
|
||||
│ │ tests/ │ │
|
||||
│ Registry: src/data_sync.py └─────────────────────────────────────┘ │
|
||||
│ create_data_source(type) -> │
|
||||
│ importlib("connectors.{type}.adapter") │
|
||||
└─────────────────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
▼ Parquet files
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ CORE ENGINE (src/) │
|
||||
│ │
|
||||
│ ┌─────────────────────┐ ┌──────────────────┐ ┌──────────────────────┐ │
|
||||
│ │ data_sync.py │ │ config.py │ │ profiler.py │ │
|
||||
│ │ DataSource ABC │ │ data_description │ │ Parquet -> stats │ │
|
||||
│ │ SyncState (JSON) │ │ .md parser │ │ alerts, sampling │ │
|
||||
│ │ DataSyncManager │ │ TableConfig │ │ -> profiles.json │ │
|
||||
│ │ create_data_source()│ │ WhereFilter │ └──────────────────────┘ │
|
||||
│ └─────────────────────┘ │ ForeignKey │ │
|
||||
│ │ get_config() │ ┌──────────────────────┐ │
|
||||
│ └──────────────────┘ │ parquet_manager.py │ │
|
||||
│ │ CSV->Parquet, merge │ │
|
||||
│ │ upsert, schema │ │
|
||||
│ └──────────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
│ /data/src_data/parquet/
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ AUTH PROVIDERS (auth/) auto-discovery via scan │
|
||||
│ │
|
||||
│ ┌────────────────┐ ┌────────────────┐ ┌──────────────────────┐ │
|
||||
│ │ auth/google/ │ │ auth/password/ │ │ auth/desktop/ │ │
|
||||
│ │ │ │ │ │ │ │
|
||||
│ │ Google OAuth │ │ Email+password │ │ JWT for desktop app │ │
|
||||
│ │ SSO (Authlib) │ │ Argon2 hash │ │ visible=False │ │
|
||||
│ │ domain restrict │ │ SendGrid email │ │ (API-only, not login) │ │
|
||||
│ │ order=10 │ │ order=20 │ │ order=100 │ │
|
||||
│ └────────────────┘ └────────────────┘ └──────────────────────┘ │
|
||||
│ │
|
||||
│ ABC: AuthProvider (get_name, get_blueprint, get_login_button, is_avail.) │
|
||||
│ Discovery: discover_providers() -> scans auth/*/provider.py │
|
||||
│ Contract: all providers set session["user"] = {email, name, picture} │
|
||||
└─────────────────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
│ Blueprints registered in Flask app
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ WEB PORTAL (webapp/) │
|
||||
│ │
|
||||
│ ┌───────────────────┐ ┌──────────────────────────────────────────────┐ │
|
||||
│ │ app.py (Flask) │ │ Pages │ │
|
||||
│ │ - discover auth │ │ /dashboard - account, stats, setup │ │
|
||||
│ │ providers │ │ /catalog - data catalog + profiles │ │
|
||||
│ │ - register │ │ /corporate-memory - knowledge + voting │ │
|
||||
│ │ blueprints │ │ /activity-center - intelligence overview │ │
|
||||
│ │ - inject_config() │ └──────────────────────────────────────────────┘ │
|
||||
│ │ - routes │ │
|
||||
│ └───────────────────┘ ┌──────────────────────────────────────────────┐ │
|
||||
│ │ API Endpoints │ │
|
||||
│ ┌───────────────────┐ │ /webhooks/jira (HMAC, -> jira connector)│ │
|
||||
│ │ webapp services │ │ /api/telegram/* (link/unlink/status) │ │
|
||||
│ │ user_service │ │ /api/desktop/* (JWT, scripts, run) │ │
|
||||
│ │ account_service │ │ /api/sync-settings (GET/POST) │ │
|
||||
│ │ sync_settings_svc │ │ /api/corporate-memory/* (CRUD, votes) │ │
|
||||
│ │ telegram_service │ │ /api/catalog/profile/<table> │ │
|
||||
│ │ email_service │ │ /health (service health) │ │
|
||||
│ │ health_service │ └──────────────────────────────────────────────┘ │
|
||||
│ │ corporate_memory │ │
|
||||
│ └───────────────────┘ Config chain: instance.yaml -> loader -> Config -> │
|
||||
│ inject_config() -> {{ config.X }} in Jinja │
|
||||
└─────────────────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ BACKGROUND SERVICES (services/) each = __main__.py + systemd │
|
||||
│ │
|
||||
│ ┌────────────────────────┐ ┌─────────────────────────────────────────┐ │
|
||||
│ │ services/telegram_bot/ │ │ services/ws_gateway/ │ │
|
||||
│ │ │ │ │ │
|
||||
│ │ bot.py polling + │ │ gateway.py WebSocket TCP:8765 │ │
|
||||
│ │ HTTP socket │ │ + HTTP dispatch socket │ │
|
||||
│ │ runner.py script exec │ │ auth.py JWT validation │ │
|
||||
│ │ sender.py msg dispatch │ │ config.py gateway config │ │
|
||||
│ │ dispatch.py -> WS gw │ │ │ │
|
||||
│ │ storage.py JSON state │ │ Heartbeat: ping/pong, 3 miss = drop │ │
|
||||
│ │ status.py /status cmd │ │ Per-user connection limit (5) │ │
|
||||
│ │ │ │ │ │
|
||||
│ │ Always running (systemd)│ │ Always running (systemd) │ │
|
||||
│ └────────────────────────┘ └─────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ ┌────────────────────────┐ ┌─────────────────────────────────────────┐ │
|
||||
│ │ services/ │ │ services/ │ │
|
||||
│ │ corporate_memory/ │ │ session_collector/ │ │
|
||||
│ │ │ │ │ │
|
||||
│ │ collector.py │ │ collector.py │ │
|
||||
│ │ Scans CLAUDE.local.md │ │ Copies .jsonl from user homes │ │
|
||||
│ │ -> Claude Haiku -> JSON│ │ to /data/user_sessions/ │ │
|
||||
│ │ MD5 change detection │ │ Idempotent, atomic writes │ │
|
||||
│ │ prompts.py │ │ │ │
|
||||
│ │ LLM prompts for │ │ Timer: every 6 hours │ │
|
||||
│ │ knowledge extraction │ │ │ │
|
||||
│ │ │ │ │ │
|
||||
│ │ Timer: every 30 min │ │ │ │
|
||||
│ └────────────────────────┘ └─────────────────────────────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
│ Unix sockets + /data/ filesystem
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ SERVER INFRASTRUCTURE (server/) │
|
||||
│ │
|
||||
│ ┌──────────────────┐ ┌────────────────────┐ ┌───────────────────────┐ │
|
||||
│ │ Deployment │ │ User Management │ │ Web Server │ │
|
||||
│ │ setup.sh │ │ bin/add-analyst │ │ webapp-nginx.conf │ │
|
||||
│ │ deploy.sh (CI/CD) │ │ bin/list-analysts │ │ webapp.service │ │
|
||||
│ │ webapp-setup.sh │ │ bin/notify-runner │ │ SSL (Let's Encrypt) │ │
|
||||
│ │ sudoers rules │ │ bin/notify-scripts │ │ Gunicorn + Unix sock │ │
|
||||
│ └──────────────────┘ └────────────────────┘ └───────────────────────┘ │
|
||||
│ │
|
||||
│ Groups: dataread (analysts) | data-private (privileged) | data-ops (admin) │
|
||||
│ │
|
||||
│ /data/ │
|
||||
│ ├── src_data/parquet/ shared data (readonly for analysts) │
|
||||
│ ├── src_data/metadata/ sync_state.json, profiles.json │
|
||||
│ ├── src_data/raw/jira/ webhook JSON, attachments │
|
||||
│ ├── docs/ , scripts/ documentation, helper scripts │
|
||||
│ ├── notifications/ telegram_users, desktop_users, codes │
|
||||
│ ├── corporate-memory/ knowledge.json, votes.json │
|
||||
│ └── user_sessions/ centralized Claude Code transcripts │
|
||||
└─────────────────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
│ rsync (SSH) - scripts/sync_data.sh (bi-directional)
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────────────┐
|
||||
│ ANALYST WORKSTATION (local) │
|
||||
│ │
|
||||
│ server/ (read-only, rsynced from broker) │
|
||||
│ ├── parquet/, docs/, scripts/, metadata/ │
|
||||
│ │
|
||||
│ user/ (writable workspace, backed up to server) │
|
||||
│ ├── duckdb/analytics.duckdb SQL views over parquet │
|
||||
│ ├── notifications/*.py custom notification scripts │
|
||||
│ ├── sessions/ Claude Code transcripts │
|
||||
│ └── artifacts/ analysis outputs │
|
||||
│ │
|
||||
│ .claude/rules/ corporate memory knowledge rules │
|
||||
│ │
|
||||
│ Claude Code <- local analysis over DuckDB + Parquet │
|
||||
└─────────────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Auto-Discovery Patterns
|
||||
|
||||
The platform uses three symmetrical auto-discovery mechanisms. Adding a new
|
||||
connector, auth method, or service requires no changes to existing code.
|
||||
|
||||
### 1. Connector Discovery (`src/data_sync.py`)
|
||||
|
||||
```
|
||||
config/instance.yaml -> data_source.type: "keboola"
|
||||
-> importlib.import_module("connectors.keboola.adapter")
|
||||
-> KeboolaDataSource (implements DataSource ABC)
|
||||
```
|
||||
|
||||
- Factory: `create_data_source(type)` in `src/data_sync.py`
|
||||
- Connectors live in `connectors/{name}/adapter.py`
|
||||
- Must export a `DataSource` subclass or a `create_data_source()` factory function
|
||||
- Keboola is hard-coded for ImportError handling; all others use dynamic import
|
||||
|
||||
### 2. Auth Provider Discovery (`auth/__init__.py`)
|
||||
|
||||
```
|
||||
startup -> scan auth/*/provider.py
|
||||
-> import `provider` instance
|
||||
-> filter by is_available() (checks env vars)
|
||||
-> register blueprint + login button in Flask
|
||||
```
|
||||
|
||||
- ABC: `AuthProvider` with methods `get_name()`, `get_blueprint()`, `get_login_button()`, `is_available()`, `init_app()`
|
||||
- Session contract: all providers set `session["user"] = {email, name, picture}`
|
||||
- Login page renders buttons dynamically, sorted by `order` field
|
||||
|
||||
### 3. Service Pattern (`services/*/__main__.py`)
|
||||
|
||||
```
|
||||
python -m services.<name> # entry point
|
||||
services/<name>/systemd/ # unit files
|
||||
deploy.sh auto-discovers # systemd/* in each service dir
|
||||
```
|
||||
|
||||
- Each service is self-contained: code, systemd units, and config in one directory
|
||||
- `deploy.sh` scans `services/*/systemd/*.service` and `connectors/*/systemd/*.service`
|
||||
- Long-running services (telegram_bot, ws_gateway) use async dual-server model
|
||||
- Periodic services (corporate_memory, session_collector) are systemd timer oneshots
|
||||
|
||||
## Data Flows
|
||||
|
||||
### Pull Sync (Keboola)
|
||||
|
||||
```
|
||||
Keboola Storage API
|
||||
-> connectors/keboola/client.py (export CSV with filters)
|
||||
-> src/parquet_manager.py (convert to typed Parquet)
|
||||
-> /data/src_data/parquet/ (stored on broker)
|
||||
-> rsync to analyst (scripts/sync_data.sh)
|
||||
-> DuckDB views (scripts/setup_views.sh)
|
||||
```
|
||||
|
||||
Sync strategies: `full_refresh`, `incremental`, `partitioned`, `chunked_initial_load`.
|
||||
|
||||
### Push Sync (Jira)
|
||||
|
||||
```
|
||||
Jira Cloud webhook (issue created/updated/deleted)
|
||||
-> connectors/jira/webhook.py (HMAC-SHA256 verification)
|
||||
-> connectors/jira/service.py (fetch full issue + attachments)
|
||||
-> /data/src_data/raw/jira/issues/ (atomic JSON write)
|
||||
-> connectors/jira/incremental_transform.py (update monthly Parquet)
|
||||
-> /data/src_data/parquet/jira/ (6 tables: issues, comments,
|
||||
attachments, changelog,
|
||||
issuelinks, remote_links)
|
||||
```
|
||||
|
||||
Background jobs supplement the webhook pipeline:
|
||||
- `jira-sla-poll` (every 5 min): refreshes SLA fields for open tickets
|
||||
- `jira-consistency` (every 6h): detects and backfills missing issues
|
||||
|
||||
### Notification Pipeline
|
||||
|
||||
```
|
||||
~/user/notifications/*.py analyst's custom scripts
|
||||
-> server/bin/notify-runner (cron, executes with timeout)
|
||||
-> cooldown check (~/.notifications/state/)
|
||||
├-> services/telegram_bot/ (Unix socket /run/notify-bot/bot.sock)
|
||||
│ -> Telegram chat message (text or photo)
|
||||
└-> services/ws_gateway/ (Unix socket /run/ws-gateway/ws.sock)
|
||||
-> WebSocket push to desktop app
|
||||
```
|
||||
|
||||
Script output format:
|
||||
```json
|
||||
{
|
||||
"notify": true,
|
||||
"title": "Revenue dropped 25%",
|
||||
"message": "Details...",
|
||||
"cooldown": "6h",
|
||||
"image_path": "/tmp/chart.png"
|
||||
}
|
||||
```
|
||||
|
||||
### Knowledge Loop (Corporate Memory)
|
||||
|
||||
```
|
||||
Analyst writes CLAUDE.local.md (insights, patterns, tips)
|
||||
-> scripts/sync_data.sh (uploads to server)
|
||||
-> services/corporate_memory/ (timer, every 30 min)
|
||||
-> MD5 change detection
|
||||
-> Claude Haiku extracts knowledge items
|
||||
-> /data/corporate-memory/knowledge.json
|
||||
-> webapp /corporate-memory (voting UI: upvote/downvote)
|
||||
-> scripts/sync_data.sh (downloads to analyst)
|
||||
-> .claude/rules/ (rules for Claude Code)
|
||||
-> Claude Code uses rules in next session
|
||||
```
|
||||
|
||||
## Module Reference
|
||||
|
||||
### Core Engine (`src/`)
|
||||
|
||||
| File | Lines | Responsibility |
|
||||
|------|-------|----------------|
|
||||
| `data_sync.py` | ~1400 | `DataSource` ABC, `SyncState`, `DataSyncManager`, connector factory |
|
||||
| `config.py` | ~600 | Parse `data_description.md` YAML blocks, `TableConfig`, `WhereFilter`, `ForeignKey` |
|
||||
| `parquet_manager.py` | ~750 | CSV-to-Parquet conversion, merge, upsert, schema enforcement |
|
||||
| `profiler.py` | ~1200 | Data profiling: stats, alerts, type classification -> `profiles.json` |
|
||||
|
||||
### Connectors (`connectors/`)
|
||||
|
||||
| Module | Files | Sync Model | Description |
|
||||
|--------|-------|------------|-------------|
|
||||
| `keboola/` | adapter.py, client.py, tests/ | Pull (DataSource ABC) | Keboola Storage API, type mapping, metadata caching (24h TTL) |
|
||||
| `jira/` | webhook.py, service.py, transform.py, incremental_transform.py, file_lock.py, scripts/, systemd/, tests/ | Push (webhook) | Real-time webhook pipeline, SLA polling, consistency monitoring, 6 output Parquet tables |
|
||||
|
||||
### Auth Providers (`auth/`)
|
||||
|
||||
| Provider | Available when | Login UI | Order | Description |
|
||||
|----------|---------------|----------|-------|-------------|
|
||||
| `google/` | `GOOGLE_CLIENT_ID` set | Yes | 10 | Google OAuth SSO with domain restriction |
|
||||
| `password/` | `SENDGRID_API_KEY` set | Yes | 20 | Email + password for external users (Argon2, rate limiting) |
|
||||
| `desktop/` | `DESKTOP_JWT_SECRET` set | No (API-only) | 100 | JWT tokens for native desktop app |
|
||||
|
||||
### Background Services (`services/`)
|
||||
|
||||
| Service | Type | Schedule | Description |
|
||||
|---------|------|----------|-------------|
|
||||
| `telegram_bot/` | Long-running | Always on | Telegram polling + HTTP dispatch socket, script execution, /status /test commands |
|
||||
| `ws_gateway/` | Long-running | Always on | WebSocket TCP:8765 + HTTP dispatch socket, JWT auth, heartbeat |
|
||||
| `corporate_memory/` | Timer oneshot | Every 30 min | AI knowledge extraction from CLAUDE.local.md via Claude Haiku |
|
||||
| `session_collector/` | Timer oneshot | Every 6 hours | Copy session .jsonl from user homes to central storage |
|
||||
|
||||
### Web Portal (`webapp/`)
|
||||
|
||||
| File | Responsibility |
|
||||
|------|----------------|
|
||||
| `app.py` | Flask factory, blueprint registration, route definitions, context processors |
|
||||
| `config.py` | Load `instance.yaml`, expose `Config` to templates |
|
||||
| `auth.py` | Core auth infrastructure: `login_required`, `validate_email_domain`, `/login`, `/logout` |
|
||||
| `user_service.py` | Username derivation, SSH key validation, system account creation |
|
||||
| `account_service.py` | Dashboard account widget data, cron info, sync status |
|
||||
| `sync_settings_service.py` | Per-user dataset sync preferences |
|
||||
| `telegram_service.py` | Telegram account linking/unlinking |
|
||||
| `desktop_auth.py` | JWT generation/validation, desktop app link state |
|
||||
| `password_auth.py` | Password auth implementation (Argon2, rate limiting, token workflow) |
|
||||
| `email_service.py` | SendGrid integration for setup/reset emails |
|
||||
| `corporate_memory_service.py` | Knowledge CRUD, voting, user rules regeneration |
|
||||
| `health_service.py` | System health checks (services, timers, disk, load, webhooks) |
|
||||
| `notification_images.py` | Serve chart PNGs generated by notification runner |
|
||||
| `utils/metric_parser.py` | Parse business metric YAML definitions for catalog UI |
|
||||
|
||||
### Server Infrastructure (`server/`)
|
||||
|
||||
| File | Responsibility |
|
||||
|------|----------------|
|
||||
| `setup.sh` | Initial server bootstrap (groups, users, directories, venv) |
|
||||
| `deploy.sh` | CI/CD deployment (git pull, deps, scripts, services, ACLs) |
|
||||
| `webapp-setup.sh` | Nginx + SSL + Gunicorn setup |
|
||||
| `webapp-nginx.conf` | Nginx reverse proxy config (HTTPS, WebSocket upgrade) |
|
||||
| `webapp.service` | Systemd unit for Gunicorn |
|
||||
| `sudoers-deploy` | Sudo rules for deploy user (least-privilege) |
|
||||
| `sudoers-webapp` | Sudo rules for www-data |
|
||||
| `bin/add-analyst` | Create analyst user with workspace structure |
|
||||
| `bin/list-analysts` | List registered analysts |
|
||||
| `bin/notify-runner` | Execute user notification scripts, dispatch to bot + gateway |
|
||||
| `bin/notify-scripts` | List/run notification scripts for a user |
|
||||
|
||||
### Analyst Scripts (`scripts/`)
|
||||
|
||||
| File | Responsibility |
|
||||
|------|----------------|
|
||||
| `sync_data.sh` | Bi-directional rsync: download data, upload workspace, refresh DuckDB |
|
||||
| `setup_views.sh` | Create/replace DuckDB views over all Parquet files |
|
||||
| `duckdb_manager.py` | DuckDB setup utility |
|
||||
| `dev_run.py` | Development server with auth bypass |
|
||||
| `collect_session.py` | Session transcript collector (used by service) |
|
||||
| `generate_user_sync_configs.py` | Generate per-user sync config files |
|
||||
|
||||
## Analyst Workspace Layout
|
||||
|
||||
Created by `server/bin/add-analyst` for each registered user:
|
||||
|
||||
```
|
||||
/home/{username}/
|
||||
├── server/ read-only symlinks to shared data
|
||||
│ ├── parquet/ -> /data/src_data/parquet
|
||||
│ ├── docs/ -> /data/docs
|
||||
│ ├── scripts/ -> /data/scripts
|
||||
│ ├── metadata/ -> /data/src_data/metadata
|
||||
│ └── jira_attachments/ -> /data/src_data/raw/jira/attachments
|
||||
├── user/ writable workspace (backed up to server)
|
||||
│ ├── duckdb/ local DuckDB database
|
||||
│ ├── notifications/ custom notification scripts (*.py)
|
||||
│ ├── artifacts/ analysis outputs
|
||||
│ ├── scripts/ user helper scripts
|
||||
│ ├── parquet/ user Parquet files
|
||||
│ └── sessions/ Claude Code session transcripts
|
||||
├── .notifications/ notification runner state
|
||||
│ ├── state/ cooldown tracking (JSON per script)
|
||||
│ └── logs/ runner logs
|
||||
└── .claude/
|
||||
└── rules/ corporate memory knowledge rules (auto-synced)
|
||||
```
|
||||
|
||||
## Security Model
|
||||
|
||||
### System Groups
|
||||
|
||||
| Group | Access |
|
||||
|-------|--------|
|
||||
| `data-ops` | Full admin access to all server resources |
|
||||
| `dataread` | Read access to public Parquet data |
|
||||
| `data-private` | Read access to sensitive/restricted data |
|
||||
|
||||
### Authentication Layers
|
||||
|
||||
| Layer | Mechanism | Scope |
|
||||
|-------|-----------|-------|
|
||||
| Web portal | Google OAuth / email+password | Browser sessions |
|
||||
| Desktop app | JWT Bearer tokens | API endpoints (`/api/desktop/*`) |
|
||||
| Jira webhook | HMAC-SHA256 signature | Webhook endpoint |
|
||||
| SSH access | Key-based auth only | Data sync (rsync) |
|
||||
| Inter-service | Unix socket permissions | Bot, gateway, webapp |
|
||||
|
||||
### Permission Boundaries
|
||||
|
||||
- Analysts cannot access other users' home directories
|
||||
- Webapp (www-data) uses sudoers-whitelisted commands for user operations
|
||||
- Deploy user has explicit sudo rules for service management
|
||||
- Staging directory (`/tmp/data_analyst_staging`) uses setgid for group ownership
|
||||
- All JSON state files written atomically: `tempfile.mkstemp()` + `os.fchmod()` + `os.replace()`
|
||||
|
||||
## Configuration Chain
|
||||
|
||||
```
|
||||
config/instance.yaml (instance-specific, not committed)
|
||||
| loaded by config/loader.py
|
||||
| ${ENV_VAR} references resolved from .env / environment
|
||||
v
|
||||
webapp/config.py (Flask Config class)
|
||||
| _load_instance_config() at module level
|
||||
| _get(config, *keys) for safe nested access
|
||||
v
|
||||
inject_config() context processor (exposes Config to templates)
|
||||
v
|
||||
{{ config.INSTANCE_NAME }} in Jinja2 (all templates have access)
|
||||
```
|
||||
|
||||
Validation: `config/loader.py` checks required fields at startup (`instance.name`,
|
||||
`auth.allowed_domain`, `server.host`, `server.hostname`, `auth.webapp_secret_key`).
|
||||
Missing required fields cause immediate startup failure with a clear error message.
|
||||
|
||||
## Server Filesystem Layout
|
||||
|
||||
```
|
||||
/opt/data-analyst/
|
||||
├── repo/ git repository (deployed via CI/CD)
|
||||
├── .venv/ Python virtual environment
|
||||
├── logs/ application logs
|
||||
└── .env secrets (mode 0640)
|
||||
|
||||
/data/
|
||||
├── src_data/
|
||||
│ ├── parquet/ shared Parquet files (readonly for analysts)
|
||||
│ ├── metadata/ sync_state.json, profiles.json, table_metadata.json
|
||||
│ └── raw/jira/ webhook JSON files, attachments
|
||||
├── docs/ documentation and schema
|
||||
├── scripts/ helper scripts synced to analysts
|
||||
├── notifications/ telegram_users.json, desktop_users.json, pending_codes.json
|
||||
├── corporate-memory/ knowledge.json, votes.json, user_hashes.json
|
||||
└── user_sessions/ centralized Claude Code session transcripts
|
||||
|
||||
/run/
|
||||
├── notify-bot/bot.sock Telegram bot HTTP socket
|
||||
├── ws-gateway/ws.sock WebSocket gateway HTTP socket
|
||||
└── webapp/webapp.sock Gunicorn WSGI socket
|
||||
```
|
||||
|
||||
## CI/CD
|
||||
|
||||
### Deploy Guard (`.github/workflows/deploy-guard.yml`)
|
||||
|
||||
Runs on every pull request:
|
||||
1. `pytest tests/test_deploy_guard.py` - validates deploy.sh/sudoers/systemd consistency
|
||||
2. `pytest tests/test_sync_data.py -m "not live"` - validates sync script reliability
|
||||
3. `visudo -cf server/sudoers-*` - validates sudoers syntax in Docker
|
||||
|
||||
### Deployment (`.github/workflows/deploy.yml.example`)
|
||||
|
||||
Runs on push to main (or manual trigger):
|
||||
1. SSH into server
|
||||
2. Execute `server/deploy.sh` (git pull, deps, scripts, services, ACLs)
|
||||
428
docs/testing/vm_test_plan.md
Normal file
428
docs/testing/vm_test_plan.md
Normal file
|
|
@ -0,0 +1,428 @@
|
|||
# VM Test Plan - Self-Service Data Onboarding
|
||||
|
||||
End-to-end test of the full platform on a clean VM with a new GitHub repository.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Clean Ubuntu 22.04+ VM (or Debian 12) with root access
|
||||
- GitHub account with ability to create repositories
|
||||
- Domain name pointing to the VM (or use IP + skip SSL)
|
||||
- Keboola project with Storage API token (for discovery/sync testing)
|
||||
- Google OAuth credentials (for login testing)
|
||||
|
||||
---
|
||||
|
||||
## Step 0: Create GitHub Repository & Push
|
||||
|
||||
**On your local machine:**
|
||||
|
||||
```bash
|
||||
cd /Users/padak/github/oss-ai-data-analyst
|
||||
|
||||
# Create repo on GitHub (pick org/name)
|
||||
gh repo create YOUR_ORG/ai-data-analyst --private --source=. --push
|
||||
|
||||
# Verify
|
||||
gh repo view YOUR_ORG/ai-data-analyst
|
||||
```
|
||||
|
||||
**Expected:** Repo created, code pushed, visible on GitHub.
|
||||
|
||||
---
|
||||
|
||||
## Step 1: VM Initial Setup
|
||||
|
||||
**On the VM as root:**
|
||||
|
||||
```bash
|
||||
# Clone the repo
|
||||
REPO_URL="git@github.com:YOUR_ORG/ai-data-analyst.git"
|
||||
APP_DIR="/opt/data-analyst"
|
||||
mkdir -p $APP_DIR
|
||||
ssh-keygen -t ed25519 -f /root/.ssh/deploy_key -N ""
|
||||
# Add deploy key to GitHub repo (Settings -> Deploy keys)
|
||||
|
||||
sudo -u deploy git clone $REPO_URL $APP_DIR/repo
|
||||
|
||||
# Run setup
|
||||
cd $APP_DIR/repo
|
||||
REPO_URL=$REPO_URL bash server/setup.sh
|
||||
```
|
||||
|
||||
### Checklist
|
||||
|
||||
| # | Check | Command |
|
||||
|---|-------|---------|
|
||||
| 1.1 | Groups created | `getent group data-ops dataread data-private` |
|
||||
| 1.2 | Deploy user exists | `id deploy` |
|
||||
| 1.3 | Directory structure | `ls -la /opt/data-analyst/` |
|
||||
| 1.4 | Python venv works | `/opt/data-analyst/.venv/bin/python -c "import flask; print('OK')"` |
|
||||
| 1.5 | Management scripts | `which add-analyst list-analysts` |
|
||||
|
||||
---
|
||||
|
||||
## Step 2: Webapp Setup
|
||||
|
||||
```bash
|
||||
export SERVER_HOSTNAME="data.yourdomain.com" # or skip SSL with IP
|
||||
bash server/webapp-setup.sh
|
||||
```
|
||||
|
||||
Then edit `/opt/data-analyst/.env`:
|
||||
|
||||
```bash
|
||||
# Required
|
||||
WEBAPP_SECRET_KEY="$(python3 -c 'import secrets; print(secrets.token_hex(32))')"
|
||||
GOOGLE_CLIENT_ID="your-google-client-id"
|
||||
GOOGLE_CLIENT_SECRET="your-google-client-secret"
|
||||
SERVER_HOST="YOUR_VM_IP"
|
||||
SERVER_HOSTNAME="data.yourdomain.com"
|
||||
|
||||
# For Keboola discovery/sync
|
||||
KEBOOLA_STORAGE_TOKEN="your-token"
|
||||
KEBOOLA_STACK_URL="https://connection.keboola.com"
|
||||
KEBOOLA_PROJECT_ID="your-project-id"
|
||||
DATA_SOURCE="keboola"
|
||||
DATA_DIR="/data/src_data"
|
||||
```
|
||||
|
||||
### Checklist
|
||||
|
||||
| # | Check | Command |
|
||||
|---|-------|---------|
|
||||
| 2.1 | Nginx running | `systemctl status nginx` |
|
||||
| 2.2 | Webapp running | `systemctl status webapp` |
|
||||
| 2.3 | SSL cert (if domain) | `curl -I https://data.yourdomain.com/health` |
|
||||
| 2.4 | Health endpoint | `curl http://localhost:5000/health` (or via nginx) |
|
||||
| 2.5 | Login page loads | Browser: `https://data.yourdomain.com/login` |
|
||||
|
||||
---
|
||||
|
||||
## Step 3: Instance Configuration
|
||||
|
||||
```bash
|
||||
cd /opt/data-analyst/repo
|
||||
cp config/instance.yaml.example config/instance.yaml
|
||||
```
|
||||
|
||||
Edit `config/instance.yaml` with:
|
||||
- `instance.name` / `instance.subtitle`
|
||||
- `server.hostname` / `server.host`
|
||||
- `auth.allowed_domain` (your Google domain)
|
||||
- `data_source.type: "keboola"` + keboola settings
|
||||
- `catalog.categories` (at least one, e.g., `crm: {label: "CRM", icon: "crm"}`)
|
||||
|
||||
### Checklist
|
||||
|
||||
| # | Check | Command |
|
||||
|---|-------|---------|
|
||||
| 3.1 | Config loads | `cd /opt/data-analyst/repo && .venv/bin/python -c "from config.loader import load_instance_config; print(load_instance_config())"` |
|
||||
| 3.2 | Webapp picks it up | Restart webapp, check login page shows instance name |
|
||||
|
||||
---
|
||||
|
||||
## Step 4: Create Admin Account & Login
|
||||
|
||||
1. Login via Google OAuth in browser
|
||||
2. Register account with SSH key
|
||||
3. Verify the user is admin:
|
||||
|
||||
```bash
|
||||
id YOUR_USERNAME # should be in data-ops or sudo group
|
||||
# If not admin, manually add:
|
||||
usermod -aG data-ops YOUR_USERNAME
|
||||
```
|
||||
|
||||
### Checklist
|
||||
|
||||
| # | Check | Command |
|
||||
|---|-------|---------|
|
||||
| 4.1 | Google OAuth works | Login via browser |
|
||||
| 4.2 | Account created | `list-analysts` shows your username |
|
||||
| 4.3 | Dashboard loads | Browser: /dashboard shows data stats |
|
||||
| 4.4 | Admin access | Browser: /admin/tables loads (no 403) |
|
||||
|
||||
---
|
||||
|
||||
## Step 5: Test Discovery API (Phase 1)
|
||||
|
||||
In browser, go to `/admin/tables` and click "Discover tables from source".
|
||||
|
||||
### Checklist
|
||||
|
||||
| # | Check | Expected |
|
||||
|---|-------|----------|
|
||||
| 5.1 | Discovery button works | Loading spinner, then tables appear |
|
||||
| 5.2 | Tables grouped by bucket | Buckets shown as collapsible sections |
|
||||
| 5.3 | Table details shown | Name, columns, row count, size for each table |
|
||||
| 5.4 | "Available" badge | All tables show "Available" (none registered yet) |
|
||||
| 5.5 | API direct test | `curl -b cookies.txt https://HOST/api/admin/discover-tables \| jq .total` |
|
||||
|
||||
---
|
||||
|
||||
## Step 6: Test Table Registry (Phase 2)
|
||||
|
||||
### 6a: Register tables via Admin UI
|
||||
|
||||
1. Click "Register" on a table in discovery results
|
||||
2. Fill in: sync_strategy=full_refresh, confirm primary key
|
||||
3. Click "Register Table"
|
||||
4. Repeat for 2-3 more tables (try incremental too)
|
||||
|
||||
### 6b: Verify registry
|
||||
|
||||
```bash
|
||||
# On server
|
||||
cat /data/src_data/metadata/table_registry.json | python3 -m json.tool | head -30
|
||||
|
||||
# Check generated data_description.md
|
||||
head -10 /opt/data-analyst/repo/docs/data_description.md
|
||||
# Should show: <!-- AUTO-GENERATED from table_registry.json -->
|
||||
|
||||
# Check audit log
|
||||
cat /data/src_data/metadata/registry_audit.log
|
||||
```
|
||||
|
||||
### 6c: Test via API
|
||||
|
||||
```bash
|
||||
# List registry
|
||||
curl -b cookies.txt https://HOST/api/admin/registry | jq '.tables | length'
|
||||
|
||||
# Update a table
|
||||
curl -b cookies.txt -X PUT https://HOST/api/admin/registry/in.c-crm.company \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"description": "Updated via API", "version": CURRENT_VERSION}'
|
||||
|
||||
# Delete a table
|
||||
curl -b cookies.txt -X DELETE https://HOST/api/admin/registry/in.c-crm.company \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"version": CURRENT_VERSION}'
|
||||
```
|
||||
|
||||
### Checklist
|
||||
|
||||
| # | Check | Expected |
|
||||
|---|-------|----------|
|
||||
| 6.1 | Register table | Success, table appears in registry panel |
|
||||
| 6.2 | Badge changes | Registered tables show green "Registered" badge |
|
||||
| 6.3 | data_description.md | Generated with AUTO-GENERATED header + checksum |
|
||||
| 6.4 | Audit log written | Actions logged with timestamps and emails |
|
||||
| 6.5 | Optimistic locking | Stale version POST returns 409 |
|
||||
| 6.6 | Edit table | PUT changes description/strategy |
|
||||
| 6.7 | Delete table | Table removed, badge reverts to "Available" |
|
||||
|
||||
---
|
||||
|
||||
## Step 7: Test Data Sync + Auto-Profiling (Phase 3)
|
||||
|
||||
```bash
|
||||
cd /opt/data-analyst/repo
|
||||
source .venv/bin/activate
|
||||
|
||||
# Run sync for registered tables
|
||||
python -m src.data_sync
|
||||
```
|
||||
|
||||
### Checklist
|
||||
|
||||
| # | Check | Expected |
|
||||
|---|-------|----------|
|
||||
| 7.1 | Sync completes | Tables downloaded, Parquet created |
|
||||
| 7.2 | Schema.yml generated | `cat docs/schema.yml \| head` |
|
||||
| 7.3 | Auto-profiling ran | Log shows "Auto-profiling: N profiled" |
|
||||
| 7.4 | profiles.json exists | `ls -la /data/src_data/metadata/profiles.json` |
|
||||
| 7.5 | Catalog shows profiles | Browser: /catalog -> click table -> profile data loads |
|
||||
|
||||
---
|
||||
|
||||
## Step 8: Test Per-Table Subscriptions (Phase 4)
|
||||
|
||||
### 8a: Via API
|
||||
|
||||
```bash
|
||||
# Get current subscriptions
|
||||
curl -b cookies.txt https://HOST/api/table-subscriptions | jq .
|
||||
|
||||
# Switch to explicit mode, subscribe to specific tables
|
||||
curl -b cookies.txt -X POST https://HOST/api/table-subscriptions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"table_mode": "explicit",
|
||||
"tables": {"company": true, "contact": true, "events": false}
|
||||
}'
|
||||
```
|
||||
|
||||
### 8b: Via Catalog UI
|
||||
|
||||
1. Go to /catalog
|
||||
2. Tables should show subscription status (all subscribed in "all" mode)
|
||||
3. After switching to "explicit" mode via API, unsubscribed tables should be visually different
|
||||
|
||||
### Checklist
|
||||
|
||||
| # | Check | Expected |
|
||||
|---|-------|----------|
|
||||
| 8.1 | Default is "all" mode | GET returns `table_mode: "all"` |
|
||||
| 8.2 | Switch to explicit | POST succeeds, settings saved |
|
||||
| 8.3 | Config YAML updated | `cat /home/USERNAME/.sync_settings.yaml` shows `table_mode: explicit` |
|
||||
| 8.4 | Catalog reflects subs | Subscribed vs unsubscribed tables visually distinct |
|
||||
|
||||
---
|
||||
|
||||
## Step 9: Test Smart Sync (Phase 5)
|
||||
|
||||
### 9a: Check rsync filter generation
|
||||
|
||||
```bash
|
||||
# After setting explicit subscriptions:
|
||||
cat /home/USERNAME/.sync_rsync_filter
|
||||
# Should show include/exclude rules
|
||||
```
|
||||
|
||||
### 9b: Test from analyst machine
|
||||
|
||||
```bash
|
||||
# On analyst machine (or simulate):
|
||||
bash server/scripts/sync_data.sh --dry-run
|
||||
# Should show filter-based sync when explicit mode is active
|
||||
```
|
||||
|
||||
### Checklist
|
||||
|
||||
| # | Check | Expected |
|
||||
|---|-------|----------|
|
||||
| 9.1 | Filter file exists | `.sync_rsync_filter` created in user home |
|
||||
| 9.2 | Correct include/exclude | Subscribed tables included, others excluded |
|
||||
| 9.3 | Dry-run uses filter | `--filter="merge ..."` in rsync output |
|
||||
| 9.4 | Fallback works | Without filter file, syncs everything (backwards compat) |
|
||||
|
||||
---
|
||||
|
||||
## Step 10: Migration Test (One-Time Bootstrap)
|
||||
|
||||
If you already have a `docs/data_description.md` with tables defined:
|
||||
|
||||
```bash
|
||||
python3 -c "
|
||||
from src.table_registry import TableRegistry
|
||||
from pathlib import Path
|
||||
|
||||
registry = TableRegistry.import_from_data_description(
|
||||
Path('docs/data_description.md'),
|
||||
Path('/data/src_data/metadata/table_registry.json'),
|
||||
registered_by='migration@test.com'
|
||||
)
|
||||
print(f'Migrated {len(registry.list_tables())} tables')
|
||||
print(f'Version: {registry.version}')
|
||||
"
|
||||
```
|
||||
|
||||
### Checklist
|
||||
|
||||
| # | Check | Expected |
|
||||
|---|-------|----------|
|
||||
| 10.1 | Migration succeeds | All tables imported |
|
||||
| 10.2 | Registry JSON valid | `cat table_registry.json \| python3 -m json.tool` |
|
||||
| 10.3 | migrated_from marker | `"migrated_from": "docs/data_description.md"` in metadata |
|
||||
| 10.4 | Admin UI shows tables | /admin/tables lists all migrated tables |
|
||||
|
||||
---
|
||||
|
||||
## Step 11: Regression Tests
|
||||
|
||||
```bash
|
||||
cd /opt/data-analyst/repo
|
||||
source .venv/bin/activate
|
||||
python -m pytest tests/ -v
|
||||
```
|
||||
|
||||
### Checklist
|
||||
|
||||
| # | Check | Expected |
|
||||
|---|-------|----------|
|
||||
| 11.1 | All tests pass | 132+ tests, 0 failures |
|
||||
| 11.2 | No import errors | All modules load cleanly |
|
||||
|
||||
---
|
||||
|
||||
## Quick Smoke Test Script
|
||||
|
||||
Run this after full setup to verify the critical path:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# smoke_test.sh - Quick verification of self-service onboarding
|
||||
set -e
|
||||
|
||||
APP_DIR="/opt/data-analyst/repo"
|
||||
cd "$APP_DIR"
|
||||
source .venv/bin/activate
|
||||
|
||||
echo "=== Smoke Test ==="
|
||||
|
||||
# 1. Tests
|
||||
echo "[1/5] Running tests..."
|
||||
python -m pytest tests/ -q --tb=short
|
||||
echo " PASS"
|
||||
|
||||
# 2. Registry module
|
||||
echo "[2/5] Testing Table Registry..."
|
||||
python -c "
|
||||
from src.table_registry import TableRegistry
|
||||
from pathlib import Path
|
||||
import tempfile
|
||||
r = TableRegistry(Path(tempfile.mktemp(suffix='.json')))
|
||||
r.register_table({'id': 'test.t', 'name': 't', 'primary_key': 'id', 'sync_strategy': 'full_refresh'}, 'test')
|
||||
assert r.is_registered('test.t')
|
||||
r.unregister_table('test.t')
|
||||
assert not r.is_registered('test.t')
|
||||
print(' PASS')
|
||||
"
|
||||
|
||||
# 3. Discovery (needs Keboola credentials)
|
||||
echo "[3/5] Testing Discovery API..."
|
||||
python -c "
|
||||
try:
|
||||
from src.data_sync import create_data_source
|
||||
ds = create_data_source()
|
||||
tables = ds.discover_tables()
|
||||
print(f' PASS - Discovered {len(tables)} tables')
|
||||
except Exception as e:
|
||||
print(f' SKIP - {e}')
|
||||
"
|
||||
|
||||
# 4. Profiler API
|
||||
echo "[4/5] Testing Profiler API..."
|
||||
python -c "
|
||||
from src.profiler import profile_changed_tables
|
||||
result = profile_changed_tables([])
|
||||
assert result == {'success': 0, 'errors': 0, 'skipped': 0}
|
||||
print(' PASS')
|
||||
"
|
||||
|
||||
# 5. Webapp imports
|
||||
echo "[5/5] Testing Webapp imports..."
|
||||
python -c "
|
||||
from webapp.auth import admin_required, login_required
|
||||
from webapp.sync_settings_service import get_table_subscriptions, generate_rsync_filter
|
||||
from src.table_registry import TableRegistry, ConflictError
|
||||
print(' PASS')
|
||||
"
|
||||
|
||||
echo ""
|
||||
echo "=== All smoke tests passed ==="
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
| Problem | Fix |
|
||||
|---------|-----|
|
||||
| `/admin/tables` returns 403 | User not in `data-ops` group. Run `usermod -aG data-ops USERNAME` |
|
||||
| Discovery returns empty | Check `KEBOOLA_STORAGE_TOKEN` in `.env`, verify `DATA_SOURCE=keboola` |
|
||||
| Profiles not generated | Check `/data/src_data/parquet/` has parquet files, check DuckDB installed |
|
||||
| Rsync filter not created | Check `sudo` permissions for `www-data` in sudoers-webapp |
|
||||
| `data_description.md` not updating | Check write permissions on `docs/` directory |
|
||||
| Webapp won't start | Check `journalctl -u webapp -n 50` for errors |
|
||||
Loading…
Reference in a new issue