docs

2026-03-10 11:43:11 +01:00 · 2026-03-10 11:43:11 +01:00 · 879bc6c44f
commit 879bc6c44f
parent 495940d6b8
2 changed files with 939 additions and 0 deletions
--- a/docs/architecture.md
+++ b/docs/architecture.md
@ -0,0 +1,511 @@
+# Architecture — Detailed Reference
+
+Comprehensive architectural overview of the OSS AI Data Analyst platform.
+For a concise summary, see [../ARCHITECTURE.md](../ARCHITECTURE.md).
+
+## Top-Level Module Map
+
+```
+oss-ai-data-analyst/
+├── src/                  Core engine (config, sync, parquet, profiling)
+├── connectors/           Pluggable data connectors (keboola, jira)
+├── auth/                 Pluggable auth providers (google, password, desktop)
+├── services/             Standalone background services
+├── webapp/               Flask web portal (dashboard, catalog, API)
+├── server/               Server deployment (setup, deploy, nginx, systemd)
+├── scripts/              Analyst-side utility scripts (sync, DuckDB, dev server)
+├── config/               Instance configuration (loader, templates)
+├── examples/             Example notification scripts
+├── tests/                Test suite
+├── dev_docs/             Internal development documentation
+└── docs/                 User-facing documentation
+```
+
+## Block Diagram
+
+```
+┌─────────────────────────────────────────────────────────────────────────────┐
+│                         EXTERNAL DATA SOURCES                               │
+│   ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐              │
+│   │ Keboola  │   │   Jira   │   │   CSV    │   │ BigQuery │              │
+│   │ Storage  │   │  Cloud   │   │  (plan)  │   │  (plan)  │              │
+│   └────┬─────┘   └────┬─────┘   └──────────┘   └──────────┘              │
+└────────┼──────────────┼────────────────────────────────────────────────────┘
+         │              │
+         ▼              ▼
+┌─────────────────────────────────────────────────────────────────────────────┐
+│  CONNECTORS  (connectors/)                  auto-discovery via importlib    │
+│                                                                             │
+│  ┌──────────────────────────┐    ┌─────────────────────────────────────┐   │
+│  │ connectors/keboola/       │    │ connectors/jira/                    │   │
+│  │                           │    │                                     │   │
+│  │ adapter.py                │    │ webhook.py    Flask blueprint       │   │
+│  │  KeboolaDataSource (ABC)  │    │ service.py    Jira REST API client │   │
+│  │  full/incr/partitioned    │    │ transform.py  JSON -> 6 Parquet tbl│   │
+│  │                           │    │ incremental_transform.py  realtime │   │
+│  │ client.py                 │    │ file_lock.py  POSIX advisory locks │   │
+│  │  Keboola Storage API      │    │                                     │   │
+│  │  type mapping + cache     │    │ scripts/  backfill, SLA poll,      │   │
+│  │                           │    │           consistency check         │   │
+│  │ tests/                    │    │ systemd/  jira-sla-poll,           │   │
+│  └──────────────────────────┘    │           jira-consistency          │   │
+│                                   │ tests/                              │   │
+│  Registry: src/data_sync.py       └─────────────────────────────────────┘   │
+│  create_data_source(type) ->                                                │
+│    importlib("connectors.{type}.adapter")                                   │
+└─────────────────────────────────────────────────────────────────────────────┘
+         │
+         ▼  Parquet files
+┌─────────────────────────────────────────────────────────────────────────────┐
+│  CORE ENGINE  (src/)                                                        │
+│                                                                             │
+│  ┌─────────────────────┐  ┌──────────────────┐  ┌──────────────────────┐  │
+│  │ data_sync.py         │  │ config.py         │  │ profiler.py          │  │
+│  │  DataSource ABC      │  │  data_description │  │  Parquet -> stats    │  │
+│  │  SyncState (JSON)    │  │  .md parser       │  │  alerts, sampling    │  │
+│  │  DataSyncManager     │  │  TableConfig      │  │  -> profiles.json    │  │
+│  │  create_data_source()│  │  WhereFilter      │  └──────────────────────┘  │
+│  └─────────────────────┘  │  ForeignKey        │                            │
+│                            │  get_config()      │  ┌──────────────────────┐  │
+│                            └──────────────────┘  │ parquet_manager.py    │  │
+│                                                   │  CSV->Parquet, merge  │  │
+│                                                   │  upsert, schema       │  │
+│                                                   └──────────────────────┘  │
+└─────────────────────────────────────────────────────────────────────────────┘
+         │
+         │  /data/src_data/parquet/
+         ▼
+┌─────────────────────────────────────────────────────────────────────────────┐
+│  AUTH PROVIDERS  (auth/)                    auto-discovery via scan         │
+│                                                                             │
+│  ┌────────────────┐  ┌────────────────┐  ┌──────────────────────┐         │
+│  │ auth/google/    │  │ auth/password/ │  │ auth/desktop/         │        │
+│  │                 │  │                │  │                       │        │
+│  │ Google OAuth    │  │ Email+password │  │ JWT for desktop app   │        │
+│  │ SSO (Authlib)   │  │ Argon2 hash   │  │ visible=False         │        │
+│  │ domain restrict │  │ SendGrid email │  │ (API-only, not login) │        │
+│  │ order=10        │  │ order=20       │  │ order=100             │        │
+│  └────────────────┘  └────────────────┘  └──────────────────────┘         │
+│                                                                             │
+│  ABC: AuthProvider (get_name, get_blueprint, get_login_button, is_avail.)  │
+│  Discovery: discover_providers() -> scans auth/*/provider.py               │
+│  Contract: all providers set session["user"] = {email, name, picture}      │
+└─────────────────────────────────────────────────────────────────────────────┘
+         │
+         │  Blueprints registered in Flask app
+         ▼
+┌─────────────────────────────────────────────────────────────────────────────┐
+│  WEB PORTAL  (webapp/)                                                      │
+│                                                                             │
+│  ┌───────────────────┐  ┌──────────────────────────────────────────────┐   │
+│  │ app.py (Flask)     │  │  Pages                                      │   │
+│  │ - discover auth    │  │  /dashboard     - account, stats, setup     │   │
+│  │   providers        │  │  /catalog       - data catalog + profiles   │   │
+│  │ - register         │  │  /corporate-memory - knowledge + voting     │   │
+│  │   blueprints       │  │  /activity-center - intelligence overview   │   │
+│  │ - inject_config()  │  └──────────────────────────────────────────────┘   │
+│  │ - routes           │                                                     │
+│  └───────────────────┘  ┌──────────────────────────────────────────────┐   │
+│                          │  API Endpoints                               │   │
+│  ┌───────────────────┐  │  /webhooks/jira     (HMAC, -> jira connector)│   │
+│  │ webapp services    │  │  /api/telegram/*    (link/unlink/status)     │   │
+│  │ user_service       │  │  /api/desktop/*     (JWT, scripts, run)     │   │
+│  │ account_service    │  │  /api/sync-settings (GET/POST)              │   │
+│  │ sync_settings_svc  │  │  /api/corporate-memory/* (CRUD, votes)      │   │
+│  │ telegram_service   │  │  /api/catalog/profile/<table>               │   │
+│  │ email_service      │  │  /health            (service health)        │   │
+│  │ health_service     │  └──────────────────────────────────────────────┘   │
+│  │ corporate_memory   │                                                     │
+│  └───────────────────┘  Config chain: instance.yaml -> loader -> Config -> │
+│                          inject_config() -> {{ config.X }} in Jinja        │
+└─────────────────────────────────────────────────────────────────────────────┘
+         │
+         │
+         ▼
+┌─────────────────────────────────────────────────────────────────────────────┐
+│  BACKGROUND SERVICES  (services/)        each = __main__.py + systemd      │
+│                                                                             │
+│  ┌────────────────────────┐  ┌─────────────────────────────────────────┐   │
+│  │ services/telegram_bot/  │  │ services/ws_gateway/                    │   │
+│  │                         │  │                                         │   │
+│  │ bot.py     polling +    │  │ gateway.py   WebSocket TCP:8765        │   │
+│  │            HTTP socket  │  │              + HTTP dispatch socket     │   │
+│  │ runner.py  script exec  │  │ auth.py      JWT validation            │   │
+│  │ sender.py  msg dispatch │  │ config.py    gateway config            │   │
+│  │ dispatch.py -> WS gw   │  │                                         │   │
+│  │ storage.py  JSON state  │  │ Heartbeat: ping/pong, 3 miss = drop   │   │
+│  │ status.py  /status cmd  │  │ Per-user connection limit (5)          │   │
+│  │                         │  │                                         │   │
+│  │ Always running (systemd)│  │ Always running (systemd)               │   │
+│  └────────────────────────┘  └─────────────────────────────────────────┘   │
+│                                                                             │
+│  ┌────────────────────────┐  ┌─────────────────────────────────────────┐   │
+│  │ services/               │  │ services/                               │   │
+│  │   corporate_memory/     │  │   session_collector/                    │   │
+│  │                         │  │                                         │   │
+│  │ collector.py            │  │ collector.py                            │   │
+│  │  Scans CLAUDE.local.md  │  │  Copies .jsonl from user homes         │   │
+│  │  -> Claude Haiku -> JSON│  │  to /data/user_sessions/               │   │
+│  │  MD5 change detection   │  │  Idempotent, atomic writes             │   │
+│  │ prompts.py              │  │                                         │   │
+│  │  LLM prompts for        │  │ Timer: every 6 hours                   │   │
+│  │  knowledge extraction   │  │                                         │   │
+│  │                         │  │                                         │   │
+│  │ Timer: every 30 min     │  │                                         │   │
+│  └────────────────────────┘  └─────────────────────────────────────────┘   │
+└─────────────────────────────────────────────────────────────────────────────┘
+         │
+         │  Unix sockets + /data/ filesystem
+         ▼
+┌─────────────────────────────────────────────────────────────────────────────┐
+│  SERVER INFRASTRUCTURE  (server/)                                           │
+│                                                                             │
+│  ┌──────────────────┐  ┌────────────────────┐  ┌───────────────────────┐  │
+│  │ Deployment        │  │ User Management     │  │ Web Server            │  │
+│  │ setup.sh          │  │ bin/add-analyst     │  │ webapp-nginx.conf     │  │
+│  │ deploy.sh (CI/CD) │  │ bin/list-analysts   │  │ webapp.service        │  │
+│  │ webapp-setup.sh   │  │ bin/notify-runner   │  │ SSL (Let's Encrypt)   │  │
+│  │ sudoers rules     │  │ bin/notify-scripts  │  │ Gunicorn + Unix sock  │  │
+│  └──────────────────┘  └────────────────────┘  └───────────────────────┘  │
+│                                                                             │
+│  Groups: dataread (analysts) | data-private (privileged) | data-ops (admin) │
+│                                                                             │
+│  /data/                                                                     │
+│  ├── src_data/parquet/          shared data (readonly for analysts)         │
+│  ├── src_data/metadata/         sync_state.json, profiles.json             │
+│  ├── src_data/raw/jira/         webhook JSON, attachments                  │
+│  ├── docs/ , scripts/           documentation, helper scripts              │
+│  ├── notifications/             telegram_users, desktop_users, codes       │
+│  ├── corporate-memory/          knowledge.json, votes.json                 │
+│  └── user_sessions/             centralized Claude Code transcripts        │
+└─────────────────────────────────────────────────────────────────────────────┘
+         │
+         │  rsync (SSH) - scripts/sync_data.sh (bi-directional)
+         ▼
+┌─────────────────────────────────────────────────────────────────────────────┐
+│  ANALYST WORKSTATION  (local)                                               │
+│                                                                             │
+│  server/   (read-only, rsynced from broker)                                │
+│  ├── parquet/, docs/, scripts/, metadata/                                  │
+│                                                                             │
+│  user/     (writable workspace, backed up to server)                       │
+│  ├── duckdb/analytics.duckdb    SQL views over parquet                     │
+│  ├── notifications/*.py         custom notification scripts                │
+│  ├── sessions/                  Claude Code transcripts                    │
+│  └── artifacts/                 analysis outputs                           │
+│                                                                             │
+│  .claude/rules/                 corporate memory knowledge rules           │
+│                                                                             │
+│  Claude Code <- local analysis over DuckDB + Parquet                       │
+└─────────────────────────────────────────────────────────────────────────────┘
+```
+
+## Auto-Discovery Patterns
+
+The platform uses three symmetrical auto-discovery mechanisms. Adding a new
+connector, auth method, or service requires no changes to existing code.
+
+### 1. Connector Discovery (`src/data_sync.py`)
+
+```
+config/instance.yaml -> data_source.type: "keboola"
+  -> importlib.import_module("connectors.keboola.adapter")
+  -> KeboolaDataSource (implements DataSource ABC)
+```
+
+- Factory: `create_data_source(type)` in `src/data_sync.py`
+- Connectors live in `connectors/{name}/adapter.py`
+- Must export a `DataSource` subclass or a `create_data_source()` factory function
+- Keboola is hard-coded for ImportError handling; all others use dynamic import
+
+### 2. Auth Provider Discovery (`auth/__init__.py`)
+
+```
+startup -> scan auth/*/provider.py
+  -> import `provider` instance
+  -> filter by is_available() (checks env vars)
+  -> register blueprint + login button in Flask
+```
+
+- ABC: `AuthProvider` with methods `get_name()`, `get_blueprint()`, `get_login_button()`, `is_available()`, `init_app()`
+- Session contract: all providers set `session["user"] = {email, name, picture}`
+- Login page renders buttons dynamically, sorted by `order` field
+
+### 3. Service Pattern (`services/*/__main__.py`)
+
+```
+python -m services.<name>       # entry point
+services/<name>/systemd/        # unit files
+deploy.sh auto-discovers        # systemd/* in each service dir
+```
+
+- Each service is self-contained: code, systemd units, and config in one directory
+- `deploy.sh` scans `services/*/systemd/*.service` and `connectors/*/systemd/*.service`
+- Long-running services (telegram_bot, ws_gateway) use async dual-server model
+- Periodic services (corporate_memory, session_collector) are systemd timer oneshots
+
+## Data Flows
+
+### Pull Sync (Keboola)
+
+```
+Keboola Storage API
+  -> connectors/keboola/client.py  (export CSV with filters)
+  -> src/parquet_manager.py        (convert to typed Parquet)
+  -> /data/src_data/parquet/       (stored on broker)
+  -> rsync to analyst              (scripts/sync_data.sh)
+  -> DuckDB views                  (scripts/setup_views.sh)
+```
+
+Sync strategies: `full_refresh`, `incremental`, `partitioned`, `chunked_initial_load`.
+
+### Push Sync (Jira)
+
+```
+Jira Cloud webhook (issue created/updated/deleted)
+  -> connectors/jira/webhook.py        (HMAC-SHA256 verification)
+  -> connectors/jira/service.py        (fetch full issue + attachments)
+  -> /data/src_data/raw/jira/issues/   (atomic JSON write)
+  -> connectors/jira/incremental_transform.py (update monthly Parquet)
+  -> /data/src_data/parquet/jira/      (6 tables: issues, comments,
+                                         attachments, changelog,
+                                         issuelinks, remote_links)
+```
+
+Background jobs supplement the webhook pipeline:
+- `jira-sla-poll` (every 5 min): refreshes SLA fields for open tickets
+- `jira-consistency` (every 6h): detects and backfills missing issues
+
+### Notification Pipeline
+
+```
+~/user/notifications/*.py             analyst's custom scripts
+  -> server/bin/notify-runner         (cron, executes with timeout)
+  -> cooldown check                   (~/.notifications/state/)
+  ├-> services/telegram_bot/          (Unix socket /run/notify-bot/bot.sock)
+  │     -> Telegram chat message (text or photo)
+  └-> services/ws_gateway/            (Unix socket /run/ws-gateway/ws.sock)
+        -> WebSocket push to desktop app
+```
+
+Script output format:
+```json
+{
+  "notify": true,
+  "title": "Revenue dropped 25%",
+  "message": "Details...",
+  "cooldown": "6h",
+  "image_path": "/tmp/chart.png"
+}
+```
+
+### Knowledge Loop (Corporate Memory)
+
+```
+Analyst writes CLAUDE.local.md        (insights, patterns, tips)
+  -> scripts/sync_data.sh             (uploads to server)
+  -> services/corporate_memory/       (timer, every 30 min)
+     -> MD5 change detection
+     -> Claude Haiku extracts knowledge items
+     -> /data/corporate-memory/knowledge.json
+  -> webapp /corporate-memory          (voting UI: upvote/downvote)
+  -> scripts/sync_data.sh             (downloads to analyst)
+  -> .claude/rules/                   (rules for Claude Code)
+  -> Claude Code uses rules in next session
+```
+
+## Module Reference
+
+### Core Engine (`src/`)
+
+| File | Lines | Responsibility |
+|------|-------|----------------|
+| `data_sync.py` | ~1400 | `DataSource` ABC, `SyncState`, `DataSyncManager`, connector factory |
+| `config.py` | ~600 | Parse `data_description.md` YAML blocks, `TableConfig`, `WhereFilter`, `ForeignKey` |
+| `parquet_manager.py` | ~750 | CSV-to-Parquet conversion, merge, upsert, schema enforcement |
+| `profiler.py` | ~1200 | Data profiling: stats, alerts, type classification -> `profiles.json` |
+
+### Connectors (`connectors/`)
+
+| Module | Files | Sync Model | Description |
+|--------|-------|------------|-------------|
+| `keboola/` | adapter.py, client.py, tests/ | Pull (DataSource ABC) | Keboola Storage API, type mapping, metadata caching (24h TTL) |
+| `jira/` | webhook.py, service.py, transform.py, incremental_transform.py, file_lock.py, scripts/, systemd/, tests/ | Push (webhook) | Real-time webhook pipeline, SLA polling, consistency monitoring, 6 output Parquet tables |
+
+### Auth Providers (`auth/`)
+
+| Provider | Available when | Login UI | Order | Description |
+|----------|---------------|----------|-------|-------------|
+| `google/` | `GOOGLE_CLIENT_ID` set | Yes | 10 | Google OAuth SSO with domain restriction |
+| `password/` | `SENDGRID_API_KEY` set | Yes | 20 | Email + password for external users (Argon2, rate limiting) |
+| `desktop/` | `DESKTOP_JWT_SECRET` set | No (API-only) | 100 | JWT tokens for native desktop app |
+
+### Background Services (`services/`)
+
+| Service | Type | Schedule | Description |
+|---------|------|----------|-------------|
+| `telegram_bot/` | Long-running | Always on | Telegram polling + HTTP dispatch socket, script execution, /status /test commands |
+| `ws_gateway/` | Long-running | Always on | WebSocket TCP:8765 + HTTP dispatch socket, JWT auth, heartbeat |
+| `corporate_memory/` | Timer oneshot | Every 30 min | AI knowledge extraction from CLAUDE.local.md via Claude Haiku |
+| `session_collector/` | Timer oneshot | Every 6 hours | Copy session .jsonl from user homes to central storage |
+
+### Web Portal (`webapp/`)
+
+| File | Responsibility |
+|------|----------------|
+| `app.py` | Flask factory, blueprint registration, route definitions, context processors |
+| `config.py` | Load `instance.yaml`, expose `Config` to templates |
+| `auth.py` | Core auth infrastructure: `login_required`, `validate_email_domain`, `/login`, `/logout` |
+| `user_service.py` | Username derivation, SSH key validation, system account creation |
+| `account_service.py` | Dashboard account widget data, cron info, sync status |
+| `sync_settings_service.py` | Per-user dataset sync preferences |
+| `telegram_service.py` | Telegram account linking/unlinking |
+| `desktop_auth.py` | JWT generation/validation, desktop app link state |
+| `password_auth.py` | Password auth implementation (Argon2, rate limiting, token workflow) |
+| `email_service.py` | SendGrid integration for setup/reset emails |
+| `corporate_memory_service.py` | Knowledge CRUD, voting, user rules regeneration |
+| `health_service.py` | System health checks (services, timers, disk, load, webhooks) |
+| `notification_images.py` | Serve chart PNGs generated by notification runner |
+| `utils/metric_parser.py` | Parse business metric YAML definitions for catalog UI |
+
+### Server Infrastructure (`server/`)
+
+| File | Responsibility |
+|------|----------------|
+| `setup.sh` | Initial server bootstrap (groups, users, directories, venv) |
+| `deploy.sh` | CI/CD deployment (git pull, deps, scripts, services, ACLs) |
+| `webapp-setup.sh` | Nginx + SSL + Gunicorn setup |
+| `webapp-nginx.conf` | Nginx reverse proxy config (HTTPS, WebSocket upgrade) |
+| `webapp.service` | Systemd unit for Gunicorn |
+| `sudoers-deploy` | Sudo rules for deploy user (least-privilege) |
+| `sudoers-webapp` | Sudo rules for www-data |
+| `bin/add-analyst` | Create analyst user with workspace structure |
+| `bin/list-analysts` | List registered analysts |
+| `bin/notify-runner` | Execute user notification scripts, dispatch to bot + gateway |
+| `bin/notify-scripts` | List/run notification scripts for a user |
+
+### Analyst Scripts (`scripts/`)
+
+| File | Responsibility |
+|------|----------------|
+| `sync_data.sh` | Bi-directional rsync: download data, upload workspace, refresh DuckDB |
+| `setup_views.sh` | Create/replace DuckDB views over all Parquet files |
+| `duckdb_manager.py` | DuckDB setup utility |
+| `dev_run.py` | Development server with auth bypass |
+| `collect_session.py` | Session transcript collector (used by service) |
+| `generate_user_sync_configs.py` | Generate per-user sync config files |
+
+## Analyst Workspace Layout
+
+Created by `server/bin/add-analyst` for each registered user:
+
+```
+/home/{username}/
+├── server/                    read-only symlinks to shared data
+│   ├── parquet/               -> /data/src_data/parquet
+│   ├── docs/                  -> /data/docs
+│   ├── scripts/               -> /data/scripts
+│   ├── metadata/              -> /data/src_data/metadata
+│   └── jira_attachments/      -> /data/src_data/raw/jira/attachments
+├── user/                      writable workspace (backed up to server)
+│   ├── duckdb/                local DuckDB database
+│   ├── notifications/         custom notification scripts (*.py)
+│   ├── artifacts/             analysis outputs
+│   ├── scripts/               user helper scripts
+│   ├── parquet/               user Parquet files
+│   └── sessions/              Claude Code session transcripts
+├── .notifications/            notification runner state
+│   ├── state/                 cooldown tracking (JSON per script)
+│   └── logs/                  runner logs
+└── .claude/
+    └── rules/                 corporate memory knowledge rules (auto-synced)
+```
+
+## Security Model
+
+### System Groups
+
+| Group | Access |
+|-------|--------|
+| `data-ops` | Full admin access to all server resources |
+| `dataread` | Read access to public Parquet data |
+| `data-private` | Read access to sensitive/restricted data |
+
+### Authentication Layers
+
+| Layer | Mechanism | Scope |
+|-------|-----------|-------|
+| Web portal | Google OAuth / email+password | Browser sessions |
+| Desktop app | JWT Bearer tokens | API endpoints (`/api/desktop/*`) |
+| Jira webhook | HMAC-SHA256 signature | Webhook endpoint |
+| SSH access | Key-based auth only | Data sync (rsync) |
+| Inter-service | Unix socket permissions | Bot, gateway, webapp |
+
+### Permission Boundaries
+
+- Analysts cannot access other users' home directories
+- Webapp (www-data) uses sudoers-whitelisted commands for user operations
+- Deploy user has explicit sudo rules for service management
+- Staging directory (`/tmp/data_analyst_staging`) uses setgid for group ownership
+- All JSON state files written atomically: `tempfile.mkstemp()` + `os.fchmod()` + `os.replace()`
+
+## Configuration Chain
+
+```
+config/instance.yaml                     (instance-specific, not committed)
+    | loaded by config/loader.py
+    | ${ENV_VAR} references resolved from .env / environment
+    v
+webapp/config.py                         (Flask Config class)
+    | _load_instance_config() at module level
+    | _get(config, *keys) for safe nested access
+    v
+inject_config() context processor       (exposes Config to templates)
+    v
+{{ config.INSTANCE_NAME }} in Jinja2    (all templates have access)
+```
+
+Validation: `config/loader.py` checks required fields at startup (`instance.name`,
+`auth.allowed_domain`, `server.host`, `server.hostname`, `auth.webapp_secret_key`).
+Missing required fields cause immediate startup failure with a clear error message.
+
+## Server Filesystem Layout
+
+```
+/opt/data-analyst/
+├── repo/                    git repository (deployed via CI/CD)
+├── .venv/                   Python virtual environment
+├── logs/                    application logs
+└── .env                     secrets (mode 0640)
+
+/data/
+├── src_data/
+│   ├── parquet/             shared Parquet files (readonly for analysts)
+│   ├── metadata/            sync_state.json, profiles.json, table_metadata.json
+│   └── raw/jira/            webhook JSON files, attachments
+├── docs/                    documentation and schema
+├── scripts/                 helper scripts synced to analysts
+├── notifications/           telegram_users.json, desktop_users.json, pending_codes.json
+├── corporate-memory/        knowledge.json, votes.json, user_hashes.json
+└── user_sessions/           centralized Claude Code session transcripts
+
+/run/
+├── notify-bot/bot.sock      Telegram bot HTTP socket
+├── ws-gateway/ws.sock       WebSocket gateway HTTP socket
+└── webapp/webapp.sock       Gunicorn WSGI socket
+```
+
+## CI/CD
+
+### Deploy Guard (`.github/workflows/deploy-guard.yml`)
+
+Runs on every pull request:
+1. `pytest tests/test_deploy_guard.py` - validates deploy.sh/sudoers/systemd consistency
+2. `pytest tests/test_sync_data.py -m "not live"` - validates sync script reliability
+3. `visudo -cf server/sudoers-*` - validates sudoers syntax in Docker
+
+### Deployment (`.github/workflows/deploy.yml.example`)
+
+Runs on push to main (or manual trigger):
+1. SSH into server
+2. Execute `server/deploy.sh` (git pull, deps, scripts, services, ACLs)
--- a/docs/testing/vm_test_plan.md
+++ b/docs/testing/vm_test_plan.md
@ -0,0 +1,428 @@
+# VM Test Plan - Self-Service Data Onboarding
+
+End-to-end test of the full platform on a clean VM with a new GitHub repository.
+
+## Prerequisites
+
+- Clean Ubuntu 22.04+ VM (or Debian 12) with root access
+- GitHub account with ability to create repositories
+- Domain name pointing to the VM (or use IP + skip SSL)
+- Keboola project with Storage API token (for discovery/sync testing)
+- Google OAuth credentials (for login testing)
+
+---
+
+## Step 0: Create GitHub Repository & Push
+
+**On your local machine:**
+
+```bash
+cd /Users/padak/github/oss-ai-data-analyst
+
+# Create repo on GitHub (pick org/name)
+gh repo create YOUR_ORG/ai-data-analyst --private --source=. --push
+
+# Verify
+gh repo view YOUR_ORG/ai-data-analyst
+```
+
+**Expected:** Repo created, code pushed, visible on GitHub.
+
+---
+
+## Step 1: VM Initial Setup
+
+**On the VM as root:**
+
+```bash
+# Clone the repo
+REPO_URL="git@github.com:YOUR_ORG/ai-data-analyst.git"
+APP_DIR="/opt/data-analyst"
+mkdir -p $APP_DIR
+ssh-keygen -t ed25519 -f /root/.ssh/deploy_key -N ""
+# Add deploy key to GitHub repo (Settings -> Deploy keys)
+
+sudo -u deploy git clone $REPO_URL $APP_DIR/repo
+
+# Run setup
+cd $APP_DIR/repo
+REPO_URL=$REPO_URL bash server/setup.sh
+```
+
+### Checklist
+
+| # | Check | Command |
+|---|-------|---------|
+| 1.1 | Groups created | `getent group data-ops dataread data-private` |
+| 1.2 | Deploy user exists | `id deploy` |
+| 1.3 | Directory structure | `ls -la /opt/data-analyst/` |
+| 1.4 | Python venv works | `/opt/data-analyst/.venv/bin/python -c "import flask; print('OK')"` |
+| 1.5 | Management scripts | `which add-analyst list-analysts` |
+
+---
+
+## Step 2: Webapp Setup
+
+```bash
+export SERVER_HOSTNAME="data.yourdomain.com"  # or skip SSL with IP
+bash server/webapp-setup.sh
+```
+
+Then edit `/opt/data-analyst/.env`:
+
+```bash
+# Required
+WEBAPP_SECRET_KEY="$(python3 -c 'import secrets; print(secrets.token_hex(32))')"
+GOOGLE_CLIENT_ID="your-google-client-id"
+GOOGLE_CLIENT_SECRET="your-google-client-secret"
+SERVER_HOST="YOUR_VM_IP"
+SERVER_HOSTNAME="data.yourdomain.com"
+
+# For Keboola discovery/sync
+KEBOOLA_STORAGE_TOKEN="your-token"
+KEBOOLA_STACK_URL="https://connection.keboola.com"
+KEBOOLA_PROJECT_ID="your-project-id"
+DATA_SOURCE="keboola"
+DATA_DIR="/data/src_data"
+```
+
+### Checklist
+
+| # | Check | Command |
+|---|-------|---------|
+| 2.1 | Nginx running | `systemctl status nginx` |
+| 2.2 | Webapp running | `systemctl status webapp` |
+| 2.3 | SSL cert (if domain) | `curl -I https://data.yourdomain.com/health` |
+| 2.4 | Health endpoint | `curl http://localhost:5000/health` (or via nginx) |
+| 2.5 | Login page loads | Browser: `https://data.yourdomain.com/login` |
+
+---
+
+## Step 3: Instance Configuration
+
+```bash
+cd /opt/data-analyst/repo
+cp config/instance.yaml.example config/instance.yaml
+```
+
+Edit `config/instance.yaml` with:
+- `instance.name` / `instance.subtitle`
+- `server.hostname` / `server.host`
+- `auth.allowed_domain` (your Google domain)
+- `data_source.type: "keboola"` + keboola settings
+- `catalog.categories` (at least one, e.g., `crm: {label: "CRM", icon: "crm"}`)
+
+### Checklist
+
+| # | Check | Command |
+|---|-------|---------|
+| 3.1 | Config loads | `cd /opt/data-analyst/repo && .venv/bin/python -c "from config.loader import load_instance_config; print(load_instance_config())"` |
+| 3.2 | Webapp picks it up | Restart webapp, check login page shows instance name |
+
+---
+
+## Step 4: Create Admin Account & Login
+
+1. Login via Google OAuth in browser
+2. Register account with SSH key
+3. Verify the user is admin:
+
+```bash
+id YOUR_USERNAME           # should be in data-ops or sudo group
+# If not admin, manually add:
+usermod -aG data-ops YOUR_USERNAME
+```
+
+### Checklist
+
+| # | Check | Command |
+|---|-------|---------|
+| 4.1 | Google OAuth works | Login via browser |
+| 4.2 | Account created | `list-analysts` shows your username |
+| 4.3 | Dashboard loads | Browser: /dashboard shows data stats |
+| 4.4 | Admin access | Browser: /admin/tables loads (no 403) |
+
+---
+
+## Step 5: Test Discovery API (Phase 1)
+
+In browser, go to `/admin/tables` and click "Discover tables from source".
+
+### Checklist
+
+| # | Check | Expected |
+|---|-------|----------|
+| 5.1 | Discovery button works | Loading spinner, then tables appear |
+| 5.2 | Tables grouped by bucket | Buckets shown as collapsible sections |
+| 5.3 | Table details shown | Name, columns, row count, size for each table |
+| 5.4 | "Available" badge | All tables show "Available" (none registered yet) |
+| 5.5 | API direct test | `curl -b cookies.txt https://HOST/api/admin/discover-tables \| jq .total` |
+
+---
+
+## Step 6: Test Table Registry (Phase 2)
+
+### 6a: Register tables via Admin UI
+
+1. Click "Register" on a table in discovery results
+2. Fill in: sync_strategy=full_refresh, confirm primary key
+3. Click "Register Table"
+4. Repeat for 2-3 more tables (try incremental too)
+
+### 6b: Verify registry
+
+```bash
+# On server
+cat /data/src_data/metadata/table_registry.json | python3 -m json.tool | head -30
+
+# Check generated data_description.md
+head -10 /opt/data-analyst/repo/docs/data_description.md
+# Should show: <!-- AUTO-GENERATED from table_registry.json -->
+
+# Check audit log
+cat /data/src_data/metadata/registry_audit.log
+```
+
+### 6c: Test via API
+
+```bash
+# List registry
+curl -b cookies.txt https://HOST/api/admin/registry | jq '.tables | length'
+
+# Update a table
+curl -b cookies.txt -X PUT https://HOST/api/admin/registry/in.c-crm.company \
+  -H "Content-Type: application/json" \
+  -d '{"description": "Updated via API", "version": CURRENT_VERSION}'
+
+# Delete a table
+curl -b cookies.txt -X DELETE https://HOST/api/admin/registry/in.c-crm.company \
+  -H "Content-Type: application/json" \
+  -d '{"version": CURRENT_VERSION}'
+```
+
+### Checklist
+
+| # | Check | Expected |
+|---|-------|----------|
+| 6.1 | Register table | Success, table appears in registry panel |
+| 6.2 | Badge changes | Registered tables show green "Registered" badge |
+| 6.3 | data_description.md | Generated with AUTO-GENERATED header + checksum |
+| 6.4 | Audit log written | Actions logged with timestamps and emails |
+| 6.5 | Optimistic locking | Stale version POST returns 409 |
+| 6.6 | Edit table | PUT changes description/strategy |
+| 6.7 | Delete table | Table removed, badge reverts to "Available" |
+
+---
+
+## Step 7: Test Data Sync + Auto-Profiling (Phase 3)
+
+```bash
+cd /opt/data-analyst/repo
+source .venv/bin/activate
+
+# Run sync for registered tables
+python -m src.data_sync
+```
+
+### Checklist
+
+| # | Check | Expected |
+|---|-------|----------|
+| 7.1 | Sync completes | Tables downloaded, Parquet created |
+| 7.2 | Schema.yml generated | `cat docs/schema.yml \| head` |
+| 7.3 | Auto-profiling ran | Log shows "Auto-profiling: N profiled" |
+| 7.4 | profiles.json exists | `ls -la /data/src_data/metadata/profiles.json` |
+| 7.5 | Catalog shows profiles | Browser: /catalog -> click table -> profile data loads |
+
+---
+
+## Step 8: Test Per-Table Subscriptions (Phase 4)
+
+### 8a: Via API
+
+```bash
+# Get current subscriptions
+curl -b cookies.txt https://HOST/api/table-subscriptions | jq .
+
+# Switch to explicit mode, subscribe to specific tables
+curl -b cookies.txt -X POST https://HOST/api/table-subscriptions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "table_mode": "explicit",
+    "tables": {"company": true, "contact": true, "events": false}
+  }'
+```
+
+### 8b: Via Catalog UI
+
+1. Go to /catalog
+2. Tables should show subscription status (all subscribed in "all" mode)
+3. After switching to "explicit" mode via API, unsubscribed tables should be visually different
+
+### Checklist
+
+| # | Check | Expected |
+|---|-------|----------|
+| 8.1 | Default is "all" mode | GET returns `table_mode: "all"` |
+| 8.2 | Switch to explicit | POST succeeds, settings saved |
+| 8.3 | Config YAML updated | `cat /home/USERNAME/.sync_settings.yaml` shows `table_mode: explicit` |
+| 8.4 | Catalog reflects subs | Subscribed vs unsubscribed tables visually distinct |
+
+---
+
+## Step 9: Test Smart Sync (Phase 5)
+
+### 9a: Check rsync filter generation
+
+```bash
+# After setting explicit subscriptions:
+cat /home/USERNAME/.sync_rsync_filter
+# Should show include/exclude rules
+```
+
+### 9b: Test from analyst machine
+
+```bash
+# On analyst machine (or simulate):
+bash server/scripts/sync_data.sh --dry-run
+# Should show filter-based sync when explicit mode is active
+```
+
+### Checklist
+
+| # | Check | Expected |
+|---|-------|----------|
+| 9.1 | Filter file exists | `.sync_rsync_filter` created in user home |
+| 9.2 | Correct include/exclude | Subscribed tables included, others excluded |
+| 9.3 | Dry-run uses filter | `--filter="merge ..."` in rsync output |
+| 9.4 | Fallback works | Without filter file, syncs everything (backwards compat) |
+
+---
+
+## Step 10: Migration Test (One-Time Bootstrap)
+
+If you already have a `docs/data_description.md` with tables defined:
+
+```bash
+python3 -c "
+from src.table_registry import TableRegistry
+from pathlib import Path
+
+registry = TableRegistry.import_from_data_description(
+    Path('docs/data_description.md'),
+    Path('/data/src_data/metadata/table_registry.json'),
+    registered_by='migration@test.com'
+)
+print(f'Migrated {len(registry.list_tables())} tables')
+print(f'Version: {registry.version}')
+"
+```
+
+### Checklist
+
+| # | Check | Expected |
+|---|-------|----------|
+| 10.1 | Migration succeeds | All tables imported |
+| 10.2 | Registry JSON valid | `cat table_registry.json \| python3 -m json.tool` |
+| 10.3 | migrated_from marker | `"migrated_from": "docs/data_description.md"` in metadata |
+| 10.4 | Admin UI shows tables | /admin/tables lists all migrated tables |
+
+---
+
+## Step 11: Regression Tests
+
+```bash
+cd /opt/data-analyst/repo
+source .venv/bin/activate
+python -m pytest tests/ -v
+```
+
+### Checklist
+
+| # | Check | Expected |
+|---|-------|----------|
+| 11.1 | All tests pass | 132+ tests, 0 failures |
+| 11.2 | No import errors | All modules load cleanly |
+
+---
+
+## Quick Smoke Test Script
+
+Run this after full setup to verify the critical path:
+
+```bash
+#!/bin/bash
+# smoke_test.sh - Quick verification of self-service onboarding
+set -e
+
+APP_DIR="/opt/data-analyst/repo"
+cd "$APP_DIR"
+source .venv/bin/activate
+
+echo "=== Smoke Test ==="
+
+# 1. Tests
+echo "[1/5] Running tests..."
+python -m pytest tests/ -q --tb=short
+echo "  PASS"
+
+# 2. Registry module
+echo "[2/5] Testing Table Registry..."
+python -c "
+from src.table_registry import TableRegistry
+from pathlib import Path
+import tempfile
+r = TableRegistry(Path(tempfile.mktemp(suffix='.json')))
+r.register_table({'id': 'test.t', 'name': 't', 'primary_key': 'id', 'sync_strategy': 'full_refresh'}, 'test')
+assert r.is_registered('test.t')
+r.unregister_table('test.t')
+assert not r.is_registered('test.t')
+print('  PASS')
+"
+
+# 3. Discovery (needs Keboola credentials)
+echo "[3/5] Testing Discovery API..."
+python -c "
+try:
+    from src.data_sync import create_data_source
+    ds = create_data_source()
+    tables = ds.discover_tables()
+    print(f'  PASS - Discovered {len(tables)} tables')
+except Exception as e:
+    print(f'  SKIP - {e}')
+"
+
+# 4. Profiler API
+echo "[4/5] Testing Profiler API..."
+python -c "
+from src.profiler import profile_changed_tables
+result = profile_changed_tables([])
+assert result == {'success': 0, 'errors': 0, 'skipped': 0}
+print('  PASS')
+"
+
+# 5. Webapp imports
+echo "[5/5] Testing Webapp imports..."
+python -c "
+from webapp.auth import admin_required, login_required
+from webapp.sync_settings_service import get_table_subscriptions, generate_rsync_filter
+from src.table_registry import TableRegistry, ConflictError
+print('  PASS')
+"
+
+echo ""
+echo "=== All smoke tests passed ==="
+```
+
+---
+
+## Troubleshooting
+
+| Problem | Fix |
+|---------|-----|
+| `/admin/tables` returns 403 | User not in `data-ops` group. Run `usermod -aG data-ops USERNAME` |
+| Discovery returns empty | Check `KEBOOLA_STORAGE_TOKEN` in `.env`, verify `DATA_SOURCE=keboola` |
+| Profiles not generated | Check `/data/src_data/parquet/` has parquet files, check DuckDB installed |
+| Rsync filter not created | Check `sudo` permissions for `www-data` in sudoers-webapp |
+| `data_description.md` not updating | Check write permissions on `docs/` directory |
+| Webapp won't start | Check `journalctl -u webapp -n 50` for errors |