Petr 7c9007a8f9 Update docs for modular architecture (auth/, services/, scripts/)

Add auth providers, standalone services, and service patterns
to project structure in README, ARCHITECTURE, and CLAUDE.md.
Reflects the completed extraction of auth, telegram bot,
ws gateway, corporate memory, and session collector.

2026-03-09 13:11:40 +01:00

7.4 KiB

Raw Blame History

AI Data Analyst

Open-source data distribution platform for AI analytical systems. Syncs data from various sources, converts to Parquet, and distributes to analysts who use Claude Code for local analysis.

First-Time Setup

When a user opens this project for the first time, guide them through interactive setup:

Step 1: Gather Information

Ask the user for:

Company domain (e.g., "acme.com") - used for Google OAuth
Data source type: keboola / csv / bigquery (future)
Instance name (e.g., "Acme Data Analyst")

Step 2: Generate Configuration

Copy config/instance.yaml.example to config/instance.yaml
Fill in values from Step 1
If Keboola: ask for Storage API token, stack URL, project ID
Create .env from config/.env.template

Step 3: Generate Data Description

If Keboola adapter: use the API to fetch table metadata and generate docs/data_description.md
If CSV: ask user to describe their data files
The file defines tables, sync strategies, and schema

Step 4: Server Setup (if deploying)

Guide VM provisioning (or use existing server)
Run server/setup.sh on the target VM
Run server/webapp-setup.sh for the web portal
Set up CI/CD from .github/workflows/deploy.yml.example

Project Structure

├── src/                    # Core data sync engine (vendor-neutral)
│   ├── config.py           # Configuration from data_description.md
│   ├── data_sync.py        # Sync orchestration + DataSource ABC
│   ├── parquet_manager.py  # Parquet file management
│   └── profiler.py         # Data profiling
├── connectors/             # Data source connectors (pluggable)
│   ├── keboola/            # Keboola Storage connector
│   └── jira/               # Jira webhook connector
├── auth/                   # Authentication providers (pluggable)
│   ├── google/             # Google OAuth provider
│   ├── password/           # Email/password provider
│   └── desktop/            # Desktop JWT provider (API-only)
├── services/               # Standalone services (own systemd units)
│   ├── telegram_bot/       # Telegram notification bot
│   ├── ws_gateway/         # WebSocket notification gateway
│   ├── corporate_memory/   # AI knowledge aggregation
│   └── session_collector/  # Claude Code session collector
├── webapp/                 # Flask web portal (login, dashboard, API)
├── server/                 # Deployment infrastructure only
├── scripts/                # Utility scripts (sync, DuckDB setup, dev)
├── config/                 # Configuration templates
│   ├── instance.yaml.example
│   └── data_description.md.example
├── docs/                   # Documentation
└── tests/                  # Test suite

Architecture

Data Source (Keboola / CSV / BigQuery)
      │
      ▼
┌─────────────────────────────────┐
│  Data Broker Server             │
│  ├── /data/src_data/parquet/    │  Converted data
│  ├── /data/docs/                │  Documentation
│  └── /data/scripts/             │  Helper scripts
└─────────────────────────────────┘
      │ rsync (via ~/server/ symlinks)
      ▼
┌─────────────────────────────────┐
│  Analyst (local machine)        │
│  ├── ./server/  (read-only)     │  parquet, docs, scripts
│  └── ./user/    (workspace)     │  duckdb, notifications
└─────────────────────────────────┘

Configuration

Instance-specific config is in config/instance.yaml. See config/instance.yaml.example for all options.

Environment variables go in .env (never committed to git).

Data schema is defined in docs/data_description.md (YAML blocks in markdown).

Development

# Setup
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# Run webapp locally
flask --app webapp.app run --debug

# Run tests
pytest tests/ -v

# Sync data
python -m src.data_sync

Extensibility

Data Sources

Pluggable data source connectors in connectors/:

Keboola (keboola): Syncs from Keboola Storage API
CSV (csv): Import from local CSV files (planned)
New connector = connectors/<name>/adapter.py implementing DataSource

Authentication

Pluggable auth providers in auth/:

Google (google): OAuth via Google
Password (password): Email/password with magic links
Desktop (desktop): JWT for desktop app API
New provider = auth/<name>/provider.py implementing AuthProvider

Configure data source in config/instance.yaml under data_source.type.

Server Management

# Add analyst user
sudo add-analyst username "ssh-rsa AAAA..."

# Add privileged analyst
sudo add-analyst username "ssh-rsa AAAA..." --private

# List analysts
list-analysts

# Server monitoring
uptime && free -h && df -h /data

Returning Users

When reopening the project in Claude Code:

Sync latest data: bash server/scripts/sync_data.sh
Verify DuckDB: ls -lh user/duckdb/analytics.duckdb
Start analyzing with Claude Code

Key Implementation Details

Config Loading Chain

config/loader.py loads instance.yaml (checks $CONFIG_DIR, then ./config/)
webapp/config.py calls _load_instance_config() at module level
_get(config, *keys, default="") traverses nested dicts safely
inject_config() context processor exposes Config to all Jinja templates
Templates use {{ config.INSTANCE_NAME }}, {{ config.INSTANCE_SUBTITLE }}, etc.

Connector Pattern

ABC: DataSource class in src/data_sync.py
Registry: create_data_source() in src/data_sync.py auto-discovers connectors in connectors/
Keboola: connectors/keboola/adapter.py -> KeboolaDataSource implementing DataSource
Core Keboola logic: connectors/keboola/client.py (Keboola Storage API wrapper)

Auth Provider Pattern

ABC: AuthProvider class in auth/__init__.py
Discovery: discover_providers() scans auth/*/provider.py
Providers: google, password, desktop (each exports provider instance)
Session contract: all providers set session["user"] = {"email", "name", "picture"}

Service Pattern

Self-contained modules in services/ with __main__.py for python -m services.<name>
Systemd files in services/<name>/systemd/, auto-discovered by deploy.sh
Services: telegram_bot, ws_gateway, corporate_memory, session_collector

Server Patterns

Atomic JSON writes: tempfile.mkstemp() + os.fchmod(fd, 0o660) + os.replace()
User home writes: sudo /usr/bin/install -o {user} -g {user} pattern
Staging dir: /tmp/data_analyst_staging (deploy.sh creates it with setgid)
Dev docs: dev_docs/server.md documents all established patterns

Files NOT to modify (stable infrastructure)

src/parquet_manager.py - Parquet conversion engine
connectors/jira/file_lock.py - Advisory file locking
connectors/jira/incremental_transform.py - Jira monthly Parquet transform
services/ws_gateway/ - WebSocket notification gateway

Git Commits & Pull Requests

Keep commit messages clean and concise
Do not include AI attribution in commits or PRs

7.4 KiB Raw Blame History