From 53f39bb38da2200972af4fa83fbdb4fffa320e40 Mon Sep 17 00:00:00 2001 From: ZdenekSrotyr Date: Thu, 9 Apr 2026 09:06:13 +0200 Subject: [PATCH] =?UTF-8?q?chore:=20clean=20stale=20docs=20=E2=80=94=20rew?= =?UTF-8?q?rite=20architecture.md,=20remove=20old=20plans?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - architecture.md rewritten for v2 (FastAPI, DuckDB, Docker) — removed all Flask/rsync/SSH/systemd references - Deleted PLAN.md and REFACTORING_PLAN.md (completed, superseded) - auto-install.md replaced with redirect to DEPLOYMENT.md - Fixed absolute paths in superpowers plan doc --- README.md | 238 +++--- docs/PLAN.md | 252 ------ docs/REFACTORING_PLAN.md | 576 ------------- docs/architecture.md | 769 ++++++++---------- docs/auto-install.md | 663 +-------------- .../plans/2026-03-27-01-duckdb-state-layer.md | 4 +- 6 files changed, 456 insertions(+), 2046 deletions(-) delete mode 100644 docs/PLAN.md delete mode 100644 docs/REFACTORING_PLAN.md diff --git a/README.md b/README.md index bd2dec0..19eb864 100644 --- a/README.md +++ b/README.md @@ -1,169 +1,149 @@ -# AI Data Analyst +# Agnes — AI Data Analyst -A data distribution platform for AI analytical systems. It pulls data from configured sources, converts it to Parquet format, and distributes it to analysts who query it locally using Claude Code and DuckDB. +Agnes is an open-source data distribution platform for AI analytical systems. It extracts data from configured sources into DuckDB, serves it via a FastAPI backend, and distributes Parquet files to analysts who query them locally using Claude Code and DuckDB. -## How It Works +Each data source produces a self-describing `extract.duckdb` file. The `SyncOrchestrator` attaches all extract databases into a master `analytics.duckdb`, making every table available through a unified view layer without copying data unnecessarily. -```mermaid -flowchart TB - subgraph Sources["Data Sources"] - A[(Keboola)] - B[(CSV Files)] - C[(BigQuery / Snowflake)] - style C stroke-dasharray: 5 5 - end +## Architecture: extract.duckdb Contract - subgraph Broker["Data Broker Server"] - D[Source Adapter] - E[Parquet Converter] - D --> E - end +Every connector produces the same output structure: - subgraph Analyst["Analyst Machine"] - F[Parquet Files] - G[(DuckDB)] - H((Claude Code)) - F --> G - G --> H - end - - A --> D - B --> D - C -.->|planned| D - E -->|rsync over SSH| F +``` +/data/extracts/{source_name}/ +├── extract.duckdb ← _meta table + views +└── data/ ← parquet files (local sources only) ``` -1. The server fetches data from a configured source using the appropriate adapter. -2. Raw data is converted to typed, columnar Parquet files. -3. Analysts sync Parquet files to their machines over SSH (rsync). -4. Claude Code queries the local DuckDB database and returns results with insights. +The orchestrator scans `/data/extracts/*/extract.duckdb`, attaches each into `analytics.duckdb`, and creates master views. -## Features +``` +┌──────────────┐ ┌──────────────┐ ┌──────────────┐ +│ Keboola │ │ BigQuery │ │ Jira │ +│ extractor │ │ extractor │ │ webhooks │ +│ (DuckDB ext) │ │ (remote BQ) │ │ (incremental)│ +└──────┬───────┘ └──────┬───────┘ └──────┬───────┘ + │ │ │ + ▼ ▼ ▼ + extract.duckdb extract.duckdb extract.duckdb + + data/*.parquet (views → BQ) + data/*.parquet + │ │ │ + └─────────────────┼─────────────────┘ + ▼ + SyncOrchestrator.rebuild() + ATTACH → master views in analytics.duckdb + │ + ┌──────────┼──────────┐ + ▼ ▼ ▼ + FastAPI CLI + (serve) (da sync) +``` -- **Pluggable data sources** -- connector interface supporting Keboola out of the box, CSV import, and extensible to BigQuery, Snowflake, and others. -- **Pluggable authentication** -- auto-discovered auth providers (Google OAuth, email/password, desktop JWT, or custom). -- **Automatic Parquet conversion** -- source data is converted to typed, partitioned Parquet files for efficient local querying. -- **SSH-based distribution** -- analysts sync data securely via rsync; no cloud credentials leave the server. -- **Claude Code as analyst interface** -- natural language queries against DuckDB, powered by Claude. -- **Claude Code as installer** -- the CLAUDE.md file guides Claude Code through automated project setup for new analysts. -- **Self-service webapp** -- web UI for user onboarding, SSH key management, sync settings, and data catalog browsing. -- **Corporate Memory** -- shared knowledge base that aggregates analyst insights and distributes approved rules back to the team. -- **Configurable per-instance** -- a single `config/instance.yaml` controls branding, authentication, data source, user mapping, and more. -- **Access control** -- role-based permissions with standard analyst, privileged analyst, and admin tiers. +## Supported Data Sources -## Quick Start +| Source | Mode | Description | +|--------|------|-------------| +| **Keboola** | Batch pull | DuckDB Keboola extension downloads tables to Parquet on a schedule | +| **BigQuery** | Remote attach | DuckDB BQ extension; queries execute in BigQuery, no local download | +| **Jira** | Real-time push | Webhook receiver updates Parquet files incrementally | -See **[docs/QUICKSTART.md](docs/QUICKSTART.md)** for full setup instructions. +Adding a new source means creating `connectors//extractor.py` that produces `extract.duckdb` with a `_meta` table (`table_name`, `description`, `rows`, `size_bytes`, `extracted_at`, `query_mode`). The orchestrator attaches it automatically. -The short version: +## Quick Start with Docker ```bash -# 1. Clone the repository +# Clone the repository git clone https://github.com/keboola/agnes-the-ai-analyst.git cd agnes-the-ai-analyst -# 2. Copy and edit configuration +# Copy and edit configuration cp config/instance.yaml.example config/instance.yaml -cp config/data_description.md.example config/data_description.md +cp config/.env.template .env # Edit both files for your environment -# 3. Deploy the server -# See docs/DEPLOYMENT.md for detailed server setup +# Start the app and scheduler +docker compose up -# 4. Analysts connect via the webapp and sync data -bash server/scripts/sync_data.sh +# Start with all optional services (Telegram bot, etc.) +docker compose --profile full up +``` + +Once running, the FastAPI app is available at `http://localhost:8000`. Trigger a manual sync: + +```bash +curl -X POST http://localhost:8000/api/sync/trigger +``` + +## Development Setup + +```bash +# Create and activate virtual environment +python3 -m venv .venv && source .venv/bin/activate + +# Install dependencies +uv pip install -r requirements.txt + +# Run FastAPI locally with hot reload +uvicorn app.main:app --reload + +# Run the test suite +pytest tests/ -v ``` ## Project Structure ``` -ai-data-analyst/ -├── config/ # Instance configuration -│ ├── instance.yaml.example # Main config template (copy to instance.yaml) -│ └── data_description.md.example # Data schema template -│ -├── src/ # Core data sync engine (vendor-neutral) -│ ├── data_sync.py # Orchestrates data pull + DataSource ABC -│ ├── parquet_manager.py # CSV to Parquet conversion -│ ├── config.py # Configuration loader -│ └── profiler.py # Data profiling for catalog -│ -├── connectors/ # Data source connectors (pluggable) -│ ├── keboola/ # Keboola Storage connector -│ │ ├── adapter.py # KeboolaDataSource (implements DataSource) -│ │ └── client.py # Low-level Keboola API client -│ └── jira/ # Jira webhook connector -│ -├── auth/ # Authentication providers (pluggable) -│ ├── google/ # Google OAuth provider -│ ├── password/ # Email/password provider -│ └── desktop/ # Desktop JWT provider (API-only) -│ -├── services/ # Standalone services (own systemd units) -│ ├── telegram_bot/ # Telegram notification bot -│ ├── ws_gateway/ # WebSocket notification gateway -│ ├── corporate_memory/ # AI knowledge aggregation -│ └── session_collector/ # Claude Code session collector -│ -├── webapp/ # Flask web application -│ └── ... # User onboarding, settings, catalog -│ -├── server/ # Deployment infrastructure only -│ ├── deploy.sh # Deployment script (auto-discovers services) -│ └── ... # Sudoers, nginx, setup scripts -│ -├── scripts/ # Helper scripts -│ ├── sync_data.sh # Sync data from server -│ ├── setup_views.sh # Initialize DuckDB views -│ └── dev_run.py # Dev server with auth bypass -│ -├── docs/ # User-facing documentation -├── dev_docs/ # Developer and operator documentation -├── tests/ # Test suite -├── requirements.txt # Python dependencies -├── CLAUDE.md # Instructions for Claude Code -└── README.md # This file +├── src/ # Core engine +│ ├── db.py # DuckDB schema (system.duckdb, analytics.duckdb) +│ ├── orchestrator.py # SyncOrchestrator — ATTACHes extract.duckdb files +│ ├── repositories/ # DuckDB-backed CRUD (sync_state, table_registry, users, etc.) +│ ├── profiler.py # Data profiling +│ └── catalog_export.py # OpenMetadata catalog export +├── app/ # FastAPI application +│ ├── main.py # App setup, router registration +│ ├── api/ # REST API (sync, data, catalog, admin, auth) +│ ├── auth/ # Auth providers (Google OAuth, email magic link, desktop JWT) +│ └── web/ # HTML dashboard routes +├── connectors/ # Data source connectors (extract.duckdb contract) +│ ├── keboola/ # Keboola: extractor.py (DuckDB extension) + client.py (fallback) +│ ├── bigquery/ # BigQuery: extractor.py (remote-only via DuckDB BQ extension) +│ └── jira/ # Jira: webhook + incremental parquet → extract.duckdb +├── cli/ # CLI tool (`da sync`, `da query`, `da admin`) +├── services/ # Standalone services (scheduler, telegram_bot, ws_gateway, etc.) +├── scripts/ # Utility + migration scripts +├── config/ # Configuration templates (instance.yaml.example) +├── docs/ # Documentation + metric YAML definitions +└── tests/ # Test suite (633 tests) ``` -## Supported Data Sources +## Configuration -| Adapter | Status | Description | -|---------|--------|-------------| -| Keboola Storage | Available | Pulls tables via the Keboola Storage API | -| CSV | Planned | Imports local or mounted CSV files | -| BigQuery | Planned | Google BigQuery adapter | -| Snowflake | Planned | Snowflake adapter | +| File | Purpose | +|------|---------| +| `config/instance.yaml` | Instance-specific settings: branding, data source type, auth provider, Google domain | +| `.env` | Secrets and environment variables — never committed | +| `system.duckdb` `table_registry` table | Table definitions managed via `POST /api/admin/tables/{id}` or the web UI | -Adding a new data source means creating a connector module in `connectors/` that implements the `DataSource` interface from `src/data_sync.py`, and setting `data_source.type` in `config/instance.yaml`. See `connectors/keboola/` for a reference implementation. +Copy the example to get started: -## Using with Claude Code - -Once data is synced, open Claude Code in the project directory and ask questions in natural language: - -``` -What are the top 10 customers by revenue this quarter? +```bash +cp config/instance.yaml.example config/instance.yaml ``` -``` -Show me the trend in support ticket volume over the last 6 months. -``` - -Claude Code will connect to the local DuckDB database, write and execute SQL, and return results with analysis. +See `config/instance.yaml.example` for all available options. ## Documentation -- **[Architecture](ARCHITECTURE.md)** -- System components, data flow, and key patterns -- **[Quick Start](docs/QUICKSTART.md)** -- End-to-end setup for new deployments -- **[Configuration](docs/CONFIGURATION.md)** -- All configuration options explained -- **[Deployment](docs/DEPLOYMENT.md)** -- Server provisioning and deployment guide -- **[Data Sources](docs/DATA_SOURCES.md)** -- How to configure and extend data source adapters -- **[Server Administration](dev_docs/server.md)** -- Day-to-day server operations -- **[Security](dev_docs/security.md)** -- Access control and security model +- [Deployment Guide](docs/DEPLOYMENT.md) — server provisioning, Docker, environment setup + +## Contributing + +1. Fork the repository and create a feature branch. +2. Run `pytest tests/ -v` to verify all tests pass before opening a pull request. +3. Keep commits focused and messages concise. +4. Open a pull request against `main` with a clear description of the change. + +For bugs and feature requests, open a GitHub issue. ## License This project is licensed under the [MIT License](LICENSE). - ---- - -Questions or issues? Open a GitHub issue. diff --git a/docs/PLAN.md b/docs/PLAN.md deleted file mode 100644 index f7b07af..0000000 --- a/docs/PLAN.md +++ /dev/null @@ -1,252 +0,0 @@ -# Modular Architecture Refactor Plan - -## Goal - -Transform the project from a monolithic structure into a modular, extensible platform where: -- **Auth providers** are pluggable (Google, password, Okta, SAML, custom) -- **Services** are standalone, self-contained modules (telegram bot, WS gateway, etc.) -- **server/** contains only deployment infrastructure -- New features = new directory, zero changes to core - -## Target Structure - -``` -ai-data-analyst/ -├── src/ # Core sync engine (done) -├── connectors/ # Data source connectors (done) -│ ├── keboola/ -│ └── jira/ -│ -├── auth/ # Pluggable auth providers -│ ├── __init__.py # AuthProvider ABC + discover_providers() -│ ├── google/ # Google OAuth -│ │ ├── __init__.py -│ │ └── provider.py # Blueprint + GoogleAuthProvider -│ ├── password/ # Email/password (requires SendGrid) -│ │ ├── __init__.py -│ │ └── provider.py # Blueprint + PasswordAuthProvider -│ └── desktop/ # JWT for desktop/API clients -│ ├── __init__.py -│ └── provider.py # Blueprint + DesktopAuthProvider -│ -├── services/ # Standalone optional services -│ ├── __init__.py # discover_services() for deploy -│ ├── telegram_bot/ # Telegram notification bot -│ │ ├── __init__.py -│ │ ├── __main__.py # python -m services.telegram_bot -│ │ ├── bot.py, sender.py, dispatch.py, runner.py -│ │ ├── config.py, storage.py, status.py, test_report.py -│ │ ├── systemd/ -│ │ │ └── notify-bot.service -│ │ └── README.md -│ ├── ws_gateway/ # WebSocket notification gateway -│ │ ├── __init__.py -│ │ ├── __main__.py -│ │ ├── gateway.py, auth.py, config.py -│ │ ├── systemd/ -│ │ │ └── ws-gateway.service -│ │ └── README.md -│ ├── corporate_memory/ # AI knowledge extraction -│ │ ├── __init__.py -│ │ ├── __main__.py -│ │ ├── collector.py, prompts.py -│ │ ├── systemd/ -│ │ │ ├── corporate-memory.service -│ │ │ └── corporate-memory.timer -│ │ └── README.md -│ └── session_collector/ # User session log collection -│ ├── __init__.py -│ ├── __main__.py -│ ├── collector.py -│ ├── systemd/ -│ │ ├── session-collector.service -│ │ └── session-collector.timer -│ └── README.md -│ -├── webapp/ # Flask web portal (slim core) -│ ├── app.py # Core routing + auto-discovery -│ ├── auth.py # login_required + provider loading -│ ├── config.py # Config from instance.yaml -│ ├── user_service.py, account_service.py -│ ├── health_service.py, sync_settings_service.py -│ ├── email_service.py -│ ├── telegram_service.py # Webapp-side Telegram integration -│ ├── corporate_memory_service.py # Webapp-side knowledge browser -│ ├── notification_images.py -│ ├── templates/, static/, utils/ -│ └── __init__.py -│ -├── server/ # Deployment infrastructure ONLY -│ ├── deploy.sh # Auto-discovers services/*/systemd/* -│ ├── setup.sh, webapp-setup.sh -│ ├── bin/ # add-analyst, list-analysts, etc. -│ ├── sudoers-*, limits-*.conf -│ ├── webapp.service, webapp-nginx.conf -│ └── migrate-*.sh -│ -├── scripts/ # Analyst-facing helpers (merged dev_scripts/) -├── config/ # Instance configuration -├── docs/ # User documentation -├── dev_docs/ # Developer docs (sanitized) -├── examples/ # Example scripts -└── tests/ # Test suite -``` - -## Auth Provider Interface - -```python -# auth/__init__.py - -class AuthProvider(ABC): - """Base class for authentication providers.""" - - @abstractmethod - def get_name(self) -> str: - """Internal name (e.g., 'google', 'password').""" - - @abstractmethod - def get_blueprint(self) -> Blueprint: - """Flask blueprint with auth routes.""" - - @abstractmethod - def get_login_button(self) -> dict: - """Login button definition for the login page. - Returns: { - "text": "Sign in with Google", - "url": "/login/google", - "icon": "google", # CSS class or SVG name - "subtitle": "For @acme.com email addresses.", - "order": 10, # Sort order on login page - } - """ - - def is_available(self) -> bool: - """Check if provider is configured and ready. - Override to check env vars, API keys, etc.""" - return True - - def get_display_name(self) -> str: - """Human-readable name for UI.""" - return self.get_name().title() -``` - -### Discovery - -```python -def discover_providers() -> list[AuthProvider]: - """Auto-discover auth providers from auth/*/provider.py. - Each provider module must export `provider` instance.""" - providers = [] - auth_dir = Path(__file__).parent - for subdir in sorted(auth_dir.iterdir()): - if subdir.is_dir() and (subdir / "provider.py").exists(): - mod = importlib.import_module(f"auth.{subdir.name}.provider") - provider = getattr(mod, "provider", None) - if provider and isinstance(provider, AuthProvider) and provider.is_available(): - providers.append(provider) - return providers -``` - -### Login Template - -```html -{# webapp/templates/login.html - dynamic login buttons #} -{% for provider in auth_providers %} - - {{ provider.login_button.text }} - -{% if provider.login_button.subtitle %} -

{{ provider.login_button.subtitle }}

-{% endif %} -{% endfor %} -``` - -### Session Contract - -All auth providers MUST set the same session structure: -```python -session["user"] = { - "email": "user@acme.com", # Required - unique identifier - "name": "John Doe", # Optional - display name - "picture": "https://...", # Optional - avatar URL -} -``` - -## Implementation Phases - -### Phase 1: Move services to services/ (git mv + fix imports) - -**Files moved:** -- `server/telegram_bot/` -> `services/telegram_bot/` -- `server/ws_gateway/` -> `services/ws_gateway/` -- `server/corporate_memory/` -> `services/corporate_memory/` -- `server/session_collector.py` -> `services/session_collector/collector.py` -- Service files from `server/*.service` -> `services/*/systemd/` -- Timer files from `server/*.timer` -> `services/*/systemd/` - -**Import fixes:** -- `from server.telegram_bot.X` -> `from services.telegram_bot.X` (in webapp/app.py) -- `python -m server.X` -> `python -m services.X` (in systemd files, bin/ scripts) -- Internal imports within services stay as relative imports - -**Config updates:** -- `server/deploy.sh` - discover services from `services/*/systemd/` -- `server/bin/collect-knowledge` - update module path -- `server/bin/collect-sessions` - update module path - -### Phase 2: Extract auth providers to auth/ - -**Files moved:** -- `webapp/auth.py` -> `auth/google/provider.py` (OAuth logic) -- `webapp/password_auth.py` -> `auth/password/provider.py` -- `webapp/desktop_auth.py` -> `auth/desktop/provider.py` - -**What stays in webapp/auth.py:** -- `login_required` decorator (used everywhere) -- `/logout` route -- Session management utils - -**New files:** -- `auth/__init__.py` - AuthProvider ABC + discover_providers() -- `auth/google/__init__.py` -- `auth/password/__init__.py` -- `auth/desktop/__init__.py` - -**webapp/app.py changes:** -- Replace hardcoded blueprint imports with `discover_providers()` -- Pass `auth_providers` to login template context -- Remove try/except blocks for individual auth modules - -### Phase 3: Update deploy.sh service discovery - -**deploy.sh changes:** -- Auto-discover and install `services/*/systemd/*.service` and `*.timer` -- Remove hardcoded service file paths -- Add enable/disable per instance.yaml config - -### Phase 4: Cleanup - -- Merge `dev_scripts/` into `scripts/` -- Sanitize `dev_docs/` (replace real IPs, hostnames, usernames with placeholders) -- Update CLAUDE.md, README.md, ARCHITECTURE.md -- Update MEMORY.md - -## Verification - -```bash -# 1. All tests pass -pytest tests/ connectors/ -v - -# 2. No server.telegram_bot imports remain -grep -rn "from server\.\(telegram_bot\|ws_gateway\|corporate_memory\)" . - -# 3. No hardcoded auth imports in app.py -grep -n "from.*auth import\|from.*password_auth" webapp/app.py - -# 4. Import smoke tests -python -c "from auth import discover_providers; print(f'{len(discover_providers())} providers')" -python -c "from services.telegram_bot.bot import TelegramBot; print('OK')" - -# 5. Service files discoverable -ls services/*/systemd/*.service services/*/systemd/*.timer -``` diff --git a/docs/REFACTORING_PLAN.md b/docs/REFACTORING_PLAN.md deleted file mode 100644 index e31dd7c..0000000 --- a/docs/REFACTORING_PLAN.md +++ /dev/null @@ -1,576 +0,0 @@ -# Refaktoring AI Data Analyst — Finální plán - -## Kontext - -Platforma vznikla iterativně pro interní Keboola a nyní se má stát produktem pro zákazníky (Groupon aj.). Klíčové problémy z transcriptu ZS+Padák: křehký filesystem stav (JSON soubory, permission konflikty), žádné API (vše SSH+skripty), bezpečnost přes Linux skupiny, složitá instalace (10+ kroků). Systém je navržen pro AI agenty — člověk diskutuje s AI, AI řeší vše (user, admin, dev operace). - -**UX zůstává stejné.** Tooling: `uv` všude místo pip. Docker + Kamal pro server. CLI (`da`) jako primární rozhraní pro AI agenty. - ---- - -## Architektura — cílový stav - -``` -SERVER (Docker + Kamal): -├── webapp Flask UI (katalog, login, corporate memory) -├── api FastAPI (CLI backend, sync manifest, data download) -├── scheduler APScheduler (nahrazuje 7 systemd timerů) -├── telegram-bot Telegram notifikace -├── ws-gateway WebSocket pro desktop app -└── script-runner Sandboxovaný runner pro user skripty - -LOKÁLNĚ (analytik): -├── da CLI Python balíček (uv tool install) -├── DuckDB Embedded (analytics.duckdb → views na parquety) -└── Parquety Stažené ze serveru přes da sync - -DVA DuckDB NA SERVERU: -├── /data/state/system.duckdb Systémový stav (users, sync_state, knowledge...) -└── /data/analytics/server.duckdb Views → /data/parquet/** (profiler, remote query, skripty) - -JEDEN DuckDB LOKÁLNĚ: -└── user/duckdb/analytics.duckdb Views → server/parquet/** + user tabulky -``` - ---- - -## Fáze 0: Základ — DuckDB state + repository vrstva - -**Cíl:** Nahradit 10+ JSON souborů DuckDB databází. Eliminovat #1 zdroj outages (file permission konflikty). - -**Proč DuckDB:** Už v stacku, agent může joinovat stav s analytickými daty, lepší než SQLite pro analytické dotazy nad stavem. - -### Task 0A: DuckDB schema + repository vrstva [INDEPENDENT] - -Nové soubory: -- `src/db.py` — DuckDB connection management, schema creation, migration system -- `src/repositories/__init__.py` -- `src/repositories/sync_state.py` — CRUD pro sync stav -- `src/repositories/users.py` — CRUD pro uživatele + role -- `src/repositories/knowledge.py` — CRUD pro corporate memory -- `src/repositories/table_registry.py` — CRUD pro registr tabulek -- `src/repositories/audit.py` — audit log -- `src/repositories/notifications.py` — telegram links, pending codes, script registry - -Schema tabulky (mapování z JSON): - -| Současný JSON | DuckDB tabulka | Zdroj soubor | -|---|---|---| -| `sync_state.json` | `sync_state` | `src/data_sync.py:37-138` | -| `sync_settings.json` | `user_sync_settings` | `webapp/sync_settings_service.py:20` | -| `knowledge.json` | `knowledge_items` | `webapp/corporate_memory_service.py` | -| `votes.json` | `knowledge_votes` | `webapp/corporate_memory_service.py` | -| `audit.jsonl` | `audit_log` | `webapp/corporate_memory_service.py` | -| `telegram_users.json` | `telegram_links` | `services/telegram_bot/storage.py` | -| `pending_codes.json` | `pending_codes` | `services/telegram_bot/storage.py` | -| `password_users.json` | `users` | `webapp/password_auth.py` | -| `table_registry.json` | `table_registry` | `src/table_registry.py` | -| `profiles.json` | `table_profiles` | `src/profiler.py` | - -Přidat navíc: `sync_history` (posledních 10 syncí per tabulka, ne jen last), `script_registry` (deployed skripty). - -### Task 0B: Migrace existujících service souborů na repository [DEPENDS ON 0A] - -Soubory k úpravě (nahradit `_read_json`/`_write_json` za repository volání): -- `webapp/sync_settings_service.py` řádky 40-62 -- `webapp/corporate_memory_service.py` — 31 JSON operací -- `webapp/telegram_service.py` řádky 22-45 -- `src/data_sync.py` — třída `SyncState` řádky 37-138 -- `src/table_registry.py` — `_load`, `_atomic_write_json` -- `src/profiler.py` — uložení profilů -- `services/corporate_memory/collector.py` — čtení/zápis knowledge -- `services/telegram_bot/storage.py` — 15 JSON operací - -Pattern: dual-write (JSON + DuckDB) po přechodnou dobu → ověřit → smazat JSON zápisy. - -### Task 0C: Migrační skript [DEPENDS ON 0A] - -- `scripts/migrate_json_to_duckdb.py` — načte všechny JSON, vloží do DuckDB -- Idempotentní (safe to run multiple times) -- Validace po migraci (count porovnání) - -### Co se NEMĚNÍ v Fázi 0 -- Flask routes v `webapp/app.py` -- HTML šablony -- Konektory (`connectors/keboola/`, `connectors/bigquery/`, `connectors/jira/`) -- `src/config.py` (čte `data_description.md` — konfigurace, ne stav) -- `config/loader.py` (čte `instance.yaml`) -- `src/parquet_manager.py` - ---- - -## Fáze 1: API vrstva (FastAPI) - -**Cíl:** REST API pro CLI. Všechny operace co dnes vyžadují SSH. - -### Task 1A: FastAPI základ + auth [INDEPENDENT od 0B, DEPENDS ON 0A] - -Nové soubory: -``` -api/ - __init__.py - app.py # FastAPI app, middleware, CORS - auth.py # JWT vydávání + validace - dependencies.py # DI pro DuckDB session, current_user -``` - -Auth flow: -1. `POST /api/auth/login` — přijme OAuth token z webappu, vydá JWT -2. `POST /api/auth/token` — přijme API key, vydá JWT -3. JWT obsahuje: user_id, email, role, expiry -4. Middleware validuje JWT na všech /api/* endpoints - -### Task 1B: Sync + Data endpointy [DEPENDS ON 1A, 0A] - -``` -api/routers/ - sync.py # GET /api/sync/manifest, POST /api/sync/trigger - data.py # GET /api/data/{table}/download (parquet stream) -``` - -- `/api/sync/manifest` — vrátí hashe všech parquetů, docs, rules, profilů (filtrované per-user dle subscription) -- `/api/data/{table}/download` — streaming parquet souboru s ETag/If-None-Match -- `/api/sync/trigger` — spustí DataSyncManager (reuse `src/data_sync.py`) - -### Task 1C: Query + Scripts endpointy [DEPENDS ON 1A, 0A] - -``` -api/routers/ - query.py # POST /api/query (remote query) - scripts.py # POST /api/scripts/run, /deploy, /list -``` - -- `/api/query` — reuse `src/remote_query.py`, výsledek jako JSON/parquet -- `/api/scripts/run` — spustí Python skript v sandboxu na serveru -- `/api/scripts/deploy` — nahraje skript + registruje v scheduleru -- `/api/scripts/list` — deployed skripty s jejich schedules - -### Task 1D: User management + Corporate memory endpointy [DEPENDS ON 1A, 0A] - -``` -api/routers/ - users.py # CRUD uživatelů, role, permissions - settings.py # GET/PUT sync settings per user - memory.py # Corporate memory CRUD, voting, governance - health.py # GET /api/health (strukturovaná diagnostika) - upload.py # POST sessions, artifacts, CLAUDE.local.md -``` - -### Task 1E: Odstranění SSH/sudo závislostí [DEPENDS ON 1B, 1D] - -Smazat/přepsat: -- `webapp/sync_settings_service.py` řádky 128-240 (sudo/rsync-filter kód) -- `webapp/user_service.py` — Linux user management (`pwd.getpwnam`, `sudo add-analyst`) -- SSH key validace workflow -- `server/sudoers-webapp`, `server/sudoers-deploy` -- `server/bin/add-analyst` - ---- - -## Fáze 2: CLI nástroj (`da`) - -**Cíl:** Jediné rozhraní pro AI agenty. Nahrazuje SSH+skripty. `uv tool install`. - -### Task 2A: CLI základ + auth [INDEPENDENT od 1B-1E, DEPENDS ON 1A] - -``` -cli/ - __init__.py - main.py # Typer app, global options (--server, --json) - config.py # ~/.config/da/ management - client.py # HTTP client wrapper (auth, retry, streaming) - commands/ - auth.py # da login, da logout, da whoami -``` - -- `da login` → otevře browser pro OAuth → server vydá JWT → uloží do `~/.config/da/token.json` -- `da --json` flag na všech příkazech pro strukturovaný output -- `da --server URL` override (default z config.yaml) - -### Task 2B: Sync příkazy [DEPENDS ON 2A, 1B] - -``` -cli/commands/ - sync.py # da sync, da sync --table X, da sync --upload-only -``` - -Flow: -1. `GET /api/sync/manifest` → porovnej s `~/.config/da/sync_state.json` -2. Download změněné parquety (HTTP streaming s progress barem) -3. Download docs, rules, profily -4. Upload sessions, artifacts, CLAUDE.local.md -5. Rebuild DuckDB views (DROP views, CREATE VIEW per tabulka, zachovej user tabulky) -6. Update lokální manifest - -Přepíše funkci `scripts/sync_data.sh` (475 řádků). - -### Task 2C: Query + Scripts příkazy [DEPENDS ON 2A, 1C] - -``` -cli/commands/ - query.py # da query "SQL" [--remote] [--json] - scripts.py # da scripts list/run/deploy/undeploy - explore.py # da explore {table} — profil tabulky -``` - -- `da query` — lokální DuckDB default, `--remote` přes server API -- `da scripts run X` — lokálně default, `--remote` přes server -- `da scripts deploy X --schedule "cron"` — upload + registrace na serveru -- `da explore orders` — profil z lokálních dat (nebo `--remote` ze serveru) - -### Task 2D: Admin + Server příkazy [DEPENDS ON 2A, 1D] - -``` -cli/commands/ - admin.py # da admin add-user/remove-user/list-users - status.py # da status [--local] — zdraví systému - server.py # da server deploy/rollback/logs/status - diagnose.py # da diagnose — AI-friendly diagnostika -``` - -- `da status` — strukturovaný health report (tabulky, sync stav, služby) -- `da status --local` — offline: kdy jsem synkoval, kolik dat mám -- `da diagnose` — projde logy, sync stav, konektivitu → root cause -- `da server deploy` — wrapper kolem `kamal deploy` -- `da server logs webapp` — wrapper kolem `kamal app logs` - -### Task 2E: PyPI distribuce [DEPENDS ON 2A] - -- `pyproject.toml` pro CLI balíček -- `uv tool install data-analyst` nebo `uv pip install data-analyst` -- Entry point: `[project.scripts] da = "cli.main:app"` -- Minimální dependencies: typer, httpx, duckdb, rich (progress bars) - ---- - -## Fáze 3: Docker + Kamal - -**Cíl:** `docker compose up` pro dev, `kamal deploy` pro produkci. Nahrazuje 10+ manuálních kroků. - -### Task 3A: Dockerfile + docker-compose.yml [INDEPENDENT] - -``` -Dockerfile # python:3.13-slim, uv install, jeden image -docker-compose.yml # webapp, api, scheduler, telegram-bot, ws-gateway -docker-compose.test.yml # api + test-runner pro integrační testy -``` - -- Jeden image, různý CMD per služba -- Volume `/data` sdílený mezi kontejnery -- `profiles: ["full"]` pro volitelné služby (telegram, ws-gateway) -- `uv sync` místo `pip install` v Dockerfile - -### Task 3B: Scheduler služba [DEPENDS ON 0A] - -Nový soubor: `services/scheduler/__main__.py` -- APScheduler (nebo jednoduchý custom) nahrazuje 7 systemd timerů: - -| Timer | Schedule | Funkce | -|---|---|---| -| data-refresh | 15 min | `DataSyncManager.sync_scheduled()` | -| catalog-refresh | 15 min | Catalog refresh | -| corporate-memory | 30 min | Knowledge collector | -| session-collector | 6h | Session collection (z uploaded dat) | -| user-scripts | per-script cron | Script runner | -| profiler | po data-refresh | Auto-profile nových dat | - -### Task 3C: Kamal konfigurace [DEPENDS ON 3A] - -``` -config/ - deploy.yml # produkční Kamal config - deploy.staging.yml # staging override -``` - -- Kamal Proxy pro auto-SSL (Let's Encrypt) -- Healthcheck na `/api/health` -- Zero-downtime deploy -- Accessories: scheduler, telegram-bot, ws-gateway, script-runner -- Environment secrets přes Kamal env management - -### Task 3D: GitHub Actions CI/CD [DEPENDS ON 3A, 3C] - -``` -.github/workflows/ - ci.yml # test + build na každém push - deploy.yml # staging na PR, production na merge do main -``` - -Flow: push → pytest → integrační testy (docker compose) → build image → push GHCR → kamal deploy staging (PR) / production (merge) - -### Task 3E: Smazání starého server infra [DEPENDS ON 3A-3D, ověřeno že nové funguje] - -Smazat: -- `server/setup.sh` (103 řádků) -- `server/webapp-setup.sh` (171 řádků) -- `server/deploy.sh` (395 řádků) -- `server/migrate-to-v2.sh` (146 řádků) -- Všechny systemd unit soubory (`services/*/systemd/`) -- `server/sudoers-*` -- `server/bin/add-analyst` a related skripty -- `scripts/sync_data.sh` (475 řádků) -- `server/webapp.service`, `server/webapp-nginx.conf` - ---- - -## Fáze 4: RBAC + bezpečnost - -**Cíl:** Aplikační RBAC místo Linux skupin. Audit trail. Script sandboxing. - -### Task 4A: Role + permissions v DuckDB [DEPENDS ON 0A] - -Nový soubor: `src/rbac.py` - -```python -class Role(Enum): - VIEWER = "viewer" # Katalog, čtení dat - ANALYST = "analyst" # Sync, queries, voting, skripty - ADMIN = "admin" # Správa uživatelů, schvalování knowledge - KM_ADMIN = "km_admin" # Corporate memory governance -``` - -- Dataset-level permissions (kdo má přístup ke kterým datům) -- Přepsat `webapp/auth.py` řádky 37-65 (admin_required/km_admin_required) -- Přepsat `webapp/user_service.py` celý — DB místo `pwd.getpwnam()` + `sudo` - -### Task 4B: Audit trail [DEPENDS ON 0A] - -- Každý API call logován do `audit_log` tabulky -- Struktura: timestamp, user_id, action, resource, params, result, duration -- Agent může: `da query "SELECT * FROM system.audit_log WHERE action='sync_trigger' ORDER BY timestamp DESC LIMIT 10"` - -### Task 4C: Script sandboxing [DEPENDS ON 3A] - -- Script-runner jako izolovaný Docker kontejner -- Read-only přístup k DuckDB -- Omezená paměť (512MB), čas (5min), žádný network (kromě notification dispatch) -- Explicitní whitelist Python balíčků (pandas, duckdb, matplotlib) - -### Task 4D: Corporate memory push model [DEPENDS ON 1D] - -- Uživatelé pushují CLAUDE.local.md přes `da sync --upload-only` -- Server nikdy nečte `/home/*/` jako root -- Corporate memory collector zpracovává uploaded data z DB - ---- - -## Dependency graf pro multi-agenty - -``` -Fáze 0: - 0A (DuckDB schema) ─────────────────────┐ - 0C (migrační skript) ← závisí na 0A │ - 0B (migrace services) ← závisí na 0A │ - │ -Fáze 1: │ - 1A (FastAPI základ) ← závisí na 0A ─────┤ - 1B (sync/data EP) ← závisí na 1A, 0A │ - 1C (query/scripts EP) ← závisí na 1A │ - 1D (users/memory EP) ← závisí na 1A │ - 1E (remove SSH) ← závisí na 1B, 1D │ - │ -Fáze 2: │ - 2A (CLI základ) ← závisí na 1A │ - 2B (sync cmd) ← závisí na 2A, 1B │ - 2C (query/scripts cmd) ← závisí na 2A │ - 2D (admin/server cmd) ← závisí na 2A │ - 2E (PyPI) ← závisí na 2A │ - │ -Fáze 3: │ - 3A (Dockerfile) ← INDEPENDENT ──────────┘ - 3B (scheduler) ← závisí na 0A - 3C (Kamal) ← závisí na 3A - 3D (CI/CD) ← závisí na 3A, 3C - 3E (cleanup) ← závisí na 3A-3D verified - -Fáze 4: - 4A (RBAC) ← závisí na 0A - 4B (audit) ← závisí na 0A - 4C (sandbox) ← závisí na 3A - 4D (push model) ← závisí na 1D -``` - -### Paralelní agenty — optimální rozložení - -``` -AGENT 1: DuckDB + Repositories AGENT 2: FastAPI AGENT 3: Docker + Kamal -───────────────────────────── ───────────────── ────────────────────── -0A: DuckDB schema (čeká na 0A) 3A: Dockerfile + compose -0C: migrační skript 1A: FastAPI základ 3B: scheduler služba -0B: migrace services 1B: sync/data EP 3C: Kamal konfigurace -4A: RBAC 1C: query/scripts EP 3D: CI/CD workflow -4B: audit trail 1D: users/memory EP 4C: script sandbox - 1E: remove SSH deps - -AGENT 4: CLI + Skills AGENT 5: Integrace + Cleanup -───────────────────── ─────────────────────────── -(čeká na 1A) (čeká na agents 1-4) -2A: CLI základ + auth End-to-end testování -2B: sync příkazy 3E: smazání starého infra -2C: query/scripts příkazy 4D: corporate memory push -2D: admin/server příkazy 5A: CLAUDE.md template update -2E: PyPI distribuce Dokumentace update -5B: CLI skills (help/docs) -5C: da setup (interactive) -5D: da diagnose -5E: da infra (multi-customer) -``` - ---- - -## Znovupoužité vs. přepsané soubory - -### Beze změny (business logika zachována) -- `src/config.py` — TableConfig, Config parsing (625 řádků) -- `src/parquet_manager.py` — Parquet conversion engine -- `connectors/keboola/adapter.py` + `client.py` -- `connectors/bigquery/adapter.py` + `client.py` -- `connectors/jira/` — celý connector -- `connectors/llm/` — LLM abstrakce -- `connectors/openmetadata/` — katalog enrichment -- `webapp/config.py`, `config/loader.py` -- `webapp/templates/` — všechny HTML šablony -- `src/remote_query.py` — query logika (zabalená API) -- `src/profiler.py` — profiling logika (output do DuckDB) - -### Přepojené na DuckDB (logika zachována, I/O vrstva vyměněna) -- `webapp/corporate_memory_service.py` -- `webapp/sync_settings_service.py` -- `webapp/telegram_service.py` -- `src/data_sync.py` (SyncState třída) -- `src/table_registry.py` -- `services/corporate_memory/collector.py` -- `services/telegram_bot/storage.py` - -### Přepsané -- `webapp/user_service.py` — DB místo Linux users -- `webapp/auth.py` řádky 37-65 — RBAC místo Linux skupin - -### Nové -- `src/db.py`, `src/repositories/`, `src/rbac.py` -- `api/` — celý FastAPI server -- `cli/` — celý CLI nástroj -- `Dockerfile`, `docker-compose*.yml`, `config/deploy*.yml` -- `services/scheduler/__main__.py` -- `.github/workflows/ci.yml`, `.github/workflows/deploy.yml` - -### Smazané -- `server/setup.sh`, `server/webapp-setup.sh`, `server/deploy.sh` -- `server/migrate-to-v2.sh` -- `server/sudoers-*`, `server/bin/add-analyst` -- `scripts/sync_data.sh` -- Všechny `services/*/systemd/` soubory -- `server/webapp.service`, `server/webapp-nginx.conf` - ---- - -## Fáze 5: Agent Skills (CLAUDE.md + CLI skills) - -**Cíl:** AI agent má vestavěné znalosti pro nasazení, administraci, diagnostiku a vývoj. Nemusí nic googlit — vše je v skills. - -### Task 5A: CLAUDE.md template pro analytiky [INDEPENDENT] - -Aktualizovat `docs/setup/claude_md_template.md`: -- Instrukce pro `da` CLI místo SSH/rsync -- `da sync` jako povinný start session -- Jak pracovat s lokálním DuckDB -- Jak vytvářet a deployovat skripty -- Jak používat corporate memory -- Notifikační vzory (lokální vs serverové) - -### Task 5B: Admin/Deploy skills v CLI [DEPENDS ON 2D] - -`da` CLI bude obsahovat vestavěné skills — dlouhé help texty s domain knowledge, které AI agent přečte přes `da --help` nebo `da skills `: - -```bash -da skills list # seznam všech dostupných skills -da skills setup # kompletní průvodce setup nové instance -da skills troubleshoot # diagnostické postupy -da skills connectors # jak přidat nový data source -da skills notifications # jak fungují notifikace -da skills corporate-memory # governance, approval flow -da skills security # RBAC, permissions, audit -da skills backup-restore # disaster recovery -da skills upgrade # jak upgradovat verzi -``` - -Každý skill = markdown soubor v `cli/skills/` který se zobrazí přes `da skills `. - -### Task 5C: Interaktivní setup skill [DEPENDS ON 2D, 1A] - -```bash -da setup # AI agent spustí interaktivní setup -``` - -Flow (agent řídí): -1. `da setup init` → vygeneruje `instance.yaml` z konverzace s uživatelem -2. `da setup test-connection` → ověří credentials (Keboola/BigQuery) -3. `da setup deploy` → `docker compose up` nebo `kamal deploy` -4. `da setup first-sync` → triggeruje první data sync -5. `da setup verify` → healthcheck, počet tabulek, sample query -6. `da setup add-user` → přidá prvního analytika - -Každý krok vrací strukturovaný JSON → agent ví co dělat dál. - -### Task 5D: Diagnose skill [DEPENDS ON 2D, 1D] - -```bash -da diagnose # kompletní diagnostika -da diagnose --symptom "data not updating" # cílená diagnostika -da diagnose --component scheduler # diagnostika jedné služby -``` - -Output (strukturovaný pro agenta): -```json -{ - "overall": "degraded", - "checks": [ - {"name": "api", "status": "ok", "latency_ms": 12}, - {"name": "scheduler", "status": "ok", "last_run": "2026-03-27T08:00"}, - {"name": "data_freshness", "status": "warning", - "detail": "table 'orders' last synced 26h ago, expected 15min", - "suggested_action": "da server logs scheduler | grep orders"}, - {"name": "disk", "status": "ok", "usage": "45%"}, - {"name": "duckdb", "status": "ok", "tables": 47, "total_rows": "12.3M"} - ], - "suggested_actions": [ - "Check scheduler logs for 'orders' sync failures", - "Run: da server logs scheduler --since 24h | grep -i error" - ] -} -``` - -### Task 5E: Operační skills pro multi-customer [DEPENDS ON 3C] - -```bash -da infra list # seznam zákaznických instancí -da infra provision --customer acme --cloud gcp --region europe-west1 -da infra status acme # zdraví zákaznické instance -da infra deploy acme # deploy na zákaznický server -da infra backup acme # snapshot dat -``` - -Budoucí rozšíření — Terraform pod kapotou pro provision, Kamal pro deploy. - ---- - -## Verifikace - -### Per-fáze -1. **Fáze 0:** `pytest tests/` zelený, webapp funguje identicky s DuckDB backendem -2. **Fáze 1:** `curl /api/health` → ok, `curl /api/sync/manifest` → manifest, parquet download funguje -3. **Fáze 2:** `da login && da sync` vytvoří identickou strukturu jako `sync_data.sh`, `da query` funguje offline -4. **Fáze 3:** `docker compose up` → všechny služby běží, `kamal deploy -d staging` → staging funguje -5. **Fáze 4:** viewer nemůže triggerovat sync, admin může spravovat uživatele, skripty běží v sandboxu - -### End-to-end test (celý flow) -1. `docker compose up -d` (nebo `kamal deploy`) -2. Přes webapp: přihlásit se, vybrat datasety -3. `da login && da sync` → parquety lokálně -4. `da query "SELECT count(*) FROM orders"` → výsledek offline -5. `da scripts run sales_alert` → lokální exekuce -6. `da scripts deploy sales_alert --schedule "0 8 * * MON"` → serverová exekuce -7. `da sync --upload-only` → sessions/artifacts na serveru -8. Corporate memory: knowledge items viditelné ve webappu -9. Telegram notifikace doručeny -10. `da diagnose` → strukturovaný health report diff --git a/docs/architecture.md b/docs/architecture.md index 23c55c4..d603ff8 100644 --- a/docs/architecture.md +++ b/docs/architecture.md @@ -1,511 +1,426 @@ # Architecture — Detailed Reference -Comprehensive architectural overview of the OSS AI Data Analyst platform. -For a concise summary, see [../ARCHITECTURE.md](../ARCHITECTURE.md). +Comprehensive architectural overview of the AI Data Analyst platform (v2). ## Top-Level Module Map ``` -oss-ai-data-analyst/ -├── src/ Core engine (config, sync, parquet, profiling) -├── connectors/ Pluggable data connectors (keboola, jira) -├── auth/ Pluggable auth providers (google, password, desktop) -├── services/ Standalone background services -├── webapp/ Flask web portal (dashboard, catalog, API) -├── server/ Server deployment (setup, deploy, nginx, systemd) -├── scripts/ Analyst-side utility scripts (sync, DuckDB, dev server) -├── config/ Instance configuration (loader, templates) -├── examples/ Example notification scripts +ai-data-analyst/ +├── src/ Core engine (db, orchestrator, rbac, profiler, repositories) +├── connectors/ Pluggable data connectors (keboola, bigquery, jira, llm, openmetadata) +├── app/ FastAPI application (API + web UI) +│ ├── api/ REST API routers +│ ├── auth/ Auth providers (JWT, Google OAuth, email magic link, password) +│ └── web/ HTML dashboard routes +├── services/ Standalone background services (scheduler, telegram_bot, ws_gateway, …) +├── cli/ CLI tool (da sync, da query, da admin) +├── scripts/ Utility and migration scripts +├── config/ Instance configuration templates ├── tests/ Test suite -├── dev_docs/ Internal development documentation └── docs/ User-facing documentation ``` -## Block Diagram +## System Overview ``` -┌─────────────────────────────────────────────────────────────────────────────┐ -│ EXTERNAL DATA SOURCES │ -│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ -│ │ Keboola │ │ Jira │ │ CSV │ │ BigQuery │ │ -│ │ Storage │ │ Cloud │ │ (plan) │ │ (plan) │ │ -│ └────┬─────┘ └────┬─────┘ └──────────┘ └──────────┘ │ -└────────┼──────────────┼────────────────────────────────────────────────────┘ - │ │ - ▼ ▼ -┌─────────────────────────────────────────────────────────────────────────────┐ -│ CONNECTORS (connectors/) auto-discovery via importlib │ -│ │ -│ ┌──────────────────────────┐ ┌─────────────────────────────────────┐ │ -│ │ connectors/keboola/ │ │ connectors/jira/ │ │ -│ │ │ │ │ │ -│ │ adapter.py │ │ webhook.py Flask blueprint │ │ -│ │ KeboolaDataSource (ABC) │ │ service.py Jira REST API client │ │ -│ │ full/incr/partitioned │ │ transform.py JSON -> 6 Parquet tbl│ │ -│ │ │ │ incremental_transform.py realtime │ │ -│ │ client.py │ │ file_lock.py POSIX advisory locks │ │ -│ │ Keboola Storage API │ │ │ │ -│ │ type mapping + cache │ │ scripts/ backfill, SLA poll, │ │ -│ │ │ │ consistency check │ │ -│ │ tests/ │ │ systemd/ jira-sla-poll, │ │ -│ └──────────────────────────┘ │ jira-consistency │ │ -│ │ tests/ │ │ -│ Registry: src/data_sync.py └─────────────────────────────────────┘ │ -│ create_data_source(type) -> │ -│ importlib("connectors.{type}.adapter") │ -└─────────────────────────────────────────────────────────────────────────────┘ - │ - ▼ Parquet files -┌─────────────────────────────────────────────────────────────────────────────┐ -│ CORE ENGINE (src/) │ -│ │ -│ ┌─────────────────────┐ ┌──────────────────┐ ┌──────────────────────┐ │ -│ │ data_sync.py │ │ config.py │ │ profiler.py │ │ -│ │ DataSource ABC │ │ data_description │ │ Parquet -> stats │ │ -│ │ SyncState (JSON) │ │ .md parser │ │ alerts, sampling │ │ -│ │ DataSyncManager │ │ TableConfig │ │ -> profiles.json │ │ -│ │ create_data_source()│ │ WhereFilter │ └──────────────────────┘ │ -│ └─────────────────────┘ │ ForeignKey │ │ -│ │ get_config() │ ┌──────────────────────┐ │ -│ └──────────────────┘ │ parquet_manager.py │ │ -│ │ CSV->Parquet, merge │ │ -│ │ upsert, schema │ │ -│ └──────────────────────┘ │ -└─────────────────────────────────────────────────────────────────────────────┘ - │ - │ /data/src_data/parquet/ - ▼ -┌─────────────────────────────────────────────────────────────────────────────┐ -│ AUTH PROVIDERS (auth/) auto-discovery via scan │ -│ │ -│ ┌────────────────┐ ┌────────────────┐ ┌──────────────────────┐ │ -│ │ auth/google/ │ │ auth/password/ │ │ auth/desktop/ │ │ -│ │ │ │ │ │ │ │ -│ │ Google OAuth │ │ Email+password │ │ JWT for desktop app │ │ -│ │ SSO (Authlib) │ │ Argon2 hash │ │ visible=False │ │ -│ │ domain restrict │ │ SendGrid email │ │ (API-only, not login) │ │ -│ │ order=10 │ │ order=20 │ │ order=100 │ │ -│ └────────────────┘ └────────────────┘ └──────────────────────┘ │ -│ │ -│ ABC: AuthProvider (get_name, get_blueprint, get_login_button, is_avail.) │ -│ Discovery: discover_providers() -> scans auth/*/provider.py │ -│ Contract: all providers set session["user"] = {email, name, picture} │ -└─────────────────────────────────────────────────────────────────────────────┘ - │ - │ Blueprints registered in Flask app - ▼ -┌─────────────────────────────────────────────────────────────────────────────┐ -│ WEB PORTAL (webapp/) │ -│ │ -│ ┌───────────────────┐ ┌──────────────────────────────────────────────┐ │ -│ │ app.py (Flask) │ │ Pages │ │ -│ │ - discover auth │ │ /dashboard - account, stats, setup │ │ -│ │ providers │ │ /catalog - data catalog + profiles │ │ -│ │ - register │ │ /corporate-memory - knowledge + voting │ │ -│ │ blueprints │ │ /activity-center - intelligence overview │ │ -│ │ - inject_config() │ └──────────────────────────────────────────────┘ │ -│ │ - routes │ │ -│ └───────────────────┘ ┌──────────────────────────────────────────────┐ │ -│ │ API Endpoints │ │ -│ ┌───────────────────┐ │ /webhooks/jira (HMAC, -> jira connector)│ │ -│ │ webapp services │ │ /api/telegram/* (link/unlink/status) │ │ -│ │ user_service │ │ /api/desktop/* (JWT, scripts, run) │ │ -│ │ account_service │ │ /api/sync-settings (GET/POST) │ │ -│ │ sync_settings_svc │ │ /api/corporate-memory/* (CRUD, votes) │ │ -│ │ telegram_service │ │ /api/catalog/profile/ │ │ -│ │ email_service │ │ /health (service health) │ │ -│ │ health_service │ └──────────────────────────────────────────────┘ │ -│ │ corporate_memory │ │ -│ └───────────────────┘ Config chain: instance.yaml -> loader -> Config -> │ -│ inject_config() -> {{ config.X }} in Jinja │ -└─────────────────────────────────────────────────────────────────────────────┘ - │ - │ - ▼ -┌─────────────────────────────────────────────────────────────────────────────┐ -│ BACKGROUND SERVICES (services/) each = __main__.py + systemd │ -│ │ -│ ┌────────────────────────┐ ┌─────────────────────────────────────────┐ │ -│ │ services/telegram_bot/ │ │ services/ws_gateway/ │ │ -│ │ │ │ │ │ -│ │ bot.py polling + │ │ gateway.py WebSocket TCP:8765 │ │ -│ │ HTTP socket │ │ + HTTP dispatch socket │ │ -│ │ runner.py script exec │ │ auth.py JWT validation │ │ -│ │ sender.py msg dispatch │ │ config.py gateway config │ │ -│ │ dispatch.py -> WS gw │ │ │ │ -│ │ storage.py JSON state │ │ Heartbeat: ping/pong, 3 miss = drop │ │ -│ │ status.py /status cmd │ │ Per-user connection limit (5) │ │ -│ │ │ │ │ │ -│ │ Always running (systemd)│ │ Always running (systemd) │ │ -│ └────────────────────────┘ └─────────────────────────────────────────┘ │ -│ │ -│ ┌────────────────────────┐ ┌─────────────────────────────────────────┐ │ -│ │ services/ │ │ services/ │ │ -│ │ corporate_memory/ │ │ session_collector/ │ │ -│ │ │ │ │ │ -│ │ collector.py │ │ collector.py │ │ -│ │ Scans CLAUDE.local.md │ │ Copies .jsonl from user homes │ │ -│ │ -> Claude Haiku -> JSON│ │ to /data/user_sessions/ │ │ -│ │ MD5 change detection │ │ Idempotent, atomic writes │ │ -│ │ prompts.py │ │ │ │ -│ │ LLM prompts for │ │ Timer: every 6 hours │ │ -│ │ knowledge extraction │ │ │ │ -│ │ │ │ │ │ -│ │ Timer: every 30 min │ │ │ │ -│ └────────────────────────┘ └─────────────────────────────────────────┘ │ -└─────────────────────────────────────────────────────────────────────────────┘ - │ - │ Unix sockets + /data/ filesystem - ▼ -┌─────────────────────────────────────────────────────────────────────────────┐ -│ SERVER INFRASTRUCTURE (server/) │ -│ │ -│ ┌──────────────────┐ ┌────────────────────┐ ┌───────────────────────┐ │ -│ │ Deployment │ │ User Management │ │ Web Server │ │ -│ │ setup.sh │ │ bin/add-analyst │ │ webapp-nginx.conf │ │ -│ │ deploy.sh (CI/CD) │ │ bin/list-analysts │ │ webapp.service │ │ -│ │ webapp-setup.sh │ │ bin/notify-runner │ │ SSL (Let's Encrypt) │ │ -│ │ sudoers rules │ │ bin/notify-scripts │ │ Gunicorn + Unix sock │ │ -│ └──────────────────┘ └────────────────────┘ └───────────────────────┘ │ -│ │ -│ Groups: dataread (analysts) | data-private (privileged) | data-ops (admin) │ -│ │ -│ /data/ │ -│ ├── src_data/parquet/ shared data (readonly for analysts) │ -│ ├── src_data/metadata/ sync_state.json, profiles.json │ -│ ├── src_data/raw/jira/ webhook JSON, attachments │ -│ ├── docs/ , scripts/ documentation, helper scripts │ -│ ├── notifications/ telegram_users, desktop_users, codes │ -│ ├── corporate-memory/ knowledge.json, votes.json │ -│ └── user_sessions/ centralized Claude Code transcripts │ -└─────────────────────────────────────────────────────────────────────────────┘ - │ - │ rsync (SSH) - scripts/sync_data.sh (bi-directional) - ▼ -┌─────────────────────────────────────────────────────────────────────────────┐ -│ ANALYST WORKSTATION (local) │ -│ │ -│ server/ (read-only, rsynced from broker) │ -│ ├── parquet/, docs/, scripts/, metadata/ │ -│ │ -│ user/ (writable workspace, backed up to server) │ -│ ├── duckdb/analytics.duckdb SQL views over parquet │ -│ ├── notifications/*.py custom notification scripts │ -│ ├── sessions/ Claude Code transcripts │ -│ └── artifacts/ analysis outputs │ -│ │ -│ .claude/rules/ corporate memory knowledge rules │ -│ │ -│ Claude Code <- local analysis over DuckDB + Parquet │ -└─────────────────────────────────────────────────────────────────────────────┘ +┌─────────────────────────────────────────────────────────────────┐ +│ EXTERNAL DATA SOURCES │ +│ Keboola Storage │ BigQuery │ Jira Cloud │ CSV/files │ +└──────────┬────────┴─────┬──────┴──────┬────────┴────────────────┘ + │ │ │ + ▼ ▼ ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ CONNECTORS (connectors/) │ +│ extractor.py per source → extract.duckdb contract │ +└──────────────────────────┬──────────────────────────────────────┘ + │ /data/extracts/{source}/extract.duckdb + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ SYNC ORCHESTRATOR (src/orchestrator.py) │ +│ Scans extracts/, ATTACHes each extract.duckdb, │ +│ creates master views in analytics.duckdb (atomic swap) │ +└──────────────────────────┬──────────────────────────────────────┘ + │ + ┌────────────┼────────────┐ + ▼ ▼ ▼ + ┌──────────────┐ ┌────────┐ ┌──────────────┐ + │ FastAPI app │ │ CLI │ │ Scheduler │ + │ port 8000 │ │ `da` │ │ sidecar │ + └──────────────┘ └────────┘ └──────────────┘ + │ + ┌─────────┴──────────┐ + ▼ ▼ +system.duckdb analytics.duckdb +(state/registry) (master views) ``` -## Auto-Discovery Patterns +**Deployment:** Docker Compose. The `app` service runs Uvicorn. The `scheduler` sidecar triggers +sync jobs via the app's REST API. Optional `full` profile adds telegram-bot, ws-gateway, +corporate-memory, session-collector. -The platform uses three symmetrical auto-discovery mechanisms. Adding a new -connector, auth method, or service requires no changes to existing code. - -### 1. Connector Discovery (`src/data_sync.py`) - -``` -config/instance.yaml -> data_source.type: "keboola" - -> importlib.import_module("connectors.keboola.adapter") - -> KeboolaDataSource (implements DataSource ABC) +```bash +docker compose up # app + scheduler +docker compose --profile full up # all services +docker compose --profile extract run extract # one-shot extraction ``` -- Factory: `create_data_source(type)` in `src/data_sync.py` -- Connectors live in `connectors/{name}/adapter.py` -- Must export a `DataSource` subclass or a `create_data_source()` factory function -- Keboola is hard-coded for ImportError handling; all others use dynamic import +--- -### 2. Auth Provider Discovery (`auth/__init__.py`) +## extract.duckdb Contract + +Every connector writes to the same directory layout: ``` -startup -> scan auth/*/provider.py - -> import `provider` instance - -> filter by is_available() (checks env vars) - -> register blueprint + login button in Flask +/data/extracts/{source_name}/ +├── extract.duckdb ← _meta table + views over parquet files +└── data/ ← parquet files (local connectors only) + ├── table_a.parquet + └── table_b.parquet ``` -- ABC: `AuthProvider` with methods `get_name()`, `get_blueprint()`, `get_login_button()`, `is_available()`, `init_app()` -- Session contract: all providers set `session["user"] = {email, name, picture}` -- Login page renders buttons dynamically, sorted by `order` field +### _meta table -### 3. Service Pattern (`services/*/__main__.py`) +Required in every `extract.duckdb`: -``` -python -m services. # entry point -services//systemd/ # unit files -deploy.sh auto-discovers # systemd/* in each service dir +```sql +CREATE TABLE _meta ( + table_name VARCHAR NOT NULL, + description TEXT, + rows BIGINT, + size_bytes BIGINT, + extracted_at TIMESTAMP, + query_mode VARCHAR -- 'local' or 'remote' +); ``` -- Each service is self-contained: code, systemd units, and config in one directory -- `deploy.sh` scans `services/*/systemd/*.service` and `connectors/*/systemd/*.service` -- Long-running services (telegram_bot, ws_gateway) use async dual-server model -- Periodic services (corporate_memory, session_collector) are systemd timer oneshots +The orchestrator reads `_meta` to know which tables exist and creates a corresponding +view in `analytics.duckdb` for each row. -## Data Flows +### _remote_attach table (optional) -### Pull Sync (Keboola) +Connectors whose views reference an external DuckDB extension (e.g. Keboola, BigQuery) +must include this table so the orchestrator can re-ATTACH the external source at rebuild time: -``` -Keboola Storage API - -> connectors/keboola/client.py (export CSV with filters) - -> src/parquet_manager.py (convert to typed Parquet) - -> /data/src_data/parquet/ (stored on broker) - -> rsync to analyst (scripts/sync_data.sh) - -> DuckDB views (scripts/setup_views.sh) +```sql +CREATE TABLE _remote_attach ( + alias VARCHAR, -- DuckDB alias for the attached source, e.g. 'kbc' + extension VARCHAR, -- Extension name, e.g. 'keboola' + url VARCHAR, -- Connection URL + token_env VARCHAR -- Name of the env var holding the auth token +); ``` -Sync strategies: `full_refresh`, `incremental`, `partitioned`, `chunked_initial_load`. +The orchestrator installs/loads the extension, reads the token from the environment, and +ATTACHes the external source so remote views resolve correctly. This mechanism is generic — +any connector can use it. Auth credentials are never stored in `extract.duckdb`. -### Push Sync (Jira) +--- + +## SyncOrchestrator + +`src/orchestrator.py` — thread-safe via `_rebuild_lock`. + +### rebuild() + +1. Open a **temporary** DuckDB file (`analytics.duckdb.tmp`). +2. Scan `/data/extracts/*/extract.duckdb` (sorted, skips non-directories and missing files). +3. Validate each directory name as a safe SQL identifier (`^[a-zA-Z_][a-zA-Z0-9_]{0,63}$`). +4. For each source: `ATTACH '{db_file}' AS {source_name} (READ_ONLY)`. +5. Handle `_remote_attach` — install extension, read token from env, ATTACH external source. +6. Read `_meta`, validate each `table_name` identifier, create `CREATE OR REPLACE VIEW`. +7. Update `sync_state` in `system.duckdb` (mtime-based hash, no full file read). +8. `CHECKPOINT` and close the temp connection. +9. **Atomic swap**: `shutil.move(tmp_path, target_path)` — replaces `analytics.duckdb` in-place. + +### rebuild_source(source_name) + +Convenience wrapper that calls `rebuild()` in full (partial rebuild is not possible because +`analytics.duckdb` is written fresh from scratch each time). Used after Jira webhooks. + +### Identifier validation + +Both `source_name` and `table_name` are checked against `^[a-zA-Z_][a-zA-Z0-9_]{0,63}$` +before being interpolated into SQL. Invalid names are skipped with a warning. + +--- + +## Data Sources + +### Keboola — Batch Pull + +`connectors/keboola/extractor.py` + +- Uses the DuckDB Keboola community extension to download tables directly to parquet. +- Fallback path: `connectors/keboola/client.py` (Keboola Storage API wrapper). +- Sync strategies: `full_refresh`, `incremental`, `partitioned`. +- Writes `extract.duckdb` + `data/*.parquet` under `/data/extracts/keboola/`. +- For tables with `query_mode='remote'`, populates `_remote_attach` so views proxy queries + to Keboola rather than downloading data locally. + +Sync trigger flow: + +``` +POST /api/sync/trigger (admin) + → BackgroundTask: _run_sync() + → Read table_registry from system.duckdb (main process) + → Serialize configs as JSON, spawn subprocess (no DuckDB lock conflict) + → Subprocess: connectors/keboola/extractor.run() → extract.duckdb + → SyncOrchestrator().rebuild() → analytics.duckdb + → Profiler: profile each synced parquet → table_profiles +``` + +### BigQuery — Remote Attach + +`connectors/bigquery/extractor.py` + +- Uses the DuckDB BigQuery community extension. +- No data download — views proxy all queries directly to BigQuery. +- Auth via `GOOGLE_APPLICATION_CREDENTIALS` (service account JSON) or ADC. +- Populates `_remote_attach` with `extension='bigquery'` and no `token_env` (env-based auth). + +### Jira — Real-Time Push + +`connectors/jira/webhook.py` → `incremental_transform.py` → `extract_init.py` ``` Jira Cloud webhook (issue created/updated/deleted) - -> connectors/jira/webhook.py (HMAC-SHA256 verification) - -> connectors/jira/service.py (fetch full issue + attachments) - -> /data/src_data/raw/jira/issues/ (atomic JSON write) - -> connectors/jira/incremental_transform.py (update monthly Parquet) - -> /data/src_data/parquet/jira/ (6 tables: issues, comments, - attachments, changelog, - issuelinks, remote_links) + → POST /api/jira/webhook (HMAC-SHA256 verification) + → connectors/jira/webhook.py (validate, persist raw JSON) + → connectors/jira/incremental_transform.py (update monthly parquet shards) + → extract_init.py (update _meta) + → SyncOrchestrator().rebuild_source('jira') ``` -Background jobs supplement the webhook pipeline: -- `jira-sla-poll` (every 5 min): refreshes SLA fields for open tickets -- `jira-consistency` (every 6h): detects and backfills missing issues +Output tables (6): `issues`, `comments`, `attachments`, `changelog`, `issuelinks`, `remote_links`. -### Notification Pipeline +Background supplements: +- `jira-sla-poll` — refreshes SLA fields for open tickets every 5 min. +- `jira-consistency` — detects and backfills missing issues every 6 h. + +Files NOT to modify: `connectors/jira/file_lock.py`, `connectors/jira/transform.py`. + +--- + +## DuckDB Schema + +### system.duckdb — `{DATA_DIR}/state/system.duckdb` + +Current schema version: **3** (auto-migrated from v1/v2 on startup). + +| Table | Purpose | +|-------|---------| +| `schema_version` | Tracks applied migration version | +| `users` | Registered users: id, email, name, role, password_hash, setup/reset tokens | +| `sync_state` | Per-table sync status: last_sync, rows, file_size_bytes, hash, status | +| `sync_history` | Historical sync runs with duration and error | +| `user_sync_settings` | Per-user dataset enable/disable preferences | +| `table_registry` | Registered tables: source_type, bucket, source_table, query_mode, sync_schedule | +| `table_profiles` | JSON data profiles (stats, nulls, cardinality) per table | +| `dataset_permissions` | Per-user per-dataset access grants | +| `access_requests` | Self-service access request workflow | +| `knowledge_items` | Corporate memory knowledge entries | +| `knowledge_votes` | Up/down votes on knowledge items | +| `audit_log` | API action log: user, action, resource, duration | +| `telegram_links` | Telegram chat_id linked to user_id | +| `pending_codes` | Telegram link confirmation codes | +| `script_registry` | Deployed Python notification scripts | + +Connections: `get_system_db()` returns a cursor on a **single shared connection** per +`DATA_DIR` (protected by `threading.Lock`). Callers `close()` the cursor, not the +underlying connection. This avoids DuckDB write-lock conflicts in the multi-threaded +FastAPI process. + +### analytics.duckdb — `{DATA_DIR}/analytics/server.duckdb` + +Read-only views over all ATTACHed `extract.duckdb` sources. Rebuilt atomically by +`SyncOrchestrator.rebuild()`. Query endpoints open this file via `get_analytics_db_readonly()` +which ATTACHes all `extract.duckdb` files in read-only mode so remote views resolve correctly. + +--- + +## Authentication + +All auth flows issue a **JWT** (`app/auth/jwt.py`) stored as a cookie (`access_token`) or +passed as a `Bearer` token in the `Authorization` header. The `get_current_user` dependency +validates the JWT and loads the user from `users` in `system.duckdb`. + +### Providers (`app/auth/providers/`) + +| Provider | Available when | Flow | +|----------|---------------|------| +| `google.py` | `GOOGLE_CLIENT_ID` + `GOOGLE_CLIENT_SECRET` set | Google OAuth 2.0 / OIDC (Authlib). Domain restriction via `allowed_domains` in `instance.yaml`. Callback issues JWT cookie. | +| `email.py` | `SMTP_HOST` or `SENDGRID_API_KEY` set | Magic link: `POST /auth/email/send-link` generates a token stored in `users.setup_token`; `POST /auth/email/verify` exchanges it for a JWT. | +| `password.py` | Always registered | Email + password with hashed credentials. | + +### RBAC + +`src/rbac.py` defines four roles in ascending order: ``` -~/user/notifications/*.py analyst's custom scripts - -> server/bin/notify-runner (cron, executes with timeout) - -> cooldown check (~/.notifications/state/) - ├-> services/telegram_bot/ (Unix socket /run/notify-bot/bot.sock) - │ -> Telegram chat message (text or photo) - └-> services/ws_gateway/ (Unix socket /run/ws-gateway/ws.sock) - -> WebSocket push to desktop app +viewer < analyst < km_admin < admin ``` -Script output format: -```json -{ - "notify": true, - "title": "Revenue dropped 25%", - "message": "Details...", - "cooldown": "6h", - "image_path": "/tmp/chart.png" -} -``` +Stored in `users.role`. The `require_role(Role.ADMIN)` FastAPI dependency factory enforces +minimum role. Table-level access is checked via `can_access_table()`: -### Knowledge Loop (Corporate Memory) +1. Role `admin` → always allowed. +2. `table_registry.is_public = true` → allowed. +3. Explicit row in `dataset_permissions` → allowed. +4. Wildcard bucket permission (`in.c-finance.*`) → allowed. -``` -Analyst writes CLAUDE.local.md (insights, patterns, tips) - -> scripts/sync_data.sh (uploads to server) - -> services/corporate_memory/ (timer, every 30 min) - -> MD5 change detection - -> Claude Haiku extracts knowledge items - -> /data/corporate-memory/knowledge.json - -> webapp /corporate-memory (voting UI: upvote/downvote) - -> scripts/sync_data.sh (downloads to analyst) - -> .claude/rules/ (rules for Claude Code) - -> Claude Code uses rules in next session -``` +--- -## Module Reference +## API Layer -### Core Engine (`src/`) +All routes are FastAPI `APIRouter` instances registered in `app/main.py`. -| File | Lines | Responsibility | -|------|-------|----------------| -| `data_sync.py` | ~1400 | `DataSource` ABC, `SyncState`, `DataSyncManager`, connector factory | -| `config.py` | ~600 | Parse `data_description.md` YAML blocks, `TableConfig`, `WhereFilter`, `ForeignKey` | -| `parquet_manager.py` | ~750 | CSV-to-Parquet conversion, merge, upsert, schema enforcement | -| `profiler.py` | ~1200 | Data profiling: stats, alerts, type classification -> `profiles.json` | +### REST API (`app/api/`) -### Connectors (`connectors/`) +| Router | Prefix | Key endpoints | +|--------|--------|---------------| +| `sync` | `/api/sync` | `GET /manifest` (hash manifest, per-user filtered), `POST /trigger` (admin), `GET/POST /settings`, `GET/POST /table-subscriptions` | +| `data` | `/api/data` | Download parquet files for synced tables | +| `query` | `/api/query` | `POST /` — execute a SELECT against `analytics.duckdb` (sandbox enforced) | +| `admin` | `/api/admin` | `GET /discover-tables`, `GET /registry`, `POST /register-table`, `PUT /registry/{id}`, `DELETE /registry/{id}` | +| `catalog` | `/api/catalog` | Data catalog: table list, profiles, metric definitions | +| `users` | `/api/users` | User CRUD (admin), self-service profile | +| `permissions` | `/api/permissions` | Dataset permission grants (admin) | +| `access_requests` | `/api/access-requests` | Request + review workflow | +| `scripts` | `/api/scripts` | Deploy, list, run, delete Python notification scripts | +| `settings` | `/api/settings` | Instance and user settings | +| `memory` | `/api/memory` | Corporate memory CRUD and voting | +| `upload` | `/api/upload` | File upload (CSV, parquet) | +| `telegram` | `/api/telegram` | Telegram account link/unlink | +| `jira_webhooks` | `/api/jira` | Jira webhook receiver (HMAC-SHA256 verified) | +| `health` | `/api/health` | Service health, sync status, disk | -| Module | Files | Sync Model | Description | -|--------|-------|------------|-------------| -| `keboola/` | adapter.py, client.py, tests/ | Pull (DataSource ABC) | Keboola Storage API, type mapping, metadata caching (24h TTL) | -| `jira/` | webhook.py, service.py, transform.py, incremental_transform.py, file_lock.py, scripts/, systemd/, tests/ | Push (webhook) | Real-time webhook pipeline, SLA polling, consistency monitoring, 6 output Parquet tables | +### Auth routes (`app/auth/`) -### Auth Providers (`auth/`) +`POST /auth/token`, `GET /auth/me`, `POST /auth/logout`, +`GET /auth/google/login`, `GET /auth/google/callback`, +`POST /auth/email/send-link`, `POST /auth/email/verify`, +`POST /auth/password/login` -| Provider | Available when | Login UI | Order | Description | -|----------|---------------|----------|-------|-------------| -| `google/` | `GOOGLE_CLIENT_ID` set | Yes | 10 | Google OAuth SSO with domain restriction | -| `password/` | `SENDGRID_API_KEY` set | Yes | 20 | Email + password for external users (Argon2, rate limiting) | -| `desktop/` | `DESKTOP_JWT_SECRET` set | No (API-only) | 100 | JWT tokens for native desktop app | +### Web UI (`app/web/`) -### Background Services (`services/`) +HTML dashboard routes served by Jinja2 templates. Registered last (catch-all). -| Service | Type | Schedule | Description | -|---------|------|----------|-------------| -| `telegram_bot/` | Long-running | Always on | Telegram polling + HTTP dispatch socket, script execution, /status /test commands | -| `ws_gateway/` | Long-running | Always on | WebSocket TCP:8765 + HTTP dispatch socket, JWT auth, heartbeat | -| `corporate_memory/` | Timer oneshot | Every 30 min | AI knowledge extraction from CLAUDE.local.md via Claude Haiku | -| `session_collector/` | Timer oneshot | Every 6 hours | Copy session .jsonl from user homes to central storage | +--- -### Web Portal (`webapp/`) +## Services -| File | Responsibility | -|------|----------------| -| `app.py` | Flask factory, blueprint registration, route definitions, context processors | -| `config.py` | Load `instance.yaml`, expose `Config` to templates | -| `auth.py` | Core auth infrastructure: `login_required`, `validate_email_domain`, `/login`, `/logout` | -| `user_service.py` | Username derivation, SSH key validation, system account creation | -| `account_service.py` | Dashboard account widget data, cron info, sync status | -| `sync_settings_service.py` | Per-user dataset sync preferences | -| `telegram_service.py` | Telegram account linking/unlinking | -| `desktop_auth.py` | JWT generation/validation, desktop app link state | -| `password_auth.py` | Password auth implementation (Argon2, rate limiting, token workflow) | -| `email_service.py` | SendGrid integration for setup/reset emails | -| `corporate_memory_service.py` | Knowledge CRUD, voting, user rules regeneration | -| `health_service.py` | System health checks (services, timers, disk, load, webhooks) | -| `notification_images.py` | Serve chart PNGs generated by notification runner | -| `utils/metric_parser.py` | Parse business metric YAML definitions for catalog UI | +Each service is a self-contained Python package (`services//__main__.py`) run as a +Docker Compose service. -### Server Infrastructure (`server/`) +| Service | Profile | Schedule / Mode | Description | +|---------|---------|-----------------|-------------| +| `scheduler` | default | Always-on; polls every N seconds | Lightweight sidecar that triggers jobs via the app's REST API (`POST /api/sync/trigger` every 15 min, `GET /api/health` every 5 min). Auth via `SCHEDULER_API_TOKEN` or auto-fetch from `/auth/token`. | +| `telegram_bot` | `full` | Always-on (long-poll) | Telegram bot: polling + HTTP dispatch, `/status` command, notification script execution. | +| `ws_gateway` | `full` | Always-on | WebSocket gateway (TCP 8765) + HTTP dispatch socket. JWT auth. Per-user connection limit (5). Heartbeat ping/pong. | +| `corporate_memory` | `full` | Periodic (every 30 min) | Scans `CLAUDE.local.md` files, extracts knowledge via LLM (Claude Haiku), writes to `knowledge_items` in system.duckdb. | +| `session_collector` | `full` | Periodic (every 6 h) | Copies Claude Code `.jsonl` session transcripts to central storage. | -| File | Responsibility | -|------|----------------| -| `setup.sh` | Initial server bootstrap (groups, users, directories, venv) | -| `deploy.sh` | CI/CD deployment (git pull, deps, scripts, services, ACLs) | -| `webapp-setup.sh` | Nginx + SSL + Gunicorn setup | -| `webapp-nginx.conf` | Nginx reverse proxy config (HTTPS, WebSocket upgrade) | -| `webapp.service` | Systemd unit for Gunicorn | -| `sudoers-deploy` | Sudo rules for deploy user (least-privilege) | -| `sudoers-webapp` | Sudo rules for www-data | -| `bin/add-analyst` | Create analyst user with workspace structure | -| `bin/list-analysts` | List registered analysts | -| `bin/notify-runner` | Execute user notification scripts, dispatch to bot + gateway | -| `bin/notify-scripts` | List/run notification scripts for a user | +Files NOT to modify: `services/ws_gateway/` (stable WebSocket infrastructure). -### Analyst Scripts (`scripts/`) +--- -| File | Responsibility | -|------|----------------| -| `sync_data.sh` | Bi-directional rsync: download data, upload workspace, refresh DuckDB | -| `setup_views.sh` | Create/replace DuckDB views over all Parquet files | -| `duckdb_manager.py` | DuckDB setup utility | -| `dev_run.py` | Development server with auth bypass | -| `collect_session.py` | Session transcript collector (used by service) | -| `generate_user_sync_configs.py` | Generate per-user sync config files | +## Security -## Analyst Workspace Layout +### Query Sandbox (`app/api/query.py`) -Created by `server/bin/add-analyst` for each registered user: +The `/api/query` endpoint enforces a strict SQL allowlist: -``` -/home/{username}/ -├── server/ read-only symlinks to shared data -│ ├── parquet/ -> /data/src_data/parquet -│ ├── docs/ -> /data/docs -│ ├── scripts/ -> /data/scripts -│ ├── metadata/ -> /data/src_data/metadata -│ └── jira_attachments/ -> /data/src_data/raw/jira/attachments -├── user/ writable workspace (backed up to server) -│ ├── duckdb/ local DuckDB database -│ ├── notifications/ custom notification scripts (*.py) -│ ├── artifacts/ analysis outputs -│ ├── scripts/ user helper scripts -│ ├── parquet/ user Parquet files -│ └── sessions/ Claude Code session transcripts -├── .notifications/ notification runner state -│ ├── state/ cooldown tracking (JSON per script) -│ └── logs/ runner logs -└── .claude/ - └── rules/ corporate memory knowledge rules (auto-synced) -``` +- Only `SELECT` and `WITH` queries accepted. +- Blocklist of ~30 keywords/functions: `DROP`, `DELETE`, `INSERT`, `UPDATE`, `ALTER`, + `CREATE`, `ATTACH`, `DETACH`, `LOAD`, `INSTALL`, `COPY`, `PRAGMA`, file functions + (`read_parquet`, `read_csv`, `glob`, etc.), URL schemes (`s3://`, `gcs://`, `http://`), + and multi-statement separator (`;`). +- Table-level RBAC: forbidden views are detected by word-boundary regex match against + the SQL text. Query is rejected if user lacks access to any referenced table. +- Analytics DB opened in `read_only=True` mode per request. -## Security Model +### Script Sandbox (`app/api/scripts.py`) -### System Groups +Deployed and ad-hoc Python scripts are checked against a pattern blocklist before execution: -| Group | Access | -|-------|--------| -| `data-ops` | Full admin access to all server resources | -| `dataread` | Read access to public Parquet data | -| `data-private` | Read access to sensitive/restricted data | +- Blocked: `subprocess`, `shutil`, `ctypes`, `importlib`, `socket`, `requests`, `httpx`, + `urllib`, `os`, `sys`, `signal`, `open(`, `pathlib`, `exec(`, `eval(`, `compile(`, + `__import__`, and others. +- Scripts run in a subprocess with a configurable timeout (`SCRIPT_TIMEOUT`, default 300 s) + and capped output (`SCRIPT_MAX_OUTPUT`, default 64 KB). + +### Identifier Validation (`src/orchestrator.py`, `src/db.py`) + +All dynamic SQL identifiers (source names, table names, extension aliases) are validated +against `^[a-zA-Z_][a-zA-Z0-9_]{0,63}$` before interpolation. Invalid identifiers are +skipped with a log warning, never executed. ### Authentication Layers -| Layer | Mechanism | Scope | -|-------|-----------|-------| -| Web portal | Google OAuth / email+password | Browser sessions | -| Desktop app | JWT Bearer tokens | API endpoints (`/api/desktop/*`) | -| Jira webhook | HMAC-SHA256 signature | Webhook endpoint | -| SSH access | Key-based auth only | Data sync (rsync) | -| Inter-service | Unix socket permissions | Bot, gateway, webapp | +| Layer | Mechanism | +|-------|-----------| +| Web UI / API | JWT Bearer token or `access_token` cookie | +| Google OAuth | Authlib OIDC + domain allowlist | +| Email magic link | `secrets.token_urlsafe(32)` stored in `users.setup_token`, 1-hour expiry | +| Jira webhook | HMAC-SHA256 signature verification | +| Inter-service (scheduler) | `SCHEDULER_API_TOKEN` env var or auto-fetched JWT | -### Permission Boundaries +--- -- Analysts cannot access other users' home directories -- Webapp (www-data) uses sudoers-whitelisted commands for user operations -- Deploy user has explicit sudo rules for service management -- Staging directory (`/tmp/data_analyst_staging`) uses setgid for group ownership -- All JSON state files written atomically: `tempfile.mkstemp()` + `os.fchmod()` + `os.replace()` - -## Configuration Chain +## Configuration ``` -config/instance.yaml (instance-specific, not committed) - | loaded by config/loader.py - | ${ENV_VAR} references resolved from .env / environment - v -webapp/config.py (Flask Config class) - | _load_instance_config() at module level - | _get(config, *keys) for safe nested access - v -inject_config() context processor (exposes Config to templates) - v -{{ config.INSTANCE_NAME }} in Jinja2 (all templates have access) +config/instance.yaml (instance-specific, not committed) + │ loaded by config/loader.py + │ ${ENV_VAR} references resolved from .env + ▼ +app/instance_config.py (exposes get_data_source_type(), get_allowed_domains(), get_value()) + ▼ +FastAPI dependency injection (passed to API routers as needed) ``` -Validation: `config/loader.py` checks required fields at startup (`instance.name`, -`auth.allowed_domain`, `server.host`, `server.hostname`, `auth.webapp_secret_key`). -Missing required fields cause immediate startup failure with a clear error message. +Table configuration lives in `table_registry` inside `system.duckdb`, not in static files. +Use `POST /api/admin/register-table` or the web UI admin panel to register tables. -## Server Filesystem Layout +Required env vars: `DATA_DIR`, `JWT_SECRET_KEY`. Source-specific vars (`KEBOOLA_STORAGE_TOKEN`, +`GOOGLE_CLIENT_ID`, `GOOGLE_CLIENT_SECRET`, `SMTP_HOST` / `SENDGRID_API_KEY`, etc.) are +optional and gate the relevant connectors/providers. + +--- + +## Data Filesystem Layout ``` -/opt/data-analyst/ -├── repo/ git repository (deployed via CI/CD) -├── .venv/ Python virtual environment -├── logs/ application logs -└── .env secrets (mode 0640) - /data/ -├── src_data/ -│ ├── parquet/ shared Parquet files (readonly for analysts) -│ ├── metadata/ sync_state.json, profiles.json, table_metadata.json -│ └── raw/jira/ webhook JSON files, attachments -├── docs/ documentation and schema -├── scripts/ helper scripts synced to analysts -├── notifications/ telegram_users.json, desktop_users.json, pending_codes.json -├── corporate-memory/ knowledge.json, votes.json, user_hashes.json -└── user_sessions/ centralized Claude Code session transcripts - -/run/ -├── notify-bot/bot.sock Telegram bot HTTP socket -├── ws-gateway/ws.sock WebSocket gateway HTTP socket -└── webapp/webapp.sock Gunicorn WSGI socket +├── state/ +│ └── system.duckdb user registry, sync state, table_registry, audit log +├── analytics/ +│ └── server.duckdb master analytics DB (views over all extracts) +└── extracts/ + ├── keboola/ + │ ├── extract.duckdb _meta + views + │ └── data/*.parquet + ├── bigquery/ + │ └── extract.duckdb _meta + _remote_attach + remote views + └── jira/ + ├── extract.duckdb _meta + views + └── data/*.parquet ``` -## CI/CD +--- -### Deploy Guard (`.github/workflows/deploy-guard.yml`) +## Extending the Platform -Runs on every pull request: -1. `pytest tests/test_deploy_guard.py` - validates deploy.sh/sudoers/systemd consistency -2. `pytest tests/test_sync_data.py -m "not live"` - validates sync script reliability -3. `visudo -cf server/sudoers-*` - validates sudoers syntax in Docker +### New Data Source -### Deployment (`.github/workflows/deploy.yml.example`) +1. Create `connectors//extractor.py`. +2. Write `extract.duckdb` with `_meta` table and views/tables. +3. Add `data/*.parquet` for local sources. +4. Add `_remote_attach` row if views reference an external DuckDB extension. +5. `SyncOrchestrator` picks it up automatically on next `rebuild()`. -Runs on push to main (or manual trigger): -1. SSH into server -2. Execute `server/deploy.sh` (git pull, deps, scripts, services, ACLs) +### New Auth Provider + +1. Add `app/auth/providers/.py` exporting a FastAPI `APIRouter`. +2. Register the router in `app/main.py`. +3. All providers must issue a JWT and set the `access_token` cookie on success. diff --git a/docs/auto-install.md b/docs/auto-install.md index 736bb79..43ca72c 100644 --- a/docs/auto-install.md +++ b/docs/auto-install.md @@ -1,662 +1,5 @@ -# Automated Installation Guide +# Auto-Install Guide -Step-by-step deployment of AI Data Analyst on a clean Ubuntu 24.04 VM. +For deployment instructions, see [DEPLOYMENT.md](DEPLOYMENT.md). -Two repos are involved: -- **OSS repo** (public/private): application code (`keboola/agnes-the-ai-analyst`) -- **Instance repo** (private): your config, secrets template, data schema (your private instance config repo) - -## Architecture on Server - -``` -/opt/data-analyst/ -├── repo/ # OSS repo clone -│ ├── config/ -│ │ └── instance.yaml -> ../../instance/config/instance.yaml (symlink) -│ ├── webapp/ -│ ├── server/ -│ └── ... -├── instance/ # Private instance repo clone -│ ├── config/ -│ │ ├── instance.yaml # Branding, auth domains, data source -│ │ └── data_description.md # Data schema (when configured) -│ ├── docs/setup/ # Custom CLAUDE.md template, etc. -│ ├── .env.example # Secrets template -│ └── README.md -├── .env # Secrets (not in git, from .env.example) -├── .venv/ # Python virtual environment -└── logs/ # Application logs -``` - -Key principle: OSS repo has no secrets/config. Instance repo has no code. Symlinks bridge them. - -## Prerequisites - -1. **DigitalOcean API token** with `ssh_key` scope (or any Ubuntu 24.04 VM) -2. **Two GitHub repos**: one for OSS code, one for private instance config -3. **SSH key** on your local machine for server access - -### Known Issues - -- `python3-venv` must be installed before `server/setup.sh` (Ubuntu 24.04 omits it) -- `webapp-setup.sh` generates SSL nginx config - use HTTP-only for IP-only deployments -- DigitalOcean cloud-init cannot override password expiry; must use `ssh_keys` API field - -## Step 0: Create Repos - -```bash -# Push OSS code to GitHub -git remote add origin git@github.com:YOUR_ORG/YOUR_OSS_REPO.git -git push -u origin main - -# Create private instance config repo on GitHub (empty, private) -# We'll populate it from the server after setup -``` - -## Step 1: Provision VM - -### 1a: Create Droplet (DigitalOcean) - -```bash -# Register SSH key (requires ssh_key scope on API token) -curl -s -X POST -H 'Content-Type: application/json' \ - -H "Authorization: Bearer $DO_TOKEN" \ - -d '{"name":"my-key","public_key":"ssh-ed25519 AAAA..."}' \ - "https://api.digitalocean.com/v2/account/keys" - -# Create droplet with SSH key -curl -s -X POST -H 'Content-Type: application/json' \ - -H "Authorization: Bearer $DO_TOKEN" \ - -d '{ - "name":"data-analyst-1", - "size":"s-1vcpu-2gb", - "region":"ams3", - "image":"ubuntu-24-04-x64", - "ssh_keys":["KEY_ID_OR_FINGERPRINT"] - }' \ - "https://api.digitalocean.com/v2/droplets" -``` - -### 1b: Install Prerequisites - -```bash -ssh root@DROPLET_IP - -# Wait for apt lock (auto-updates run on first boot) -apt update && apt install -y python3.12-venv python3-pip -``` - -### 1c: Generate Deploy Keys - -Two separate keys - one per repo, for security isolation: - -```bash -# Key for OSS repo -ssh-keygen -t ed25519 -f /root/.ssh/deploy_key -N "" -C "oss-app@$(hostname)" - -# Key for private instance config repo -ssh-keygen -t ed25519 -f /root/.ssh/instance_key -N "" -C "instance-config@$(hostname)" -``` - -Add each public key as a **deploy key** on its respective GitHub repo: -- `deploy_key.pub` -> OSS repo Settings > Deploy Keys -- `instance_key.pub` -> Instance repo Settings > Deploy Keys - -Configure SSH to use the right key per repo: - -```bash -cat > /root/.ssh/config << 'EOF' -# OSS application repo -Host github-oss - HostName github.com - IdentityFile /root/.ssh/deploy_key - StrictHostKeyChecking no - -# Instance config repo (private) -Host github-cfg - HostName github.com - IdentityFile /root/.ssh/instance_key - StrictHostKeyChecking no -EOF -chmod 600 /root/.ssh/config -``` - -### 1d: Clone OSS Repo & Run Setup - -```bash -git clone git@github-oss:YOUR_ORG/YOUR_OSS_REPO.git /opt/data-analyst/repo -cd /opt/data-analyst/repo -REPO_URL="git@github-oss:YOUR_ORG/YOUR_OSS_REPO.git" bash server/setup.sh -``` - -### Step 1 Checklist - -| # | Check | Expected | -|---|-------|----------| -| 1.1 | Groups | data-ops, dataread, data-private exist | -| 1.2 | Deploy user | uid deploy, groups: deploy, data-ops | -| 1.3 | Directories | /opt/data-analyst/{repo,.venv,logs} | -| 1.4 | Python venv | Flask loads in .venv | -| 1.5 | Scripts | add-analyst, list-analysts in /usr/local/bin | - -## Step 2: Webapp Setup - -### 2a: Run webapp-setup.sh - -```bash -export SERVER_HOSTNAME="your-domain-or-ip" -bash server/webapp-setup.sh -``` - -For IP-only (no SSL), replace nginx config: - -```bash -cat > /etc/nginx/sites-available/webapp << 'NGINX' -server { - listen 80; - server_name _; - location / { - proxy_pass http://unix:/run/webapp/webapp.sock; - proxy_set_header Host $host; - proxy_set_header X-Real-IP $remote_addr; - proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; - proxy_set_header X-Forwarded-Proto $scheme; - proxy_http_version 1.1; - proxy_set_header Upgrade $http_upgrade; - proxy_set_header Connection "upgrade"; - } - location /static/ { - alias /opt/data-analyst/repo/webapp/static/; - expires 1d; - } - location /health { - proxy_pass http://unix:/run/webapp/webapp.sock; - proxy_set_header Host $host; - access_log off; - } -} -NGINX -rm -f /etc/nginx/sites-enabled/default -nginx -t && systemctl restart nginx -``` - -### 2b: Create .env - -```bash -SECRET_KEY=$(python3 -c 'import secrets; print(secrets.token_hex(32))') - -cat > /opt/data-analyst/.env << EOF -WEBAPP_SECRET_KEY="${SECRET_KEY}" -SERVER_HOST="YOUR_IP" -SERVER_HOSTNAME="YOUR_IP_OR_DOMAIN" -GOOGLE_CLIENT_ID="placeholder" -GOOGLE_CLIENT_SECRET="placeholder" -DATA_SOURCE="local" -DATA_DIR="/data/src_data" -EOF - -chown root:data-ops /opt/data-analyst/.env -chmod 640 /opt/data-analyst/.env -``` - -### 2c: Create Data Directories & Start - -```bash -mkdir -p /data/src_data/{parquet,metadata} /data/docs /data/scripts -chown -R root:data-ops /data -chmod -R 2775 /data - -mkdir -p /run/webapp -chown www-data:www-data /run/webapp - -systemctl daemon-reload -systemctl start webapp -systemctl enable webapp -``` - -### Step 2 Checklist - -| # | Check | Expected | -|---|-------|----------| -| 2.1 | Nginx | active, port 80 | -| 2.2 | Webapp | active (gunicorn) | -| 2.3 | Health | `curl http://IP/health` returns JSON | -| 2.4 | Login page | HTTP 200 at /login | - -## Step 3: Instance Configuration (Private Repo) - -### 3a: Clone Instance Repo - -```bash -git clone git@github-cfg:YOUR_ORG/YOUR_INSTANCE_REPO.git /opt/data-analyst/instance -chown -R root:data-ops /opt/data-analyst/instance -chmod -R 770 /opt/data-analyst/instance -``` - -### 3b: Initialize Instance Config (if empty repo) - -If this is a fresh instance repo, create the initial config: - -```bash -cd /opt/data-analyst/instance -mkdir -p config docs/setup - -cat > config/instance.yaml << 'YAML' -instance: - name: "My Data Analyst" - subtitle: "My Organization" - copyright: "My Org" - -server: - hostname: "YOUR_IP_OR_DOMAIN" - host: "YOUR_IP" - app_dir: "/opt/data-analyst" - -auth: - allowed_domain: "mycompany.com" - webapp_secret_key: "${WEBAPP_SECRET_KEY}" - -data_source: - type: "local" - -catalog: - categories: {} -YAML - -# Create .env.example as a template for future deployments -cat > .env.example << 'ENV' -WEBAPP_SECRET_KEY="generate-with: python3 -c 'import secrets; print(secrets.token_hex(32))'" -SERVER_HOST="server-ip" -SERVER_HOSTNAME="server-ip-or-domain" -GOOGLE_CLIENT_ID="placeholder" -GOOGLE_CLIENT_SECRET="placeholder" -DATA_SOURCE="local" -DATA_DIR="/data/src_data" -ENV - -cat > .gitignore << 'GI' -.env -.env.local -*.swp -*~ -.DS_Store -GI - -git add -A && git commit -m "Initial instance config" && git push origin main -``` - -### 3c: Symlink Config into OSS Repo - -```bash -# Remove any existing instance.yaml (from manual setup) and symlink -rm -f /opt/data-analyst/repo/config/instance.yaml -ln -s /opt/data-analyst/instance/config/instance.yaml /opt/data-analyst/repo/config/instance.yaml - -# Symlink data_description.md (for Data Catalog - add when ready in Step 6) -ln -sf /opt/data-analyst/instance/config/data_description.md /opt/data-analyst/repo/docs/data_description.md - -systemctl restart webapp -``` - -### Step 3 Checklist - -| # | Check | Expected | -|---|-------|----------| -| 3.1 | Instance repo | /opt/data-analyst/instance/ exists | -| 3.2 | Symlink | config/instance.yaml -> ../../instance/config/instance.yaml | -| 3.3 | Webapp loads | Instance name shown on login page | - -## Step 4: Authentication - -Email magic link works without any external service. - -1. Login page shows "Sign in with Email" -2. User enters email with allowed domain -3. Without SMTP: magic link shown in browser (dev mode) -4. With SMTP: link sent via email -5. Click link -> logged in -> dashboard - -Optional: add Google OAuth by setting real `GOOGLE_CLIENT_ID`/`GOOGLE_CLIENT_SECRET`. - -### Step 4 Checklist - -| # | Check | Expected | -|---|-------|----------| -| 4.1 | Email auth | "Sign in with Email" on login page | -| 4.2 | Magic link | Generated for valid domain email | -| 4.3 | Domain check | Rejects wrong domains | -| 4.4 | Login flow | Magic link -> dashboard with session | - -## Step 5: Onboarding Flow (End-User) - -After server is set up, analysts self-onboard via the webapp: - -1. Visit `http://YOUR_SERVER/login` and sign in with email -2. Dashboard shows "Get Started" with 4 steps: - - Create project folder (`mkdir -p data-analyst && cd data-analyst`) - - Generate SSH key (`ssh-keygen -t ed25519 -f ~/.ssh/data_analyst_server -N ''`) - - Copy public key (`cat ~/.ssh/data_analyst_server.pub`) - - Paste key into form, click "Create Account" -3. After account creation, dashboard shows "Set up your local environment" -4. User runs `claude` in their project folder, pastes setup instructions -5. Claude Code configures SSH, rsyncs data, sets up Python + DuckDB - -## Step 6: Sample Data (Try Without a Data Adapter) - -Before connecting a real data source, you can load sample data to verify the full pipeline -(Parquet files, Data Catalog with profiling, analyst rsync, Claude Code analysis). - -### How the Data Catalog & Profiler Pipeline Works - -``` -Instance repo Server filesystem Webapp -───────────── ──────────────── ────── -config/data_description.md ──symlink──> repo/docs/data_description.md - (tables, folder_mapping, │ - foreign_keys) │ - ▼ -config/instance.yaml ────────symlink──> repo/config/instance.yaml - (catalog.categories, │ - labels, icons, order) │ - ▼ - /data/src_data/parquet/*.parquet - │ - ┌─────────┴──────────┐ - ▼ ▼ - python -m src.profiler _load_catalog_data() - │ │ - ▼ ▼ - /data/src_data/metadata/ /catalog page - profiles.json (categories + tables) - │ - ┌──────────┴──────────┐ - ▼ ▼ - /api/catalog/profile/ _load_data_stats() - (per-table stats, (header: "9 tables, - columns, alerts, ~217K rows total") - relationships, - used_by_metrics) - -docs/metrics/*/*.yml ──────────────> _load_metrics_data() - (metric definitions, │ - SQL examples, ▼ - dimensions) /catalog "Business Metrics" card - /api/metrics/ (modal detail) -``` - -Key files and their roles: - -| File | Location | Purpose | -|------|----------|---------| -| `data_description.md` | Instance repo | Table definitions, folder_mapping (bucket→category), foreign_keys | -| `instance.yaml` | Instance repo | Catalog category labels, icons, display order | -| `*.parquet` | `/data/src_data/parquet/` | Actual data files (flat or in subfolders) | -| `profiles.json` | `/data/src_data/metadata/` | Profiler output: statistics, alerts, relationships per table | -| `sync_state.json` | `/data/src_data/metadata/` | Sync process stats (optional; profiler provides fallback) | -| `docs/metrics/*/*.yml` | OSS repo (sample) or instance repo (production) | Business metric definitions with SQL examples | - -**Folder mapping** serves dual purpose: maps table IDs to catalog categories for the UI, -and maps to filesystem paths for the profiler. The profiler auto-detects flat layouts -(all parquet files in one directory) vs subfolder layouts (Keboola-style `parquet//
.parquet`). - -### 6a: Generate Parquet Files - -```bash -cd /opt/data-analyst/repo - -# Install generator dependency -/opt/data-analyst/.venv/bin/pip install faker - -# Generate Parquet files directly (uses project's ParquetManager -# for snappy compression, proper types, and metadata embedding) -/opt/data-analyst/.venv/bin/python scripts/generate_sample_data.py \ - --size m --format parquet --output /data/src_data/parquet --seed 42 - -# Set correct permissions -chown -R root:data-ops /data/src_data/parquet -chmod -R 2775 /data/src_data/parquet -``` - -Available sizes: `xs` (50 customers, ~1 MB), `s` (500, ~15 MB), `m` (5K, ~150 MB), `l` (50K, ~1.5 GB). -See `docs/sample-data.md` for the full data model and built-in analytical patterns. - -### 6b: Configure Data Catalog - -The Data Catalog reads from two files in the **instance repo**: - -1. **`config/data_description.md`** - YAML block with `folder_mapping`, `tables` (id, name, description, primary_key, sync_strategy, foreign_keys) -2. **`config/instance.yaml`** - `catalog.categories` with label, icon per category + `catalog.order` - -The `folder_mapping` maps bucket prefixes from table IDs to category names. Example: -table ID `sample.sales.orders` → bucket `sample.sales` → folder `sales` → category "Sales & Orders". - -Tables with `foreign_keys` will show interactive relationship diagrams in the profiler modal. - -Add `data_description.md` to the instance repo with the sample tables: - -```bash -cd /opt/data-analyst/instance - -# Create data_description.md (see config/data_description.md.example in OSS repo) -# Must contain a ```yaml block with: -# folder_mapping: { "bucket.prefix": "category_key", ... } -# tables: list of table definitions -# -# Each table needs: id, name, description, primary_key, sync_strategy -# Optional: foreign_keys (for profiler Relationships tab) -# -# Example foreign_keys: -# foreign_keys: -# - column: "customer_id" -# references: "customers.customer_id" -# description: "Ordering customer" - -# Add catalog categories to instance.yaml: -cat >> config/instance.yaml << 'YAML' - -catalog: - categories: - customers: - label: "Customers" - icon: "users" - products: - label: "Product Catalog" - icon: "package" - marketing: - label: "Marketing & Campaigns" - icon: "megaphone" - web: - label: "Web Analytics" - icon: "globe" - sales: - label: "Sales & Orders" - icon: "shopping-cart" - support: - label: "Support & Tickets" - icon: "help-circle" - order: [customers, products, marketing, web, sales, support] -YAML - -git add -A && git commit -m "Add sample data catalog" && git push origin main -``` - -Then symlink and restart: - -```bash -# Symlink data_description.md into OSS repo (if not already done) -ln -sf /opt/data-analyst/instance/config/data_description.md \ - /opt/data-analyst/repo/docs/data_description.md - -systemctl restart webapp -``` - -### 6c: Business Metrics - -The Data Catalog includes a **Business Metrics** card that dynamically renders metric -definitions from YAML files. The OSS repo ships with 10 sample e-commerce metrics in -`docs/metrics/` (4 categories: revenue, customers, marketing, support) that align with -the sample data generator tables. - -**How it works:** -- Webapp scans `docs/metrics/*/*.yml` (production: `/data/docs/metrics/`) -- Each YAML file defines one metric with SQL examples, dimensions, and notes -- The profiler links metrics to tables via `used_by_metrics` in `profiles.json` -- Clicking a metric opens a modal with Overview, How to Use, SQL Examples, and Technical tabs - -**For sample data:** metrics work out of the box - the OSS repo includes sample definitions. - -**For production:** create metric YAMLs in the **instance repo** and deploy them to -`/data/docs/metrics/` on the server. The production path takes precedence over the OSS repo. - -```bash -# Instance repo: create metric definitions -mkdir -p /opt/data-analyst/instance/docs/metrics/{revenue,operations} -# ... add your .yml files ... - -# Deploy metrics to server -cp -r /opt/data-analyst/instance/docs/metrics/ /data/docs/metrics/ -chown -R root:data-ops /data/docs/metrics -chmod -R 2775 /data/docs/metrics -``` - -Each metric YAML file follows this structure (list with one dict): - -```yaml -- name: metric_name - display_name: Human Readable Name - category: revenue # must match parent directory name - type: sum # sum, average, count_distinct, ratio - unit: USD - grain: monthly - time_column: order_date - table: orders # primary table - tables: [orders, customers] # optional: all referenced tables - expression: "SUM(total_amount)" - description: "What this metric measures..." - dimensions: [channel, region] - notes: ["Important context..."] - synonyms: [alias1, alias2] - sql: | - SELECT ... FROM ... GROUP BY ... - sql_by_channel: | # any sql_* key is auto-discovered - SELECT ... GROUP BY channel -``` - -### 6d: Run Data Profiler - -The profiler reads parquet files + `data_description.md` and generates `profiles.json` -with per-table statistics, column analysis, data quality alerts, and relationship maps. - -```bash -cd /opt/data-analyst/repo -/opt/data-analyst/.venv/bin/python -m src.profiler -``` - -Output: `/data/src_data/metadata/profiles.json` (auto-created, readable by webapp). - -The profiler provides: -- **Overview**: row count, column count, file size, date coverage, missing cell % -- **Columns**: type distribution, top values, histograms for numeric columns -- **Insights**: data quality alerts (high missing %, imbalanced categories, high cardinality) -- **Relationships**: FK diagram built from `foreign_keys` in `data_description.md`, plus linked Business Metrics -- **Used by Metrics**: shows which metric definitions reference this table (from `docs/metrics/`) -- **Sample**: first 5 rows of the table - -Without `sync_state.json` (no data adapter running), the profiler computes file sizes -directly from parquet files, and the catalog header derives table/row counts from `profiles.json`. - -To re-run after data changes: - -```bash -cd /opt/data-analyst/repo && /opt/data-analyst/.venv/bin/python -m src.profiler -# No webapp restart needed - profiles.json is read on each request -``` - -### Step 6 Checklist - -| # | Check | Expected | -|---|-------|----------| -| 6.1 | Parquet files | `ls /data/src_data/parquet/*.parquet` shows 9 files | -| 6.2 | Permissions | Files owned by root:data-ops, group-readable | -| 6.3 | Data Catalog | `/catalog` page shows 6 categories with 9 tables | -| 6.4 | Catalog header | "9 tables, ~217K+ rows total" (from profiles.json) | -| 6.5 | Profile modal | Click "Profile" on any table → statistics, columns, insights | -| 6.6 | Relationships | Orders profile → shows customers, order_items, payments links | -| 6.7 | Used by Metrics | Orders overview → shows total_revenue, campaign_roi, etc. badges | -| 6.8 | Business Metrics | `/catalog` shows "Business Metrics" card with 4 categories, 10 metrics | -| 6.9 | Metric modal | Click any metric → modal with SQL examples, dimensions, notes | -| 6.10 | File sizes | Profile overview shows non-zero file size (e.g., 0.69 MB) | -| 6.11 | Analyst sync | Analyst can rsync parquet files to local machine | -| 6.12 | DuckDB loads | `SELECT count(*) FROM read_parquet('orders.parquet')` returns rows | - -## Step 7: Real Data Source (Production) - -When ready, replace sample data with a real data source adapter in `instance/config/instance.yaml`: - -```yaml -data_source: - type: "keboola" - keboola: - storage_token: "${KEBOOLA_STORAGE_TOKEN}" - stack_url: "https://connection.keboola.com" - project_id: "12345" -``` - -Add the token to `.env` and create `config/data_description.md` with table schemas. - -Other planned adapters: BigQuery, CSV import. - -## Deployment Workflow (Ongoing) - -### Update OSS code -```bash -cd /opt/data-analyst/repo && git pull -bash server/deploy.sh # restarts services, syncs scripts/docs -``` - -### Update instance config -```bash -cd /opt/data-analyst/instance && git pull -systemctl restart webapp # picks up new instance.yaml via symlink -``` - -### Both at once -```bash -cd /opt/data-analyst/repo && git pull -cd /opt/data-analyst/instance && git pull -bash server/deploy.sh -``` - -## Server Layout Summary - -``` -/opt/data-analyst/ -├── repo/ -> git@github-oss:ORG/OSS_REPO.git -├── instance/ -> git@github-cfg:ORG/INSTANCE_REPO.git -├── .env # Secrets (not in git) -├── .venv/ # Python -└── logs/ # App logs - -/root/.ssh/ -├── deploy_key # For OSS repo (github-oss alias) -├── instance_key # For instance repo (github-cfg alias) -└── config # Maps aliases to keys - -Symlinks: - repo/config/instance.yaml -> instance/config/instance.yaml - repo/docs/data_description.md -> instance/config/data_description.md (optional) -``` - -## Quick Verification - -```bash -# Health check -curl http://YOUR_IP/health | python3 -m json.tool - -# Login page -curl -s -o /dev/null -w "%{http_code}" http://YOUR_IP/login -# Expected: 200 - -# Instance config loaded -curl -s http://YOUR_IP/login | grep 'YOUR_INSTANCE_NAME' -``` +For local development setup, see the [README](../README.md#development). diff --git a/docs/superpowers/plans/2026-03-27-01-duckdb-state-layer.md b/docs/superpowers/plans/2026-03-27-01-duckdb-state-layer.md index 6c0cb03..21c9c43 100644 --- a/docs/superpowers/plans/2026-03-27-01-duckdb-state-layer.md +++ b/docs/superpowers/plans/2026-03-27-01-duckdb-state-layer.md @@ -103,7 +103,7 @@ def test_schema_version_tracked(): - [ ] **Step 2: Run test to verify it fails** -Run: `cd "/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss" && python -m pytest tests/test_db.py -v` +Run: `python -m pytest tests/test_db.py -v` Expected: FAIL — `ModuleNotFoundError: No module named 'src.db'` - [ ] **Step 3: Implement src/db.py** @@ -317,7 +317,7 @@ def get_schema_version(conn: duckdb.DuckDBPyConnection) -> int: - [ ] **Step 4: Run test to verify it passes** -Run: `cd "/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss" && python -m pytest tests/test_db.py -v` +Run: `python -m pytest tests/test_db.py -v` Expected: 3 tests PASS - [ ] **Step 5: Commit**