docs: add refactoring plan, design spec, and gitignore updates

This commit is contained in:
ZdenekSrotyr 2026-03-27 15:42:57 +01:00
parent e0ce91ddb9
commit 07b396bfe2
4 changed files with 2677 additions and 0 deletions

3
.gitignore vendored
View file

@ -132,3 +132,6 @@ docs/datasets/*/schema.yml
# Agent-generated reports (not part of codebase)
.audit/
docs/AGENT-REPORTS/
# Internal transcripts and meeting notes
docs/ZS_PADAK_*

576
docs/REFACTORING_PLAN.md Normal file
View file

@ -0,0 +1,576 @@
# Refaktoring AI Data Analyst — Finální plán
## Kontext
Platforma vznikla iterativně pro interní Keboola a nyní se má stát produktem pro zákazníky (Groupon aj.). Klíčové problémy z transcriptu ZS+Padák: křehký filesystem stav (JSON soubory, permission konflikty), žádné API (vše SSH+skripty), bezpečnost přes Linux skupiny, složitá instalace (10+ kroků). Systém je navržen pro AI agenty — člověk diskutuje s AI, AI řeší vše (user, admin, dev operace).
**UX zůstává stejné.** Tooling: `uv` všude místo pip. Docker + Kamal pro server. CLI (`da`) jako primární rozhraní pro AI agenty.
---
## Architektura — cílový stav
```
SERVER (Docker + Kamal):
├── webapp Flask UI (katalog, login, corporate memory)
├── api FastAPI (CLI backend, sync manifest, data download)
├── scheduler APScheduler (nahrazuje 7 systemd timerů)
├── telegram-bot Telegram notifikace
├── ws-gateway WebSocket pro desktop app
└── script-runner Sandboxovaný runner pro user skripty
LOKÁLNĚ (analytik):
├── da CLI Python balíček (uv tool install)
├── DuckDB Embedded (analytics.duckdb → views na parquety)
└── Parquety Stažené ze serveru přes da sync
DVA DuckDB NA SERVERU:
├── /data/state/system.duckdb Systémový stav (users, sync_state, knowledge...)
└── /data/analytics/server.duckdb Views → /data/parquet/** (profiler, remote query, skripty)
JEDEN DuckDB LOKÁLNĚ:
└── user/duckdb/analytics.duckdb Views → server/parquet/** + user tabulky
```
---
## Fáze 0: Základ — DuckDB state + repository vrstva
**Cíl:** Nahradit 10+ JSON souborů DuckDB databází. Eliminovat #1 zdroj outages (file permission konflikty).
**Proč DuckDB:** Už v stacku, agent může joinovat stav s analytickými daty, lepší než SQLite pro analytické dotazy nad stavem.
### Task 0A: DuckDB schema + repository vrstva [INDEPENDENT]
Nové soubory:
- `src/db.py` — DuckDB connection management, schema creation, migration system
- `src/repositories/__init__.py`
- `src/repositories/sync_state.py` — CRUD pro sync stav
- `src/repositories/users.py` — CRUD pro uživatele + role
- `src/repositories/knowledge.py` — CRUD pro corporate memory
- `src/repositories/table_registry.py` — CRUD pro registr tabulek
- `src/repositories/audit.py` — audit log
- `src/repositories/notifications.py` — telegram links, pending codes, script registry
Schema tabulky (mapování z JSON):
| Současný JSON | DuckDB tabulka | Zdroj soubor |
|---|---|---|
| `sync_state.json` | `sync_state` | `src/data_sync.py:37-138` |
| `sync_settings.json` | `user_sync_settings` | `webapp/sync_settings_service.py:20` |
| `knowledge.json` | `knowledge_items` | `webapp/corporate_memory_service.py` |
| `votes.json` | `knowledge_votes` | `webapp/corporate_memory_service.py` |
| `audit.jsonl` | `audit_log` | `webapp/corporate_memory_service.py` |
| `telegram_users.json` | `telegram_links` | `services/telegram_bot/storage.py` |
| `pending_codes.json` | `pending_codes` | `services/telegram_bot/storage.py` |
| `password_users.json` | `users` | `webapp/password_auth.py` |
| `table_registry.json` | `table_registry` | `src/table_registry.py` |
| `profiles.json` | `table_profiles` | `src/profiler.py` |
Přidat navíc: `sync_history` (posledních 10 syncí per tabulka, ne jen last), `script_registry` (deployed skripty).
### Task 0B: Migrace existujících service souborů na repository [DEPENDS ON 0A]
Soubory k úpravě (nahradit `_read_json`/`_write_json` za repository volání):
- `webapp/sync_settings_service.py` řádky 40-62
- `webapp/corporate_memory_service.py` — 31 JSON operací
- `webapp/telegram_service.py` řádky 22-45
- `src/data_sync.py` — třída `SyncState` řádky 37-138
- `src/table_registry.py``_load`, `_atomic_write_json`
- `src/profiler.py` — uložení profilů
- `services/corporate_memory/collector.py` — čtení/zápis knowledge
- `services/telegram_bot/storage.py` — 15 JSON operací
Pattern: dual-write (JSON + DuckDB) po přechodnou dobu → ověřit → smazat JSON zápisy.
### Task 0C: Migrační skript [DEPENDS ON 0A]
- `scripts/migrate_json_to_duckdb.py` — načte všechny JSON, vloží do DuckDB
- Idempotentní (safe to run multiple times)
- Validace po migraci (count porovnání)
### Co se NEMĚNÍ v Fázi 0
- Flask routes v `webapp/app.py`
- HTML šablony
- Konektory (`connectors/keboola/`, `connectors/bigquery/`, `connectors/jira/`)
- `src/config.py` (čte `data_description.md` — konfigurace, ne stav)
- `config/loader.py` (čte `instance.yaml`)
- `src/parquet_manager.py`
---
## Fáze 1: API vrstva (FastAPI)
**Cíl:** REST API pro CLI. Všechny operace co dnes vyžadují SSH.
### Task 1A: FastAPI základ + auth [INDEPENDENT od 0B, DEPENDS ON 0A]
Nové soubory:
```
api/
__init__.py
app.py # FastAPI app, middleware, CORS
auth.py # JWT vydávání + validace
dependencies.py # DI pro DuckDB session, current_user
```
Auth flow:
1. `POST /api/auth/login` — přijme OAuth token z webappu, vydá JWT
2. `POST /api/auth/token` — přijme API key, vydá JWT
3. JWT obsahuje: user_id, email, role, expiry
4. Middleware validuje JWT na všech /api/* endpoints
### Task 1B: Sync + Data endpointy [DEPENDS ON 1A, 0A]
```
api/routers/
sync.py # GET /api/sync/manifest, POST /api/sync/trigger
data.py # GET /api/data/{table}/download (parquet stream)
```
- `/api/sync/manifest` — vrátí hashe všech parquetů, docs, rules, profilů (filtrované per-user dle subscription)
- `/api/data/{table}/download` — streaming parquet souboru s ETag/If-None-Match
- `/api/sync/trigger` — spustí DataSyncManager (reuse `src/data_sync.py`)
### Task 1C: Query + Scripts endpointy [DEPENDS ON 1A, 0A]
```
api/routers/
query.py # POST /api/query (remote query)
scripts.py # POST /api/scripts/run, /deploy, /list
```
- `/api/query` — reuse `src/remote_query.py`, výsledek jako JSON/parquet
- `/api/scripts/run` — spustí Python skript v sandboxu na serveru
- `/api/scripts/deploy` — nahraje skript + registruje v scheduleru
- `/api/scripts/list` — deployed skripty s jejich schedules
### Task 1D: User management + Corporate memory endpointy [DEPENDS ON 1A, 0A]
```
api/routers/
users.py # CRUD uživatelů, role, permissions
settings.py # GET/PUT sync settings per user
memory.py # Corporate memory CRUD, voting, governance
health.py # GET /api/health (strukturovaná diagnostika)
upload.py # POST sessions, artifacts, CLAUDE.local.md
```
### Task 1E: Odstranění SSH/sudo závislostí [DEPENDS ON 1B, 1D]
Smazat/přepsat:
- `webapp/sync_settings_service.py` řádky 128-240 (sudo/rsync-filter kód)
- `webapp/user_service.py` — Linux user management (`pwd.getpwnam`, `sudo add-analyst`)
- SSH key validace workflow
- `server/sudoers-webapp`, `server/sudoers-deploy`
- `server/bin/add-analyst`
---
## Fáze 2: CLI nástroj (`da`)
**Cíl:** Jediné rozhraní pro AI agenty. Nahrazuje SSH+skripty. `uv tool install`.
### Task 2A: CLI základ + auth [INDEPENDENT od 1B-1E, DEPENDS ON 1A]
```
cli/
__init__.py
main.py # Typer app, global options (--server, --json)
config.py # ~/.config/da/ management
client.py # HTTP client wrapper (auth, retry, streaming)
commands/
auth.py # da login, da logout, da whoami
```
- `da login` → otevře browser pro OAuth → server vydá JWT → uloží do `~/.config/da/token.json`
- `da --json` flag na všech příkazech pro strukturovaný output
- `da --server URL` override (default z config.yaml)
### Task 2B: Sync příkazy [DEPENDS ON 2A, 1B]
```
cli/commands/
sync.py # da sync, da sync --table X, da sync --upload-only
```
Flow:
1. `GET /api/sync/manifest` → porovnej s `~/.config/da/sync_state.json`
2. Download změněné parquety (HTTP streaming s progress barem)
3. Download docs, rules, profily
4. Upload sessions, artifacts, CLAUDE.local.md
5. Rebuild DuckDB views (DROP views, CREATE VIEW per tabulka, zachovej user tabulky)
6. Update lokální manifest
Přepíše funkci `scripts/sync_data.sh` (475 řádků).
### Task 2C: Query + Scripts příkazy [DEPENDS ON 2A, 1C]
```
cli/commands/
query.py # da query "SQL" [--remote] [--json]
scripts.py # da scripts list/run/deploy/undeploy
explore.py # da explore {table} — profil tabulky
```
- `da query` — lokální DuckDB default, `--remote` přes server API
- `da scripts run X` — lokálně default, `--remote` přes server
- `da scripts deploy X --schedule "cron"` — upload + registrace na serveru
- `da explore orders` — profil z lokálních dat (nebo `--remote` ze serveru)
### Task 2D: Admin + Server příkazy [DEPENDS ON 2A, 1D]
```
cli/commands/
admin.py # da admin add-user/remove-user/list-users
status.py # da status [--local] — zdraví systému
server.py # da server deploy/rollback/logs/status
diagnose.py # da diagnose — AI-friendly diagnostika
```
- `da status` — strukturovaný health report (tabulky, sync stav, služby)
- `da status --local` — offline: kdy jsem synkoval, kolik dat mám
- `da diagnose` — projde logy, sync stav, konektivitu → root cause
- `da server deploy` — wrapper kolem `kamal deploy`
- `da server logs webapp` — wrapper kolem `kamal app logs`
### Task 2E: PyPI distribuce [DEPENDS ON 2A]
- `pyproject.toml` pro CLI balíček
- `uv tool install data-analyst` nebo `uv pip install data-analyst`
- Entry point: `[project.scripts] da = "cli.main:app"`
- Minimální dependencies: typer, httpx, duckdb, rich (progress bars)
---
## Fáze 3: Docker + Kamal
**Cíl:** `docker compose up` pro dev, `kamal deploy` pro produkci. Nahrazuje 10+ manuálních kroků.
### Task 3A: Dockerfile + docker-compose.yml [INDEPENDENT]
```
Dockerfile # python:3.13-slim, uv install, jeden image
docker-compose.yml # webapp, api, scheduler, telegram-bot, ws-gateway
docker-compose.test.yml # api + test-runner pro integrační testy
```
- Jeden image, různý CMD per služba
- Volume `/data` sdílený mezi kontejnery
- `profiles: ["full"]` pro volitelné služby (telegram, ws-gateway)
- `uv sync` místo `pip install` v Dockerfile
### Task 3B: Scheduler služba [DEPENDS ON 0A]
Nový soubor: `services/scheduler/__main__.py`
- APScheduler (nebo jednoduchý custom) nahrazuje 7 systemd timerů:
| Timer | Schedule | Funkce |
|---|---|---|
| data-refresh | 15 min | `DataSyncManager.sync_scheduled()` |
| catalog-refresh | 15 min | Catalog refresh |
| corporate-memory | 30 min | Knowledge collector |
| session-collector | 6h | Session collection (z uploaded dat) |
| user-scripts | per-script cron | Script runner |
| profiler | po data-refresh | Auto-profile nových dat |
### Task 3C: Kamal konfigurace [DEPENDS ON 3A]
```
config/
deploy.yml # produkční Kamal config
deploy.staging.yml # staging override
```
- Kamal Proxy pro auto-SSL (Let's Encrypt)
- Healthcheck na `/api/health`
- Zero-downtime deploy
- Accessories: scheduler, telegram-bot, ws-gateway, script-runner
- Environment secrets přes Kamal env management
### Task 3D: GitHub Actions CI/CD [DEPENDS ON 3A, 3C]
```
.github/workflows/
ci.yml # test + build na každém push
deploy.yml # staging na PR, production na merge do main
```
Flow: push → pytest → integrační testy (docker compose) → build image → push GHCR → kamal deploy staging (PR) / production (merge)
### Task 3E: Smazání starého server infra [DEPENDS ON 3A-3D, ověřeno že nové funguje]
Smazat:
- `server/setup.sh` (103 řádků)
- `server/webapp-setup.sh` (171 řádků)
- `server/deploy.sh` (395 řádků)
- `server/migrate-to-v2.sh` (146 řádků)
- Všechny systemd unit soubory (`services/*/systemd/`)
- `server/sudoers-*`
- `server/bin/add-analyst` a related skripty
- `scripts/sync_data.sh` (475 řádků)
- `server/webapp.service`, `server/webapp-nginx.conf`
---
## Fáze 4: RBAC + bezpečnost
**Cíl:** Aplikační RBAC místo Linux skupin. Audit trail. Script sandboxing.
### Task 4A: Role + permissions v DuckDB [DEPENDS ON 0A]
Nový soubor: `src/rbac.py`
```python
class Role(Enum):
VIEWER = "viewer" # Katalog, čtení dat
ANALYST = "analyst" # Sync, queries, voting, skripty
ADMIN = "admin" # Správa uživatelů, schvalování knowledge
KM_ADMIN = "km_admin" # Corporate memory governance
```
- Dataset-level permissions (kdo má přístup ke kterým datům)
- Přepsat `webapp/auth.py` řádky 37-65 (admin_required/km_admin_required)
- Přepsat `webapp/user_service.py` celý — DB místo `pwd.getpwnam()` + `sudo`
### Task 4B: Audit trail [DEPENDS ON 0A]
- Každý API call logován do `audit_log` tabulky
- Struktura: timestamp, user_id, action, resource, params, result, duration
- Agent může: `da query "SELECT * FROM system.audit_log WHERE action='sync_trigger' ORDER BY timestamp DESC LIMIT 10"`
### Task 4C: Script sandboxing [DEPENDS ON 3A]
- Script-runner jako izolovaný Docker kontejner
- Read-only přístup k DuckDB
- Omezená paměť (512MB), čas (5min), žádný network (kromě notification dispatch)
- Explicitní whitelist Python balíčků (pandas, duckdb, matplotlib)
### Task 4D: Corporate memory push model [DEPENDS ON 1D]
- Uživatelé pushují CLAUDE.local.md přes `da sync --upload-only`
- Server nikdy nečte `/home/*/` jako root
- Corporate memory collector zpracovává uploaded data z DB
---
## Dependency graf pro multi-agenty
```
Fáze 0:
0A (DuckDB schema) ─────────────────────┐
0C (migrační skript) ← závisí na 0A │
0B (migrace services) ← závisí na 0A │
Fáze 1: │
1A (FastAPI základ) ← závisí na 0A ─────┤
1B (sync/data EP) ← závisí na 1A, 0A │
1C (query/scripts EP) ← závisí na 1A │
1D (users/memory EP) ← závisí na 1A │
1E (remove SSH) ← závisí na 1B, 1D │
Fáze 2: │
2A (CLI základ) ← závisí na 1A │
2B (sync cmd) ← závisí na 2A, 1B │
2C (query/scripts cmd) ← závisí na 2A │
2D (admin/server cmd) ← závisí na 2A │
2E (PyPI) ← závisí na 2A │
Fáze 3: │
3A (Dockerfile) ← INDEPENDENT ──────────┘
3B (scheduler) ← závisí na 0A
3C (Kamal) ← závisí na 3A
3D (CI/CD) ← závisí na 3A, 3C
3E (cleanup) ← závisí na 3A-3D verified
Fáze 4:
4A (RBAC) ← závisí na 0A
4B (audit) ← závisí na 0A
4C (sandbox) ← závisí na 3A
4D (push model) ← závisí na 1D
```
### Paralelní agenty — optimální rozložení
```
AGENT 1: DuckDB + Repositories AGENT 2: FastAPI AGENT 3: Docker + Kamal
───────────────────────────── ───────────────── ──────────────────────
0A: DuckDB schema (čeká na 0A) 3A: Dockerfile + compose
0C: migrační skript 1A: FastAPI základ 3B: scheduler služba
0B: migrace services 1B: sync/data EP 3C: Kamal konfigurace
4A: RBAC 1C: query/scripts EP 3D: CI/CD workflow
4B: audit trail 1D: users/memory EP 4C: script sandbox
1E: remove SSH deps
AGENT 4: CLI + Skills AGENT 5: Integrace + Cleanup
───────────────────── ───────────────────────────
(čeká na 1A) (čeká na agents 1-4)
2A: CLI základ + auth End-to-end testování
2B: sync příkazy 3E: smazání starého infra
2C: query/scripts příkazy 4D: corporate memory push
2D: admin/server příkazy 5A: CLAUDE.md template update
2E: PyPI distribuce Dokumentace update
5B: CLI skills (help/docs)
5C: da setup (interactive)
5D: da diagnose
5E: da infra (multi-customer)
```
---
## Znovupoužité vs. přepsané soubory
### Beze změny (business logika zachována)
- `src/config.py` — TableConfig, Config parsing (625 řádků)
- `src/parquet_manager.py` — Parquet conversion engine
- `connectors/keboola/adapter.py` + `client.py`
- `connectors/bigquery/adapter.py` + `client.py`
- `connectors/jira/` — celý connector
- `connectors/llm/` — LLM abstrakce
- `connectors/openmetadata/` — katalog enrichment
- `webapp/config.py`, `config/loader.py`
- `webapp/templates/` — všechny HTML šablony
- `src/remote_query.py` — query logika (zabalená API)
- `src/profiler.py` — profiling logika (output do DuckDB)
### Přepojené na DuckDB (logika zachována, I/O vrstva vyměněna)
- `webapp/corporate_memory_service.py`
- `webapp/sync_settings_service.py`
- `webapp/telegram_service.py`
- `src/data_sync.py` (SyncState třída)
- `src/table_registry.py`
- `services/corporate_memory/collector.py`
- `services/telegram_bot/storage.py`
### Přepsané
- `webapp/user_service.py` — DB místo Linux users
- `webapp/auth.py` řádky 37-65 — RBAC místo Linux skupin
### Nové
- `src/db.py`, `src/repositories/`, `src/rbac.py`
- `api/` — celý FastAPI server
- `cli/` — celý CLI nástroj
- `Dockerfile`, `docker-compose*.yml`, `config/deploy*.yml`
- `services/scheduler/__main__.py`
- `.github/workflows/ci.yml`, `.github/workflows/deploy.yml`
### Smazané
- `server/setup.sh`, `server/webapp-setup.sh`, `server/deploy.sh`
- `server/migrate-to-v2.sh`
- `server/sudoers-*`, `server/bin/add-analyst`
- `scripts/sync_data.sh`
- Všechny `services/*/systemd/` soubory
- `server/webapp.service`, `server/webapp-nginx.conf`
---
## Fáze 5: Agent Skills (CLAUDE.md + CLI skills)
**Cíl:** AI agent má vestavěné znalosti pro nasazení, administraci, diagnostiku a vývoj. Nemusí nic googlit — vše je v skills.
### Task 5A: CLAUDE.md template pro analytiky [INDEPENDENT]
Aktualizovat `docs/setup/claude_md_template.md`:
- Instrukce pro `da` CLI místo SSH/rsync
- `da sync` jako povinný start session
- Jak pracovat s lokálním DuckDB
- Jak vytvářet a deployovat skripty
- Jak používat corporate memory
- Notifikační vzory (lokální vs serverové)
### Task 5B: Admin/Deploy skills v CLI [DEPENDS ON 2D]
`da` CLI bude obsahovat vestavěné skills — dlouhé help texty s domain knowledge, které AI agent přečte přes `da <command> --help` nebo `da skills <topic>`:
```bash
da skills list # seznam všech dostupných skills
da skills setup # kompletní průvodce setup nové instance
da skills troubleshoot # diagnostické postupy
da skills connectors # jak přidat nový data source
da skills notifications # jak fungují notifikace
da skills corporate-memory # governance, approval flow
da skills security # RBAC, permissions, audit
da skills backup-restore # disaster recovery
da skills upgrade # jak upgradovat verzi
```
Každý skill = markdown soubor v `cli/skills/` který se zobrazí přes `da skills <name>`.
### Task 5C: Interaktivní setup skill [DEPENDS ON 2D, 1A]
```bash
da setup # AI agent spustí interaktivní setup
```
Flow (agent řídí):
1. `da setup init` → vygeneruje `instance.yaml` z konverzace s uživatelem
2. `da setup test-connection` → ověří credentials (Keboola/BigQuery)
3. `da setup deploy``docker compose up` nebo `kamal deploy`
4. `da setup first-sync` → triggeruje první data sync
5. `da setup verify` → healthcheck, počet tabulek, sample query
6. `da setup add-user` → přidá prvního analytika
Každý krok vrací strukturovaný JSON → agent ví co dělat dál.
### Task 5D: Diagnose skill [DEPENDS ON 2D, 1D]
```bash
da diagnose # kompletní diagnostika
da diagnose --symptom "data not updating" # cílená diagnostika
da diagnose --component scheduler # diagnostika jedné služby
```
Output (strukturovaný pro agenta):
```json
{
"overall": "degraded",
"checks": [
{"name": "api", "status": "ok", "latency_ms": 12},
{"name": "scheduler", "status": "ok", "last_run": "2026-03-27T08:00"},
{"name": "data_freshness", "status": "warning",
"detail": "table 'orders' last synced 26h ago, expected 15min",
"suggested_action": "da server logs scheduler | grep orders"},
{"name": "disk", "status": "ok", "usage": "45%"},
{"name": "duckdb", "status": "ok", "tables": 47, "total_rows": "12.3M"}
],
"suggested_actions": [
"Check scheduler logs for 'orders' sync failures",
"Run: da server logs scheduler --since 24h | grep -i error"
]
}
```
### Task 5E: Operační skills pro multi-customer [DEPENDS ON 3C]
```bash
da infra list # seznam zákaznických instancí
da infra provision --customer acme --cloud gcp --region europe-west1
da infra status acme # zdraví zákaznické instance
da infra deploy acme # deploy na zákaznický server
da infra backup acme # snapshot dat
```
Budoucí rozšíření — Terraform pod kapotou pro provision, Kamal pro deploy.
---
## Verifikace
### Per-fáze
1. **Fáze 0:** `pytest tests/` zelený, webapp funguje identicky s DuckDB backendem
2. **Fáze 1:** `curl /api/health` → ok, `curl /api/sync/manifest` → manifest, parquet download funguje
3. **Fáze 2:** `da login && da sync` vytvoří identickou strukturu jako `sync_data.sh`, `da query` funguje offline
4. **Fáze 3:** `docker compose up` → všechny služby běží, `kamal deploy -d staging` → staging funguje
5. **Fáze 4:** viewer nemůže triggerovat sync, admin může spravovat uživatele, skripty běží v sandboxu
### End-to-end test (celý flow)
1. `docker compose up -d` (nebo `kamal deploy`)
2. Přes webapp: přihlásit se, vybrat datasety
3. `da login && da sync` → parquety lokálně
4. `da query "SELECT count(*) FROM orders"` → výsledek offline
5. `da scripts run sales_alert` → lokální exekuce
6. `da scripts deploy sales_alert --schedule "0 8 * * MON"` → serverová exekuce
7. `da sync --upload-only` → sessions/artifacts na serveru
8. Corporate memory: knowledge items viditelné ve webappu
9. Telegram notifikace doručeny
10. `da diagnose` → strukturovaný health report

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,524 @@
# AI Data Analyst — Refactoring Design Spec
**Date:** 2026-03-27
**Status:** Draft
**Target:** Greenfield demo with Keboola internal data
## 1. Problem Statement
The platform was built iteratively as an internal tool and needs to become a product for external customers (Groupon, others). Key problems:
1. **Fragile filesystem state** — 10+ JSON files, permission conflicts between processes (www-data, deploy, root, user) cause outages
2. **No API** — all operations via SSH + bash scripts, no programmatic control
3. **Security via Linux groups** — no real RBAC, SSH keys visible in `ps aux`, root reads user homes
4. **Complex installation** — 10+ manual steps, specific OS requirements, dual-repo pattern with symlinks
5. **Operations nightmare** — scattered scripts, no unified logging/monitoring, creator calls it "duct tape solution"
The system is designed for AI agents — humans discuss with AI, AI handles everything (user, admin, dev operations).
**Constraint:** UX must remain identical. Web catalog, data sync, offline Claude Code analysis, Telegram notifications, corporate memory — all preserved.
## 2. Architecture
### Target State
```
SERVER (Docker + Kamal):
┌──────────────────────────────────────────────────┐
│ FastAPI Main App (1 process) │
│ ├── Web UI (Jinja2 templates) │
│ ├── REST API (/api/*) │
│ ├── WebSocket (/ws/notifications) │
│ └── Auth (JWT + pluggable providers) │
└──────────────────────────────────────────────────┘
┌─────────────────┐ ┌──────────────────────────────┐
│ Scheduler sidecar│ │ Telegram bot (optional) │
│ Calls /api/ │ │ Long-running daemon │
└─────────────────┘ └──────────────────────────────┘
/data/state/system.duckdb ← system state (users, sync, knowledge, audit)
/data/analytics/server.duckdb ← views on parquet files
/data/parquet/** ← data files
LOCAL (analyst):
┌──────────────────────────────────────────────────┐
│ da CLI (uv tool install data-analyst-cli) │
│ user/duckdb/analytics.duckdb ← views + user tbls│
│ server/parquet/** ← downloaded via da sync │
│ Claude Code ← works offline with DuckDB │
└──────────────────────────────────────────────────┘
```
### Key Decisions
| Decision | Choice | Rationale |
|----------|--------|-----------|
| Web framework | FastAPI only (no Flask) | One framework, OpenAPI auto-schema, async native, Jinja2 support |
| State storage | DuckDB | Already in stack, agent can join state with analytics, better than SQLite for analytical queries |
| CLI tool | `da` via `uv tool install` | AI-agent native interface, no Docker dependency locally |
| Server deploy | Docker + Kamal | Zero-downtime deploys, auto-SSL, simple config |
| Architecture | Hybrid (main app + scheduler sidecar + optional telegram) | 3 containers max, WebSocket in main app |
| Auth providers | All 3 (Google OAuth + Email magic link + Password) | Full compatibility with existing users |
| LLM provider | Configurable in instance.yaml | User chooses: local Ollama, Anthropic, OpenAI, AI Gateway |
| Python tooling | uv everywhere (no pip) | Faster, deterministic, modern |
## 3. Data Layer
### Server DuckDB: system.duckdb
```sql
-- Users & RBAC
CREATE TABLE users (
id VARCHAR PRIMARY KEY,
email VARCHAR UNIQUE NOT NULL,
name VARCHAR,
role VARCHAR DEFAULT 'analyst', -- viewer, analyst, admin, km_admin
password_hash VARCHAR,
setup_token VARCHAR,
reset_token VARCHAR,
created_at TIMESTAMP DEFAULT current_timestamp,
updated_at TIMESTAMP
);
CREATE TABLE user_sync_settings (
user_id VARCHAR REFERENCES users(id),
dataset VARCHAR NOT NULL,
enabled BOOLEAN DEFAULT false,
table_mode VARCHAR DEFAULT 'all', -- all, explicit
tables JSON,
updated_at TIMESTAMP,
PRIMARY KEY (user_id, dataset)
);
CREATE TABLE dataset_permissions (
user_id VARCHAR REFERENCES users(id),
dataset VARCHAR NOT NULL,
access VARCHAR DEFAULT 'read', -- read, none
PRIMARY KEY (user_id, dataset)
);
-- Sync state + history
CREATE TABLE sync_state (
table_id VARCHAR PRIMARY KEY,
last_sync TIMESTAMP,
rows BIGINT,
file_size_bytes BIGINT,
uncompressed_size_bytes BIGINT,
columns INTEGER,
hash VARCHAR,
status VARCHAR DEFAULT 'ok',
error TEXT
);
CREATE TABLE sync_history (
id VARCHAR PRIMARY KEY,
table_id VARCHAR NOT NULL,
synced_at TIMESTAMP NOT NULL,
rows BIGINT,
duration_ms INTEGER,
status VARCHAR,
error TEXT
);
-- Corporate memory
CREATE TABLE knowledge_items (
id VARCHAR PRIMARY KEY,
title VARCHAR NOT NULL,
content TEXT,
category VARCHAR,
tags TEXT[],
status VARCHAR DEFAULT 'pending', -- pending, approved, mandatory, rejected
contributors TEXT[],
source_user VARCHAR,
audience VARCHAR,
created_at TIMESTAMP,
updated_at TIMESTAMP
);
CREATE TABLE knowledge_votes (
item_id VARCHAR REFERENCES knowledge_items(id),
user_id VARCHAR REFERENCES users(id),
vote INTEGER, -- 1 or -1
voted_at TIMESTAMP,
PRIMARY KEY (item_id, user_id)
);
-- Audit
CREATE TABLE audit_log (
id VARCHAR PRIMARY KEY,
timestamp TIMESTAMP NOT NULL,
user_id VARCHAR,
action VARCHAR NOT NULL,
resource VARCHAR,
params JSON,
result VARCHAR,
duration_ms INTEGER
);
-- Notifications
CREATE TABLE telegram_links (
user_id VARCHAR PRIMARY KEY REFERENCES users(id),
chat_id BIGINT NOT NULL,
linked_at TIMESTAMP
);
CREATE TABLE pending_codes (
code VARCHAR PRIMARY KEY,
chat_id BIGINT NOT NULL,
created_at TIMESTAMP
);
CREATE TABLE script_registry (
id VARCHAR PRIMARY KEY,
name VARCHAR NOT NULL,
owner VARCHAR REFERENCES users(id),
schedule VARCHAR, -- cron expression or null
source TEXT NOT NULL,
deployed_at TIMESTAMP,
last_run TIMESTAMP,
last_status VARCHAR
);
-- Table registry
CREATE TABLE table_registry (
id VARCHAR PRIMARY KEY,
name VARCHAR NOT NULL,
folder VARCHAR,
sync_strategy VARCHAR,
primary_key VARCHAR,
description TEXT,
registered_by VARCHAR,
registered_at TIMESTAMP
);
-- Profiles
CREATE TABLE table_profiles (
table_id VARCHAR PRIMARY KEY,
profile JSON NOT NULL,
profiled_at TIMESTAMP
);
```
### Server DuckDB: server.duckdb
Auto-generated views on parquet files:
```sql
CREATE VIEW orders AS SELECT * FROM read_parquet('/data/parquet/sales/orders.parquet');
CREATE VIEW customers AS SELECT * FROM read_parquet('/data/parquet/sales/customers.parquet');
-- Generated from schema.yml by profiler/sync
```
### Local DuckDB: analytics.duckdb
Views on local parquets (generated by `da sync`):
```sql
CREATE VIEW orders AS SELECT * FROM read_parquet('./server/parquet/sales/orders.parquet');
-- User-created tables survive da sync (rebuild drops only views, not tables)
```
### Repository Pattern
```
src/repositories/
__init__.py # get_system_db(), get_analytics_db() factories
users.py # UserRepository (CRUD + role checks)
sync_state.py # SyncStateRepository (state + history)
knowledge.py # KnowledgeRepository (items + votes + governance)
audit.py # AuditRepository (append + query)
scripts.py # ScriptRepository (registry + scheduling)
table_registry.py # TableRegistryRepository
notifications.py # TelegramRepository + PendingCodeRepository
```
## 4. API Endpoints
### FastAPI Router Structure
```
app/
main.py # FastAPI app, lifespan events, middleware
auth/
router.py # POST /auth/login, /auth/token, /auth/logout
jwt.py # JWT create/verify (PyJWT)
providers/ # Pluggable: google/, email/, password/
dependencies.py # get_current_user, require_role(Role)
web/
router.py # Web UI: GET /, /catalog, /memory, /settings...
templates/ # Jinja2 (migrated from webapp/templates/)
static/ # CSS, JS, images
api/
sync.py # GET /api/sync/manifest, POST /api/sync/trigger
data.py # GET /api/data/{table}/download
query.py # POST /api/query
scripts.py # GET/POST /api/scripts, POST /api/scripts/{id}/run
users.py # CRUD /api/users
settings.py # GET/PUT /api/users/{id}/settings
memory.py # CRUD /api/memory, POST /api/memory/{id}/vote
health.py # GET /api/health
upload.py # POST /api/upload/sessions, /artifacts, /local-md
ws/
notifications.py # WebSocket /ws/notifications
```
### Key Endpoints
| Endpoint | Method | Auth | Purpose |
|----------|--------|------|---------|
| `/api/sync/manifest` | GET | JWT (analyst+) | Hash-based manifest of all synced data |
| `/api/sync/trigger` | POST | JWT (admin) | Trigger data sync from source |
| `/api/data/{table}/download` | GET | JWT (analyst+) | Stream parquet file (ETag support) |
| `/api/query` | POST | JWT (analyst+) | Execute SQL against server DuckDB |
| `/api/scripts` | GET/POST | JWT (analyst+) | List/deploy user scripts |
| `/api/scripts/{id}/run` | POST | JWT (analyst+) | Execute script in sandbox |
| `/api/users` | GET/POST/DELETE | JWT (admin) | User management |
| `/api/memory` | GET/POST/PUT | JWT (analyst+) | Corporate memory CRUD |
| `/api/health` | GET | none | Structured health check |
| `/api/upload/sessions` | POST | JWT (analyst+) | Upload Claude session transcripts |
| `/api/upload/local-md` | POST | JWT (analyst+) | Upload CLAUDE.local.md content |
### Sync Protocol
1. CLI calls `GET /api/sync/manifest` → receives hashes per table/asset
2. CLI compares with local `~/.config/da/sync_state.json`
3. For each changed table: `GET /api/data/{table}/download` → streaming to `./server/parquet/`
4. Download changed docs, rules, profiles, scripts
5. Upload new sessions, artifacts, CLAUDE.local.md content
6. Rebuild local DuckDB views (preserve user-created tables)
7. Update local sync manifest
## 5. CLI Tool (`da`)
### Structure
```
cli/
main.py # Typer app, --server/--json global options
config.py # ~/.config/da/ management (token, server URL, sync state)
client.py # httpx async client (JWT auth, retry, streaming, progress bars)
duckdb_local.py # Local DuckDB management (create views, query, explore)
commands/
auth.py # da login/logout/whoami
sync.py # da sync [--table X] [--upload-only] [--docs-only]
query.py # da query "SQL" [--remote] [--json] [--format csv/table/json]
scripts.py # da scripts list/run/deploy/undeploy
explore.py # da explore {table}
admin.py # da admin add-user/remove-user/list-users/set-role
status.py # da status [--local] [--json]
server.py # da server deploy/rollback/logs/status/backup
setup.py # da setup init/test-connection/deploy/first-sync/verify
diagnose.py # da diagnose [--symptom X] [--component Y]
skills.py # da skills list/show
infra.py # da infra provision/status/deploy (future)
skills/ # Markdown knowledge base for AI agents
setup.md
troubleshoot.md
connectors.md
notifications.md
corporate-memory.md
security.md
backup-restore.md
upgrade.md
```
### Distribution
```toml
[project]
name = "data-analyst-cli"
requires-python = ">=3.11"
dependencies = ["typer>=0.12", "httpx>=0.27", "duckdb>=1.1", "rich>=13", "pyjwt>=2.8"]
[project.scripts]
da = "cli.main:app"
```
Install: `uv tool install data-analyst-cli`
### Offline Capability
After `da sync`, everything works without network:
- `da query` → local DuckDB
- `da scripts run` → local Python execution
- `da explore` → local profile data
- `da status --local` → sync timestamps from local manifest
## 6. Deploy & Infrastructure
### Docker
```dockerfile
FROM python:3.13-slim
COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv
WORKDIR /app
COPY pyproject.toml uv.lock ./
RUN uv sync --frozen --no-dev
COPY . .
CMD ["uv", "run", "uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
```
### Docker Compose (dev)
```yaml
services:
app:
build: .
ports: ["8000:8000"]
volumes: [".:/app", "data:/data"]
env_file: .env
command: uv run uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload
scheduler:
build: .
volumes: ["data:/data"]
env_file: .env
command: uv run python -m services.scheduler
telegram-bot:
build: .
volumes: ["data:/data"]
env_file: .env
command: uv run python -m services.telegram_bot
profiles: ["full"]
volumes:
data:
```
### Scheduler Sidecar
The scheduler is a lightweight process that triggers jobs by calling the main app's API:
```python
# services/scheduler/__main__.py
import httpx
from apscheduler.schedulers.blocking import BlockingScheduler
API_URL = os.environ.get("API_URL", "http://app:8000")
API_TOKEN = os.environ.get("SCHEDULER_API_TOKEN") # internal service token
scheduler = BlockingScheduler()
@scheduler.scheduled_job("interval", minutes=15)
def data_refresh():
httpx.post(f"{API_URL}/api/sync/trigger", headers={"Authorization": f"Bearer {API_TOKEN}"})
@scheduler.scheduled_job("interval", minutes=30)
def corporate_memory():
httpx.post(f"{API_URL}/api/internal/collect-knowledge", headers={"Authorization": f"Bearer {API_TOKEN}"})
# ... more jobs
scheduler.start()
```
This keeps all business logic in the main app. The scheduler is stateless and restartable.
### Kamal (production)
- Auto-SSL via Kamal Proxy (Let's Encrypt)
- Zero-downtime deploy
- Healthcheck on `/api/health`
- Staging: `kamal deploy -d staging`
- Production: `kamal deploy`
- Rollback: `kamal rollback`
### CI/CD (GitHub Actions)
```
push → pytest (unit) → docker compose test (integration) → build+push GHCR
PR → kamal deploy staging
merge main → kamal deploy production
```
## 7. Security
### RBAC
| Role | Permissions |
|------|-------------|
| `viewer` | Read catalog, view profiles, browse corporate memory |
| `analyst` | + sync data, run queries, vote on knowledge, run/deploy scripts |
| `admin` | + manage users, approve knowledge, trigger sync, view audit |
| `km_admin` | + corporate memory governance (approve/reject/mandate) |
Dataset-level permissions restrict which datasets each user can access.
### Auth Flow
1. Web: user logs in via Google OAuth / Email magic link / Password
2. Server issues JWT (contains: user_id, email, role, exp)
3. CLI: `da login` → OAuth browser flow → JWT stored in `~/.config/da/token.json`
4. All API calls include JWT in Authorization header
5. FastAPI dependency validates JWT + checks role permissions
### Audit Trail
Every API call logged to `audit_log` table:
- timestamp, user_id, action, resource, params, result, duration_ms
- Queryable by agent: `da query "SELECT * FROM system.audit_log WHERE ..."`
### Script Sandboxing
User scripts run in isolated Docker container:
- Read-only DuckDB access
- Memory limit: 512MB, time limit: 5min
- No network (except notification dispatch)
- Whitelisted Python packages: pandas, duckdb, matplotlib, numpy
## 8. Testing Strategy
```
tests/
unit/ # No I/O, mocked dependencies
test_repositories.py # In-memory DuckDB
test_sync_logic.py
test_auth.py
test_rbac.py
integration/ # Docker compose, real DuckDB + sample data
test_api_endpoints.py
test_sync_flow.py
test_cli_commands.py
fixtures/
sample_data/ # Small parquets for testing
instance.yaml # Test config
```
## 9. Migration Path
1. **Greenfield demo** — build new system from scratch with sample Keboola data
2. **Validate** — end-to-end: setup → sync → query → scripts → notifications
3. **Migrate internal** — point new system at Keboola internal, migrate users
4. **Migrate Groupon** — deploy new system for Groupon with their config
5. **Deprecate old** — remove old server infrastructure
## 10. Reused Code
| File | Status | Notes |
|------|--------|-------|
| `src/config.py` | Reused as-is | TableConfig, Config parsing |
| `src/parquet_manager.py` | Reused as-is | Parquet conversion |
| `connectors/keboola/` | Reused as-is | Keboola adapter + client |
| `connectors/bigquery/` | Reused as-is | BigQuery adapter + client |
| `connectors/jira/` | Reused as-is | Jira connector |
| `connectors/llm/` | Reused as-is | LLM abstraction |
| `connectors/openmetadata/` | Reused as-is | Catalog enrichment |
| `src/data_sync.py` | Rewired | SyncState → DuckDB repository |
| `src/remote_query.py` | Wrapped | Query logic wrapped by API endpoint |
| `src/profiler.py` | Rewired | Output to DuckDB instead of JSON |
| `src/table_registry.py` | Rewired | JSON → DuckDB repository |
| `webapp/corporate_memory_service.py` | Rewired | Business logic preserved, I/O swapped |
| `webapp/templates/` | Migrated | Jinja2 templates work in FastAPI |
| `auth/` | Migrated | Provider pattern preserved |
## 11. Deleted Code
| File | Reason |
|------|--------|
| `server/setup.sh` | Replaced by Docker |
| `server/webapp-setup.sh` | Replaced by Docker + Kamal |
| `server/deploy.sh` | Replaced by Kamal |
| `server/sudoers-*` | No more Linux user management |
| `server/bin/add-analyst` | Replaced by API + CLI |
| `scripts/sync_data.sh` | Replaced by `da sync` |
| `services/*/systemd/` | Replaced by Docker Compose |
| `webapp/user_service.py` | Rewritten for DB-based users |
| `webapp/sync_settings_service.py` (sudo parts) | Replaced by API |