ZdenekSrotyr 07b396bfe2 docs: add refactoring plan, design spec, and gitignore updates

2026-03-27 15:42:57 +01:00

18 KiB

Raw Blame History

AI Data Analyst — Refactoring Design Spec

Date: 2026-03-27 Status: Draft Target: Greenfield demo with Keboola internal data

1. Problem Statement

The platform was built iteratively as an internal tool and needs to become a product for external customers (Groupon, others). Key problems:

Fragile filesystem state — 10+ JSON files, permission conflicts between processes (www-data, deploy, root, user) cause outages
No API — all operations via SSH + bash scripts, no programmatic control
Security via Linux groups — no real RBAC, SSH keys visible in ps aux, root reads user homes
Complex installation — 10+ manual steps, specific OS requirements, dual-repo pattern with symlinks
Operations nightmare — scattered scripts, no unified logging/monitoring, creator calls it "duct tape solution"

The system is designed for AI agents — humans discuss with AI, AI handles everything (user, admin, dev operations).

Constraint: UX must remain identical. Web catalog, data sync, offline Claude Code analysis, Telegram notifications, corporate memory — all preserved.

2. Architecture

Target State

SERVER (Docker + Kamal):
┌──────────────────────────────────────────────────┐
│          FastAPI Main App (1 process)             │
│  ├── Web UI (Jinja2 templates)                   │
│  ├── REST API (/api/*)                           │
│  ├── WebSocket (/ws/notifications)               │
│  └── Auth (JWT + pluggable providers)            │
└──────────────────────────────────────────────────┘
┌─────────────────┐  ┌──────────────────────────────┐
│ Scheduler sidecar│  │ Telegram bot (optional)      │
│ Calls /api/      │  │ Long-running daemon          │
└─────────────────┘  └──────────────────────────────┘

/data/state/system.duckdb     ← system state (users, sync, knowledge, audit)
/data/analytics/server.duckdb ← views on parquet files
/data/parquet/**              ← data files

LOCAL (analyst):
┌──────────────────────────────────────────────────┐
│  da CLI (uv tool install data-analyst-cli)       │
│  user/duckdb/analytics.duckdb ← views + user tbls│
│  server/parquet/** ← downloaded via da sync      │
│  Claude Code ← works offline with DuckDB         │
└──────────────────────────────────────────────────┘

Key Decisions

Decision	Choice	Rationale
Web framework	FastAPI only (no Flask)	One framework, OpenAPI auto-schema, async native, Jinja2 support
State storage	DuckDB	Already in stack, agent can join state with analytics, better than SQLite for analytical queries
CLI tool	`da` via `uv tool install`	AI-agent native interface, no Docker dependency locally
Server deploy	Docker + Kamal	Zero-downtime deploys, auto-SSL, simple config
Architecture	Hybrid (main app + scheduler sidecar + optional telegram)	3 containers max, WebSocket in main app
Auth providers	All 3 (Google OAuth + Email magic link + Password)	Full compatibility with existing users
LLM provider	Configurable in instance.yaml	User chooses: local Ollama, Anthropic, OpenAI, AI Gateway
Python tooling	uv everywhere (no pip)	Faster, deterministic, modern

3. Data Layer

Server DuckDB: system.duckdb

-- Users & RBAC
CREATE TABLE users (
  id VARCHAR PRIMARY KEY,
  email VARCHAR UNIQUE NOT NULL,
  name VARCHAR,
  role VARCHAR DEFAULT 'analyst',  -- viewer, analyst, admin, km_admin
  password_hash VARCHAR,
  setup_token VARCHAR,
  reset_token VARCHAR,
  created_at TIMESTAMP DEFAULT current_timestamp,
  updated_at TIMESTAMP
);

CREATE TABLE user_sync_settings (
  user_id VARCHAR REFERENCES users(id),
  dataset VARCHAR NOT NULL,
  enabled BOOLEAN DEFAULT false,
  table_mode VARCHAR DEFAULT 'all',  -- all, explicit
  tables JSON,
  updated_at TIMESTAMP,
  PRIMARY KEY (user_id, dataset)
);

CREATE TABLE dataset_permissions (
  user_id VARCHAR REFERENCES users(id),
  dataset VARCHAR NOT NULL,
  access VARCHAR DEFAULT 'read',  -- read, none
  PRIMARY KEY (user_id, dataset)
);

-- Sync state + history
CREATE TABLE sync_state (
  table_id VARCHAR PRIMARY KEY,
  last_sync TIMESTAMP,
  rows BIGINT,
  file_size_bytes BIGINT,
  uncompressed_size_bytes BIGINT,
  columns INTEGER,
  hash VARCHAR,
  status VARCHAR DEFAULT 'ok',
  error TEXT
);

CREATE TABLE sync_history (
  id VARCHAR PRIMARY KEY,
  table_id VARCHAR NOT NULL,
  synced_at TIMESTAMP NOT NULL,
  rows BIGINT,
  duration_ms INTEGER,
  status VARCHAR,
  error TEXT
);

-- Corporate memory
CREATE TABLE knowledge_items (
  id VARCHAR PRIMARY KEY,
  title VARCHAR NOT NULL,
  content TEXT,
  category VARCHAR,
  tags TEXT[],
  status VARCHAR DEFAULT 'pending',  -- pending, approved, mandatory, rejected
  contributors TEXT[],
  source_user VARCHAR,
  audience VARCHAR,
  created_at TIMESTAMP,
  updated_at TIMESTAMP
);

CREATE TABLE knowledge_votes (
  item_id VARCHAR REFERENCES knowledge_items(id),
  user_id VARCHAR REFERENCES users(id),
  vote INTEGER,  -- 1 or -1
  voted_at TIMESTAMP,
  PRIMARY KEY (item_id, user_id)
);

-- Audit
CREATE TABLE audit_log (
  id VARCHAR PRIMARY KEY,
  timestamp TIMESTAMP NOT NULL,
  user_id VARCHAR,
  action VARCHAR NOT NULL,
  resource VARCHAR,
  params JSON,
  result VARCHAR,
  duration_ms INTEGER
);

-- Notifications
CREATE TABLE telegram_links (
  user_id VARCHAR PRIMARY KEY REFERENCES users(id),
  chat_id BIGINT NOT NULL,
  linked_at TIMESTAMP
);

CREATE TABLE pending_codes (
  code VARCHAR PRIMARY KEY,
  chat_id BIGINT NOT NULL,
  created_at TIMESTAMP
);

CREATE TABLE script_registry (
  id VARCHAR PRIMARY KEY,
  name VARCHAR NOT NULL,
  owner VARCHAR REFERENCES users(id),
  schedule VARCHAR,  -- cron expression or null
  source TEXT NOT NULL,
  deployed_at TIMESTAMP,
  last_run TIMESTAMP,
  last_status VARCHAR
);

-- Table registry
CREATE TABLE table_registry (
  id VARCHAR PRIMARY KEY,
  name VARCHAR NOT NULL,
  folder VARCHAR,
  sync_strategy VARCHAR,
  primary_key VARCHAR,
  description TEXT,
  registered_by VARCHAR,
  registered_at TIMESTAMP
);

-- Profiles
CREATE TABLE table_profiles (
  table_id VARCHAR PRIMARY KEY,
  profile JSON NOT NULL,
  profiled_at TIMESTAMP
);

Server DuckDB: server.duckdb

Auto-generated views on parquet files:

CREATE VIEW orders AS SELECT * FROM read_parquet('/data/parquet/sales/orders.parquet');
CREATE VIEW customers AS SELECT * FROM read_parquet('/data/parquet/sales/customers.parquet');
-- Generated from schema.yml by profiler/sync

Local DuckDB: analytics.duckdb

Views on local parquets (generated by da sync):

CREATE VIEW orders AS SELECT * FROM read_parquet('./server/parquet/sales/orders.parquet');
-- User-created tables survive da sync (rebuild drops only views, not tables)

Repository Pattern

src/repositories/
  __init__.py          # get_system_db(), get_analytics_db() factories
  users.py             # UserRepository (CRUD + role checks)
  sync_state.py        # SyncStateRepository (state + history)
  knowledge.py         # KnowledgeRepository (items + votes + governance)
  audit.py             # AuditRepository (append + query)
  scripts.py           # ScriptRepository (registry + scheduling)
  table_registry.py    # TableRegistryRepository
  notifications.py     # TelegramRepository + PendingCodeRepository

4. API Endpoints

FastAPI Router Structure

app/
  main.py                 # FastAPI app, lifespan events, middleware
  auth/
    router.py             # POST /auth/login, /auth/token, /auth/logout
    jwt.py                # JWT create/verify (PyJWT)
    providers/            # Pluggable: google/, email/, password/
    dependencies.py       # get_current_user, require_role(Role)
  web/
    router.py             # Web UI: GET /, /catalog, /memory, /settings...
    templates/            # Jinja2 (migrated from webapp/templates/)
    static/               # CSS, JS, images
  api/
    sync.py               # GET /api/sync/manifest, POST /api/sync/trigger
    data.py               # GET /api/data/{table}/download
    query.py              # POST /api/query
    scripts.py            # GET/POST /api/scripts, POST /api/scripts/{id}/run
    users.py              # CRUD /api/users
    settings.py           # GET/PUT /api/users/{id}/settings
    memory.py             # CRUD /api/memory, POST /api/memory/{id}/vote
    health.py             # GET /api/health
    upload.py             # POST /api/upload/sessions, /artifacts, /local-md
  ws/
    notifications.py      # WebSocket /ws/notifications

Key Endpoints

Endpoint	Method	Auth	Purpose
`/api/sync/manifest`	GET	JWT (analyst+)	Hash-based manifest of all synced data
`/api/sync/trigger`	POST	JWT (admin)	Trigger data sync from source
`/api/data/{table}/download`	GET	JWT (analyst+)	Stream parquet file (ETag support)
`/api/query`	POST	JWT (analyst+)	Execute SQL against server DuckDB
`/api/scripts`	GET/POST	JWT (analyst+)	List/deploy user scripts
`/api/scripts/{id}/run`	POST	JWT (analyst+)	Execute script in sandbox
`/api/users`	GET/POST/DELETE	JWT (admin)	User management
`/api/memory`	GET/POST/PUT	JWT (analyst+)	Corporate memory CRUD
`/api/health`	GET	none	Structured health check
`/api/upload/sessions`	POST	JWT (analyst+)	Upload Claude session transcripts
`/api/upload/local-md`	POST	JWT (analyst+)	Upload CLAUDE.local.md content

Sync Protocol

CLI calls GET /api/sync/manifest → receives hashes per table/asset
CLI compares with local ~/.config/da/sync_state.json
For each changed table: GET /api/data/{table}/download → streaming to ./server/parquet/
Download changed docs, rules, profiles, scripts
Upload new sessions, artifacts, CLAUDE.local.md content
Rebuild local DuckDB views (preserve user-created tables)
Update local sync manifest

5. CLI Tool (`da`)

Structure

cli/
  main.py             # Typer app, --server/--json global options
  config.py           # ~/.config/da/ management (token, server URL, sync state)
  client.py           # httpx async client (JWT auth, retry, streaming, progress bars)
  duckdb_local.py     # Local DuckDB management (create views, query, explore)
  commands/
    auth.py           # da login/logout/whoami
    sync.py           # da sync [--table X] [--upload-only] [--docs-only]
    query.py          # da query "SQL" [--remote] [--json] [--format csv/table/json]
    scripts.py        # da scripts list/run/deploy/undeploy
    explore.py        # da explore {table}
    admin.py          # da admin add-user/remove-user/list-users/set-role
    status.py         # da status [--local] [--json]
    server.py         # da server deploy/rollback/logs/status/backup
    setup.py          # da setup init/test-connection/deploy/first-sync/verify
    diagnose.py       # da diagnose [--symptom X] [--component Y]
    skills.py         # da skills list/show
    infra.py          # da infra provision/status/deploy (future)
  skills/             # Markdown knowledge base for AI agents
    setup.md
    troubleshoot.md
    connectors.md
    notifications.md
    corporate-memory.md
    security.md
    backup-restore.md
    upgrade.md

Distribution

[project]
name = "data-analyst-cli"
requires-python = ">=3.11"
dependencies = ["typer>=0.12", "httpx>=0.27", "duckdb>=1.1", "rich>=13", "pyjwt>=2.8"]

[project.scripts]
da = "cli.main:app"

Install: uv tool install data-analyst-cli

Offline Capability

After da sync, everything works without network:

da query → local DuckDB
da scripts run → local Python execution
da explore → local profile data
da status --local → sync timestamps from local manifest

6. Deploy & Infrastructure

Docker

FROM python:3.13-slim
COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv
WORKDIR /app
COPY pyproject.toml uv.lock ./
RUN uv sync --frozen --no-dev
COPY . .
CMD ["uv", "run", "uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

Docker Compose (dev)

services:
  app:
    build: .
    ports: ["8000:8000"]
    volumes: [".:/app", "data:/data"]
    env_file: .env
    command: uv run uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload

  scheduler:
    build: .
    volumes: ["data:/data"]
    env_file: .env
    command: uv run python -m services.scheduler

  telegram-bot:
    build: .
    volumes: ["data:/data"]
    env_file: .env
    command: uv run python -m services.telegram_bot
    profiles: ["full"]

volumes:
  data:

Scheduler Sidecar

The scheduler is a lightweight process that triggers jobs by calling the main app's API:

# services/scheduler/__main__.py
import httpx
from apscheduler.schedulers.blocking import BlockingScheduler

API_URL = os.environ.get("API_URL", "http://app:8000")
API_TOKEN = os.environ.get("SCHEDULER_API_TOKEN")  # internal service token

scheduler = BlockingScheduler()

@scheduler.scheduled_job("interval", minutes=15)
def data_refresh():
    httpx.post(f"{API_URL}/api/sync/trigger", headers={"Authorization": f"Bearer {API_TOKEN}"})

@scheduler.scheduled_job("interval", minutes=30)
def corporate_memory():
    httpx.post(f"{API_URL}/api/internal/collect-knowledge", headers={"Authorization": f"Bearer {API_TOKEN}"})

# ... more jobs
scheduler.start()

This keeps all business logic in the main app. The scheduler is stateless and restartable.

Kamal (production)

Auto-SSL via Kamal Proxy (Let's Encrypt)
Zero-downtime deploy
Healthcheck on /api/health
Staging: kamal deploy -d staging
Production: kamal deploy
Rollback: kamal rollback

CI/CD (GitHub Actions)

push → pytest (unit) → docker compose test (integration) → build+push GHCR
PR → kamal deploy staging
merge main → kamal deploy production

7. Security

RBAC

Role	Permissions
`viewer`	Read catalog, view profiles, browse corporate memory
`analyst`	+ sync data, run queries, vote on knowledge, run/deploy scripts
`admin`	+ manage users, approve knowledge, trigger sync, view audit
`km_admin`	+ corporate memory governance (approve/reject/mandate)

Dataset-level permissions restrict which datasets each user can access.

Auth Flow

Web: user logs in via Google OAuth / Email magic link / Password
Server issues JWT (contains: user_id, email, role, exp)
CLI: da login → OAuth browser flow → JWT stored in ~/.config/da/token.json
All API calls include JWT in Authorization header
FastAPI dependency validates JWT + checks role permissions

Audit Trail

Every API call logged to audit_log table:

timestamp, user_id, action, resource, params, result, duration_ms
Queryable by agent: da query "SELECT * FROM system.audit_log WHERE ..."

Script Sandboxing

User scripts run in isolated Docker container:

Read-only DuckDB access
Memory limit: 512MB, time limit: 5min
No network (except notification dispatch)
Whitelisted Python packages: pandas, duckdb, matplotlib, numpy

8. Testing Strategy

tests/
  unit/                   # No I/O, mocked dependencies
    test_repositories.py  # In-memory DuckDB
    test_sync_logic.py
    test_auth.py
    test_rbac.py
  integration/            # Docker compose, real DuckDB + sample data
    test_api_endpoints.py
    test_sync_flow.py
    test_cli_commands.py
  fixtures/
    sample_data/          # Small parquets for testing
    instance.yaml         # Test config

9. Migration Path

Greenfield demo — build new system from scratch with sample Keboola data
Validate — end-to-end: setup → sync → query → scripts → notifications
Migrate internal — point new system at Keboola internal, migrate users
Migrate Groupon — deploy new system for Groupon with their config
Deprecate old — remove old server infrastructure

10. Reused Code

File	Status	Notes
`src/config.py`	Reused as-is	TableConfig, Config parsing
`src/parquet_manager.py`	Reused as-is	Parquet conversion
`connectors/keboola/`	Reused as-is	Keboola adapter + client
`connectors/bigquery/`	Reused as-is	BigQuery adapter + client
`connectors/jira/`	Reused as-is	Jira connector
`connectors/llm/`	Reused as-is	LLM abstraction
`connectors/openmetadata/`	Reused as-is	Catalog enrichment
`src/data_sync.py`	Rewired	SyncState → DuckDB repository
`src/remote_query.py`	Wrapped	Query logic wrapped by API endpoint
`src/profiler.py`	Rewired	Output to DuckDB instead of JSON
`src/table_registry.py`	Rewired	JSON → DuckDB repository
`webapp/corporate_memory_service.py`	Rewired	Business logic preserved, I/O swapped
`webapp/templates/`	Migrated	Jinja2 templates work in FastAPI
`auth/`	Migrated	Provider pattern preserved

11. Deleted Code

File	Reason
`server/setup.sh`	Replaced by Docker
`server/webapp-setup.sh`	Replaced by Docker + Kamal
`server/deploy.sh`	Replaced by Kamal
`server/sudoers-*`	No more Linux user management
`server/bin/add-analyst`	Replaced by API + CLI
`scripts/sync_data.sh`	Replaced by `da sync`
`services/*/systemd/`	Replaced by Docker Compose
`webapp/user_service.py`	Rewritten for DB-based users
`webapp/sync_settings_service.py` (sudo parts)	Replaced by API

18 KiB Raw Blame History