ZdenekSrotyr 9fef90a729 docs: rewrite CLAUDE.md for extract.duckdb architecture

Update project structure, architecture diagram, key implementation
details, development commands, and extensibility docs.
Add extract service to docker-compose.yml for one-shot extraction.

2026-03-31 07:52:44 +02:00

7.6 KiB

Raw Blame History

AI Data Analyst

Open-source data distribution platform for AI analytical systems. Extracts data from sources into DuckDB, serves via FastAPI, and distributes parquets to analysts who use Claude Code for local analysis.

First-Time Setup

When a user opens this project for the first time, guide them through interactive setup:

Step 1: Gather Information

Ask the user for:

Company domain (e.g., "acme.com") - used for Google OAuth
Data source type: keboola / bigquery / csv
Instance name (e.g., "Acme Data Analyst")

Step 2: Generate Configuration

Copy config/instance.yaml.example to config/instance.yaml
Fill in values from Step 1
If Keboola: ask for Storage API token, stack URL, project ID
Create .env from config/.env.template

Step 3: Register Tables

Use the FastAPI admin API (POST /api/admin/tables/{id}) or webapp UI to register tables
Tables are stored in DuckDB table_registry with source_type, bucket, source_table, query_mode
For migration from old format: python scripts/migrate_registry_to_duckdb.py

Step 4: Docker Deployment

docker compose up          # Start app + scheduler
docker compose --profile full up  # Include telegram bot

Project Structure

├── src/                    # Core engine
│   ├── db.py               # DuckDB schema (system.duckdb, analytics.duckdb)
│   ├── orchestrator.py     # SyncOrchestrator — ATTACHes extract.duckdb files
│   ├── repositories/       # DuckDB-backed CRUD (sync_state, table_registry, users, etc.)
│   ├── profiler.py         # Data profiling
│   └── catalog_export.py   # OpenMetadata catalog export
├── app/                    # FastAPI application
│   ├── main.py             # App setup, router registration
│   ├── api/                # REST API (sync, data, catalog, admin, auth)
│   └── web/                # HTML dashboard routes
├── connectors/             # Data source connectors (extract.duckdb contract)
│   ├── keboola/            # Keboola: extractor.py (DuckDB extension) + client.py (fallback)
│   ├── bigquery/           # BigQuery: extractor.py (remote-only via DuckDB BQ extension)
│   └── jira/               # Jira: webhook + incremental parquet → extract.duckdb
├── cli/                    # CLI tool (`da sync`, `da query`, `da admin`)
├── auth/                   # Authentication providers (google, email, password, desktop)
├── services/               # Standalone services (scheduler, telegram_bot, ws_gateway, etc.)
├── webapp/                 # Legacy Flask web portal
├── server/                 # Legacy deployment infrastructure
├── scripts/                # Utility + migration scripts
├── config/                 # Configuration templates (instance.yaml.example)
├── docs/                   # Documentation + metric YAML definitions
└── tests/                  # Test suite (704 tests)

Architecture: extract.duckdb Contract

Every data source produces the same output:

/data/extracts/{source_name}/
├── extract.duckdb          ← _meta table + views
└── data/                   ← parquet files (local sources only)

The SyncOrchestrator scans /data/extracts/*/extract.duckdb, ATTACHes each into master analytics.duckdb, and creates views.

┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│   Keboola    │  │   BigQuery   │  │   Jira       │
│  extractor   │  │  extractor   │  │  webhooks    │
│ (DuckDB ext) │  │ (remote BQ)  │  │ (incremental)│
└──────┬───────┘  └──────┬───────┘  └──────┬───────┘
       │                 │                 │
       ▼                 ▼                 ▼
   extract.duckdb    extract.duckdb    extract.duckdb
   + data/*.parquet  (views → BQ)      + data/*.parquet
       │                 │                 │
       └─────────────────┼─────────────────┘
                         ▼
              SyncOrchestrator.rebuild()
              ATTACH → master views in analytics.duckdb
                         │
              ┌──────────┼──────────┐
              ▼          ▼          ▼
          FastAPI      CLI       Webapp
          (serve)    (da sync)   (dashboard)

Three source types:

Batch pull (Keboola): DuckDB extension downloads to parquet, scheduled
Remote attach (BigQuery): DuckDB BQ extension, no download, queries go to BQ
Real-time push (Jira): Webhooks update parquets incrementally

Configuration

Instance-specific config: config/instance.yaml (see example). Environment variables: .env (never committed). Table definitions: DuckDB table_registry table in system.duckdb.

Development

# Setup
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# Run FastAPI locally
uvicorn app.main:app --reload

# Run legacy Flask webapp
flask --app webapp.app run --debug

# Run tests
pytest tests/ -v

# Trigger sync manually
curl -X POST http://localhost:8000/api/sync/trigger

# Docker
docker compose up

Extensibility

Data Sources (extract.duckdb contract)

New connector = connectors/<name>/extractor.py producing extract.duckdb + data/. Must create _meta table with columns: table_name, description, rows, size_bytes, extracted_at, query_mode. Orchestrator ATTACHes it automatically.

Authentication

Pluggable auth providers in auth/:

Google (google): OAuth via Google
Email (email): Email magic link (itsdangerous token)
Password (password): Username/password
Desktop (desktop): JWT for API
New provider = auth/<name>/provider.py implementing AuthProvider

Key Implementation Details

DuckDB Schema (src/db.py)

Schema v2 with auto-migration from v1
table_registry: id, name, source_type, bucket, source_table, query_mode, sync_schedule, etc.
sync_state, sync_history: track extraction progress
users, dataset_permissions, audit_log: auth + RBAC
System DB at {DATA_DIR}/state/system.duckdb
Analytics DB at {DATA_DIR}/analytics/server.duckdb

SyncOrchestrator (src/orchestrator.py)

rebuild(): scans extracts dir, ATTACHes all, creates master views, updates sync_state
rebuild_source(name): single source (used after Jira webhooks)
Thread-safe via _rebuild_lock

Connector Pattern

Keboola: connectors/keboola/extractor.py uses DuckDB Keboola extension, fallback to client.py
BigQuery: connectors/bigquery/extractor.py uses DuckDB BQ extension (remote-only, no download)
Jira: connectors/jira/webhook.py → incremental_transform.py → extract_init.py updates _meta
connectors/keboola/client.py: legacy Keboola Storage API wrapper (kept as fallback)

Config Loading

config/loader.py loads instance.yaml
app/instance_config.py exposes get_data_source_type(), get_value()
Table config lives in DuckDB table_registry (not markdown files)

Files NOT to modify (stable infrastructure)

connectors/jira/file_lock.py - Advisory file locking
connectors/jira/transform.py - Core Jira transform logic
services/ws_gateway/ - WebSocket notification gateway

Git Commits & Pull Requests

Keep commit messages clean and concise
Do not include AI attribution in commits or PRs

7.6 KiB Raw Blame History