Update project structure, architecture diagram, key implementation details, development commands, and extensibility docs. Add extract service to docker-compose.yml for one-shot extraction.
7.6 KiB
AI Data Analyst
Open-source data distribution platform for AI analytical systems. Extracts data from sources into DuckDB, serves via FastAPI, and distributes parquets to analysts who use Claude Code for local analysis.
First-Time Setup
When a user opens this project for the first time, guide them through interactive setup:
Step 1: Gather Information
Ask the user for:
- Company domain (e.g., "acme.com") - used for Google OAuth
- Data source type: keboola / bigquery / csv
- Instance name (e.g., "Acme Data Analyst")
Step 2: Generate Configuration
- Copy
config/instance.yaml.exampletoconfig/instance.yaml - Fill in values from Step 1
- If Keboola: ask for Storage API token, stack URL, project ID
- Create
.envfromconfig/.env.template
Step 3: Register Tables
- Use the FastAPI admin API (
POST /api/admin/tables/{id}) or webapp UI to register tables - Tables are stored in DuckDB
table_registrywith source_type, bucket, source_table, query_mode - For migration from old format:
python scripts/migrate_registry_to_duckdb.py
Step 4: Docker Deployment
docker compose up # Start app + scheduler
docker compose --profile full up # Include telegram bot
Project Structure
├── src/ # Core engine
│ ├── db.py # DuckDB schema (system.duckdb, analytics.duckdb)
│ ├── orchestrator.py # SyncOrchestrator — ATTACHes extract.duckdb files
│ ├── repositories/ # DuckDB-backed CRUD (sync_state, table_registry, users, etc.)
│ ├── profiler.py # Data profiling
│ └── catalog_export.py # OpenMetadata catalog export
├── app/ # FastAPI application
│ ├── main.py # App setup, router registration
│ ├── api/ # REST API (sync, data, catalog, admin, auth)
│ └── web/ # HTML dashboard routes
├── connectors/ # Data source connectors (extract.duckdb contract)
│ ├── keboola/ # Keboola: extractor.py (DuckDB extension) + client.py (fallback)
│ ├── bigquery/ # BigQuery: extractor.py (remote-only via DuckDB BQ extension)
│ └── jira/ # Jira: webhook + incremental parquet → extract.duckdb
├── cli/ # CLI tool (`da sync`, `da query`, `da admin`)
├── auth/ # Authentication providers (google, email, password, desktop)
├── services/ # Standalone services (scheduler, telegram_bot, ws_gateway, etc.)
├── webapp/ # Legacy Flask web portal
├── server/ # Legacy deployment infrastructure
├── scripts/ # Utility + migration scripts
├── config/ # Configuration templates (instance.yaml.example)
├── docs/ # Documentation + metric YAML definitions
└── tests/ # Test suite (704 tests)
Architecture: extract.duckdb Contract
Every data source produces the same output:
/data/extracts/{source_name}/
├── extract.duckdb ← _meta table + views
└── data/ ← parquet files (local sources only)
The SyncOrchestrator scans /data/extracts/*/extract.duckdb, ATTACHes each into master analytics.duckdb, and creates views.
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Keboola │ │ BigQuery │ │ Jira │
│ extractor │ │ extractor │ │ webhooks │
│ (DuckDB ext) │ │ (remote BQ) │ │ (incremental)│
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │
▼ ▼ ▼
extract.duckdb extract.duckdb extract.duckdb
+ data/*.parquet (views → BQ) + data/*.parquet
│ │ │
└─────────────────┼─────────────────┘
▼
SyncOrchestrator.rebuild()
ATTACH → master views in analytics.duckdb
│
┌──────────┼──────────┐
▼ ▼ ▼
FastAPI CLI Webapp
(serve) (da sync) (dashboard)
Three source types:
- Batch pull (Keboola): DuckDB extension downloads to parquet, scheduled
- Remote attach (BigQuery): DuckDB BQ extension, no download, queries go to BQ
- Real-time push (Jira): Webhooks update parquets incrementally
Configuration
Instance-specific config: config/instance.yaml (see example).
Environment variables: .env (never committed).
Table definitions: DuckDB table_registry table in system.duckdb.
Development
# Setup
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# Run FastAPI locally
uvicorn app.main:app --reload
# Run legacy Flask webapp
flask --app webapp.app run --debug
# Run tests
pytest tests/ -v
# Trigger sync manually
curl -X POST http://localhost:8000/api/sync/trigger
# Docker
docker compose up
Extensibility
Data Sources (extract.duckdb contract)
New connector = connectors/<name>/extractor.py producing extract.duckdb + data/.
Must create _meta table with columns: table_name, description, rows, size_bytes, extracted_at, query_mode.
Orchestrator ATTACHes it automatically.
Authentication
Pluggable auth providers in auth/:
- Google (
google): OAuth via Google - Email (
email): Email magic link (itsdangerous token) - Password (
password): Username/password - Desktop (
desktop): JWT for API - New provider =
auth/<name>/provider.pyimplementingAuthProvider
Key Implementation Details
DuckDB Schema (src/db.py)
- Schema v2 with auto-migration from v1
table_registry: id, name, source_type, bucket, source_table, query_mode, sync_schedule, etc.sync_state,sync_history: track extraction progressusers,dataset_permissions,audit_log: auth + RBAC- System DB at
{DATA_DIR}/state/system.duckdb - Analytics DB at
{DATA_DIR}/analytics/server.duckdb
SyncOrchestrator (src/orchestrator.py)
rebuild(): scans extracts dir, ATTACHes all, creates master views, updates sync_staterebuild_source(name): single source (used after Jira webhooks)- Thread-safe via
_rebuild_lock
Connector Pattern
- Keboola:
connectors/keboola/extractor.pyuses DuckDB Keboola extension, fallback toclient.py - BigQuery:
connectors/bigquery/extractor.pyuses DuckDB BQ extension (remote-only, no download) - Jira:
connectors/jira/webhook.py→incremental_transform.py→extract_init.pyupdates_meta connectors/keboola/client.py: legacy Keboola Storage API wrapper (kept as fallback)
Config Loading
config/loader.pyloadsinstance.yamlapp/instance_config.pyexposesget_data_source_type(),get_value()- Table config lives in DuckDB
table_registry(not markdown files)
Files NOT to modify (stable infrastructure)
connectors/jira/file_lock.py- Advisory file lockingconnectors/jira/transform.py- Core Jira transform logicservices/ws_gateway/- WebSocket notification gateway
Git Commits & Pull Requests
- Keep commit messages clean and concise
- Do not include AI attribution in commits or PRs