Update project structure, architecture diagram, key implementation details, development commands, and extensibility docs. Add extract service to docker-compose.yml for one-shot extraction.
176 lines
7.6 KiB
Markdown
176 lines
7.6 KiB
Markdown
# AI Data Analyst
|
|
|
|
Open-source data distribution platform for AI analytical systems. Extracts data from sources into DuckDB, serves via FastAPI, and distributes parquets to analysts who use Claude Code for local analysis.
|
|
|
|
## First-Time Setup
|
|
|
|
When a user opens this project for the first time, guide them through interactive setup:
|
|
|
|
### Step 1: Gather Information
|
|
Ask the user for:
|
|
1. Company domain (e.g., "acme.com") - used for Google OAuth
|
|
2. Data source type: keboola / bigquery / csv
|
|
3. Instance name (e.g., "Acme Data Analyst")
|
|
|
|
### Step 2: Generate Configuration
|
|
1. Copy `config/instance.yaml.example` to `config/instance.yaml`
|
|
2. Fill in values from Step 1
|
|
3. If Keboola: ask for Storage API token, stack URL, project ID
|
|
4. Create `.env` from `config/.env.template`
|
|
|
|
### Step 3: Register Tables
|
|
1. Use the FastAPI admin API (`POST /api/admin/tables/{id}`) or webapp UI to register tables
|
|
2. Tables are stored in DuckDB `table_registry` with source_type, bucket, source_table, query_mode
|
|
3. For migration from old format: `python scripts/migrate_registry_to_duckdb.py`
|
|
|
|
### Step 4: Docker Deployment
|
|
```bash
|
|
docker compose up # Start app + scheduler
|
|
docker compose --profile full up # Include telegram bot
|
|
```
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
├── src/ # Core engine
|
|
│ ├── db.py # DuckDB schema (system.duckdb, analytics.duckdb)
|
|
│ ├── orchestrator.py # SyncOrchestrator — ATTACHes extract.duckdb files
|
|
│ ├── repositories/ # DuckDB-backed CRUD (sync_state, table_registry, users, etc.)
|
|
│ ├── profiler.py # Data profiling
|
|
│ └── catalog_export.py # OpenMetadata catalog export
|
|
├── app/ # FastAPI application
|
|
│ ├── main.py # App setup, router registration
|
|
│ ├── api/ # REST API (sync, data, catalog, admin, auth)
|
|
│ └── web/ # HTML dashboard routes
|
|
├── connectors/ # Data source connectors (extract.duckdb contract)
|
|
│ ├── keboola/ # Keboola: extractor.py (DuckDB extension) + client.py (fallback)
|
|
│ ├── bigquery/ # BigQuery: extractor.py (remote-only via DuckDB BQ extension)
|
|
│ └── jira/ # Jira: webhook + incremental parquet → extract.duckdb
|
|
├── cli/ # CLI tool (`da sync`, `da query`, `da admin`)
|
|
├── auth/ # Authentication providers (google, email, password, desktop)
|
|
├── services/ # Standalone services (scheduler, telegram_bot, ws_gateway, etc.)
|
|
├── webapp/ # Legacy Flask web portal
|
|
├── server/ # Legacy deployment infrastructure
|
|
├── scripts/ # Utility + migration scripts
|
|
├── config/ # Configuration templates (instance.yaml.example)
|
|
├── docs/ # Documentation + metric YAML definitions
|
|
└── tests/ # Test suite (704 tests)
|
|
```
|
|
|
|
## Architecture: extract.duckdb Contract
|
|
|
|
Every data source produces the same output:
|
|
```
|
|
/data/extracts/{source_name}/
|
|
├── extract.duckdb ← _meta table + views
|
|
└── data/ ← parquet files (local sources only)
|
|
```
|
|
|
|
The SyncOrchestrator scans `/data/extracts/*/extract.duckdb`, ATTACHes each into master `analytics.duckdb`, and creates views.
|
|
|
|
```
|
|
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
|
|
│ Keboola │ │ BigQuery │ │ Jira │
|
|
│ extractor │ │ extractor │ │ webhooks │
|
|
│ (DuckDB ext) │ │ (remote BQ) │ │ (incremental)│
|
|
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
|
|
│ │ │
|
|
▼ ▼ ▼
|
|
extract.duckdb extract.duckdb extract.duckdb
|
|
+ data/*.parquet (views → BQ) + data/*.parquet
|
|
│ │ │
|
|
└─────────────────┼─────────────────┘
|
|
▼
|
|
SyncOrchestrator.rebuild()
|
|
ATTACH → master views in analytics.duckdb
|
|
│
|
|
┌──────────┼──────────┐
|
|
▼ ▼ ▼
|
|
FastAPI CLI Webapp
|
|
(serve) (da sync) (dashboard)
|
|
```
|
|
|
|
Three source types:
|
|
- **Batch pull** (Keboola): DuckDB extension downloads to parquet, scheduled
|
|
- **Remote attach** (BigQuery): DuckDB BQ extension, no download, queries go to BQ
|
|
- **Real-time push** (Jira): Webhooks update parquets incrementally
|
|
|
|
## Configuration
|
|
|
|
Instance-specific config: `config/instance.yaml` (see example).
|
|
Environment variables: `.env` (never committed).
|
|
Table definitions: DuckDB `table_registry` table in `system.duckdb`.
|
|
|
|
## Development
|
|
|
|
```bash
|
|
# Setup
|
|
python3 -m venv .venv && source .venv/bin/activate
|
|
pip install -r requirements.txt
|
|
|
|
# Run FastAPI locally
|
|
uvicorn app.main:app --reload
|
|
|
|
# Run legacy Flask webapp
|
|
flask --app webapp.app run --debug
|
|
|
|
# Run tests
|
|
pytest tests/ -v
|
|
|
|
# Trigger sync manually
|
|
curl -X POST http://localhost:8000/api/sync/trigger
|
|
|
|
# Docker
|
|
docker compose up
|
|
```
|
|
|
|
## Extensibility
|
|
|
|
### Data Sources (extract.duckdb contract)
|
|
New connector = `connectors/<name>/extractor.py` producing `extract.duckdb + data/`.
|
|
Must create `_meta` table with columns: table_name, description, rows, size_bytes, extracted_at, query_mode.
|
|
Orchestrator ATTACHes it automatically.
|
|
|
|
### Authentication
|
|
Pluggable auth providers in `auth/`:
|
|
- **Google** (`google`): OAuth via Google
|
|
- **Email** (`email`): Email magic link (itsdangerous token)
|
|
- **Password** (`password`): Username/password
|
|
- **Desktop** (`desktop`): JWT for API
|
|
- New provider = `auth/<name>/provider.py` implementing `AuthProvider`
|
|
|
|
## Key Implementation Details
|
|
|
|
### DuckDB Schema (src/db.py)
|
|
- Schema v2 with auto-migration from v1
|
|
- `table_registry`: id, name, source_type, bucket, source_table, query_mode, sync_schedule, etc.
|
|
- `sync_state`, `sync_history`: track extraction progress
|
|
- `users`, `dataset_permissions`, `audit_log`: auth + RBAC
|
|
- System DB at `{DATA_DIR}/state/system.duckdb`
|
|
- Analytics DB at `{DATA_DIR}/analytics/server.duckdb`
|
|
|
|
### SyncOrchestrator (src/orchestrator.py)
|
|
- `rebuild()`: scans extracts dir, ATTACHes all, creates master views, updates sync_state
|
|
- `rebuild_source(name)`: single source (used after Jira webhooks)
|
|
- Thread-safe via `_rebuild_lock`
|
|
|
|
### Connector Pattern
|
|
- **Keboola**: `connectors/keboola/extractor.py` uses DuckDB Keboola extension, fallback to `client.py`
|
|
- **BigQuery**: `connectors/bigquery/extractor.py` uses DuckDB BQ extension (remote-only, no download)
|
|
- **Jira**: `connectors/jira/webhook.py` → `incremental_transform.py` → `extract_init.py` updates `_meta`
|
|
- `connectors/keboola/client.py`: legacy Keboola Storage API wrapper (kept as fallback)
|
|
|
|
### Config Loading
|
|
1. `config/loader.py` loads `instance.yaml`
|
|
2. `app/instance_config.py` exposes `get_data_source_type()`, `get_value()`
|
|
3. Table config lives in DuckDB `table_registry` (not markdown files)
|
|
|
|
### Files NOT to modify (stable infrastructure)
|
|
- `connectors/jira/file_lock.py` - Advisory file locking
|
|
- `connectors/jira/transform.py` - Core Jira transform logic
|
|
- `services/ws_gateway/` - WebSocket notification gateway
|
|
|
|
## Git Commits & Pull Requests
|
|
|
|
- Keep commit messages clean and concise
|
|
- Do not include AI attribution in commits or PRs
|