# AI Data Analyst Open-source data distribution platform for AI analytical systems. Extracts data from sources into DuckDB, serves via FastAPI, and distributes parquets to analysts who use Claude Code for local analysis. ## First-Time Setup When a user opens this project for the first time, guide them through interactive setup: ### Step 1: Gather Information Ask the user for: 1. Company domain (e.g., "acme.com") - used for Google OAuth 2. Data source type: keboola / bigquery / csv 3. Instance name (e.g., "Acme Data Analyst") ### Step 2: Generate Configuration 1. Copy `config/instance.yaml.example` to `config/instance.yaml` 2. Fill in values from Step 1 3. If Keboola: ask for Storage API token, stack URL, project ID 4. Create `.env` from `config/.env.template` ### Step 3: Register Tables 1. Use the FastAPI admin API (`POST /api/admin/tables/{id}`) or webapp UI to register tables 2. Tables are stored in DuckDB `table_registry` with source_type, bucket, source_table, query_mode 3. For migration from old format: `python scripts/migrate_registry_to_duckdb.py` ### Step 4: Docker Deployment ```bash docker compose up # Start app + scheduler docker compose --profile full up # Include telegram bot # HTTPS mode — Caddy + corporate-CA certs at /data/state/certs docker compose -f docker-compose.yml -f docker-compose.prod.yml -f docker-compose.tls.yml \ --profile tls up -d ``` See `docs/DEPLOYMENT.md` → **TLS** for cert provisioning + `scripts/grpn/agnes-tls-rotate.sh` (daily refetch from `TLS_FULLCHAIN_URL`, `SIGUSR1` reload on diff, no-op when unchanged). The infra repo's `startup.sh` installs this as a systemd timer automatically. ## Project Structure ``` ├── src/ # Core engine │ ├── db.py # DuckDB schema (system.duckdb, analytics.duckdb) │ ├── orchestrator.py # SyncOrchestrator — ATTACHes extract.duckdb files │ ├── repositories/ # DuckDB-backed CRUD (sync_state, table_registry, users, etc.) │ ├── profiler.py # Data profiling │ └── catalog_export.py # OpenMetadata catalog export ├── app/ # FastAPI application │ ├── main.py # App setup, router registration │ ├── api/ # REST API (sync, data, catalog, admin, auth) │ └── web/ # HTML dashboard routes ├── connectors/ # Data source connectors (extract.duckdb contract) │ ├── keboola/ # Keboola: extractor.py (DuckDB extension) + client.py (fallback) │ ├── bigquery/ # BigQuery: extractor.py (remote-only via DuckDB BQ extension) │ └── jira/ # Jira: webhook + incremental parquet → extract.duckdb ├── cli/ # CLI tool (`da sync`, `da query`, `da admin`) ├── app/auth/ # Authentication (FastAPI-based providers) ├── services/ # Standalone services (scheduler, telegram_bot, ws_gateway, etc.) ├── server/ # Legacy deployment infrastructure ├── scripts/ # Utility + migration scripts ├── config/ # Configuration templates (instance.yaml.example) ├── docs/ # Documentation + metric YAML definitions └── tests/ # Test suite (633 tests) ``` ## Architecture: extract.duckdb Contract Every data source produces the same output: ``` /data/extracts/{source_name}/ ├── extract.duckdb ← _meta table + views └── data/ ← parquet files (local sources only) ``` ### Remote table support (`_remote_attach`) Extractors with remote/passthrough tables (query_mode='remote') include a `_remote_attach` table in extract.duckdb so the orchestrator can re-ATTACH the external DuckDB extension at query time: ```sql CREATE TABLE _remote_attach ( alias VARCHAR, -- DuckDB alias used in views, e.g. 'kbc' extension VARCHAR, -- Extension name, e.g. 'keboola' url VARCHAR, -- Connection URL token_env VARCHAR -- Env-var name holding the auth token (NOT the token itself) ); ``` The orchestrator reads this table, installs/loads the extension, reads the token from the environment, and ATTACHes the external source. Views referencing `kbc."bucket"."table"` then resolve correctly. This mechanism is generic — any connector can use it. The SyncOrchestrator scans `/data/extracts/*/extract.duckdb`, ATTACHes each into master `analytics.duckdb`, and creates views. ``` ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ Keboola │ │ BigQuery │ │ Jira │ │ extractor │ │ extractor │ │ webhooks │ │ (DuckDB ext) │ │ (remote BQ) │ │ (incremental)│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │ │ │ ▼ ▼ ▼ extract.duckdb extract.duckdb extract.duckdb + data/*.parquet (views → BQ) + data/*.parquet │ │ │ └─────────────────┼─────────────────┘ ▼ SyncOrchestrator.rebuild() ATTACH → master views in analytics.duckdb │ ┌──────────┼──────────┐ ▼ ▼ ▼ FastAPI CLI (serve) (da sync) ``` Three source types: - **Batch pull** (Keboola): DuckDB extension downloads to parquet, scheduled - **Remote attach** (BigQuery): DuckDB BQ extension, no download, queries go to BQ - **Real-time push** (Jira): Webhooks update parquets incrementally ## Configuration Instance-specific config: `config/instance.yaml` (see example). Environment variables: `.env` (never committed). Table definitions: DuckDB `table_registry` table in `system.duckdb`. ## Development ```bash # Setup python3 -m venv .venv && source .venv/bin/activate uv pip install ".[dev]" # Run FastAPI locally uvicorn app.main:app --reload # Run tests pytest tests/ -v # Trigger sync manually curl -X POST http://localhost:8000/api/sync/trigger # Docker docker compose up ``` ## Business Metrics Standardized metric definitions live in DuckDB (`metric_definitions` table). Import starter pack: ```bash da metrics import docs/metrics/ ``` ### For AI agents analyzing data: Before computing any business metric, look up the canonical definition: 1. `da metrics list` — find the relevant metric 2. `da metrics show revenue/mrr` — read the SQL and business rules 3. Use the SQL from the metric definition, adapt to the specific question Never invent metric calculations — always use the canonical definitions. ## Hybrid Queries (BigQuery + Local) For tables too large to sync locally, use hybrid queries that JOIN local data with on-demand BigQuery results: ```bash da query --sql "SELECT o.*, t.views FROM orders o JOIN traffic t ON o.date = t.date" \ --register-bq "traffic=SELECT date, SUM(views) as views FROM dataset.web WHERE date > '2026-01-01' GROUP BY 1" ``` The `--register-bq` flag executes a BigQuery subquery, loads the result into memory, and makes it available as a DuckDB view for the final SQL. Multiple `--register-bq` flags can be used for multiple BQ sources. For complex SQL, use stdin mode: ```bash echo '{"register_bq": {"traffic": "SELECT ..."}, "sql": "SELECT ..."}' | da query --stdin ``` ## Extensibility ### Data Sources (extract.duckdb contract) New connector = `connectors//extractor.py` producing `extract.duckdb + data/`. Must create `_meta` table with columns: table_name, description, rows, size_bytes, extracted_at, query_mode. Orchestrator ATTACHes it automatically. ### Authentication Auth providers in `app/auth/` (FastAPI-based): - **Google**: OAuth via Google - **Email**: Email magic link (itsdangerous token) - **Desktop**: JWT for API ## Key Implementation Details ### DuckDB Schema (src/db.py) - Schema v7 with auto-migration from v1→v2→v3→v4→v5→v6→v7 (v5 adds `users.active`, v6 adds `personal_access_tokens`, v7 adds `personal_access_tokens.last_used_ip`) - `table_registry`: id, name, source_type, bucket, source_table, query_mode, sync_schedule, etc. - `sync_state`, `sync_history`: track extraction progress - `users`, `dataset_permissions`, `audit_log`: auth + RBAC - System DB at `{DATA_DIR}/state/system.duckdb` - Analytics DB at `{DATA_DIR}/analytics/server.duckdb` ### SyncOrchestrator (src/orchestrator.py) - `rebuild()`: scans extracts dir, ATTACHes all, creates master views, updates sync_state - `rebuild_source(name)`: single source (used after Jira webhooks) - Thread-safe via `_rebuild_lock` ### Connector Pattern - **Keboola**: `connectors/keboola/extractor.py` uses DuckDB Keboola extension, fallback to `client.py` - **BigQuery**: `connectors/bigquery/extractor.py` uses DuckDB BQ extension (remote-only, no download) - **Jira**: `connectors/jira/webhook.py` → `incremental_transform.py` → `extract_init.py` updates `_meta` - `connectors/keboola/client.py`: legacy Keboola Storage API wrapper (kept as fallback) ### Config Loading 1. `config/loader.py` loads `instance.yaml` 2. `app/instance_config.py` exposes `get_data_source_type()`, `get_value()` 3. Table config lives in DuckDB `table_registry` (not markdown files) ### Files NOT to modify (stable infrastructure) - `connectors/jira/file_lock.py` - Advisory file locking - `connectors/jira/transform.py` - Core Jira transform logic - `services/ws_gateway/` - WebSocket notification gateway ## Vendor-agnostic OSS — no customer-specific content This repo is the public OSS distribution. **Nothing customer-specific belongs in code, configuration defaults, comments, docs, commit messages, PR titles, or PR bodies.** That includes: - Specific deployments or brands (private VM names, internal product brands, organization names that aren't already public sponsors). - Cloud project IDs, internal hostnames, runbook paths from a particular install (`/opt/`, `.`, `prj--…`, internal SA emails). - Cross-references to private repos (`/#NN`). Describe the integration in generic terms or link to public examples instead. When you motivate a change, frame it abstractly ("behind a TLS-terminating reverse proxy", "in containerized deploys") rather than naming a specific operator. When you show examples, use placeholders (`example.com`, ``, ``). When config has reasonable defaults pulled from one deployment's habits, generalize them or surface them as documented examples — not hard-coded assumptions. Customer-specific automation, hostnames, and identities live in private infra repos that *consume* this OSS. The OSS describes capabilities, defaults, and configuration knobs — not how a specific operator wired them up. ## Git Commits & Pull Requests - Keep commit messages clean and concise - Do not include AI attribution in commits or PRs - Before opening a PR, scan the diff and the PR body for the customer-specific tokens listed above (`grep -niE '||...'`). If anything matches, generalize or remove it.