agnes-the-ai-analyst/CLAUDE.md
ZdenekSrotyr 1381770057
fix(auth): uvicorn --proxy-headers + Google OAuth doc + vendor-agnostic OSS rule in CLAUDE.md (#39)
* fix(compose): pass --proxy-headers to uvicorn so OAuth callbacks resolve to https

When the app runs behind a reverse proxy (Caddy, nginx, Cloudflare Tunnel),
uvicorn's default policy of trusting X-Forwarded-* only from 127.0.0.1 means
the request the container sees still looks like http://localhost:8000/...,
even when the user is on https://. The OAuth provider then sends Google a
callback URL Google has never seen — Error 400: redirect_uri_mismatch.

--proxy-headers + --forwarded-allow-ips '*' tell uvicorn to honor those
headers from any source. The container only ever sees its own docker network
anyway; trusting it everywhere is safe in this deployment shape.

Adds docs/auth-google-oauth.md with the full operator gotcha list — env
vars that have to be set, instance.yaml fields that silently fall back to
defaults, and the DB workaround for ad-hoc role promotion when
SEED_ADMIN_EMAIL was missed on first boot.

* docs(claude): codify vendor-agnostic OSS rule for AI agents and humans

Adds a "Vendor-agnostic OSS" section to CLAUDE.md spelling out what cannot
land in this repo (specific deployments, internal hostnames/projects, cross-
references to private repos, customer-specific paths) and how to phrase
abstractions instead. Plus a pre-PR grep checklist in the existing "Git
Commits & Pull Requests" section.

This trips up agents and humans alike — the previous version of #39 had
private-deployment references in the body and a customer domain in a doc
example. Surfacing the rule once in the file every Claude/Cursor/Aider
session reads should prevent that on the next PR.

* docs(oauth): cover DOMAIN + SERVER_URL env vars introduced by PR #48

PR #48 (merged) added DOMAIN-gated Secure cookie in google.py and
documented SERVER_URL in .env.template, but this operator doc was
drafted before that merge and didn't reference either variable.
Adding both to the env table and extending the common-failure-modes
table with a sticky-cookie / redirect-URI-mismatch entry that
references SERVER_URL as the host-header-independent fix. Also
aligns the compose command snippet with the `='*'` syntax that
actually ships on main post-PR #48.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Vojtech Rysanek <vrysanek@groupon.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 07:07:33 +00:00

11 KiB

AI Data Analyst

Open-source data distribution platform for AI analytical systems. Extracts data from sources into DuckDB, serves via FastAPI, and distributes parquets to analysts who use Claude Code for local analysis.

First-Time Setup

When a user opens this project for the first time, guide them through interactive setup:

Step 1: Gather Information

Ask the user for:

  1. Company domain (e.g., "acme.com") - used for Google OAuth
  2. Data source type: keboola / bigquery / csv
  3. Instance name (e.g., "Acme Data Analyst")

Step 2: Generate Configuration

  1. Copy config/instance.yaml.example to config/instance.yaml
  2. Fill in values from Step 1
  3. If Keboola: ask for Storage API token, stack URL, project ID
  4. Create .env from config/.env.template

Step 3: Register Tables

  1. Use the FastAPI admin API (POST /api/admin/tables/{id}) or webapp UI to register tables
  2. Tables are stored in DuckDB table_registry with source_type, bucket, source_table, query_mode
  3. For migration from old format: python scripts/migrate_registry_to_duckdb.py

Step 4: Docker Deployment

docker compose up          # Start app + scheduler
docker compose --profile full up  # Include telegram bot

Project Structure

├── src/                    # Core engine
│   ├── db.py               # DuckDB schema (system.duckdb, analytics.duckdb)
│   ├── orchestrator.py     # SyncOrchestrator — ATTACHes extract.duckdb files
│   ├── repositories/       # DuckDB-backed CRUD (sync_state, table_registry, users, etc.)
│   ├── profiler.py         # Data profiling
│   └── catalog_export.py   # OpenMetadata catalog export
├── app/                    # FastAPI application
│   ├── main.py             # App setup, router registration
│   ├── api/                # REST API (sync, data, catalog, admin, auth)
│   └── web/                # HTML dashboard routes
├── connectors/             # Data source connectors (extract.duckdb contract)
│   ├── keboola/            # Keboola: extractor.py (DuckDB extension) + client.py (fallback)
│   ├── bigquery/           # BigQuery: extractor.py (remote-only via DuckDB BQ extension)
│   └── jira/               # Jira: webhook + incremental parquet → extract.duckdb
├── cli/                    # CLI tool (`da sync`, `da query`, `da admin`)
├── app/auth/               # Authentication (FastAPI-based providers)
├── services/               # Standalone services (scheduler, telegram_bot, ws_gateway, etc.)
├── server/                 # Legacy deployment infrastructure
├── scripts/                # Utility + migration scripts
├── config/                 # Configuration templates (instance.yaml.example)
├── docs/                   # Documentation + metric YAML definitions
└── tests/                  # Test suite (633 tests)

Architecture: extract.duckdb Contract

Every data source produces the same output:

/data/extracts/{source_name}/
├── extract.duckdb          ← _meta table + views
└── data/                   ← parquet files (local sources only)

Remote table support (_remote_attach)

Extractors with remote/passthrough tables (query_mode='remote') include a _remote_attach table in extract.duckdb so the orchestrator can re-ATTACH the external DuckDB extension at query time:

CREATE TABLE _remote_attach (
    alias     VARCHAR,  -- DuckDB alias used in views, e.g. 'kbc'
    extension VARCHAR,  -- Extension name, e.g. 'keboola'
    url       VARCHAR,  -- Connection URL
    token_env VARCHAR   -- Env-var name holding the auth token (NOT the token itself)
);

The orchestrator reads this table, installs/loads the extension, reads the token from the environment, and ATTACHes the external source. Views referencing kbc."bucket"."table" then resolve correctly. This mechanism is generic — any connector can use it.

The SyncOrchestrator scans /data/extracts/*/extract.duckdb, ATTACHes each into master analytics.duckdb, and creates views.

┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│   Keboola    │  │   BigQuery   │  │   Jira       │
│  extractor   │  │  extractor   │  │  webhooks    │
│ (DuckDB ext) │  │ (remote BQ)  │  │ (incremental)│
└──────┬───────┘  └──────┬───────┘  └──────┬───────┘
       │                 │                 │
       ▼                 ▼                 ▼
   extract.duckdb    extract.duckdb    extract.duckdb
   + data/*.parquet  (views → BQ)      + data/*.parquet
       │                 │                 │
       └─────────────────┼─────────────────┘
                         ▼
              SyncOrchestrator.rebuild()
              ATTACH → master views in analytics.duckdb
                         │
              ┌──────────┼──────────┐
              ▼          ▼          ▼
          FastAPI      CLI
          (serve)    (da sync)

Three source types:

  • Batch pull (Keboola): DuckDB extension downloads to parquet, scheduled
  • Remote attach (BigQuery): DuckDB BQ extension, no download, queries go to BQ
  • Real-time push (Jira): Webhooks update parquets incrementally

Configuration

Instance-specific config: config/instance.yaml (see example). Environment variables: .env (never committed). Table definitions: DuckDB table_registry table in system.duckdb.

Development

# Setup
python3 -m venv .venv && source .venv/bin/activate
uv pip install ".[dev]"

# Run FastAPI locally
uvicorn app.main:app --reload

# Run tests
pytest tests/ -v

# Trigger sync manually
curl -X POST http://localhost:8000/api/sync/trigger

# Docker
docker compose up

Business Metrics

Standardized metric definitions live in DuckDB (metric_definitions table). Import starter pack:

da metrics import docs/metrics/

For AI agents analyzing data:

Before computing any business metric, look up the canonical definition:

  1. da metrics list — find the relevant metric
  2. da metrics show revenue/mrr — read the SQL and business rules
  3. Use the SQL from the metric definition, adapt to the specific question

Never invent metric calculations — always use the canonical definitions.

Hybrid Queries (BigQuery + Local)

For tables too large to sync locally, use hybrid queries that JOIN local data with on-demand BigQuery results:

da query --sql "SELECT o.*, t.views FROM orders o JOIN traffic t ON o.date = t.date" \
         --register-bq "traffic=SELECT date, SUM(views) as views FROM dataset.web WHERE date > '2026-01-01' GROUP BY 1"

The --register-bq flag executes a BigQuery subquery, loads the result into memory, and makes it available as a DuckDB view for the final SQL. Multiple --register-bq flags can be used for multiple BQ sources.

For complex SQL, use stdin mode:

echo '{"register_bq": {"traffic": "SELECT ..."}, "sql": "SELECT ..."}' | da query --stdin

Extensibility

Data Sources (extract.duckdb contract)

New connector = connectors/<name>/extractor.py producing extract.duckdb + data/. Must create _meta table with columns: table_name, description, rows, size_bytes, extracted_at, query_mode. Orchestrator ATTACHes it automatically.

Authentication

Auth providers in app/auth/ (FastAPI-based):

  • Google: OAuth via Google
  • Email: Email magic link (itsdangerous token)
  • Desktop: JWT for API

Key Implementation Details

DuckDB Schema (src/db.py)

  • Schema v7 with auto-migration from v1→v2→v3→v4→v5→v6→v7 (v5 adds users.active, v6 adds personal_access_tokens, v7 adds personal_access_tokens.last_used_ip)
  • table_registry: id, name, source_type, bucket, source_table, query_mode, sync_schedule, etc.
  • sync_state, sync_history: track extraction progress
  • users, dataset_permissions, audit_log: auth + RBAC
  • System DB at {DATA_DIR}/state/system.duckdb
  • Analytics DB at {DATA_DIR}/analytics/server.duckdb

SyncOrchestrator (src/orchestrator.py)

  • rebuild(): scans extracts dir, ATTACHes all, creates master views, updates sync_state
  • rebuild_source(name): single source (used after Jira webhooks)
  • Thread-safe via _rebuild_lock

Connector Pattern

  • Keboola: connectors/keboola/extractor.py uses DuckDB Keboola extension, fallback to client.py
  • BigQuery: connectors/bigquery/extractor.py uses DuckDB BQ extension (remote-only, no download)
  • Jira: connectors/jira/webhook.pyincremental_transform.pyextract_init.py updates _meta
  • connectors/keboola/client.py: legacy Keboola Storage API wrapper (kept as fallback)

Config Loading

  1. config/loader.py loads instance.yaml
  2. app/instance_config.py exposes get_data_source_type(), get_value()
  3. Table config lives in DuckDB table_registry (not markdown files)

Files NOT to modify (stable infrastructure)

  • connectors/jira/file_lock.py - Advisory file locking
  • connectors/jira/transform.py - Core Jira transform logic
  • services/ws_gateway/ - WebSocket notification gateway

Vendor-agnostic OSS — no customer-specific content

This repo is the public OSS distribution. Nothing customer-specific belongs in code, configuration defaults, comments, docs, commit messages, PR titles, or PR bodies. That includes:

  • Specific deployments or brands (private VM names, internal product brands, organization names that aren't already public sponsors).
  • Cloud project IDs, internal hostnames, runbook paths from a particular install (/opt/<deployment>, <host>.<internal-domain>, prj-<org>-…, internal SA emails).
  • Cross-references to private repos (<private-org>/<private-repo>#NN). Describe the integration in generic terms or link to public examples instead.

When you motivate a change, frame it abstractly ("behind a TLS-terminating reverse proxy", "in containerized deploys") rather than naming a specific operator. When you show examples, use placeholders (example.com, <your-host>, <install-dir>). When config has reasonable defaults pulled from one deployment's habits, generalize them or surface them as documented examples — not hard-coded assumptions.

Customer-specific automation, hostnames, and identities live in private infra repos that consume this OSS. The OSS describes capabilities, defaults, and configuration knobs — not how a specific operator wired them up.

Git Commits & Pull Requests

  • Keep commit messages clean and concise
  • Do not include AI attribution in commits or PRs
  • Before opening a PR, scan the diff and the PR body for the customer-specific tokens listed above (grep -niE '<token1>|<token2>|...'). If anything matches, generalize or remove it.