Fork of keboola/agnes-the-ai-analyst (via manana2520 GitHub fork). Develop here, push to GitHub fork to open upstream PRs.
Find a file
Petr Simecek 4799119c81
feat(deploy): keboola-deploy tag-triggered workflow + Caddyfile LE/internal modes + dev_instances TLS support (#52)
* feat(deploy): keboola-deploy tag-triggered workflow + Caddyfile LE/internal modes + dev_instances TLS support

Three coordinated changes that together unblock Keboola's internal Agnes
deployment from the foot-gun where the dev VM tracks `:dev` (= last push
from anyone in the upstream repo).

1. .github/workflows/keboola-deploy.yml — new workflow

   Triggered ONLY on `keboola-deploy-*` git tag pushes (not on every branch
   push like release.yml). Builds an image and publishes two GHCR tags:

     ghcr.io/keboola/agnes-the-ai-analyst:keboola-deploy-<git-tag-suffix>
     ghcr.io/keboola/agnes-the-ai-analyst:keboola-deploy-latest

   The Keboola dev VM pins to `keboola-deploy-latest`; an operator deploys
   by `git tag keboola-deploy-foo && git push origin keboola-deploy-foo`.
   Audit trail lives in git tags (immutable, who-tagged-what-when), no
   PR-cycle needed for each deploy.

   Doesn't touch Vojta/Minas/David workflow — release.yml still builds
   `:dev-<slug>` for every branch push as before.

2. Caddyfile — parametrize TLS directive via $CADDY_TLS env var

   PR #51 hardcoded cert-file mode (`tls /certs/fullchain.pem ...`) for
   Groupon's corporate CA flow. That broke the Let's Encrypt path the
   module previously supported. Now:

     CADDY_TLS unset (default) → cert-file mode (Groupon corp PKI)
     CADDY_TLS="tls user@x.com"  → Let's Encrypt auto-issue
     CADDY_TLS="tls internal"     → Caddy-managed self-signed (lab/dev)

   Single Caddyfile, three regimes, no per-deployment fork. Validated with
   `caddy validate` in all three modes.

3. customer-instance module — dev_instances TLS + auto-set CADDY_TLS

   - variables.tf: dev_instances object schema gains optional tls_mode +
     domain (mirroring prod_instance). Defaults to "none" + "" so existing
     callers without those fields keep current behavior.
   - startup-script.sh.tpl: when tls_mode="caddy" and DOMAIN is set, write
     CADDY_TLS=tls <ACME_EMAIL> (or "tls internal" when ACME_EMAIL empty)
     into /opt/agnes/.env. Caddy then picks it up and the Caddyfile
     substitution flips the cert source.

   For an LE deploy: set tls_mode="caddy", domain="agnes-dev.example.com",
   ensure DNS A-record points at the VM, and acme_email is set on the
   module (or seed_admin_email is, since acme_email defaults to it).

After this lands, tag as infra-v1.6.0 so downstream infra repos can bump
their module ref without needing the upstream change tracking.

* feat(deploy): fetch optional Google OAuth credentials from Secret Manager

Mirrors the existing keboola-storage-token / agnes-<customer>-jwt-secret
pattern: VM SA reads google-oauth-client-{id,secret} secrets at boot
(if they exist + IAM is wired by caller via runtime_secrets) and writes
them into /opt/agnes/.env. Empty / missing / 403 → silent fallback
to "" so password and email auth keep working untouched.

Pairs with downstream change in agnes-infra-keboola which adds the two
secret names to runtime_secrets, granting the Keboola VM SA secretAccessor
on them. Operator pre-creates the SM containers via gcloud secrets create
google-oauth-client-{id,secret} (one-time, out of band) — values stay
in SM forever; rotation = `gcloud secrets versions add`.

This unblocks the Keboola agnes-dev deploy from PR #3 (infra) — without
GOOGLE_CLIENT_{ID,SECRET} in .env, app/auth/providers/google.is_available()
returns False and the Google sign-in button never even appears.
2026-04-25 23:19:00 +02:00
.github/workflows feat(deploy): keboola-deploy tag-triggered workflow + Caddyfile LE/internal modes + dev_instances TLS support (#52) 2026-04-25 23:19:00 +02:00
app chore(deploy): trust proxy headers + document HTTPS env vars (#48) 2026-04-24 08:52:53 +02:00
cli release(2.1.0): durable sync, CLI auto-update, versioned wheel URL, version unification (#43) 2026-04-22 21:18:18 +02:00
config feat(tls): corporate-CA HTTPS with URL-driven rotation, on-VM CSR gen, self-signed fallback (#51) 2026-04-25 19:51:25 +00:00
connectors fix: strip HTML from table and column descriptions in OpenMetadata enricher 2026-04-09 18:42:37 +02:00
dev_docs docs: update stale v1 docs to v2 Docker/FastAPI/DuckDB architecture 2026-04-09 18:44:25 +02:00
docs feat(tls): corporate-CA HTTPS with URL-driven rotation, on-VM CSR gen, self-signed fallback (#51) 2026-04-25 19:51:25 +00:00
infra feat(deploy): keboola-deploy tag-triggered workflow + Caddyfile LE/internal modes + dev_instances TLS support (#52) 2026-04-25 23:19:00 +02:00
scripts feat(tls): corporate-CA HTTPS with URL-driven rotation, on-VM CSR gen, self-signed fallback (#51) 2026-04-25 19:51:25 +00:00
services fix: make bot.py FileHandler resilient to missing log directory 2026-04-13 13:28:59 +02:00
src User management + PAT + CLI distribution + HTML auth redirect (#9 #10 #11 #12) (#28) 2026-04-22 14:24:28 +02:00
tests release(2.1.0): durable sync, CLI auto-update, versioned wheel URL, version unification (#43) 2026-04-22 21:18:18 +02:00
.dockerignore refactor: consolidate deps into pyproject.toml, remove requirements.txt 2026-04-09 13:17:59 +02:00
.gitignore infra: add bootstrap-gcp.sh for per-customer GCP setup 2026-04-21 16:18:35 +02:00
ARCHITECTURE.md Update docs for modular architecture (auth/, services/, scripts/) 2026-03-09 13:11:40 +01:00
Caddyfile feat(deploy): keboola-deploy tag-triggered workflow + Caddyfile LE/internal modes + dev_instances TLS support (#52) 2026-04-25 23:19:00 +02:00
CHANGELOG.md feat: multi-instance deployment — all 14 must-have items from spec 2026-04-10 11:57:42 +02:00
CLAUDE.md feat(tls): corporate-CA HTTPS with URL-driven rotation, on-VM CSR gen, self-signed fallback (#51) 2026-04-25 19:51:25 +00:00
docker-compose.ci.yml feat: multi-instance deployment — all 14 must-have items from spec 2026-04-10 11:57:42 +02:00
docker-compose.host-mount.yml fix(ci): move bind-mount of /data to separate overlay, fix CI smoke test 2026-04-21 16:54:18 +02:00
docker-compose.local-dev.yml feat(dev): LOCAL_DEV_MODE for one-command local dev + magic-link fixes (#32) 2026-04-22 14:47:33 +02:00
docker-compose.override.yml chore(deploy): trust proxy headers + document HTTPS env vars (#48) 2026-04-24 08:52:53 +02:00
docker-compose.prod.yml fix(ci): move bind-mount of /data to separate overlay, fix CI smoke test 2026-04-21 16:54:18 +02:00
docker-compose.test.yml chore(deploy): trust proxy headers + document HTTPS env vars (#48) 2026-04-24 08:52:53 +02:00
docker-compose.tls.yml feat(tls): corporate-CA HTTPS with URL-driven rotation, on-VM CSR gen, self-signed fallback (#51) 2026-04-25 19:51:25 +00:00
docker-compose.yml feat(tls): corporate-CA HTTPS with URL-driven rotation, on-VM CSR gen, self-signed fallback (#51) 2026-04-25 19:51:25 +00:00
Dockerfile chore(deploy): trust proxy headers + document HTTPS env vars (#48) 2026-04-24 08:52:53 +02:00
LICENSE OSS cleanup: remove internal references, harden deployment, add config env interpolation 2026-03-09 07:59:57 +01:00
Makefile feat(dev): make local-dev targets for one-keystroke LOCAL_DEV_MODE startup (#33) 2026-04-22 14:57:10 +02:00
pyproject.toml release(2.1.0): durable sync, CLI auto-update, versioned wheel URL, version unification (#43) 2026-04-22 21:18:18 +02:00
pytest.ini test: add shared test infrastructure (fixtures, factories, assertions, mocks) 2026-04-12 11:05:35 +02:00
README.md feat(tls): corporate-CA HTTPS with URL-driven rotation, on-VM CSR gen, self-signed fallback (#51) 2026-04-25 19:51:25 +00:00
uv.lock chore(deps): bump python-multipart from 0.0.24 to 0.0.26 2026-04-21 13:26:19 +00:00

Agnes — AI Data Analyst

Agnes is an open-source data distribution platform for AI analytical systems. It extracts data from configured sources into DuckDB, serves it via a FastAPI backend, and distributes Parquet files to analysts who query them locally using Claude Code and DuckDB.

Each data source produces a self-describing extract.duckdb file. The SyncOrchestrator attaches all extract databases into a master analytics.duckdb, making every table available through a unified view layer without copying data unnecessarily.

Architecture: extract.duckdb Contract

Every connector produces the same output structure:

/data/extracts/{source_name}/
├── extract.duckdb          ← _meta table + views
└── data/                   ← parquet files (local sources only)

The orchestrator scans /data/extracts/*/extract.duckdb, attaches each into analytics.duckdb, and creates master views.

┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│   Keboola    │  │   BigQuery   │  │   Jira       │
│  extractor   │  │  extractor   │  │  webhooks    │
│ (DuckDB ext) │  │ (remote BQ)  │  │ (incremental)│
└──────┬───────┘  └──────┬───────┘  └──────┬───────┘
       │                 │                 │
       ▼                 ▼                 ▼
   extract.duckdb    extract.duckdb    extract.duckdb
   + data/*.parquet  (views → BQ)      + data/*.parquet
       │                 │                 │
       └─────────────────┼─────────────────┘
                         ▼
              SyncOrchestrator.rebuild()
              ATTACH → master views in analytics.duckdb
                         │
              ┌──────────┼──────────┐
              ▼          ▼          ▼
          FastAPI      CLI
          (serve)    (da sync)

Supported Data Sources

Source Mode Description
Keboola Batch pull DuckDB Keboola extension downloads tables to Parquet on a schedule
BigQuery Remote attach DuckDB BQ extension; queries execute in BigQuery, no local download
Jira Real-time push Webhook receiver updates Parquet files incrementally

Adding a new source means creating connectors/<name>/extractor.py that produces extract.duckdb with a _meta table (table_name, description, rows, size_bytes, extracted_at, query_mode). The orchestrator attaches it automatically.

Quick Start with Docker

# Clone the repository
git clone https://github.com/keboola/agnes-the-ai-analyst.git
cd agnes-the-ai-analyst

# Copy and edit configuration
cp config/instance.yaml.example config/instance.yaml
cp config/.env.template .env
# Edit both files for your environment

# Start the app and scheduler
docker compose up

# Start with all optional services (Telegram bot, etc.)
docker compose --profile full up

# Start with TLS (Caddy on :443 with corporate-CA certs from /data/state/certs)
docker compose -f docker-compose.yml -f docker-compose.prod.yml -f docker-compose.tls.yml \
    --profile tls up -d

Once running, the FastAPI app is available at http://localhost:8000 (or https://$DOMAIN in TLS mode). See docs/DEPLOYMENT.md for cert provisioning + auto-rotation via scripts/grpn/agnes-tls-rotate.sh. Trigger a manual sync:

curl -X POST http://localhost:8000/api/sync/trigger

Development Setup

# Create and activate virtual environment
python3 -m venv .venv && source .venv/bin/activate

# Install dependencies
uv pip install ".[dev]"

# Run FastAPI locally with hot reload
uvicorn app.main:app --reload

# Run the test suite
pytest tests/ -v

Project Structure

├── src/                    # Core engine
│   ├── db.py               # DuckDB schema (system.duckdb, analytics.duckdb)
│   ├── orchestrator.py     # SyncOrchestrator — ATTACHes extract.duckdb files
│   ├── repositories/       # DuckDB-backed CRUD (sync_state, table_registry, users, etc.)
│   ├── profiler.py         # Data profiling
│   └── catalog_export.py   # OpenMetadata catalog export
├── app/                    # FastAPI application
│   ├── main.py             # App setup, router registration
│   ├── api/                # REST API (sync, data, catalog, admin, auth)
│   ├── auth/               # Auth providers (Google OAuth, email magic link, desktop JWT)
│   └── web/                # HTML dashboard routes
├── connectors/             # Data source connectors (extract.duckdb contract)
│   ├── keboola/            # Keboola: extractor.py (DuckDB extension) + client.py (fallback)
│   ├── bigquery/           # BigQuery: extractor.py (remote-only via DuckDB BQ extension)
│   └── jira/               # Jira: webhook + incremental parquet → extract.duckdb
├── cli/                    # CLI tool (`da sync`, `da query`, `da admin`)
├── services/               # Standalone services (scheduler, telegram_bot, ws_gateway, etc.)
├── scripts/                # Utility + migration scripts
├── config/                 # Configuration templates (instance.yaml.example)
├── docs/                   # Documentation + metric YAML definitions
└── tests/                  # Test suite (633 tests)

Configuration

File Purpose
config/instance.yaml Instance-specific settings: branding, data source type, auth provider, Google domain
.env Secrets and environment variables — never committed
system.duckdb table_registry table Table definitions managed via POST /api/admin/tables/{id} or the web UI

Copy the example to get started:

cp config/instance.yaml.example config/instance.yaml

See config/instance.yaml.example for all available options.

Documentation

Contributing

  1. Fork the repository and create a feature branch.
  2. Run pytest tests/ -v to verify all tests pass before opening a pull request.
  3. Keep commits focused and messages concise.
  4. Open a pull request against main with a clear description of the change.

For bugs and feature requests, open a GitHub issue.

License

This project is licensed under the MIT License.