agnes-the-ai-analyst/CLAUDE.md
Petr Simecek c25fd41bf7
feat(auth): Google Workspace groups on /profile + tag-triggered Keboola deploy workflow (#56)
* feat(auth): display Google Workspace groups on /profile

- Request cloud-identity.groups.readonly scope in Google OAuth
- Fetch groups via Cloud Identity API after callback; tolerate 4xx
  (non-Workspace tenants) and network errors — never break login
- Store result in Starlette session as google_groups
- Replace /profile redirect with a real profile page rendering
  account details (email, name, role) and the group list; show a
  friendly empty state when no groups are available
- Tests: helper parsing + 403 + exception paths; profile page
  smoke test; updated the old redirect test

* test: remove stale /profile redirect tests

Cherry-pick of Zdeněk's 4f7e4cd ("display Google Workspace groups on
/profile") replaces the /profile redirect with a real profile page —
but only updated one of three tests that expected the old behaviour.

These two tests in test_admin_tokens_ui.py and test_pat.py were left
asserting `/profile → 302 /tokens`, which now returns
`/profile → 302 /login?next=%2Fprofile` for unauth users (the standard
auth guard) or `/profile → 200 HTML` for authenticated users.

Removed both rather than patched — coverage for the new behaviour
already exists in tests/test_auth_providers.py (added by the same
commit). The /tokens render assertions in the deleted test_pat.py case
are redundant with test_admin_tokens_ui.py's own /tokens UI tests.

* fix(auth): Google groups search query needs parent + labels predicates

Cloud Identity Groups Search API returns 400 INVALID_ARGUMENT when the
CEL query lacks the required `parent == 'customers/<id>'` predicate AND
a `'<label>' in labels` membership predicate. Zdeněk's original 4f7e4cd
query had only `member_key_id == '<email>'` — every fetch silently
returned [] and the /profile groups list was always empty.

Fix: build the query with all three required pieces:
  parent == 'customers/my_customer'   (alias = caller's own Workspace
                                       org; no need to look up customer ID)
  member_key_id == '<email>'           (filter to this user's memberships)
  'cloudidentity.googleapis.com/groups.discussion_forum' in labels
                                       (Workspace mailing-list groups —
                                       the common case; security-group
                                       coverage is a follow-up)

Also: log the full error body (not truncated to 200 chars) and the
query string so the next time Google rejects something we can diagnose
in one log line instead of a re-deploy.

Caught when first agnes-dev login completed normally (HTTP 302) but app
log showed `Google groups fetch returned 400 for petr@keboola.com:
{"error":{"code":400,"message":"Request contains an invalid argument."}}`
on the same VM (kids-ai-data-analysis / agnes-dev.keboola.com).

Reference: https://cloud.google.com/identity/docs/reference/rest/v1/groups/search

* feat(web): add Profile link to user dropdown menu

The /profile page (Zdeněk's 4f7e4cd cherry-pick) renders a real profile
view including Google Workspace groups, but had no entry point in the
UI — users could only reach it by typing the URL manually. Add a
"Profile" menu item between the user header (email + role) and
"My tokens" so the page is discoverable.

Side effect: cleaned up the leftover `or _path.startswith('/profile')`
condition on the "My tokens" active class, which dated from the old
/profile → /tokens redirect (removed in c789617). Now each menu item
owns its own active state.

* fix: profile-link tests + .env quoting for CADDY_TLS

Two issues caught by Keboola's first agnes-dev deploy + agnes-auto-upgrade
cron run:

1. tests/test_web_ui.py — two negative assertions ("href=/profile" NOT in
   body) date from when /profile was a redirect-only stub. Now /profile
   is a real page (groups display) AND has a dropdown menu link, so the
   negative assertions flip to positive. Same for ">Profile<" text in
   the non-admin nav test.

2. startup-script.sh.tpl — CADDY_TLS line must be QUOTED in .env, because
   agnes-auto-upgrade.sh sources .env via `set -a; . .env; set +a` and
   bash treats `KEY=value with spaces` as `KEY=value` followed by `with`
   and `spaces` exec attempts. Symptom: cron log spam
   `/opt/agnes/.env: line 14: petr@keboola.com: command not found`,
   the cron exits non-zero, and no auto-upgrade ever happens. Caddy
   itself reads the value fine because docker-compose env_file=.env
   parses key=value properly without shell-evaluating the rest.

   Fix: emit `CADDY_TLS="tls <email>"` instead of `CADDY_TLS=tls <email>`.
   Both the cron source and docker-compose env_file accept the quoted
   form; cron stops failing.

* fix(auth): use searchTransitiveGroups + security label for non-admin user

Three bugs in the original cherry-pick + my prior fix attempt, all caught
by a stdlib probe script (scripts/debug/probe_google_groups.py) run
locally with a Playground-issued OAuth token:

1. Wrong endpoint. `groups:search` is the admin "find groups in org"
   endpoint and 400s for non-admin users regardless of query. Switched
   to `groups/-/memberships:searchTransitiveGroups` which is the
   user-perspective "what groups am I in" endpoint.

2. Wrong label. Querying with `cloudidentity.googleapis.com/groups.discussion_forum`
   returns 403 "Insufficient permissions to retrieve memberships" even
   on the new endpoint — Workspace policy denies non-admin reads of
   discussion-forum groups. Switching to `groups.security` returns 200
   with the actual membership list. Empirically every Workspace group
   at Keboola carries BOTH labels, so the security filter sees the full
   set anyway. Confirmed with the probe script.

3. Wrong response shape. `searchTransitiveGroups` returns
   {"memberships": [...]}, not {"groups": [...]}. Parser updated
   accordingly.

Also adds scripts/debug/probe_google_groups.py — stdlib-only standalone
probe that hits 6 candidate endpoints with a user OAuth token. Saved a
deploy cycle (~10 min) per query iteration; future API-syntax debugging
should start there.

Verified end-to-end: petr@keboola.com login on agnes-dev returns 5
groups (LIC-1PASSWORD, ROLE_ATLASSIAN_*, etc.) via the probe; once
deployed, the same will populate session["google_groups"] and render
on /profile.

* test(auth): update Google groups parser fixture to match searchTransitiveGroups shape

Mock payload was `{"groups": [...]}` (the shape `groups:search` returns).
After switching to `groups/-/memberships:searchTransitiveGroups` in the
prior commit, the actual response is `{"memberships": [...]}` and the
parser iterates that key. Test now mirrors the real shape.

The per-item structure (groupKey.id + displayName) is unchanged, so the
expected output dict stays the same: [{"id": "...", "name": "..."}].

* docs(auth): add docs/auth-groups.md — Google Workspace groups runbook

Captures the non-obvious bits: the GCP-side setup checklist (Cloud
Identity API + scope on consent screen + Internal user type), the
`security` vs `discussion_forum` label trap (the latter 403s for
non-admins, the former 200s — one of those is a 4-iteration debug
session and shouldn't have to be repeated), where groups are stored
(session, not DB) and how to refresh (re-login), plus how to use the
probe script for future API-syntax issues.

Deliberately stops short of explaining "what is Cloud Identity" or
"what is OAuth scope" — those belong in Google's own docs, not ours.

* docs(claude): document release workflows + module versioning + recreate trick

New "Release & deploy workflows" section in CLAUDE.md covers what didn't
exist anywhere in the repo before:

- Distinction between release.yml (auto-build per push) vs the new
  keboola-deploy.yml (tag-triggered, explicit deploy only) — plus when
  to use which (per-developer convenience vs shared dev VM safety)
- Module versioning (infra-vX.Y.Z) and the bump-after-merge dance
- The lifecycle.ignore_changes [metadata_startup_script] gotcha and how
  to force a recreate via workflow_dispatch's recreate_targets input

All generic — no customer hostnames, project IDs, IPs. Customer-specific
deploy steps belong in the consuming infra repo's README.

Also: cross-reference docs/auth-groups.md from the Authentication
section so future Claude sessions find the Workspace-groups runbook
without grepping.

---------

Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
2026-04-26 00:56:44 +02:00

14 KiB

AI Data Analyst

Open-source data distribution platform for AI analytical systems. Extracts data from sources into DuckDB, serves via FastAPI, and distributes parquets to analysts who use Claude Code for local analysis.

First-Time Setup

When a user opens this project for the first time, guide them through interactive setup:

Step 1: Gather Information

Ask the user for:

  1. Company domain (e.g., "acme.com") - used for Google OAuth
  2. Data source type: keboola / bigquery / csv
  3. Instance name (e.g., "Acme Data Analyst")

Step 2: Generate Configuration

  1. Copy config/instance.yaml.example to config/instance.yaml
  2. Fill in values from Step 1
  3. If Keboola: ask for Storage API token, stack URL, project ID
  4. Create .env from config/.env.template

Step 3: Register Tables

  1. Use the FastAPI admin API (POST /api/admin/tables/{id}) or webapp UI to register tables
  2. Tables are stored in DuckDB table_registry with source_type, bucket, source_table, query_mode
  3. For migration from old format: python scripts/migrate_registry_to_duckdb.py

Step 4: Docker Deployment

docker compose up          # Start app + scheduler
docker compose --profile full up  # Include telegram bot

# HTTPS mode — Caddy + corporate-CA certs at /data/state/certs
docker compose -f docker-compose.yml -f docker-compose.prod.yml -f docker-compose.tls.yml \
    --profile tls up -d

See docs/DEPLOYMENT.mdTLS for cert provisioning + scripts/grpn/agnes-tls-rotate.sh (daily refetch from TLS_FULLCHAIN_URL, SIGUSR1 reload on diff, no-op when unchanged). The infra repo's startup.sh installs this as a systemd timer automatically.

Project Structure

├── src/                    # Core engine
│   ├── db.py               # DuckDB schema (system.duckdb, analytics.duckdb)
│   ├── orchestrator.py     # SyncOrchestrator — ATTACHes extract.duckdb files
│   ├── repositories/       # DuckDB-backed CRUD (sync_state, table_registry, users, etc.)
│   ├── profiler.py         # Data profiling
│   └── catalog_export.py   # OpenMetadata catalog export
├── app/                    # FastAPI application
│   ├── main.py             # App setup, router registration
│   ├── api/                # REST API (sync, data, catalog, admin, auth)
│   └── web/                # HTML dashboard routes
├── connectors/             # Data source connectors (extract.duckdb contract)
│   ├── keboola/            # Keboola: extractor.py (DuckDB extension) + client.py (fallback)
│   ├── bigquery/           # BigQuery: extractor.py (remote-only via DuckDB BQ extension)
│   └── jira/               # Jira: webhook + incremental parquet → extract.duckdb
├── cli/                    # CLI tool (`da sync`, `da query`, `da admin`)
├── app/auth/               # Authentication (FastAPI-based providers)
├── services/               # Standalone services (scheduler, telegram_bot, ws_gateway, etc.)
├── server/                 # Legacy deployment infrastructure
├── scripts/                # Utility + migration scripts
├── config/                 # Configuration templates (instance.yaml.example)
├── docs/                   # Documentation + metric YAML definitions
└── tests/                  # Test suite (633 tests)

Architecture: extract.duckdb Contract

Every data source produces the same output:

/data/extracts/{source_name}/
├── extract.duckdb          ← _meta table + views
└── data/                   ← parquet files (local sources only)

Remote table support (_remote_attach)

Extractors with remote/passthrough tables (query_mode='remote') include a _remote_attach table in extract.duckdb so the orchestrator can re-ATTACH the external DuckDB extension at query time:

CREATE TABLE _remote_attach (
    alias     VARCHAR,  -- DuckDB alias used in views, e.g. 'kbc'
    extension VARCHAR,  -- Extension name, e.g. 'keboola'
    url       VARCHAR,  -- Connection URL
    token_env VARCHAR   -- Env-var name holding the auth token (NOT the token itself)
);

The orchestrator reads this table, installs/loads the extension, reads the token from the environment, and ATTACHes the external source. Views referencing kbc."bucket"."table" then resolve correctly. This mechanism is generic — any connector can use it.

The SyncOrchestrator scans /data/extracts/*/extract.duckdb, ATTACHes each into master analytics.duckdb, and creates views.

┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│   Keboola    │  │   BigQuery   │  │   Jira       │
│  extractor   │  │  extractor   │  │  webhooks    │
│ (DuckDB ext) │  │ (remote BQ)  │  │ (incremental)│
└──────┬───────┘  └──────┬───────┘  └──────┬───────┘
       │                 │                 │
       ▼                 ▼                 ▼
   extract.duckdb    extract.duckdb    extract.duckdb
   + data/*.parquet  (views → BQ)      + data/*.parquet
       │                 │                 │
       └─────────────────┼─────────────────┘
                         ▼
              SyncOrchestrator.rebuild()
              ATTACH → master views in analytics.duckdb
                         │
              ┌──────────┼──────────┐
              ▼          ▼          ▼
          FastAPI      CLI
          (serve)    (da sync)

Three source types:

  • Batch pull (Keboola): DuckDB extension downloads to parquet, scheduled
  • Remote attach (BigQuery): DuckDB BQ extension, no download, queries go to BQ
  • Real-time push (Jira): Webhooks update parquets incrementally

Configuration

Instance-specific config: config/instance.yaml (see example). Environment variables: .env (never committed). Table definitions: DuckDB table_registry table in system.duckdb.

Development

# Setup
python3 -m venv .venv && source .venv/bin/activate
uv pip install ".[dev]"

# Run FastAPI locally
uvicorn app.main:app --reload

# Run tests
pytest tests/ -v

# Trigger sync manually
curl -X POST http://localhost:8000/api/sync/trigger

# Docker
docker compose up

Business Metrics

Standardized metric definitions live in DuckDB (metric_definitions table). Import starter pack:

da metrics import docs/metrics/

For AI agents analyzing data:

Before computing any business metric, look up the canonical definition:

  1. da metrics list — find the relevant metric
  2. da metrics show revenue/mrr — read the SQL and business rules
  3. Use the SQL from the metric definition, adapt to the specific question

Never invent metric calculations — always use the canonical definitions.

Hybrid Queries (BigQuery + Local)

For tables too large to sync locally, use hybrid queries that JOIN local data with on-demand BigQuery results:

da query --sql "SELECT o.*, t.views FROM orders o JOIN traffic t ON o.date = t.date" \
         --register-bq "traffic=SELECT date, SUM(views) as views FROM dataset.web WHERE date > '2026-01-01' GROUP BY 1"

The --register-bq flag executes a BigQuery subquery, loads the result into memory, and makes it available as a DuckDB view for the final SQL. Multiple --register-bq flags can be used for multiple BQ sources.

For complex SQL, use stdin mode:

echo '{"register_bq": {"traffic": "SELECT ..."}, "sql": "SELECT ..."}' | da query --stdin

Extensibility

Data Sources (extract.duckdb contract)

New connector = connectors/<name>/extractor.py producing extract.duckdb + data/. Must create _meta table with columns: table_name, description, rows, size_bytes, extracted_at, query_mode. Orchestrator ATTACHes it automatically.

Authentication

Auth providers in app/auth/ (FastAPI-based):

  • Google: OAuth via Google (Workspace group memberships pulled at sign-in — see docs/auth-groups.md for the GCP setup checklist + the security label gotcha)
  • Email: Email magic link (itsdangerous token)
  • Desktop: JWT for API

Release & deploy workflows

Two separate release.yml-style workflows produce GHCR images. Pick the one that matches what you're shipping.

release.yml — auto-build on every push

Runs on every push to every branch.

  • Push to main:stable, :stable-YYYY.MM.N (CalVer).
  • Push to non-main <prefix>/<branch>:dev, :dev-YYYY.MM.N, :dev-<branch-slug>, and (when prefix isn't a Git Flow convention) :dev-<prefix>-latest alias.

VMs that pin to a floating tag (:dev, :dev-<prefix>-latest) auto-upgrade within ~5 min via the cron in agnes-auto-upgrade.sh. Convenient for per-developer dev VMs; footgun for shared dev VMs (last pusher wins, regardless of who).

keboola-deploy.yml — tag-triggered, explicit deploy only

Runs only on git tags matching keboola-deploy-*. Publishes:

  • :keboola-deploy-<git-tag-suffix> — immutable, tied to the exact commit
  • :keboola-deploy-latest — floating alias the consumer pins to

Operator workflow:

git checkout <commit-or-branch>
git tag keboola-deploy-<descriptive-name>
git push origin keboola-deploy-<descriptive-name>
# → workflow builds + publishes both tags
# → VM cron picks up :keboola-deploy-latest within ~5 min
# → manual cron trigger (skip the wait): sudo /usr/local/bin/agnes-auto-upgrade.sh on the VM

Use this when the consumer (e.g. a customer dev VM) needs deploy-when-I-decide semantics — no surprise rollouts from upstream branch pushes by other contributors. The infra repo pins image_tag = "keboola-deploy-latest" on the relevant VM.

Module versioning

The customer-instance Terraform module under infra/modules/customer-instance/ is published as infra-vMAJOR.MINOR.PATCH git tags (separate from app CalVer tags). Bump on any module-API change; downstream infra repos pin to the tag in their source = "github.com/keboola/agnes-the-ai-analyst//infra/modules/customer-instance?ref=infra-v1.X.Y".

After merging a module change to main:

git tag infra-vX.Y.Z origin/main
git push origin infra-vX.Y.Z

Replacing a VM after a startup-script change

Module sets lifecycle { ignore_changes = [metadata_startup_script] } on google_compute_instance.vm so normal terraform apply doesn't churn running VMs. To propagate a startup-script update, trigger the consumer's apply workflow manually with the VM resource address — typical workflow_dispatch input is recreate_targets='module.agnes.google_compute_instance.vm["<vm-name>"]'.

Key Implementation Details

DuckDB Schema (src/db.py)

  • Schema v7 with auto-migration from v1→v2→v3→v4→v5→v6→v7 (v5 adds users.active, v6 adds personal_access_tokens, v7 adds personal_access_tokens.last_used_ip)
  • table_registry: id, name, source_type, bucket, source_table, query_mode, sync_schedule, etc.
  • sync_state, sync_history: track extraction progress
  • users, dataset_permissions, audit_log: auth + RBAC
  • System DB at {DATA_DIR}/state/system.duckdb
  • Analytics DB at {DATA_DIR}/analytics/server.duckdb

SyncOrchestrator (src/orchestrator.py)

  • rebuild(): scans extracts dir, ATTACHes all, creates master views, updates sync_state
  • rebuild_source(name): single source (used after Jira webhooks)
  • Thread-safe via _rebuild_lock

Connector Pattern

  • Keboola: connectors/keboola/extractor.py uses DuckDB Keboola extension, fallback to client.py
  • BigQuery: connectors/bigquery/extractor.py uses DuckDB BQ extension (remote-only, no download)
  • Jira: connectors/jira/webhook.pyincremental_transform.pyextract_init.py updates _meta
  • connectors/keboola/client.py: legacy Keboola Storage API wrapper (kept as fallback)

Config Loading

  1. config/loader.py loads instance.yaml
  2. app/instance_config.py exposes get_data_source_type(), get_value()
  3. Table config lives in DuckDB table_registry (not markdown files)

Files NOT to modify (stable infrastructure)

  • connectors/jira/file_lock.py - Advisory file locking
  • connectors/jira/transform.py - Core Jira transform logic
  • services/ws_gateway/ - WebSocket notification gateway

Vendor-agnostic OSS — no customer-specific content

This repo is the public OSS distribution. Nothing customer-specific belongs in code, configuration defaults, comments, docs, commit messages, PR titles, or PR bodies. That includes:

  • Specific deployments or brands (private VM names, internal product brands, organization names that aren't already public sponsors).
  • Cloud project IDs, internal hostnames, runbook paths from a particular install (/opt/<deployment>, <host>.<internal-domain>, prj-<org>-…, internal SA emails).
  • Cross-references to private repos (<private-org>/<private-repo>#NN). Describe the integration in generic terms or link to public examples instead.

When you motivate a change, frame it abstractly ("behind a TLS-terminating reverse proxy", "in containerized deploys") rather than naming a specific operator. When you show examples, use placeholders (example.com, <your-host>, <install-dir>). When config has reasonable defaults pulled from one deployment's habits, generalize them or surface them as documented examples — not hard-coded assumptions.

Customer-specific automation, hostnames, and identities live in private infra repos that consume this OSS. The OSS describes capabilities, defaults, and configuration knobs — not how a specific operator wired them up.

Git Commits & Pull Requests

  • Keep commit messages clean and concise
  • Do not include AI attribution in commits or PRs
  • Before opening a PR, scan the diff and the PR body for the customer-specific tokens listed above (grep -niE '<token1>|<token2>|...'). If anything matches, generalize or remove it.