Fork of keboola/agnes-the-ai-analyst (via manana2520 GitHub fork). Develop here, push to GitHub fork to open upstream PRs.
Find a file
ZdenekSrotyr aa5921da67
release: 0.47.0 — source-agnostic catalog metadata + cache discipline (#223)
## Summary

- Catalog enrichment for `query_mode='remote'` rows: `rows`, `size_bytes`, `partition_by`, `clustered_by` per table (BQ + Keboola providers).
- `/api/v2/schema/{id}` cache miss: 2 BQ jobs → 1 (-50%) via shared `fetch_bq_columns_full`.
- All four catalog/schema/sample/metadata caches flush on registry change; single-row re-warm scheduled.
- Automatic cache warmup at server startup (bounded concurrency, opt-out via `AGNES_SKIP_CACHE_WARMUP=1`).
- SSE-driven freshness toolbar on `/admin/tables` with progress bar, log, and per-row badge.
- New admin doc `docs/admin/query-modes.md` — single source of truth on `local` / `remote` / `materialized` choice.

Closes #155.
Closes #156.

## Test plan

- [x] 65+ targeted tests pass across 11 new test modules + 3 modified ones.
- [x] No DB migration; no wire-break; `MIN_COMPAT_CLI_VERSION` unchanged.
- [ ] Reviewer: register a remote BQ table via `/admin/tables`, observe the toolbar populates within ~2 s and the per-row badge transitions warming → fresh.
- [ ] Reviewer: trigger `Re-warm all`, verify SSE log scrolls and `cacheWarmupBar` progresses.
- [ ] Reviewer: edit a registered row's bucket, verify `agnes schema <id>` returns updated columns immediately (no 1-hour staleness).
- [ ] Reviewer: confirm `agnes admin register-table --query-mode remote` prints the new IAM-smoke-check hint.

## Notable design decisions

- BigQuery `INFORMATION_SCHEMA.TABLE_STORAGE` is the only valid scope for size+rows (verified live 2026-05-07; dataset-scoped doesn't exist). Region resolved from `instance.yaml.data_source.bigquery.location` → `bq.client().get_dataset(...)` → fall back to legacy `__TABLES__`.
- VIEW handling: TABLE_STORAGE returns no rows for views, fall through to `__TABLES__` (also empty) → `TableMetadata(rows=None, size_bytes=None, partition_by=..., clustered_by=...)`. Null size signals analyst Claude to apply existing CLAUDE.md guidance.
- `size_bytes` is `active_logical_bytes + long_term_logical_bytes` — full BQ scan reads both; reporting only active undercounts aged partitioned tables.
- Source-agnostic provider seam: per-source `connectors/<source>/metadata.py:fetch(MetadataRequest)`; dispatcher in `app/api/v2_catalog.py:_metadata_provider_for` lazily imports per source_type so a Keboola-only deployment doesn't pay the BQ-extension import cost.
- Warmup non-blocking: FastAPI `lifespan` schedules `asyncio.create_task(_warm_catalog_caches_bg)` before `yield`. Per-row failures isolated.

## Out of scope

- Profile / column histograms / dimension cardinality for remote tables (separate issue).
- Onboarding nudge ("you have 0 remote tables, consider registering some BQ ones") — separate UX call.
- Provider plug-in registration via entry-points (the dispatch table is a hardcoded if-tree today; one line per future source).

## Release

Bumps `pyproject.toml` 0.46.1 → 0.47.0 (main shipped 0.46.0 + 0.46.1 during this PR — see commit `d98976ec`). New CHANGELOG section under `## [0.47.0] — 2026-05-07`.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/keboola/agnes-the-ai-analyst/pull/223" target="_blank">
  <picture>
    <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
    <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review">
  </picture>
</a>
<!-- devin-review-badge-end -->
2026-05-07 18:33:55 +02:00
.github fix(ci): smoke-test stale route + rollback ghcr auth + issues:write (#140) 2026-04-30 09:42:27 +02:00
app release: 0.47.0 — source-agnostic catalog metadata + cache discipline (#223) 2026-05-07 18:33:55 +02:00
cli release: 0.47.0 — source-agnostic catalog metadata + cache discipline (#223) 2026-05-07 18:33:55 +02:00
config perf(bq): pool DuckDB BQ extension sessions to amortize INSTALL/LOAD/ATTACH cost 2026-05-06 13:06:25 +02:00
connectors release: 0.47.0 — source-agnostic catalog metadata + cache discipline (#223) 2026-05-07 18:33:55 +02:00
dev_docs chore(docs): replace stale da verbs and vendor-specific install paths 2026-05-04 21:22:19 +02:00
docs release: 0.47.0 — source-agnostic catalog metadata + cache discipline (#223) 2026-05-07 18:33:55 +02:00
infra infra(customer-instance): preserve operator AGNES_TAG / AGNES_TEMP_DIR (#214) 2026-05-07 11:36:36 +02:00
scripts Keboola cutover: native parquet path + sync correctness + auto-discover protection (#190) 2026-05-07 12:12:14 +02:00
services release: 0.45.0 — easy-wins bundle (#84 #164 #177 #178 #203 #204) 2026-05-07 11:43:16 +02:00
src Keboola cutover: native parquet path + sync correctness + auto-discover protection (#190) 2026-05-07 12:12:14 +02:00
tests release: 0.47.0 — source-agnostic catalog metadata + cache discipline (#223) 2026-05-07 18:33:55 +02:00
.dockerignore refactor: consolidate deps into pyproject.toml, remove requirements.txt 2026-04-09 13:17:59 +02:00
.gitignore chore(.gitignore): allowlist cli/lib/ from generic lib/ rule (Task 7 follow-up) 2026-05-04 17:54:00 +02:00
.pre-commit-config.yaml feat(ci+tests): deploy safety audit — linting, rollback, smoke tests, 50+ new tests (#120) 2026-04-29 09:18:55 +02:00
ARCHITECTURE.md fix: address Devin Review findings — incomplete renames + estimate guard 2026-05-04 20:05:06 +02:00
Caddyfile fix: Devin Review on #188 — try_files fallback + auto-upgrade ordering 2026-05-05 17:24:42 +02:00
CHANGELOG.md release: 0.47.0 — source-agnostic catalog metadata + cache discipline (#223) 2026-05-07 18:33:55 +02:00
CLAUDE.md release: 0.46.3 — self-heal session pipeline + clearer diagnose (#220) 2026-05-07 17:41:22 +02:00
docker-compose.ci.yml feat: multi-instance deployment — all 14 must-have items from spec 2026-04-10 11:57:42 +02:00
docker-compose.dev.yml fix(security+ops) + release(0.12.1): #82 #85 #87 hardening + cut 0.12.1 (#104) 2026-04-28 19:57:30 +02:00
docker-compose.flat-mount.yml fix: Devin Review on #194 round 2 — 3 BUG-class findings 2026-05-05 20:02:50 +02:00
docker-compose.host-mount.yml fix: Devin Review on #194 round 2 — 3 BUG-class findings 2026-05-05 20:02:50 +02:00
docker-compose.local-dev.yml release(0.11.2): LOCAL_DEV_GROUPS dev mock + Makefile defaults + docs/local-development.md (#70) 2026-04-26 16:48:55 +02:00
docker-compose.prod.yml fix(compose): drop corporate-memory + session-collector services (#176) 2026-05-04 23:59:44 +02:00
docker-compose.test.yml chore(deploy): trust proxy headers + document HTTPS env vars (#48) 2026-04-24 08:52:53 +02:00
docker-compose.tls.yml feat(tls): corporate-CA HTTPS with URL-driven rotation, on-VM CSR gen, self-signed fallback (#51) 2026-04-25 19:51:25 +00:00
docker-compose.yml Keboola cutover: native parquet path + sync correctness + auto-discover protection (#190) 2026-05-07 12:12:14 +02:00
Dockerfile refactor(ops): bake all host artifacts into image, drop every curl-from-main (#149) 2026-04-30 21:40:25 +02:00
LICENSE
Makefile fix(security+ops) + release(0.12.1): #82 #85 #87 hardening + cut 0.12.1 (#104) 2026-04-28 19:57:30 +02:00
pyproject.toml release: 0.47.0 — source-agnostic catalog metadata + cache discipline (#223) 2026-05-07 18:33:55 +02:00
pytest.ini feat(rbac+marketplace): RBAC v13 + Claude Code marketplace + #81/#83/#44 hardening 2026-04-28 14:25:04 +02:00
README.md fix: address Devin Review findings — incomplete renames + estimate guard 2026-05-04 20:05:06 +02:00
uv.lock chore(deps): bump python-multipart from 0.0.26 to 0.0.27 2026-05-07 09:09:45 +02:00

Agnes — AI Data Analyst

Agnes is an open-source data distribution platform for AI analytical systems. It extracts data from configured sources into DuckDB, serves it via a FastAPI backend, and distributes Parquet files to analysts who query them locally using Claude Code and DuckDB.

Each data source produces a self-describing extract.duckdb file. The SyncOrchestrator attaches all extract databases into a master analytics.duckdb, making every table available through a unified view layer without copying data unnecessarily.

Architecture: extract.duckdb Contract

Every connector produces the same output structure:

/data/extracts/{source_name}/
├── extract.duckdb          ← _meta table + views
└── data/                   ← parquet files (local sources only)

The orchestrator scans /data/extracts/*/extract.duckdb, attaches each into analytics.duckdb, and creates master views.

┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│   Keboola    │  │   BigQuery   │  │   Jira       │
│  extractor   │  │  extractor   │  │  webhooks    │
│ (DuckDB ext) │  │ (remote BQ)  │  │ (incremental)│
└──────┬───────┘  └──────┬───────┘  └──────┬───────┘
       │                 │                 │
       ▼                 ▼                 ▼
   extract.duckdb    extract.duckdb    extract.duckdb
   + data/*.parquet  (views → BQ)      + data/*.parquet
       │                 │                 │
       └─────────────────┼─────────────────┘
                         ▼
              SyncOrchestrator.rebuild()
              ATTACH → master views in analytics.duckdb
                         │
              ┌──────────┼──────────┐
              ▼          ▼          ▼
          FastAPI      CLI
          (serve)    (agnes pull)

Supported Data Sources

Mode Distribution Sources Use when
Batch pull (local) Parquet on disk, scheduled Keboola Source has a native bulk-export and the table fits on disk
Materialized SQL (materialized) Parquet on disk, scheduled query BigQuery, Keboola Source table is too large to mirror as-is; you want a curated subset / aggregate on disk
Remote attach (remote) View only, no download BigQuery Table is too large to materialize; latency cost of remote query is acceptable
Real-time push Incremental parquet Jira Source is event-driven and you need sub-minute freshness

The first three modes are what agnes pull distributes to analysts. The fourth is server-side only — analysts query Jira data through the same agnes pull-distributed parquets.

Admins manage per-source registrations through the /admin/tables UI (per-connector tabs for BigQuery / Keboola / Jira) or the agnes admin register-table CLI; per-row "Manage access" deep-links to /admin/access for granting tables to user groups via resource_grants(group, ResourceType.TABLE, table_id).

Analysts get a closed loop with Claude Code: agnes init writes <workspace>/.claude/settings.json with SessionStart (agnes pull --quiet) and SessionEnd (agnes push --quiet) hooks so every Claude Code session starts with fresh RBAC-filtered parquets and ends with the session log uploaded back.

Adding a new source means creating connectors/<name>/extractor.py that produces extract.duckdb with a _meta table (table_name, description, rows, size_bytes, extracted_at, query_mode). The orchestrator attaches it automatically.

Quick Start with Docker

# Clone the repository
git clone https://github.com/keboola/agnes-the-ai-analyst.git
cd agnes-the-ai-analyst

# Copy and edit configuration
cp config/instance.yaml.example config/instance.yaml
cp config/.env.template .env
# Edit both files for your environment

# Start the app and scheduler
docker compose up

# Start with all optional services (Telegram bot, etc.)
docker compose --profile full up

# Start with TLS (Caddy on :443 with corporate-CA certs from /data/state/certs)
docker compose -f docker-compose.yml -f docker-compose.prod.yml -f docker-compose.tls.yml \
    --profile tls up -d

Once running, the FastAPI app is available at http://localhost:8000 (or https://$DOMAIN in TLS mode). See docs/DEPLOYMENT.md for cert provisioning + auto-rotation via scripts/ops/agnes-tls-rotate.sh. Trigger a manual sync:

curl -X POST http://localhost:8000/api/sync/trigger

Local sync & auto-update

Analysts run Claude Code against a local DuckDB built from RBAC-filtered parquets pulled from the server. agnes pull is the distribution path:

agnes pull             # delta-pull: manifest → MD5 compare → download changed → rebuild views
agnes pull --quiet     # same, no progress output (for hooks/cron)
agnes push  # push session jsonl + CLAUDE.local.md back to the server

agnes init writes Claude Code lifecycle hooks into <workspace>/.claude/settings.json:

  • SessionStartagnes pull --quiet — fresh data on every session
  • SessionEndagnes push --quiet — uploads notes and session log

Hooks live at workspace level so they only fire in this analyst workspace, not in unrelated Claude Code sessions on the same machine.

Admin: which tables auto-sync to whom

The auto-sync set per analyst is the intersection of:

  1. Tables with query_mode IN ('local', 'materialized') — these have parquets on disk and end up in the manifest
  2. Tables granted to one of the analyst's groups via resource_grants(group, ResourceType.TABLE, table_id) (see docs/RBAC.md)

To enroll a new table for auto-sync, register it (or update its query_mode) and grant it to the relevant groups in /admin/access. New analysts get the same set on their next agnes pull.

For BigQuery, register a query_mode='materialized' table with a SQL body:

agnes admin register-table orders_90d \
    --source-type bigquery \
    --query-mode materialized \
    --query @docs/queries/orders_90d.sql \
    --schedule "every 6h"

The scheduler runs the query through the DuckDB BigQuery extension on each tick that's due, writes the result as a parquet, and the analyst picks it up on the next agnes pull. Cost guardrail: data_source.bigquery.max_bytes_per_materialize (default 10 GiB) — operations exceeding the BQ dry-run estimate are skipped.

Development Setup

# Create and activate virtual environment
python3 -m venv .venv && source .venv/bin/activate

# Install dependencies
uv pip install ".[dev]"

# Run FastAPI locally with hot reload
uvicorn app.main:app --reload

# Run the test suite
pytest tests/ -v

Project Structure

├── src/                    # Core engine
│   ├── db.py               # DuckDB schema (system.duckdb, analytics.duckdb)
│   ├── orchestrator.py     # SyncOrchestrator — ATTACHes extract.duckdb files
│   ├── repositories/       # DuckDB-backed CRUD (sync_state, table_registry, users, etc.)
│   ├── profiler.py         # Data profiling
│   └── catalog_export.py   # OpenMetadata catalog export
├── app/                    # FastAPI application
│   ├── main.py             # App setup, router registration
│   ├── api/                # REST API (sync, data, catalog, admin, auth)
│   ├── auth/               # Auth providers (Google OAuth, email magic link, desktop JWT)
│   └── web/                # HTML dashboard routes
├── connectors/             # Data source connectors (extract.duckdb contract)
│   ├── keboola/            # Keboola: extractor.py (DuckDB extension) + client.py (fallback)
│   ├── bigquery/           # BigQuery: extractor.py (remote-only via DuckDB BQ extension)
│   └── jira/               # Jira: webhook + incremental parquet → extract.duckdb
├── cli/                    # CLI tool (`agnes pull`, `agnes query`, `agnes admin`)
├── services/               # Standalone services (scheduler, telegram_bot, ws_gateway, etc.)
├── scripts/                # Utility + migration scripts
├── config/                 # Configuration templates (instance.yaml.example)
├── docs/                   # Documentation + metric YAML definitions
└── tests/                  # Test suite (633 tests)

Configuration

File Purpose
config/instance.yaml Instance-specific settings: branding, data source type, auth provider, Google domain
.env Secrets and environment variables — never committed
system.duckdb table_registry table Table definitions managed via POST /api/admin/register-table (or PUT /api/admin/registry/{id} to update) or the web UI

Copy the example to get started:

cp config/instance.yaml.example config/instance.yaml

See config/instance.yaml.example for all available options.

Documentation

Contributing

  1. Fork the repository and create a feature branch.
  2. Run pytest tests/ -v to verify all tests pass before opening a pull request.
  3. Keep commits focused and messages concise.
  4. Open a pull request against main with a clear description of the change.

For bugs and feature requests, open a GitHub issue.

License

This project is licensed under the MIT License.