AI-Cognitive-Leap/agnes-the-ai-analyst

Fork 0

Fork of keboola/agnes-the-ai-analyst (via manana2520 GitHub fork). Develop here, push to GitHub fork to open upstream PRs.

Find a file

Vojtech Rysanek 32c8ea601a fix(bigquery): apply bq_query_timeout_ms on every BQ-extension attach + surface silent failures The DuckDB BigQuery extension defaults bq_query_timeout_ms to 90 s, which is too tight for analyst-scale queries against view-backed BQ datasets. Agnes already has apply_bq_session_settings() that bumps it to 600 s (configurable via data_source.bigquery.query_timeout_ms), but two regressions let the 90 s default leak through to live queries: 1. apply_bq_session_settings() swallowed every Exception silently. If the BigQuery extension wasn't loaded on the connection yet, or the installed extension version didn't recognise the setting, the SET would fail and the function would return without surfacing the problem. Operators saw 90 s timeouts on 'agnes query --remote' with no log line explaining why. 2. The call sites in src/db.py:_reattach_remote_extensions and src/orchestrator.py:_remote_attach only invoked apply_bq_session_settings on the metadata-token branch (token_env empty, the BqAccess contract). The token-based and no-auth branches ran ATTACH against the BigQuery extension without ever applying the timeout setting — so any BQ source registered with an explicit token_env, or with no auth env at all, fell back to the 90 s default. Fix: - apply_bq_session_settings now logs WARNING on each failure path (instance_config import error, non-numeric value, SET execution failure, readback error). It also verifies the setting actually landed via SELECT current_setting('bq_query_timeout_ms') and logs WARNING when the readback disagrees with the requested value, which catches the silent-ignore case some extension versions exhibit. - Both _reattach_remote_extensions (src/db.py) and _remote_attach (src/orchestrator.py) now call apply_bq_session_settings on every branch that ATTACHes a BigQuery alias, not only the metadata-token branch. Idempotent: calling it twice on the metadata-token path is a no-op SET. Tests: - Extended the _RecordingConn fixture to support .fetchone() so the readback assertion path works. Updated existing call-shape assertions to expect the SELECT current_setting readback alongside the SET. Added two new tests covering the WARNING surfaces for SET failure and readback mismatch — regression guards for the silent- fallback bug this PR addresses. - Full BQ-touching suite (398 tests) passes.		2026-05-06 11:24:14 +04:00
.github	fix(ci): smoke-test stale route + rollback ghcr auth + issues:write (#140 )	2026-04-30 09:42:27 +02:00
app	Merge remote-tracking branch 'origin/main' into pr180-review	2026-05-06 07:27:25 +02:00
cli	Merge remote-tracking branch 'origin/main' into pr180-review	2026-05-06 07:27:25 +02:00
config	feat: clean CLI errors + init progress + skip-materialize + claude.md catalog pointer	2026-05-05 18:11:59 +02:00
connectors	fix(bigquery): apply bq_query_timeout_ms on every BQ-extension attach + surface silent failures	2026-05-06 11:24:14 +04:00
dev_docs	chore(docs): replace stale `da` verbs and vendor-specific install paths	2026-05-04 21:22:19 +02:00
docs	fix: Devin Review on #194 round 2 — 3 BUG-class findings	2026-05-05 20:02:50 +02:00
infra	refactor(ops): bake all host artifacts into image, drop every curl-from-main (#149 )	2026-04-30 21:40:25 +02:00
scripts	feat: STATE_DIR env var + flat-mount overlay (parallel disks)	2026-05-05 19:28:07 +02:00
services	fix(corporate-memory): CLI catches fail-fast ValueError, exits 1 with clean message (Devin Review on #179 )	2026-05-05 06:45:10 +02:00
src	fix(bigquery): apply bq_query_timeout_ms on every BQ-extension attach + surface silent failures	2026-05-06 11:24:14 +04:00
tests	fix(bigquery): apply bq_query_timeout_ms on every BQ-extension attach + surface silent failures	2026-05-06 11:24:14 +04:00
.dockerignore	refactor: consolidate deps into pyproject.toml, remove requirements.txt	2026-04-09 13:17:59 +02:00
.gitignore	chore(.gitignore): allowlist cli/lib/ from generic lib/ rule (Task 7 follow-up)	2026-05-04 17:54:00 +02:00
.pre-commit-config.yaml	feat(ci+tests): deploy safety audit — linting, rollback, smoke tests, 50+ new tests (#120 )	2026-04-29 09:18:55 +02:00
ARCHITECTURE.md	fix: address Devin Review findings — incomplete renames + estimate guard	2026-05-04 20:05:06 +02:00
Caddyfile	fix: Devin Review on #188 — try_files fallback + auto-upgrade ordering	2026-05-05 17:24:42 +02:00
CHANGELOG.md	Merge remote-tracking branch 'origin/main' into pr180-review	2026-05-06 07:27:25 +02:00
CLAUDE.md	feat(store): /store + /my-ai-stack — community marketplace + per-user composition	2026-05-05 02:53:49 +02:00
docker-compose.ci.yml	feat: multi-instance deployment — all 14 must-have items from spec	2026-04-10 11:57:42 +02:00
docker-compose.dev.yml	fix(security+ops) + release(0.12.1): #82 #85 #87 hardening + cut 0.12.1 (#104 )	2026-04-28 19:57:30 +02:00
docker-compose.flat-mount.yml	fix: Devin Review on #194 round 2 — 3 BUG-class findings	2026-05-05 20:02:50 +02:00
docker-compose.host-mount.yml	fix: Devin Review on #194 round 2 — 3 BUG-class findings	2026-05-05 20:02:50 +02:00
docker-compose.local-dev.yml	release(0.11.2): LOCAL_DEV_GROUPS dev mock + Makefile defaults + docs/local-development.md (#70 )	2026-04-26 16:48:55 +02:00
docker-compose.prod.yml	fix(compose): drop corporate-memory + session-collector services (#176 )	2026-05-04 23:59:44 +02:00
docker-compose.test.yml	chore(deploy): trust proxy headers + document HTTPS env vars (#48 )	2026-04-24 08:52:53 +02:00
docker-compose.tls.yml	feat(tls): corporate-CA HTTPS with URL-driven rotation, on-VM CSR gen, self-signed fallback (#51 )	2026-04-25 19:51:25 +00:00
docker-compose.yml	feat(caddy): file_server for parquet downloads — bypass uvicorn	2026-05-05 16:41:33 +02:00
Dockerfile	refactor(ops): bake all host artifacts into image, drop every curl-from-main (#149 )	2026-04-30 21:40:25 +02:00
LICENSE
Makefile	fix(security+ops) + release(0.12.1): #82 #85 #87 hardening + cut 0.12.1 (#104 )	2026-04-28 19:57:30 +02:00
pyproject.toml	Merge remote-tracking branch 'origin/main' into pr180-review	2026-05-06 07:27:25 +02:00
pytest.ini	feat(rbac+marketplace): RBAC v13 + Claude Code marketplace + #81/#83/#44 hardening	2026-04-28 14:25:04 +02:00
README.md	fix: address Devin Review findings — incomplete renames + estimate guard	2026-05-04 20:05:06 +02:00
uv.lock	feat(observability): request_id end-to-end + dev debug toolbar + centralized logging (#136 )	2026-04-29 22:54:21 +02:00

README.md

Agnes — AI Data Analyst

Agnes is an open-source data distribution platform for AI analytical systems. It extracts data from configured sources into DuckDB, serves it via a FastAPI backend, and distributes Parquet files to analysts who query them locally using Claude Code and DuckDB.

Each data source produces a self-describing extract.duckdb file. The SyncOrchestrator attaches all extract databases into a master analytics.duckdb, making every table available through a unified view layer without copying data unnecessarily.

Architecture: extract.duckdb Contract

Every connector produces the same output structure:

/data/extracts/{source_name}/
├── extract.duckdb          ← _meta table + views
└── data/                   ← parquet files (local sources only)

The orchestrator scans /data/extracts/*/extract.duckdb, attaches each into analytics.duckdb, and creates master views.

┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│   Keboola    │  │   BigQuery   │  │   Jira       │
│  extractor   │  │  extractor   │  │  webhooks    │
│ (DuckDB ext) │  │ (remote BQ)  │  │ (incremental)│
└──────┬───────┘  └──────┬───────┘  └──────┬───────┘
       │                 │                 │
       ▼                 ▼                 ▼
   extract.duckdb    extract.duckdb    extract.duckdb
   + data/*.parquet  (views → BQ)      + data/*.parquet
       │                 │                 │
       └─────────────────┼─────────────────┘
                         ▼
              SyncOrchestrator.rebuild()
              ATTACH → master views in analytics.duckdb
                         │
              ┌──────────┼──────────┐
              ▼          ▼          ▼
          FastAPI      CLI
          (serve)    (agnes pull)

Supported Data Sources

Mode	Distribution	Sources	Use when
Batch pull (`local`)	Parquet on disk, scheduled	Keboola	Source has a native bulk-export and the table fits on disk
Materialized SQL (`materialized`)	Parquet on disk, scheduled query	BigQuery, Keboola	Source table is too large to mirror as-is; you want a curated subset / aggregate on disk
Remote attach (`remote`)	View only, no download	BigQuery	Table is too large to materialize; latency cost of remote query is acceptable
Real-time push	Incremental parquet	Jira	Source is event-driven and you need sub-minute freshness

The first three modes are what agnes pull distributes to analysts. The fourth is server-side only — analysts query Jira data through the same agnes pull-distributed parquets.

Admins manage per-source registrations through the /admin/tables UI (per-connector tabs for BigQuery / Keboola / Jira) or the agnes admin register-table CLI; per-row "Manage access" deep-links to /admin/access for granting tables to user groups via resource_grants(group, ResourceType.TABLE, table_id).

Analysts get a closed loop with Claude Code: agnes init writes <workspace>/.claude/settings.json with SessionStart (agnes pull --quiet) and SessionEnd (agnes push --quiet) hooks so every Claude Code session starts with fresh RBAC-filtered parquets and ends with the session log uploaded back.

Adding a new source means creating connectors/<name>/extractor.py that produces extract.duckdb with a _meta table (table_name, description, rows, size_bytes, extracted_at, query_mode). The orchestrator attaches it automatically.

Quick Start with Docker

# Clone the repository
git clone https://github.com/keboola/agnes-the-ai-analyst.git
cd agnes-the-ai-analyst

# Copy and edit configuration
cp config/instance.yaml.example config/instance.yaml
cp config/.env.template .env
# Edit both files for your environment

# Start the app and scheduler
docker compose up

# Start with all optional services (Telegram bot, etc.)
docker compose --profile full up

# Start with TLS (Caddy on :443 with corporate-CA certs from /data/state/certs)
docker compose -f docker-compose.yml -f docker-compose.prod.yml -f docker-compose.tls.yml \
    --profile tls up -d

Once running, the FastAPI app is available at http://localhost:8000 (or https://$DOMAIN in TLS mode). See docs/DEPLOYMENT.md for cert provisioning + auto-rotation via scripts/ops/agnes-tls-rotate.sh. Trigger a manual sync:

curl -X POST http://localhost:8000/api/sync/trigger

Local sync & auto-update

Analysts run Claude Code against a local DuckDB built from RBAC-filtered parquets pulled from the server. agnes pull is the distribution path:

agnes pull             # delta-pull: manifest → MD5 compare → download changed → rebuild views
agnes pull --quiet     # same, no progress output (for hooks/cron)
agnes push  # push session jsonl + CLAUDE.local.md back to the server

agnes init writes Claude Code lifecycle hooks into <workspace>/.claude/settings.json:

SessionStart → agnes pull --quiet — fresh data on every session
SessionEnd → agnes push --quiet — uploads notes and session log

Hooks live at workspace level so they only fire in this analyst workspace, not in unrelated Claude Code sessions on the same machine.

Admin: which tables auto-sync to whom

The auto-sync set per analyst is the intersection of:

Tables with query_mode IN ('local', 'materialized') — these have parquets on disk and end up in the manifest
Tables granted to one of the analyst's groups via resource_grants(group, ResourceType.TABLE, table_id) (see docs/RBAC.md)

To enroll a new table for auto-sync, register it (or update its query_mode) and grant it to the relevant groups in /admin/access. New analysts get the same set on their next agnes pull.

For BigQuery, register a query_mode='materialized' table with a SQL body:

agnes admin register-table orders_90d \
    --source-type bigquery \
    --query-mode materialized \
    --query @docs/queries/orders_90d.sql \
    --schedule "every 6h"

The scheduler runs the query through the DuckDB BigQuery extension on each tick that's due, writes the result as a parquet, and the analyst picks it up on the next agnes pull. Cost guardrail: data_source.bigquery.max_bytes_per_materialize (default 10 GiB) — operations exceeding the BQ dry-run estimate are skipped.

Development Setup

# Create and activate virtual environment
python3 -m venv .venv && source .venv/bin/activate

# Install dependencies
uv pip install ".[dev]"

# Run FastAPI locally with hot reload
uvicorn app.main:app --reload

# Run the test suite
pytest tests/ -v

Project Structure

├── src/                    # Core engine
│   ├── db.py               # DuckDB schema (system.duckdb, analytics.duckdb)
│   ├── orchestrator.py     # SyncOrchestrator — ATTACHes extract.duckdb files
│   ├── repositories/       # DuckDB-backed CRUD (sync_state, table_registry, users, etc.)
│   ├── profiler.py         # Data profiling
│   └── catalog_export.py   # OpenMetadata catalog export
├── app/                    # FastAPI application
│   ├── main.py             # App setup, router registration
│   ├── api/                # REST API (sync, data, catalog, admin, auth)
│   ├── auth/               # Auth providers (Google OAuth, email magic link, desktop JWT)
│   └── web/                # HTML dashboard routes
├── connectors/             # Data source connectors (extract.duckdb contract)
│   ├── keboola/            # Keboola: extractor.py (DuckDB extension) + client.py (fallback)
│   ├── bigquery/           # BigQuery: extractor.py (remote-only via DuckDB BQ extension)
│   └── jira/               # Jira: webhook + incremental parquet → extract.duckdb
├── cli/                    # CLI tool (`agnes pull`, `agnes query`, `agnes admin`)
├── services/               # Standalone services (scheduler, telegram_bot, ws_gateway, etc.)
├── scripts/                # Utility + migration scripts
├── config/                 # Configuration templates (instance.yaml.example)
├── docs/                   # Documentation + metric YAML definitions
└── tests/                  # Test suite (633 tests)

Configuration

File	Purpose
`config/instance.yaml`	Instance-specific settings: branding, data source type, auth provider, Google domain
`.env`	Secrets and environment variables — never committed
`system.duckdb` `table_registry` table	Table definitions managed via `POST /api/admin/register-table` (or `PUT /api/admin/registry/{id}` to update) or the web UI

Copy the example to get started:

cp config/instance.yaml.example config/instance.yaml

See config/instance.yaml.example for all available options.

Documentation

Hackathon TL;DR — condensed deploy + dev playbooks (for both humans and AI agents)
Onboarding Guide — end-to-end Terraform deployment into a GCP project (recommended for production)
Deployment Guide — chooses between Terraform and Docker Compose; covers OSS self-host
Configuration Reference — instance.yaml, env vars, per-instance options
Architecture — orchestrator, extractors, DB layout
Quickstart — local development

Contributing

Fork the repository and create a feature branch.
Run pytest tests/ -v to verify all tests pass before opening a pull request.
Keep commits focused and messages concise.
Open a pull request against main with a clear description of the change.

For bugs and feature requests, open a GitHub issue.

License

This project is licensed under the MIT License.