AI-Cognitive-Leap/agnes-the-ai-analyst

Fork 0

Fork of keboola/agnes-the-ai-analyst (via manana2520 GitHub fork). Develop here, push to GitHub fork to open upstream PRs.

Find a file

Monika Feigler caae12d02f fix(web): UI consistency — code tokens, label-qualifier, radio cards, Keboola edit-modal JS (#347 ) * fix(web): UI consistency — code tokens, label-qualifier, radio card selected state I-UI-01: Add .sync-option-card:has(input:checked) rule — border + background feedback when a radio option card is selected. Add class sync-option-card to all 14 radio label cards in admin_tables.html. I-UI-02: Add .label-qualifier / .optional to style-custom.css. Remove the duplicate local definition from admin_tables.html <style> block. I-UI-03: Migrate inline code rule to design tokens (--font-mono, --text-sm, --border-light, --border, --radius-sm). Add background + border so inline code is visually distinct across all pages. I-UI-05 (partial): Replace hardcoded #c4c4c4 / #fafafa in .btn-google:hover with var(--border) / var(--background) so theme overrides apply. * fix(web): expose entire Keboola edit-modal JS to all instance types openEditKeboolaModal, closeEditKeboolaModal, saveKeboolaTabEdit, onEditKbStrategyChange and helpers were still inside {% if keboola %} but called from always-rendered HTML (openEditModal dispatcher, Escape key handler, modal overlay click, Cancel/Save buttons). Removed the Phase F2 if-guard entirely — only prefillFromKeboolaTable stays conditional (its callers are inside {% if keboola %} HTML blocks). * fix(ui): promote .form-textarea to global CSS with design tokens Removes the local hardcoded .form-textarea definition from admin_tables.html and adds it globally to style-custom.css using design tokens, making description textareas visually consistent with other form fields. * fix(ui): restore .form-textarea to local style block for visual consistency Tokens --text-sm (12px) and --radius-md (6px) differ from the local override values (13px, 8px) used by .form-input on this page, causing a visible mismatch. .form-textarea rejoins the shared local selector so all three classes render identically; global .form-textarea in style-custom.css remains as a baseline for other pages. * fix(ui): use textarea.form-textarea in global CSS to override .form-group textarea .form-group textarea (specificity 0,1,1) was overriding .form-textarea (0,1,0) with a legacy monospace font and different padding. Raising the selector to textarea.form-textarea matches specificity and wins via source order, making description textareas consistent with other form inputs. Local admin_tables.html overrides for .form-textarea removed — styling now comes entirely from global CSS. * fix(ui): add border:none to .code-block code + add CHANGELOG entries Fixes light-gray border leaking into dark .code-block backgrounds. Adds required CHANGELOG.md entries for all user-visible changes in this PR. * fix(ui): add --border-dark token + reset border-radius in .code-block code - Adds --border-dark: #C4C4C4 design token for hover border states - Uses var(--border-dark) in both .btn-google:hover rules so hover border is visually distinct from the base border (was a no-op with var(--border)) - Adds border-radius: 0 to .code-block code override to fully reset the new global code border-radius on dark code-block backgrounds * fix(ui): reset code border/bg inside .use-case-prompt dark container Adds .plugin-detail .use-case-prompt code override to prevent the new global code border and background from leaking into the dark #1e1e2e pre block in marketplace_plugin_detail.html. * fix(ui): reset code border in all dark-background containers Global code { border } leaks into dark-themed containers across templates. Adds border: none (+ border-radius: 0 where needed) to: - marketplace_plugin_detail.html: lead-rendered pre code, sample-assistant-body code/pre code - marketplace_item_detail.html: same three selectors - home_onboarded.html, home_not_onboarded.html, admin_welcome.html: inline code on hero dark backgrounds * fix(ui): uniform form typography — chip-input font, data-package desc textarea, orphan endif - .chip-input container gets font-family/size tokens so inner input inherits correctly (inline `font: inherit` was pulling browser default) - cdp-desc / edp-desc switched from form-input to form-textarea so description fields render Inter, not monospace - Removed orphan {% endif %} left in admin_tables.html after rebase (caused TemplateSyntaxError breaking all admin-tables tests in CI) - .item-detail .use-case-prompt code: border/bg reset for dark container * fix: relax test_keboola_discover_buttons assertion + CHANGELOG bullet for #347 The test_keboola_discover_buttons_hidden_on_bigquery_instance test asserted bare-string `prefillFromKeboolaTable` not in the rendered HTML on a non-Keboola instance. That made sense when the function DEFINITION lived behind the keboola Jinja guard. #347 moves several Keboola edit-modal helpers out from under the guard so they're now defined as dead code on every instance, but the actual call sites (`onclick="prefillFromKeboolaTable(...)"` + the Discover buttons themselves) still respect the guard — which is what actually matters for runtime behavior. Updated the assertions to match `onclick="<fn>(` so they pin the call-site contract, not the function-definition substring. --------- Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>		2026-05-19 16:30:19 +02:00
.claude	feat: Agnes specialist agents and skills under .claude/ (#328 ) (#328 )	2026-05-15 20:39:11 +02:00
.github	feat(stack): unified Browse + My Stack for Data Packages and Memory (v49 schema) (#333 )	2026-05-19 15:00:15 +02:00
app	fix(web): UI consistency — code tokens, label-qualifier, radio cards, Keboola edit-modal JS (#347 )	2026-05-19 16:30:19 +02:00
cli	fix(marketplace): chmod +x .sh files after fetch+reset, not just bootstrap (#352 )	2026-05-19 14:10:38 +00:00
config	feat(home): status frame on /home (operator-gated, onboarded-only) (#297 )	2026-05-14 09:28:47 +00:00
connectors	feat(bq): decouple table_registry bucket from BQ dataset name (#343 ) (#346 )	2026-05-19 11:17:32 +00:00
dev_docs	docs: consolidate and de-clutter the documentation tree (#306 )	2026-05-14 18:54:22 +00:00
docs	feat(stack): unified Browse + My Stack for Data Packages and Memory (v49 schema) (#333 )	2026-05-19 15:00:15 +02:00
infra	fix(infra): pre-create /data/uploads in customer-instance startup script (#351 )	2026-05-19 13:59:39 +00:00
scripts	feat(stack): unified Browse + My Stack for Data Packages and Memory (v49 schema) (#333 )	2026-05-19 15:00:15 +02:00
services	feat(flea): marketplace refactor — data model, attribution, UI unification (#342 )	2026-05-19 02:32:41 +02:00
src	feat(stack): unified Browse + My Stack for Data Packages and Memory (v49 schema) (#333 )	2026-05-19 15:00:15 +02:00
tests	fix(web): UI consistency — code tokens, label-qualifier, radio cards, Keboola edit-modal JS (#347 )	2026-05-19 16:30:19 +02:00
.dockerignore	refactor: consolidate deps into pyproject.toml, remove requirements.txt	2026-04-09 13:17:59 +02:00
.gitignore	feat: Agnes specialist agents and skills under .claude/ (#328 ) (#328 )	2026-05-15 20:39:11 +02:00
.pre-commit-config.yaml	feat(ci+tests): deploy safety audit — linting, rollback, smoke tests, 50+ new tests (#120 )	2026-04-29 09:18:55 +02:00
.test_durations	ci: shard test suite + drop duplicate test run (#311 )	2026-05-14 20:18:21 +00:00
AGENTS.md	docs: consolidate and de-clutter the documentation tree (#306 )	2026-05-14 18:54:22 +00:00
ARCHITECTURE.md	fix: address Devin Review findings — incomplete renames + estimate guard	2026-05-04 20:05:06 +02:00
Caddyfile	fix: Devin Review on #188 — try_files fallback + auto-upgrade ordering	2026-05-05 17:24:42 +02:00
CHANGELOG.md	fix(web): UI consistency — code tokens, label-qualifier, radio cards, Keboola edit-modal JS (#347 )	2026-05-19 16:30:19 +02:00
CLAUDE.md	feat: Agnes specialist agents and skills under .claude/ (#328 ) (#328 )	2026-05-15 20:39:11 +02:00
docker-compose.ci.yml	feat: multi-instance deployment — all 14 must-have items from spec	2026-04-10 11:57:42 +02:00
docker-compose.dev.yml	fix(security+ops) + release(0.12.1): #82 #85 #87 hardening + cut 0.12.1 (#104 )	2026-04-28 19:57:30 +02:00
docker-compose.flat-mount.yml	fix: Devin Review on #194 round 2 — 3 BUG-class findings	2026-05-05 20:02:50 +02:00
docker-compose.host-mount.yml	fix: Devin Review on #194 round 2 — 3 BUG-class findings	2026-05-05 20:02:50 +02:00
docker-compose.local-dev.yml	release(0.11.2): LOCAL_DEV_GROUPS dev mock + Makefile defaults + docs/local-development.md (#70 )	2026-04-26 16:48:55 +02:00
docker-compose.prod.yml	fix(compose): drop corporate-memory + session-collector services (#176 )	2026-05-04 23:59:44 +02:00
docker-compose.test.yml	chore(deploy): trust proxy headers + document HTTPS env vars (#48 )	2026-04-24 08:52:53 +02:00
docker-compose.tls.yml	feat(tls): corporate-CA HTTPS with URL-driven rotation, on-VM CSR gen, self-signed fallback (#51 )	2026-04-25 19:51:25 +00:00
docker-compose.yml	fix(duckdb): CHECKPOINT on shutdown + 60s compose grace to prevent WAL corruption (#235 )	2026-05-10 19:02:30 +00:00
Dockerfile	fix(cli-install): move kbcstorage to [server] extra so wheel installs cleanly (P0 onboarding hotfix → 0.53.4) (#272 )	2026-05-12 17:09:44 +00:00
LICENSE	OSS cleanup: remove internal references, harden deployment, add config env interpolation	2026-03-09 07:59:57 +01:00
Makefile	fix(security+ops) + release(0.12.1): #82 #85 #87 hardening + cut 0.12.1 (#104 )	2026-04-28 19:57:30 +02:00
pyproject.toml	fix(infra): pre-create /data/uploads in customer-instance startup script (#351 )	2026-05-19 13:59:39 +00:00
pytest.ini	feat(rbac+marketplace): RBAC v13 + Claude Code marketplace + #81/#83/#44 hardening	2026-04-28 14:25:04 +02:00
README.md	docs: consolidate and de-clutter the documentation tree (#306 )	2026-05-14 18:54:22 +00:00
uv.lock	feat(home): Getting Started + Overview + Usage modes sections (release 0.54.7) (#291 )	2026-05-13 21:44:11 +02:00

README.md

Agnes — AI Data Analyst

Agnes is an open-source data distribution platform for AI analytical systems. It extracts data from configured sources into DuckDB, serves it via a FastAPI backend, and distributes Parquet files to analysts who query them locally using Claude Code and DuckDB.

Each data source produces a self-describing extract.duckdb file. The SyncOrchestrator attaches all extract databases into a master analytics.duckdb, making every table available through a unified view layer without copying data unnecessarily.

Architecture: extract.duckdb Contract

Every connector produces the same output structure:

/data/extracts/{source_name}/
├── extract.duckdb          ← _meta table + views
└── data/                   ← parquet files (local sources only)

The orchestrator scans /data/extracts/*/extract.duckdb, attaches each into analytics.duckdb, and creates master views.

┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│   Keboola    │  │   BigQuery   │  │   Jira       │
│  extractor   │  │  extractor   │  │  webhooks    │
│ (DuckDB ext) │  │ (remote BQ)  │  │ (incremental)│
└──────┬───────┘  └──────┬───────┘  └──────┬───────┘
       │                 │                 │
       ▼                 ▼                 ▼
   extract.duckdb    extract.duckdb    extract.duckdb
   + data/*.parquet  (views → BQ)      + data/*.parquet
       │                 │                 │
       └─────────────────┼─────────────────┘
                         ▼
              SyncOrchestrator.rebuild()
              ATTACH → master views in analytics.duckdb
                         │
              ┌──────────┼──────────┐
              ▼          ▼          ▼
          FastAPI      CLI
          (serve)    (agnes pull)

Supported Data Sources

Mode	Distribution	Sources	Use when
Batch pull (`local`)	Parquet on disk, scheduled	Keboola	Source has a native bulk-export and the table fits on disk
Materialized SQL (`materialized`)	Parquet on disk, scheduled query	BigQuery, Keboola	Source table is too large to mirror as-is; you want a curated subset / aggregate on disk
Remote attach (`remote`)	View only, no download	BigQuery	Table is too large to materialize; latency cost of remote query is acceptable
Real-time push	Incremental parquet	Jira	Source is event-driven and you need sub-minute freshness

The first three modes are what agnes pull distributes to analysts. The fourth is server-side only — analysts query Jira data through the same agnes pull-distributed parquets.

Admins manage per-source registrations through the /admin/tables UI (per-connector tabs for BigQuery / Keboola / Jira) or the agnes admin register-table CLI; per-row "Manage access" deep-links to /admin/access for granting tables to user groups via resource_grants(group, ResourceType.TABLE, table_id).

Analysts get a closed loop with Claude Code: agnes init writes <workspace>/.claude/settings.json with SessionStart (agnes pull --quiet) and SessionEnd (agnes push --quiet) hooks so every Claude Code session starts with fresh RBAC-filtered parquets and ends with the session log uploaded back.

Adding a new source means creating connectors/<name>/extractor.py that produces extract.duckdb with a _meta table (table_name, description, rows, size_bytes, extracted_at, query_mode). The orchestrator attaches it automatically.

Quick Start with Docker

# Clone the repository
git clone https://github.com/keboola/agnes-the-ai-analyst.git
cd agnes-the-ai-analyst

# Copy and edit configuration
cp config/instance.yaml.example config/instance.yaml
cp config/.env.template .env
# Edit both files for your environment

# Start the app and scheduler
docker compose up

# Start with all optional services (Telegram bot, etc.)
docker compose --profile full up

# Start with TLS (Caddy on :443 with corporate-CA certs from /data/state/certs)
docker compose -f docker-compose.yml -f docker-compose.prod.yml -f docker-compose.tls.yml \
    --profile tls up -d

Once running, the FastAPI app is available at http://localhost:8000 (or https://$DOMAIN in TLS mode). See docs/DEPLOYMENT.md for cert provisioning + auto-rotation via scripts/ops/agnes-tls-rotate.sh. Trigger a manual sync:

curl -X POST http://localhost:8000/api/sync/trigger

Local sync & auto-update

Analysts run Claude Code against a local DuckDB built from RBAC-filtered parquets pulled from the server. agnes pull is the distribution path:

agnes pull             # delta-pull: manifest → MD5 compare → download changed → rebuild views
agnes pull --quiet     # same, no progress output (for hooks/cron)
agnes push  # push session jsonl + CLAUDE.local.md back to the server

agnes init writes Claude Code lifecycle hooks into <workspace>/.claude/settings.json:

SessionStart → agnes pull --quiet — fresh data on every session
SessionEnd → agnes push --quiet — uploads notes and session log

Hooks live at workspace level so they only fire in this analyst workspace, not in unrelated Claude Code sessions on the same machine.

Admin: which tables auto-sync to whom

The auto-sync set per analyst is the intersection of:

Tables with query_mode IN ('local', 'materialized') — these have parquets on disk and end up in the manifest
Tables granted to one of the analyst's groups via resource_grants(group, ResourceType.TABLE, table_id) (see docs/RBAC.md)

To enroll a new table for auto-sync, register it (or update its query_mode) and grant it to the relevant groups in /admin/access. New analysts get the same set on their next agnes pull.

For BigQuery, register a query_mode='materialized' table with a SQL body:

agnes admin register-table orders_90d \
    --source-type bigquery \
    --query-mode materialized \
    --query @docs/queries/orders_90d.sql \
    --schedule "every 6h"

The scheduler runs the query through the DuckDB BigQuery extension on each tick that's due, writes the result as a parquet, and the analyst picks it up on the next agnes pull. Cost guardrail: data_source.bigquery.max_bytes_per_materialize (default 10 GiB) — operations exceeding the BQ dry-run estimate are skipped.

Development Setup

# Create and activate virtual environment
python3 -m venv .venv && source .venv/bin/activate

# Install dependencies
uv pip install ".[dev]"

# Run FastAPI locally with hot reload
uvicorn app.main:app --reload

# Run the test suite
pytest tests/ -v

Project Structure

├── src/                    # Core engine
│   ├── db.py               # DuckDB schema (system.duckdb, analytics.duckdb)
│   ├── orchestrator.py     # SyncOrchestrator — ATTACHes extract.duckdb files
│   ├── repositories/       # DuckDB-backed CRUD (sync_state, table_registry, users, etc.)
│   ├── profiler.py         # Data profiling
│   └── catalog_export.py   # OpenMetadata catalog export
├── app/                    # FastAPI application
│   ├── main.py             # App setup, router registration
│   ├── api/                # REST API (sync, data, catalog, admin, auth)
│   ├── auth/               # Auth providers (Google OAuth, email magic link, desktop JWT)
│   └── web/                # HTML dashboard routes
├── connectors/             # Data source connectors (extract.duckdb contract)
│   ├── keboola/            # Keboola: extractor.py (DuckDB extension) + client.py (fallback)
│   ├── bigquery/           # BigQuery: extractor.py (remote-only via DuckDB BQ extension)
│   └── jira/               # Jira: webhook + incremental parquet → extract.duckdb
├── cli/                    # CLI tool (`agnes pull`, `agnes query`, `agnes admin`)
├── services/               # Standalone services (scheduler, telegram_bot, ws_gateway, etc.)
├── scripts/                # Utility + migration scripts
├── config/                 # Configuration templates (instance.yaml.example)
├── docs/                   # Documentation + metric YAML definitions
└── tests/                  # Test suite (633 tests)

Configuration

File	Purpose
`config/instance.yaml`	Instance-specific settings: branding, data source type, auth provider, Google domain
`.env`	Secrets and environment variables — never committed
`system.duckdb` `table_registry` table	Table definitions managed via `POST /api/admin/register-table` (or `PUT /api/admin/registry/{id}` to update) or the web UI

Copy the example to get started:

cp config/instance.yaml.example config/instance.yaml

See config/instance.yaml.example for all available options.

Documentation

Full index: docs/README.md — every doc, organized by audience (analyst / operator / developer).

Key entry points:

Quickstart — local development setup
Onboarding Guide — end-to-end Terraform deployment into a GCP project (recommended for production)
Deployment Guide — chooses between Terraform and Docker Compose; covers OSS self-host
Configuration Reference — instance.yaml, env vars, per-instance options
Architecture — orchestrator, extractors, DB layout

Contributing

Fork the repository and create a feature branch.
Run pytest tests/ -v to verify all tests pass before opening a pull request.
Keep commits focused and messages concise.
Open a pull request against main with a clear description of the change.

For bugs and feature requests, open a GitHub issue.

License

This project is licensed under the MIT License.