13 Devin findings across 10 files: 🔴 Critical: - app/api/v2_catalog.py:42 — `_fetch_hint` returns `da fetch` in /api/v2/catalog responses (user-visible in every catalog list) - cli/skills/agnes-data-querying.md — 11 stale `da fetch`/`da sync` refs in the bundled skill markdown - config/claude_md_template.txt:38 — referenced `agnes pull --docs-only` flag that does NOT exist in agnes pull (removed; spec only ships --quiet/--json/ --dry-run) 🟡 Important: - app/api/admin.py:252 — `da fetch` in bq_max_scan_bytes hint - cli/commands/auth.py:119 — `da sync` in import-token docstring (--help text) - cli/commands/tokens.py:48 — "Export it so `da` can use it" prose - ARCHITECTURE.md — 4 stale rows in CLI commands table - README.md — stale paragraphs for analysts (da sync, da analyst setup) 🚩 Substantive observations addressed: - app/api/query.py:249,302,489 — server-side error/help strings still said `da sync`/`da fetch` (returned in API responses to clients) - cli/commands/snapshot.py:235-241 — DuckDB existence guard incorrectly blocked `--estimate` (server-side dry-run that never opens local DB). Added test ensuring estimate path skips the guard. Skipped (intentionally historical): - app/api/admin.py:2377,2429,2437 — historical comments describing past manifest-vs-sync_state bug; past tense, accurate to keep as `da sync`.
10 KiB
Architecture
System Overview
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Keboola │ │ BigQuery │ │ Jira │
│ extractor │ │ extractor │ │ webhooks │
│ (DuckDB ext) │ │ (remote BQ) │ │ (incremental)│
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │
▼ ▼ ▼
extract.duckdb extract.duckdb extract.duckdb
+ data/*.parquet (views → BQ) + data/*.parquet
│ │ │
└─────────────────┼─────────────────┘
▼
SyncOrchestrator.rebuild()
ATTACH → master views in analytics.duckdb
│
┌──────────┼──────────┐
▼ ▼ ▼
FastAPI CLI
(serve) (agnes pull)
Three source types:
- Batch pull (Keboola): DuckDB extension downloads to parquet, scheduled
- Remote attach (BigQuery): DuckDB BQ extension, no download, queries go to BQ
- Real-time push (Jira): Webhooks update parquets incrementally
Components
1. Core Engine (src/)
DuckDB-backed data orchestration and state management.
| File | Role |
|---|---|
src/db.py |
DuckDB schema (system.duckdb v14, analytics.duckdb), auto-migration v1→…→v14 |
src/orchestrator.py |
SyncOrchestrator — ATTACHes extract.duckdb files, rebuilds master views |
src/orchestrator_security.py |
Extension allowlist, token-env validation, SQL string escaping |
src/identifier_validation.py |
Shared regex validators for SQL identifiers (used by orchestrator + extractors) |
src/remote_query.py |
RemoteQueryEngine — hybrid queries joining local + BigQuery data |
src/repositories/ |
DuckDB-backed CRUD (sync_state, table_registry, users, knowledge, etc.) |
src/profiler.py |
Data profiling for catalog UI |
src/catalog_export.py |
OpenMetadata catalog export |
src/scheduler.py |
Schedule parsing (every 15m, daily 03:00) and is_table_due() |
src/rbac.py |
Dataset-access helpers (can_access_table, get_accessible_tables) |
src/marketplace.py |
Marketplace git-clone/sync + plugin manifest parsing |
src/marketplace_filter.py |
RBAC-filtered plugin resolution for ZIP/git channels |
2. FastAPI Application (app/)
Unified web server for UI + REST API.
| File/Dir | Role |
|---|---|
app/main.py |
FastAPI app setup, router registration, startup hooks |
app/api/ |
REST API endpoints (sync, data, catalog, admin, auth, query, memory, etc.) |
app/auth/ |
Authentication — router, dependencies, PAT resolver, group sync |
app/auth/providers/ |
Auth providers: Google OAuth, email magic link, password |
app/web/ |
HTML dashboard routes + Jinja2 templates |
app/resource_types.py |
ResourceType StrEnum + RESOURCE_TYPES registry for RBAC |
3. Connectors (connectors/)
Each connector produces an extract.duckdb following a standard contract.
| Directory | Source Type | Mechanism |
|---|---|---|
connectors/keboola/ |
Batch pull | DuckDB Keboola extension → parquet files |
connectors/bigquery/ |
Remote attach | DuckDB BQ extension → views to BigQuery |
connectors/jira/ |
Real-time push | Webhooks → incremental parquet updates |
connectors/openmetadata/ |
Catalog | httpx client to OpenMetadata API |
connectors/llm/ |
LLM routing | OpenAI-compatible API client |
extract.duckdb Contract
Every connector outputs to /data/extracts/{source_name}/:
/data/extracts/{source_name}/
├── extract.duckdb ← _meta table + views
└── data/ ← parquet files (local sources only)
The _meta table (required):
CREATE TABLE _meta (
table_name VARCHAR,
description VARCHAR,
rows INTEGER,
size_bytes INTEGER,
extracted_at TIMESTAMP,
query_mode VARCHAR -- 'local' or 'remote'
);
Remote tables (query_mode='remote') must also include _remote_attach:
CREATE TABLE _remote_attach (
alias VARCHAR, -- DuckDB alias used in views, e.g. 'kbc'
extension VARCHAR, -- Extension name, e.g. 'keboola'
url VARCHAR, -- Connection URL
token_env VARCHAR -- Env-var name holding the auth token (NOT the token itself)
);
The SyncOrchestrator scans /data/extracts/*/extract.duckdb, ATTACHes each into the master analytics.duckdb, and creates views. For remote tables, it reads _remote_attach, installs/loads the extension, reads the token from the environment, and ATTACHes the external source.
4. CLI (cli/)
Command-line tool da for sync, query, and admin operations.
| Command | Role |
|---|---|
agnes pull |
Trigger data sync |
agnes query |
Run SQL against analytics.duckdb |
agnes admin group * |
Manage user groups |
agnes admin grant * |
Manage resource grants |
agnes admin register-table |
Register tables in table_registry |
agnes admin break-glass <user> |
Emergency admin access recovery |
agnes auth token * |
Manage personal access tokens |
agnes admin metrics * |
Business metric definitions |
agnes skills * |
List/show bundled skills |
5. Authentication (app/auth/)
FastAPI-based auth with pluggable providers.
| File | Role |
|---|---|
app/auth/router.py |
Auth routes (login, callback, bootstrap, token) |
app/auth/providers/google.py |
Google OAuth + Workspace group sync |
app/auth/providers/email.py |
Email magic link (atomic compare-and-swap consumption) |
app/auth/providers/password.py |
Password login + reset (with audit logging) |
app/auth/pat_resolver.py |
Personal Access Token validation (hash, expiry, revocation, IP audit) |
app/auth/access.py |
Authorization: require_admin, require_resource_access |
app/auth/group_sync.py |
fetch_user_groups() — Cloud Identity API client |
app/auth/dependencies.py |
get_current_user FastAPI dependency |
app/auth/jwt.py |
Desktop JWT auth (API-only) |
6. Standalone Services (services/)
Self-contained services with own __main__.py, run via Docker Compose profiles.
| Directory | Role |
|---|---|
services/scheduler/ |
Cron-like job runner (data-refresh, health-check, marketplaces) |
services/telegram_bot/ |
Telegram notification bot + dispatch (opt-in, --profile full) |
services/ws_gateway/ |
WebSocket gateway for desktop app |
services/corporate_memory/ |
AI knowledge aggregation from analyst sessions |
services/session_collector/ |
Claude Code session metadata collector |
7. Configuration (config/)
| File | Role |
|---|---|
config/instance.yaml.example |
Template with all options |
config/loader.py |
YAML loader with ${ENV_VAR} interpolation + required-field validation |
config/.env.template |
Secret variable placeholders |
Table definitions are stored in DuckDB table_registry table (not in config files).
Config Loading Chain
config/instance.yaml
| (loaded by config/loader.py)
| (${ENV_VAR} references resolved from .env / environment)
| (required fields validated: instance.name, auth.allowed_domain, server.host, server.hostname)
v
app/instance_config.py
| (get_value() for safe nested access)
v
FastAPI app + templates
Data Flow
1. Admin registers tables via /api/admin/register-table or web UI
2. Table metadata stored in DuckDB table_registry (system.duckdb)
3. Scheduler triggers data-refresh (default every 15m)
4. POST /api/sync/trigger invokes each connector's extractor
5. Extractor produces extract.duckdb + parquet files (local) or remote views
6. SyncOrchestrator.rebuild() ATTACHes extract.duckdb files into analytics.duckdb
7. FastAPI serves data via /api/data/{table_id}/download and /api/query
8. Claude Code queries analytics.duckdb via SQL for analysis
Security Model
- Authentication: Google OAuth, email magic link, password, PAT, desktop JWT
- Authorization: Two-layer RBAC — Admin user-group (god mode) + resource-level grants
- Session cookies: Signed via Starlette SessionMiddleware (secret from
SESSION_SECRET) - Bootstrap:
SEED_ADMIN_EMAILenv var seeds first admin at deploy time - Identifier validation: Shared regex validators prevent SQL injection in table/connector names
- Orchestrator hardening: Extension allowlist, token-env validation, SQL string escaping
- SSRF protection:
_validate_url_not_private()on admin configure endpoint - Container: Runs as non-root user
agnes; Docker resource limits enforced - TLS: Caddy reverse proxy with security headers (X-Frame-Options, X-Content-Type-Options, Referrer-Policy)
- Secrets:
${ENV_VAR}in YAML, actual values in.env(gitignored); PATs stored as hashes
Key Patterns
- Connector pattern:
connectors/{name}/extractor.pyproducesextract.duckdbfollowing the_meta+_remote_attachcontract. Orchestrator auto-discovers and ATTACHes. - Auth provider pattern:
app/auth/providers/{name}.py— Google, email, password. Router dispatches based on instance config. - Repository pattern:
src/repositories/{domain}.py— DuckDB-backed CRUD with parameterized queries andALLOWED_FIELDSallowlists. - Resource type pattern:
app/resource_types.py—ResourceTypeStrEnum +ResourceTypeSpecregistry. Adding a new type = one enum member + onelist_blocksdelegate + one spec entry. No DB migration. - Atomic token consumption: Compare-and-swap with
CONSUMED:marker prevents race conditions on one-shot tokens (magic links, password resets). - Config interpolation:
${ENV_VAR}in YAML resolved at load time, missing vars logged as warnings.