# Architecture ## System Overview ``` ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ Keboola │ │ BigQuery │ │ Jira │ │ extractor │ │ extractor │ │ webhooks │ │ (DuckDB ext) │ │ (remote BQ) │ │ (incremental)│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │ │ │ ▼ ▼ ▼ extract.duckdb extract.duckdb extract.duckdb + data/*.parquet (views → BQ) + data/*.parquet │ │ │ └─────────────────┼─────────────────┘ ▼ SyncOrchestrator.rebuild() ATTACH → master views in analytics.duckdb │ ┌──────────┼──────────┐ ▼ ▼ ▼ FastAPI CLI (serve) (da sync) ``` Three source types: - **Batch pull** (Keboola): DuckDB extension downloads to parquet, scheduled - **Remote attach** (BigQuery): DuckDB BQ extension, no download, queries go to BQ - **Real-time push** (Jira): Webhooks update parquets incrementally ## Components ### 1. Core Engine (`src/`) DuckDB-backed data orchestration and state management. | File | Role | |------|------| | `src/db.py` | DuckDB schema (system.duckdb v14, analytics.duckdb), auto-migration v1→…→v14 | | `src/orchestrator.py` | SyncOrchestrator — ATTACHes extract.duckdb files, rebuilds master views | | `src/orchestrator_security.py` | Extension allowlist, token-env validation, SQL string escaping | | `src/identifier_validation.py` | Shared regex validators for SQL identifiers (used by orchestrator + extractors) | | `src/remote_query.py` | RemoteQueryEngine — hybrid queries joining local + BigQuery data | | `src/repositories/` | DuckDB-backed CRUD (sync_state, table_registry, users, knowledge, etc.) | | `src/profiler.py` | Data profiling for catalog UI | | `src/catalog_export.py` | OpenMetadata catalog export | | `src/scheduler.py` | Schedule parsing (`every 15m`, `daily 03:00`) and `is_table_due()` | | `src/rbac.py` | Dataset-access helpers (`can_access_table`, `get_accessible_tables`) | | `src/marketplace.py` | Marketplace git-clone/sync + plugin manifest parsing | | `src/marketplace_filter.py` | RBAC-filtered plugin resolution for ZIP/git channels | ### 2. FastAPI Application (`app/`) Unified web server for UI + REST API. | File/Dir | Role | |----------|------| | `app/main.py` | FastAPI app setup, router registration, startup hooks | | `app/api/` | REST API endpoints (sync, data, catalog, admin, auth, query, memory, etc.) | | `app/auth/` | Authentication — router, dependencies, PAT resolver, group sync | | `app/auth/providers/` | Auth providers: Google OAuth, email magic link, password | | `app/web/` | HTML dashboard routes + Jinja2 templates | | `app/resource_types.py` | `ResourceType` StrEnum + `RESOURCE_TYPES` registry for RBAC | ### 3. Connectors (`connectors/`) Each connector produces an `extract.duckdb` following a standard contract. | Directory | Source Type | Mechanism | |-----------|-------------|-----------| | `connectors/keboola/` | Batch pull | DuckDB Keboola extension → parquet files | | `connectors/bigquery/` | Remote attach | DuckDB BQ extension → views to BigQuery | | `connectors/jira/` | Real-time push | Webhooks → incremental parquet updates | | `connectors/openmetadata/` | Catalog | httpx client to OpenMetadata API | | `connectors/llm/` | LLM routing | OpenAI-compatible API client | #### extract.duckdb Contract Every connector outputs to `/data/extracts/{source_name}/`: ``` /data/extracts/{source_name}/ ├── extract.duckdb ← _meta table + views └── data/ ← parquet files (local sources only) ``` The `_meta` table (required): ```sql CREATE TABLE _meta ( table_name VARCHAR, description VARCHAR, rows INTEGER, size_bytes INTEGER, extracted_at TIMESTAMP, query_mode VARCHAR -- 'local' or 'remote' ); ``` Remote tables (`query_mode='remote'`) must also include `_remote_attach`: ```sql CREATE TABLE _remote_attach ( alias VARCHAR, -- DuckDB alias used in views, e.g. 'kbc' extension VARCHAR, -- Extension name, e.g. 'keboola' url VARCHAR, -- Connection URL token_env VARCHAR -- Env-var name holding the auth token (NOT the token itself) ); ``` The SyncOrchestrator scans `/data/extracts/*/extract.duckdb`, ATTACHes each into the master `analytics.duckdb`, and creates views. For remote tables, it reads `_remote_attach`, installs/loads the extension, reads the token from the environment, and ATTACHes the external source. ### 4. CLI (`cli/`) Command-line tool `da` for sync, query, and admin operations. | Command | Role | |---------|------| | `da sync` | Trigger data sync | | `agnes query` | Run SQL against analytics.duckdb | | `agnes admin group *` | Manage user groups | | `agnes admin grant *` | Manage resource grants | | `agnes admin register-table` | Register tables in table_registry | | `agnes admin break-glass ` | Emergency admin access recovery | | `da tokens *` | Manage personal access tokens | | `da metrics *` | Business metric definitions | | `agnes skills *` | List/show bundled skills | ### 5. Authentication (`app/auth/`) FastAPI-based auth with pluggable providers. | File | Role | |------|------| | `app/auth/router.py` | Auth routes (login, callback, bootstrap, token) | | `app/auth/providers/google.py` | Google OAuth + Workspace group sync | | `app/auth/providers/email.py` | Email magic link (atomic compare-and-swap consumption) | | `app/auth/providers/password.py` | Password login + reset (with audit logging) | | `app/auth/pat_resolver.py` | Personal Access Token validation (hash, expiry, revocation, IP audit) | | `app/auth/access.py` | Authorization: `require_admin`, `require_resource_access` | | `app/auth/group_sync.py` | `fetch_user_groups()` — Cloud Identity API client | | `app/auth/dependencies.py` | `get_current_user` FastAPI dependency | | `app/auth/jwt.py` | Desktop JWT auth (API-only) | ### 6. Standalone Services (`services/`) Self-contained services with own `__main__.py`, run via Docker Compose profiles. | Directory | Role | |-----------|------| | `services/scheduler/` | Cron-like job runner (data-refresh, health-check, marketplaces) | | `services/telegram_bot/` | Telegram notification bot + dispatch (opt-in, `--profile full`) | | `services/ws_gateway/` | WebSocket gateway for desktop app | | `services/corporate_memory/` | AI knowledge aggregation from analyst sessions | | `services/session_collector/` | Claude Code session metadata collector | ### 7. Configuration (`config/`) | File | Role | |------|------| | `config/instance.yaml.example` | Template with all options | | `config/loader.py` | YAML loader with `${ENV_VAR}` interpolation + required-field validation | | `config/.env.template` | Secret variable placeholders | Table definitions are stored in DuckDB `table_registry` table (not in config files). ## Config Loading Chain ``` config/instance.yaml | (loaded by config/loader.py) | (${ENV_VAR} references resolved from .env / environment) | (required fields validated: instance.name, auth.allowed_domain, server.host, server.hostname) v app/instance_config.py | (get_value() for safe nested access) v FastAPI app + templates ``` ## Data Flow ``` 1. Admin registers tables via /api/admin/register-table or web UI 2. Table metadata stored in DuckDB table_registry (system.duckdb) 3. Scheduler triggers data-refresh (default every 15m) 4. POST /api/sync/trigger invokes each connector's extractor 5. Extractor produces extract.duckdb + parquet files (local) or remote views 6. SyncOrchestrator.rebuild() ATTACHes extract.duckdb files into analytics.duckdb 7. FastAPI serves data via /api/data/{table_id}/download and /api/query 8. Claude Code queries analytics.duckdb via SQL for analysis ``` ## Security Model - **Authentication**: Google OAuth, email magic link, password, PAT, desktop JWT - **Authorization**: Two-layer RBAC — Admin user-group (god mode) + resource-level grants - **Session cookies**: Signed via Starlette SessionMiddleware (secret from `SESSION_SECRET`) - **Bootstrap**: `SEED_ADMIN_EMAIL` env var seeds first admin at deploy time - **Identifier validation**: Shared regex validators prevent SQL injection in table/connector names - **Orchestrator hardening**: Extension allowlist, token-env validation, SQL string escaping - **SSRF protection**: `_validate_url_not_private()` on admin configure endpoint - **Container**: Runs as non-root user `agnes`; Docker resource limits enforced - **TLS**: Caddy reverse proxy with security headers (X-Frame-Options, X-Content-Type-Options, Referrer-Policy) - **Secrets**: `${ENV_VAR}` in YAML, actual values in `.env` (gitignored); PATs stored as hashes ## Key Patterns - **Connector pattern**: `connectors/{name}/extractor.py` produces `extract.duckdb` following the `_meta` + `_remote_attach` contract. Orchestrator auto-discovers and ATTACHes. - **Auth provider pattern**: `app/auth/providers/{name}.py` — Google, email, password. Router dispatches based on instance config. - **Repository pattern**: `src/repositories/{domain}.py` — DuckDB-backed CRUD with parameterized queries and `ALLOWED_FIELDS` allowlists. - **Resource type pattern**: `app/resource_types.py` — `ResourceType` StrEnum + `ResourceTypeSpec` registry. Adding a new type = one enum member + one `list_blocks` delegate + one spec entry. No DB migration. - **Atomic token consumption**: Compare-and-swap with `CONSUMED:` marker prevents race conditions on one-shot tokens (magic links, password resets). - **Config interpolation**: `${ENV_VAR}` in YAML resolved at load time, missing vars logged as warnings.