--- name: agnes-orchestrator description: Rules for the SyncOrchestrator, the extract.duckdb ATTACH flow, query_mode semantics (local / remote / materialized), and when to call rebuild() vs rebuild_source(). Use when editing src/orchestrator.py, src/db.py, or anything that produces extract.duckdb in connectors/. --- # Agnes orchestrator Source of truth for orchestrator invariants. See `CLAUDE.md § Architecture` and `docs/architecture.md` for the canonical description. ## ATTACH flow `SyncOrchestrator.rebuild()` scans `/data/extracts/*/extract.duckdb`, ATTACHes each into the master `analytics.duckdb`, creates views like `."".""`, and updates `sync_state`. Per-source rebuild is `rebuild_source(name)` — used after Jira webhooks where only one source changed. Full `rebuild()` is the fallback when scope is unclear. ## Thread safety All write paths take `self._rebuild_lock` (a `threading.Lock`). New write paths — anything that DETACHes / re-ATTACHes / updates `sync_state` — MUST hold the lock. Read paths must not hold it. ## query_mode Every table has a `query_mode` in its `_meta` row: - `local` — batch-pulled to parquet, queried locally. Parquets live under `/data/extracts//data/`. Synced via `agnes pull`. - `remote` — queried against the upstream (e.g., BigQuery) at query time. No parquet on disk. Requires a `_remote_attach` row in `extract.duckdb`. - `materialized` — admin-registered SQL run by the scheduler. Result lands as a parquet under `/data/extracts//data/`. Distributed like `local`. ## `_remote_attach` mechanism For `query_mode='remote'` tables, the extractor writes a `_remote_attach` table in `extract.duckdb` with columns: | column | meaning | |---|---| | `alias` | name used in the ATTACH statement | | `extension` | DuckDB extension to install + load | | `url` | upstream connection URL | | `token_env` | env var holding the auth token (`''` if extension-specific auth, e.g., BigQuery's GCE metadata server) | At query time the orchestrator installs/loads the extension, resolves the token, creates a session-scoped SECRET when required, and ATTACHes the source so views like `kbc."bucket"."table"` resolve. ## Master DB locations - System DB: `${DATA_DIR}/state/system.duckdb` (sync_state, table_registry, users, RBAC). - Analytics DB: `${DATA_DIR}/analytics/server.duckdb` (master views). ## Schema migrations `src/db.py` auto-migrates from `v1 → vN` on startup. Per-version notes live in `CHANGELOG.md`. Adding a schema version means: 1. Bumping the version constant in `src/db.py`. 2. Adding the `vN-1 → vN` migration step. 3. Adding a CHANGELOG bullet that names the version. 4. Updating documentation that references the schema version (search for "schema v" in `docs/` + `CLAUDE.md`). ## Files NOT to modify `connectors/jira/file_lock.py` and `services/ws_gateway/` — stable infrastructure. (`connectors/jira/transform.py` was previously off-limits but as of 0.54.19 is no longer; it remains sensitive — touch only with end-to-end understanding of the JSON-overlay / parquet-rewrite pipeline.)