agnes-the-ai-analyst/.claude/skills/agnes-connectors.md
ZdenekSrotyr 650ea3c804
feat: Agnes specialist agents and skills under .claude/ (#328) (#328)
Four knowledge skills auto-load into the main agent's context when
their description matches the work; invokable explicitly via
Skill(<name>):

- agnes-orchestrator — extract.duckdb ATTACH flow, query_mode
  semantics, _remote_attach, rebuild lock
- agnes-rbac — require_admin vs require_resource_access,
  ResourceType registration
- agnes-connectors — _meta contract, three connector shapes,
  new-connector checklist
- agnes-release-process — CHANGELOG discipline, release-cut,
  version bump, post-merge auto-rollback

Three reviewer subagents fire in parallel at end of PR work; one
releaser subagent handles pre-merge release-cut + post-merge tag /
GitHub Release:

- agnes-reviewer-rules — CHANGELOG bullet, vendor-agnostic scan,
  AI attribution, commit hygiene (always fires)
- agnes-reviewer-rbac — endpoint gates, ResourceType registration
  (fires on app/api/, app/auth/ diffs)
- agnes-reviewer-architecture — extract.duckdb invariants, schema
  migrations, rebuild lock (fires on src/, connectors/ diffs)
- agnes-releaser — Phase 1 pre-merge release-cut commit; Phase 2
  post-merge tag + GitHub Release

.gitignore un-ignores .claude/agents/ and .claude/skills/ while
keeping the rest of .claude/ local-only. CLAUDE.md gets a new
'Specialized agents and skills' section pointing at the two
directories.

Source of truth for the rules these encode remains CLAUDE.md +
docs/RELEASING.md — skills explicitly defer to the master docs on
conflict.

Design rationale: docs/superpowers/specs/2026-05-15-agnes-agents-design.md
Implementation plan: docs/superpowers/plans/2026-05-15-agnes-agents.md
2026-05-15 20:39:11 +02:00

72 lines
3.2 KiB
Markdown

---
name: agnes-connectors
description: Rules for the extract.duckdb contract every data source must produce — the _meta table, the _remote_attach mechanism for remote-mode tables, parquet layout, and the pattern for adding a new connector. Use when adding a new data source or modifying an existing extractor in connectors/.
---
# Agnes connectors — the extract.duckdb contract
Every data source produces the same output:
/data/extracts/{source_name}/
├── extract.duckdb ← _meta table + views
└── data/ ← parquet files (local sources only)
See `CLAUDE.md § Architecture: extract.duckdb Contract` and
`docs/architecture.md`.
## Required `_meta` table
Every `extract.duckdb` MUST contain a `_meta` table with these columns:
| column | type | meaning |
|---|---|---|
| `table_name` | VARCHAR | name used in views |
| `description` | VARCHAR | human-readable description |
| `rows` | BIGINT | row count at extraction time |
| `size_bytes` | BIGINT | parquet size for local mode, 0 for remote |
| `extracted_at` | TIMESTAMP | extraction time |
| `query_mode` | VARCHAR | one of `local`, `remote`, `materialized` |
If `_meta` is missing or malformed, `SyncOrchestrator.rebuild()` skips the
source with an error logged. Tests for new connectors MUST assert `_meta` is
well-formed.
## Four connector shapes
- **Batch pull** (Keboola, `query_mode='local'`) — DuckDB extension downloads
data to parquet, scheduled. Extractor in
`connectors/<name>/extractor.py`.
- **Remote attach** (BigQuery, `query_mode='remote'`) — DuckDB BQ extension,
no download. Queries hit the upstream at query time. Requires `_remote_attach`.
- **Materialized SQL** (`query_mode='materialized'`) — scheduler runs
admin-registered SQL through DuckDB and writes the result to a parquet under
`/data/extracts/<source>/data/`. Distributed via the same manifest +
`agnes pull` flow as `local`. BigQuery cost guardrail:
`data_source.bigquery.max_bytes_per_materialize` (default 10 GiB; `0` disables).
- **Real-time push** (Jira) — webhooks update parquets incrementally; the
webhook handler triggers `rebuild_source('jira')`.
## `_remote_attach` table (remote mode only)
For each remote-mode table in `_meta`, the extractor writes a row in
`_remote_attach` with `alias`, `extension`, `url`, `token_env`. See the
`agnes-orchestrator` skill for how the orchestrator consumes it.
## Adding a new connector — checklist
1. Create `connectors/<name>/extractor.py` that emits `extract.duckdb` (+
`data/*.parquet` if local) into `/data/extracts/<name>/`.
2. Populate `_meta` with one row per table.
3. If any table is `query_mode='remote'`, populate `_remote_attach`.
4. Register the connector type in the catalog (search for existing
`source_type` values to follow the pattern).
5. Add a fixture-based test that runs the extractor against a fixture
upstream and asserts `_meta` is complete.
6. CHANGELOG bullet under `Added` per `agnes-release-process`.
## Stable infrastructure — do NOT modify
`connectors/jira/file_lock.py`. (`connectors/jira/transform.py` was
previously off-limits but as of 0.54.19 is no longer; it remains
sensitive — touch only with end-to-end understanding of the
JSON-overlay / parquet-rewrite pipeline.)