chore(docs): replace stale da verbs and vendor-specific install paths

Sweep operator runbooks (docs/QUICKSTART, docs/HEADLESS_USAGE,
docs/architecture, docs/sample-data, docs/agent-workspace-prompt,
docs/metrics/metrics.yml, dev_docs/server, dev_docs/disaster-recovery),
the corporate-memory service README, the jira connector README + backfill
scripts, the deploy skill, and test docstrings. Replaces `da sync` →
`agnes pull`, `da analyst setup` → `agnes init`, `da metrics ...` →
`agnes catalog --metrics` / `agnes admin metrics ...`, `da fetch` →
`agnes snapshot create`, plus the matching docker-compose admin
invocations.

Vendor-specific `/opt/data-analyst/` install paths in jira backfill /
consistency scripts and operator docs are replaced with the
placeholder `<install-dir>` and a new `AGNES_ENV_FILE` env-var override
that lets a deployment inject its actual install path without a code
change. Aligns with the OSS vendor-agnostic policy in CLAUDE.md.

CHANGELOG `### Internal` entry summarizes the audit and reaffirms the
intentional stale-marker tuples (`_LEGACY_STRINGS`, `_OUR_COMMAND_MARKERS`)
that must keep referencing `da sync` / `da fetch` / etc. for hook upgrade
and override-detection logic.
This commit is contained in:
ZdenekSrotyr 2026-05-04 21:22:19 +02:00
parent 976d0c7160
commit 8233c3e3f9
24 changed files with 89 additions and 115 deletions

View file

@ -55,6 +55,7 @@ End-to-end clean-analyst-bootstrap rewrite. The web `/setup?role=analyst` page n
- `tests/test_reader_smoke_matrix.py` — load-bearing parametrized test: every reader CLI command runs on a freshly-bootstrapped zero-grants workspace without a Python traceback. - `tests/test_reader_smoke_matrix.py` — load-bearing parametrized test: every reader CLI command runs on a freshly-bootstrapped zero-grants workspace without a Python traceback.
- `tests/test_clean_install_integration.py` — end-to-end happy-path tests (minimal grants, zero grants, force preserves CLAUDE.local.md, readers in pre-init dir). - `tests/test_clean_install_integration.py` — end-to-end happy-path tests (minimal grants, zero grants, force preserves CLAUDE.local.md, readers in pre-init dir).
- `docs/RELEASE_CHECKLIST.md` — manual clean-install protocol mandated for any PR touching the bootstrap path. - `docs/RELEASE_CHECKLIST.md` — manual clean-install protocol mandated for any PR touching the bootstrap path.
- Audited and replaced stale `da` verbs left over from prior merges in admin UI text, audit-log messages, code comments, operator runbooks, analyst-facing skill docs, and test docstrings (welcome template renderer/API tests now assert exact emitted markers — `agnes init` for analyst flow, `agnes auth` for admin flow — with explicit absence checks on legacy verbs). Vendor-specific `/opt/data-analyst/` install paths in jira backfill/consistency scripts and operator docs replaced with `<install-dir>/` and an `AGNES_ENV_FILE` env-var override. Intentional stale-marker tuples (`_LEGACY_STRINGS` in `app/api/claude_md.py`, `_OUR_COMMAND_MARKERS` in `cli/lib/hooks.py`) and tests that seed legacy hook content (`tests/test_lib_hooks.py`, `tests/test_legacy_strings_scan.py`) are preserved by design.
## [0.33.0] — 2026-05-04 ## [0.33.0] — 2026-05-04

View file

@ -19,8 +19,8 @@ ssh user@your-server-ip
### 2. Clone the repository ### 2. Clone the repository
```bash ```bash
git clone https://github.com/keboola/agnes-the-ai-analyst.git /opt/data-analyst git clone https://github.com/keboola/agnes-the-ai-analyst.git <install-dir>
cd /opt/data-analyst cd <install-dir>
git checkout main git checkout main
``` ```

View file

@ -257,7 +257,7 @@ T+Xsec Analyst: Query with DuckDB - sees latest data
### Server Environment Variables ### Server Environment Variables
In `/opt/data-analyst/.env`: In `<install-dir>/.env` (typically the directory you run `docker compose` from):
```bash ```bash
# Jira webhook integration # Jira webhook integration
@ -364,7 +364,7 @@ Response:
```bash ```bash
# Webapp logs (webhook processing) # Webapp logs (webhook processing)
tail -f /opt/data-analyst/logs/webapp-error.log | grep -i jira docker compose logs app --tail 200 | grep -i jira
# Recent webhook events # Recent webhook events
ls -lt /data/src_data/raw/jira/webhook_events/ | head -20 ls -lt /data/src_data/raw/jira/webhook_events/ | head -20
@ -535,67 +535,26 @@ WHERE first_response_elapsed_millis IS NOT NULL
## Analyst Sync Configuration ## Analyst Sync Configuration
Jira data is an **optional dataset** - not synced by default to save bandwidth. Whether an analyst sees Jira tables locally is decided server-side: an admin
must register the Jira tables and grant the analyst's group access via
**Enable Jira sync:** `resource_grants(resource_type='table')`. Once granted, the manifest
```bash advertises the tables and `agnes pull` downloads the parquets to the
# Edit local config (created on first sync_data.sh run) analyst's workspace on the next session.
nano ~/.config/data-analyst/sync.yaml
# Change:
datasets:
jira: true # Enable parquet data (~50MB)
jira_attachments: false # Keep false unless you need actual files
```
**Then sync:**
```bash
bash server/scripts/sync_data.sh
```
DuckDB views for Jira tables are created automatically if data exists: DuckDB views for Jira tables are created automatically if data exists:
- `jira_issues` - main issues table - `jira_issues` — main issues table
- `jira_comments` - issue comments - `jira_comments` — issue comments
- `jira_attachments` - attachment metadata (filenames, sizes, URLs) - `jira_attachments` — attachment metadata (filenames, sizes, URLs)
- `jira_changelog` - field change history - `jira_changelog` — field change history
- `jira_issuelinks` - links between issues (blocks, duplicates, relates to) - `jira_issuelinks` — links between issues (blocks, duplicates, relates to)
- `jira_remote_links` - external links (Confluence, Slack, etc.) - `jira_remote_links` — external links (Confluence, Slack, etc.)
## Attachment Access ## Attachment Access
Attachments (images, logs, PDFs) are stored separately from parquet data. Attachments (images, logs, PDFs) are stored on the server alongside parquet
data and are **not** distributed via `agnes pull` (the manifest only
### Option 1: Download per-ticket (recommended) advertises parquet tables). The `jira_attachments` table has a `local_path`
column with the server-side filesystem path:
Download attachments for a specific ticket to local temp folder:
```bash
# Download all attachments for one ticket
rsync -avz data-analyst:server/jira_attachments/SUPPORT-1234/ /tmp/SUPPORT-1234/
# View locally
ls /tmp/SUPPORT-1234/
open /tmp/SUPPORT-1234/screenshot.png # macOS
```
This is fast (only downloads files for one ticket) and keeps your local machine clean.
### Option 2: Sync attachments locally (for heavy analysis)
If you need frequent access to attachments, enable full sync:
```yaml
# ~/.config/data-analyst/sync.yaml
datasets:
jira: true
jira_attachments: true # Syncs ~500MB+ of files
```
Then `sync_data.sh` will rsync attachments to `./server/jira_attachments/`.
### Finding attachment path from parquet
The `jira_attachments` table has a `local_path` column with the server path:
```sql ```sql
SELECT SELECT
@ -613,7 +572,11 @@ issue_key | filename | local_path
SUPPORT-1234 | screenshot.png | /data/src_data/raw/jira/attachments/SUPPORT-1234/... | 45678 SUPPORT-1234 | screenshot.png | /data/src_data/raw/jira/attachments/SUPPORT-1234/... | 45678
``` ```
To access locally (if synced): replace `/data/src_data/raw/jira/attachments/` with `./server/jira_attachments/`. To pull the actual file to a workstation, operators with SSH access to the
host can `scp` / `rsync` from the path above. Public OSS does not ship a
client-side attachment-fetch primitive — wire one up per deployment if
attachment access is required (e.g. a thin admin endpoint that streams the
file with the same RBAC gate as the parquet table).
## Future Improvements ## Future Improvements

View file

@ -6,7 +6,7 @@ Downloads all issues from Jira using JQL search with pagination.
Reuses the webapp's JiraService for consistent data handling. Reuses the webapp's JiraService for consistent data handling.
Usage: Usage:
# On server (uses /opt/data-analyst/.env): # On server (loads .env from <install-dir>/.env or the current directory):
python -m connectors.jira.scripts.backfill python -m connectors.jira.scripts.backfill
# With custom settings: # With custom settings:
@ -58,12 +58,15 @@ class Config:
@classmethod @classmethod
def from_env(cls) -> "Config": def from_env(cls) -> "Config":
"""Load configuration from environment variables.""" """Load configuration from environment variables."""
# Try to load .env file from common locations # Try to load .env file from common locations.
# Customer-specific install paths (e.g. /opt/<deployment>/.env) can be
# injected via the AGNES_ENV_FILE env var without editing this list.
env_paths = [ env_paths = [
Path("/opt/data-analyst/.env"), Path(os.environ["AGNES_ENV_FILE"]) if os.environ.get("AGNES_ENV_FILE") else None,
Path.cwd() / ".env", Path.cwd() / ".env",
Path(__file__).parent.parent / ".env", Path(__file__).parent.parent / ".env",
] ]
env_paths = [p for p in env_paths if p is not None]
for env_path in env_paths: for env_path in env_paths:
if env_path.exists(): if env_path.exists():
load_dotenv(env_path) load_dotenv(env_path)

View file

@ -7,7 +7,7 @@ and embeds them into existing issue JSON files. This enables the
Parquet transform to extract remote_links table data. Parquet transform to extract remote_links table data.
Usage: Usage:
# On server (uses /opt/data-analyst/.env): # On server (loads .env from <install-dir>/.env or the current directory):
python -m connectors.jira.scripts.backfill_remote_links python -m connectors.jira.scripts.backfill_remote_links
# With parallel workers: # With parallel workers:
@ -44,11 +44,14 @@ logger = logging.getLogger(__name__)
def load_config() -> dict: def load_config() -> dict:
"""Load configuration from environment variables.""" """Load configuration from environment variables."""
# Customer-specific install paths (e.g. /opt/<deployment>/.env) can be
# injected via the AGNES_ENV_FILE env var without editing this list.
env_paths = [ env_paths = [
Path("/opt/data-analyst/.env"), Path(os.environ["AGNES_ENV_FILE"]) if os.environ.get("AGNES_ENV_FILE") else None,
Path.cwd() / ".env", Path.cwd() / ".env",
Path(__file__).parent.parent / ".env", Path(__file__).parent.parent / ".env",
] ]
env_paths = [p for p in env_paths if p is not None]
for env_path in env_paths: for env_path in env_paths:
if env_path.exists(): if env_path.exists():
load_dotenv(env_path) load_dotenv(env_path)

View file

@ -57,11 +57,14 @@ logger = logging.getLogger(__name__)
def load_config() -> dict: def load_config() -> dict:
"""Load configuration from environment variables.""" """Load configuration from environment variables."""
# Customer-specific install paths (e.g. /opt/<deployment>/.env) can be
# injected via the AGNES_ENV_FILE env var without editing this list.
env_paths = [ env_paths = [
Path("/opt/data-analyst/.env"), Path(os.environ["AGNES_ENV_FILE"]) if os.environ.get("AGNES_ENV_FILE") else None,
Path.cwd() / ".env", Path.cwd() / ".env",
Path(__file__).parent.parent / ".env", Path(__file__).parent.parent / ".env",
] ]
env_paths = [p for p in env_paths if p is not None]
for env_path in env_paths: for env_path in env_paths:
if env_path.exists(): if env_path.exists():
load_dotenv(env_path) load_dotenv(env_path)

View file

@ -72,12 +72,15 @@ class Config:
@classmethod @classmethod
def from_env(cls) -> "Config": def from_env(cls) -> "Config":
"""Load configuration from environment variables.""" """Load configuration from environment variables."""
# Try to load .env file from common locations # Try to load .env file from common locations.
# Customer-specific install paths (e.g. /opt/<deployment>/.env) can be
# injected via the AGNES_ENV_FILE env var without editing this list.
env_paths = [ env_paths = [
Path("/opt/data-analyst/.env"), Path(os.environ["AGNES_ENV_FILE"]) if os.environ.get("AGNES_ENV_FILE") else None,
Path.cwd() / ".env", Path.cwd() / ".env",
Path(__file__).parent.parent / ".env", Path(__file__).parent.parent / ".env",
] ]
env_paths = [p for p in env_paths if p is not None]
for env_path in env_paths: for env_path in env_paths:
if env_path.exists(): if env_path.exists():
load_dotenv(env_path) load_dotenv(env_path)
@ -92,8 +95,11 @@ class Config:
raw_dir = Path(os.environ.get("JIRA_DATA_DIR", "/data/src_data/raw/jira")) raw_dir = Path(os.environ.get("JIRA_DATA_DIR", "/data/src_data/raw/jira"))
parquet_dir = Path(os.environ.get("JIRA_PARQUET_DIR", "/data/src_data/parquet/jira")) parquet_dir = Path(os.environ.get("JIRA_PARQUET_DIR", "/data/src_data/parquet/jira"))
repo_dir = Path(os.environ.get("REPO_DIR", "/opt/data-analyst/repo")) # REPO_DIR / VENV_PYTHON have no sensible OSS default — operators
venv_python = Path(os.environ.get("VENV_PYTHON", "/opt/data-analyst/.venv/bin/python")) # must export them when running this script outside an editable
# checkout.
repo_dir = Path(os.environ.get("REPO_DIR", str(Path(__file__).resolve().parents[3])))
venv_python = Path(os.environ.get("VENV_PYTHON", sys.executable))
return cls( return cls(
jira_domain=os.environ["JIRA_DOMAIN"], jira_domain=os.environ["JIRA_DOMAIN"],

View file

@ -87,8 +87,6 @@ docker compose up -d
# Trigger a full sync from the data source # Trigger a full sync from the data source
curl -X POST http://localhost:8000/api/sync/trigger curl -X POST http://localhost:8000/api/sync/trigger
# Or via CLI:
docker compose exec app da sync
``` ```
DuckDB extract files and parquet will be repopulated from Keboola / BigQuery. DuckDB extract files and parquet will be repopulated from Keboola / BigQuery.
@ -123,8 +121,8 @@ not regenerated — user accounts and table definitions are not recreated by syn
4. **Clone repo and create .env**: 4. **Clone repo and create .env**:
```bash ```bash
git clone git@github.com:your-org/ai-data-analyst.git /opt/data-analyst git clone git@github.com:keboola/agnes-the-ai-analyst.git <install-dir>
cd /opt/data-analyst cd <install-dir>
cp config/.env.template .env cp config/.env.template .env
# Fill in secrets from GitHub Secrets / 1Password # Fill in secrets from GitHub Secrets / 1Password
``` ```

View file

@ -88,11 +88,8 @@ the database is unavailable.
# Via API # Via API
curl -X POST http://localhost:8000/api/sync/trigger curl -X POST http://localhost:8000/api/sync/trigger
# Via CLI inside the container
docker compose exec app da sync
# Sync a single table # Sync a single table
docker compose exec app da sync --table table_name curl -X POST "http://localhost:8000/api/sync/trigger?table=table_name"
``` ```
### Check sync status ### Check sync status
@ -123,16 +120,16 @@ any destructive operation.
```bash ```bash
# List registered tables # List registered tables
docker compose exec app da admin tables list docker compose exec app agnes admin list-tables
# Register a new table # Register a new table
docker compose exec app da admin tables add docker compose exec app agnes admin register-table
# User management # User management
docker compose exec app da admin users list docker compose exec app agnes admin list-users
# Query data directly # Query data directly
docker compose exec app da query "SELECT * FROM my_table LIMIT 10" docker compose exec app agnes query "SELECT * FROM my_table LIMIT 10"
``` ```
## Application Deployment ## Application Deployment
@ -143,7 +140,7 @@ Application is deployed via Docker image. The recommended workflow:
2. CI builds and pushes a new image 2. CI builds and pushes a new image
3. On the server, pull and restart: 3. On the server, pull and restart:
```bash ```bash
cd /opt/data-analyst cd <install-dir>
docker compose pull docker compose pull
docker compose up -d docker compose up -d
``` ```
@ -154,7 +151,7 @@ To pin a specific image version, set the tag in `docker-compose.yml` before depl
```bash ```bash
# Edit .env (never commit this file) # Edit .env (never commit this file)
nano /opt/data-analyst/.env nano <install-dir>/.env
# Restart app to apply changes # Restart app to apply changes
docker compose restart app docker compose restart app
@ -297,7 +294,7 @@ most lock issues.
docker compose logs app | grep -i "sync\|error\|exception" docker compose logs app | grep -i "sync\|error\|exception"
# Verify data source credentials in .env # Verify data source credentials in .env
docker compose exec app da admin tables list docker compose exec app agnes admin list-tables
``` ```
### Out of disk space ### Out of disk space

View file

@ -31,8 +31,8 @@ agnes query "SELECT 1"
AGNES_TOKEN: ${{ secrets.AGNES_TOKEN }} AGNES_TOKEN: ${{ secrets.AGNES_TOKEN }}
AGNES_SERVER: https://agnes.example.com AGNES_SERVER: https://agnes.example.com
run: | run: |
pip install data-analyst uv tool install "$AGNES_SERVER/cli/wheel/agnes.whl"
da sync --all agnes pull
``` ```
## Revoke ## Revoke

View file

@ -48,7 +48,6 @@
7. Trigger a data sync: 7. Trigger a data sync:
```bash ```bash
curl -X POST http://localhost:8000/api/sync/trigger curl -X POST http://localhost:8000/api/sync/trigger
# Or: da sync
``` ```
## Docker Deployment ## Docker Deployment

View file

@ -1,23 +1,23 @@
# Agent Workspace Prompt # Agent Workspace Prompt
The agent workspace prompt is the `CLAUDE.md` file written to each analyst's The agent workspace prompt is the `CLAUDE.md` file written to each analyst's
workspace by `da analyst setup`. It gives Claude Code context about the workspace by `agnes init`. It gives Claude Code context about the
connected instance: available tables (RBAC-filtered), business metrics, installed connected instance: available tables (RBAC-filtered), business metrics, installed
plugins, and operational rules for the analyst. plugins, and operational rules for the analyst.
## When is CLAUDE.md written? ## When is CLAUDE.md written?
`da analyst setup` fetches `GET /api/welcome` and writes the rendered markdown `agnes init` fetches `GET /api/welcome` and writes the rendered markdown
to `<workspace>/CLAUDE.md` on every run (including `--force` re-initialisation). to `<workspace>/CLAUDE.md` on every run (including `--force` re-initialisation).
To skip writing CLAUDE.md: To skip writing CLAUDE.md:
```bash ```bash
da analyst setup --server-url https://agnes.example.com --no-claude-md agnes init --server-url https://agnes.example.com --no-claude-md
``` ```
**Analysts who ran setup while CLAUDE.md generation was temporarily absent** will **Analysts who ran setup while CLAUDE.md generation was temporarily absent** will
have their file written on the next `da analyst setup` run. Any existing have their file written on the next `agnes init` run. Any existing
`CLAUDE.md` is overwritten with the current server template. `CLAUDE.md` is overwritten with the current server template.
The companion `CLAUDE.local.md` (at `.claude/CLAUDE.local.md`) is **never** The companion `CLAUDE.local.md` (at `.claude/CLAUDE.local.md`) is **never**
@ -110,5 +110,5 @@ PUT validation time, so the admin is notified immediately.
Click **Reset to default** in the admin UI, or call Click **Reset to default** in the admin UI, or call
`DELETE /api/admin/workspace-prompt-template`. The next analyst who runs `DELETE /api/admin/workspace-prompt-template`. The next analyst who runs
`da analyst setup` will receive the rich default template from `agnes init` will receive the rich default template from
`config/claude_md_template.txt`. `config/claude_md_template.txt`.

View file

@ -13,7 +13,7 @@ ai-data-analyst/
│ ├── auth/ Auth providers (JWT, Google OAuth, email magic link, password) │ ├── auth/ Auth providers (JWT, Google OAuth, email magic link, password)
│ └── web/ HTML dashboard routes │ └── web/ HTML dashboard routes
├── services/ Standalone background services (scheduler, telegram_bot, ws_gateway, …) ├── services/ Standalone background services (scheduler, telegram_bot, ws_gateway, …)
├── cli/ CLI tool (da sync, agnes query, agnes admin) ├── cli/ CLI tool (agnes pull, agnes query, agnes admin)
├── scripts/ Utility and migration scripts ├── scripts/ Utility and migration scripts
├── config/ Instance configuration templates ├── config/ Instance configuration templates
├── tests/ Test suite ├── tests/ Test suite
@ -185,7 +185,7 @@ POST /api/sync/trigger (admin)
- Runs admin-registered SQL through the DuckDB BigQuery extension via `BqAccess.duckdb_session()` and writes the result to `/data/extracts/bigquery/data/<id>.parquet` atomically (`<id>.parquet.tmp` → `os.replace`). - Runs admin-registered SQL through the DuckDB BigQuery extension via `BqAccess.duckdb_session()` and writes the result to `/data/extracts/bigquery/data/<id>.parquet` atomically (`<id>.parquet.tmp` → `os.replace`).
- Triggered by `_run_materialized_pass` in `app/api/sync.py` between custom-connectors and orchestrator rebuild on every `/api/sync/trigger`. Per-table `sync_schedule` honored via `is_table_due()`. - Triggered by `_run_materialized_pass` in `app/api/sync.py` between custom-connectors and orchestrator rebuild on every `/api/sync/trigger`. Per-table `sync_schedule` honored via `is_table_due()`.
- Cost guardrail: BQ dry-run via `app.api.v2_scan._bq_dry_run_bytes` (single source of truth for cost-estimate logic). `data_source.bigquery.max_bytes_per_materialize` (default 10 GiB; `0` disables). Fail-open when dry-run errors (DuckDB three-part syntax the native BQ client can't parse) — log warning + proceed. - Cost guardrail: BQ dry-run via `app.api.v2_scan._bq_dry_run_bytes` (single source of truth for cost-estimate logic). `data_source.bigquery.max_bytes_per_materialize` (default 10 GiB; `0` disables). Fail-open when dry-run errors (DuckDB three-part syntax the native BQ client can't parse) — log warning + proceed.
- Distribution: result parquet rides the same manifest + `da sync` flow as Keboola tables. Per-user RBAC unchanged (`resource_grants(group, ResourceType.TABLE, table_id)`). - Distribution: result parquet rides the same manifest + `agnes pull` flow as Keboola tables. Per-user RBAC unchanged (`resource_grants(group, ResourceType.TABLE, table_id)`).
### Jira — Real-Time Push ### Jira — Real-Time Push

View file

@ -1,5 +1,5 @@
version: "2.0" version: "2.0"
description: "Business metrics starter pack. Import with: da metrics import docs/metrics/" description: "Business metrics starter pack. Import with: agnes admin metrics import docs/metrics/"
categories: categories:
- name: revenue - name: revenue
folder: revenue/ folder: revenue/

View file

@ -169,12 +169,13 @@ diff -r run1 run2 # no differences
To use sample data on a deployed server (instead of connecting a data adapter): To use sample data on a deployed server (instead of connecting a data adapter):
```bash ```bash
# On the server # On the server, from the install directory containing your repo checkout
cd /opt/data-analyst/repo # and Python venv (paths vary per deployment):
cd <install-dir>/repo
# Generate Parquet files directly using project's ParquetManager # Generate Parquet files directly using project's ParquetManager
# (snappy compression, proper column types, metadata embedding) # (snappy compression, proper column types, metadata embedding)
/opt/data-analyst/.venv/bin/python scripts/generate_sample_data.py \ <install-dir>/.venv/bin/python scripts/generate_sample_data.py \
--size m --format parquet --output /data/src_data/parquet --seed 42 --size m --format parquet --output /data/src_data/parquet --seed 42
# Set correct permissions # Set correct permissions

View file

@ -78,7 +78,7 @@ Corporate Memory solves this by making institutional knowledge:
└──────────┬───────────┘ └──────────┬───────────┘
┌──────────▼───────────┐ ┌──────────▼───────────┐
da sync agnes pull
│ │ │ │
│ .claude/rules/ │ │ .claude/rules/ │
│ km_<id>.md │ ← one per mandatory item │ km_<id>.md │ ← one per mandatory item
@ -238,7 +238,7 @@ The highest-ranked facts enter the agent's context first. Mandatory items bypass
### Claude Code integration ### Claude Code integration
`da sync` writes the bundle as files in `.claude/rules/`: `agnes pull` writes the bundle as files in `.claude/rules/`:
``` ```
.claude/rules/ .claude/rules/
@ -368,7 +368,7 @@ agnes-the-ai-analyst/
├── src/repositories/knowledge.py ← DuckDB CRUD (no SQL in API layer) ├── src/repositories/knowledge.py ← DuckDB CRUD (no SQL in API layer)
├── src/db.py ← Schema: knowledge_items + 4 supporting tables ├── src/db.py ← Schema: knowledge_items + 4 supporting tables
└── cli/commands/sync.py ← da sync step 7: fetch bundle → write km_*.md └── cli/commands/pull.py ← agnes pull step 7: fetch bundle → write km_*.md
``` ```
--- ---
@ -401,7 +401,7 @@ An analyst working on sensitive M&A data marks their items as personal. The note
| | Corporate Memory | Static `CLAUDE.md` | Vector RAG | Fine-tuning | | | Corporate Memory | Static `CLAUDE.md` | Vector RAG | Fine-tuning |
|---|---|---|---|---| |---|---|---|---|---|
| **Update latency** | Next `da sync` (~minutes) | Manual edit + redeploy | Near-realtime | Days to weeks | | **Update latency** | Next `agnes pull` (~minutes) | Manual edit + redeploy | Near-realtime | Days to weeks |
| **Governance** | Approve / reject / audit | None | None | Training data curation | | **Governance** | Approve / reject / audit | None | None | Training data curation |
| **Confidence scoring** | Yes (source + decay) | No | Similarity score only | Baked into weights | | **Confidence scoring** | Yes (source + decay) | No | Similarity score only | Baked into weights |
| **Contradiction detection** | Yes (auto, per domain) | No | No | No (invisible) | | **Contradiction detection** | Yes (auto, per domain) | No | No | No (invisible) |
@ -454,7 +454,7 @@ Scans `/data/user_sessions/*.jsonl`, extracts knowledge from unprocessed session
Corporate Memory is wired into Agnes' sync pipeline automatically: Corporate Memory is wired into Agnes' sync pipeline automatically:
``` ```
da sync agnes pull
step 16: download tables, rebuild DuckDB views step 16: download tables, rebuild DuckDB views
step 7: fetch /api/memory/bundle → write .claude/rules/km_*.md step 7: fetch /api/memory/bundle → write .claude/rules/km_*.md
``` ```

View file

@ -1,6 +1,6 @@
"""DELETE /api/admin/registry/{id} for materialized rows must remove the """DELETE /api/admin/registry/{id} for materialized rows must remove the
materialized parquet file too otherwise sync_state still has the row, materialized parquet file too otherwise sync_state still has the row,
the manifest still serves it, and `da sync` keeps trying to download the manifest still serves it, and `agnes pull` keeps trying to download
data for a table that no longer has a registry entry. The orchestrator's data for a table that no longer has a registry entry. The orchestrator's
rebuild path additionally skips parquets that lack a matching rebuild path additionally skips parquets that lack a matching
table_registry row, so a transient race (or operator-deleted parquet) table_registry row, so a transient race (or operator-deleted parquet)
@ -184,7 +184,7 @@ def test_delete_remote_bq_row_does_not_touch_data_dir(
def test_delete_clears_sync_state_for_materialized_row(seeded_app, keboola_instance): def test_delete_clears_sync_state_for_materialized_row(seeded_app, keboola_instance):
"""DELETE must also clear the sync_state row so the manifest stops """DELETE must also clear the sync_state row so the manifest stops
advertising the dropped table to `da sync`.""" advertising the dropped table to `agnes pull`."""
c = seeded_app["client"] c = seeded_app["client"]
token = seeded_app["admin_token"] token = seeded_app["admin_token"]

View file

@ -285,7 +285,7 @@ class TestCatalogMetrics:
def test_catalog_metrics_help(self): def test_catalog_metrics_help(self):
result = runner.invoke(app, ["catalog", "--help"]) result = runner.invoke(app, ["catalog", "--help"])
assert result.exit_code == 0 assert result.exit_code == 0
# `agnes catalog --metrics` replaces the old `da metrics list/show`. # `agnes catalog --metrics` lists business-metric definitions.
assert "metrics" in result.output.lower() assert "metrics" in result.output.lower()
def test_admin_metrics_help(self): def test_admin_metrics_help(self):

View file

@ -1,4 +1,4 @@
"""Tests for `agnes admin metrics {import,export,validate}` (lifted from `da metrics`).""" """Tests for `agnes admin metrics {import,export,validate}`."""
from typer.testing import CliRunner from typer.testing import CliRunner

View file

@ -1,4 +1,4 @@
"""Tests for `agnes catalog --metrics` (folded from `da metrics list/show`).""" """Tests for `agnes catalog --metrics`."""
from typer.testing import CliRunner from typer.testing import CliRunner

View file

@ -1,5 +1,5 @@
"""End-to-end: register a Keboola materialized row -> trigger sync -> """End-to-end: register a Keboola materialized row -> trigger sync ->
parquet appears -> manifest serves it -> CLI da sync would download it. parquet appears -> manifest serves it -> CLI agnes pull would download it.
Skipped unless KBC_TEST_URL + KBC_TEST_TOKEN + KBC_TEST_BUCKET + Skipped unless KBC_TEST_URL + KBC_TEST_TOKEN + KBC_TEST_BUCKET +
KBC_TEST_TABLE are present. KBC_TEST_TABLE are present.
@ -68,4 +68,4 @@ def test_register_trigger_manifest_path(seeded_app, monkeypatch, tmp_path):
assert smoke is not None assert smoke is not None
assert smoke["source_type"] == "keboola" assert smoke["source_type"] == "keboola"
assert smoke["query_mode"] == "local" # materialized parquets surface as local assert smoke["query_mode"] == "local" # materialized parquets surface as local
assert smoke["md5"] # has a hash for da sync delta detection assert smoke["md5"] # has a hash for agnes pull delta detection

View file

@ -23,7 +23,7 @@ def _auth(token: str) -> dict:
def test_query_materialized_id_not_in_views_returns_helpful_message(seeded_app): def test_query_materialized_id_not_in_views_returns_helpful_message(seeded_app):
"""An admin querying a materialized id that isn't yet materialized in """An admin querying a materialized id that isn't yet materialized in
the local analytics.duckdb gets a 400 whose detail names the the local analytics.duckdb gets a 400 whose detail names the
query_mode and points at `da sync` / direct-BQ-query.""" query_mode and points at `agnes pull` / direct-BQ-query."""
from src.db import get_system_db from src.db import get_system_db
sys_conn = get_system_db() sys_conn = get_system_db()
try: try:

View file

@ -21,7 +21,7 @@ def test_template_has_session_end_upload():
ends = cfg.get("hooks", {}).get("SessionEnd", []) ends = cfg.get("hooks", {}).get("SessionEnd", [])
cmds = [h["command"] for entry in ends for h in entry.get("hooks", [])] cmds = [h["command"] for entry in ends for h in entry.get("hooks", [])]
assert any("agnes push" in c for c in cmds), ( assert any("agnes push" in c for c in cmds), (
f"Expected `da sync --upload-only` in SessionEnd, got {cmds}" f"Expected `agnes push` in SessionEnd, got {cmds}"
) )

View file

@ -156,7 +156,7 @@ def test_materialized_pass_collects_errors_per_row(system_db, stub_bq, tmp_path)
def test_materialized_pass_records_parquet_hash(system_db, stub_bq, tmp_path): def test_materialized_pass_records_parquet_hash(system_db, stub_bq, tmp_path):
"""sync_state.hash must be the MD5 of the parquet file — otherwise the """sync_state.hash must be the MD5 of the parquet file — otherwise the
manifest reports an empty hash and every da sync re-downloads.""" manifest reports an empty hash and every agnes pull re-downloads."""
repo = TableRegistryRepository(system_db) repo = TableRegistryRepository(system_db)
repo.register( repo.register(
id="hashed", name="hashed", id="hashed", name="hashed",