# Porting Internal Features to OSS — Design Spec

**Date:** 2026-04-10
**Status:** Approved
**Approach:** Metric-First (A) — metriky → bootstrap → metadata writer

## Context

Comparison of `keboola/internal_ai_data_analyst` (private, Jan 2026) with the OSS version revealed three feature gaps worth porting. Many features initially thought missing (session collector, corporate memory, Jira SLA polling, CI/CD, telegram bot) already exist in OSS.

**Primary user:** Local Claude Code agent analyzing data. Web UI is secondary.

**What's being ported:**
1. Business metrics layer (20+ YAML metrics → DuckDB-backed framework + starter pack)
2. Analyst bootstrap flow (onboarding for analysts connecting to a remote instance)
3. Metadata writer (column descriptions + basetype push back to Keboola Storage API)

**What's NOT being ported:**
- macOS desktop app (narrow use-case, WebSocket gateway covers most needs)
- Linux user management (replaced by DuckDB RBAC)
- rsync distribution (replaced by FastAPI API)
- systemd services (replaced by Docker Compose)

---

## 1. Business Metrics in DuckDB

### 1.1 DuckDB Schema — `metric_definitions` table

New table in `system.duckdb`, added as part of schema migration v3→v4:

```sql
CREATE TABLE metric_definitions (
    id              VARCHAR PRIMARY KEY,     -- 'revenue/mrr'
    name            VARCHAR NOT NULL,        -- 'mrr'
    display_name    VARCHAR NOT NULL,        -- 'Monthly Recurring Revenue'
    category        VARCHAR NOT NULL,        -- 'revenue'
    description     TEXT,
    type            VARCHAR DEFAULT 'sum',   -- sum, count, ratio, comparison
    unit            VARCHAR,                 -- 'USD', 'percentage', 'count'
    grain           VARCHAR DEFAULT 'monthly', -- monthly, weekly, daily
    table_name      VARCHAR,                 -- primary table
    tables          VARCHAR[],               -- for JOIN metrics
    expression      VARCHAR,                 -- 'SUM(total_amount)'
    time_column     VARCHAR,                 -- 'order_date'
    dimensions      VARCHAR[],              -- ['channel', 'region']
    filters         VARCHAR[],              -- descriptive WHERE conditions
    synonyms        VARCHAR[],              -- for NL matching
    notes           VARCHAR[],              -- business rules
    sql             TEXT NOT NULL,           -- canonical SQL query
    sql_variants    JSON,                   -- {"by_channel": "SELECT ...", "by_region": "..."}
    validation      JSON,                   -- {"method": "...", "result": "..."}
    source          VARCHAR DEFAULT 'manual', -- 'yaml_import', 'manual', 'api'
    created_at      TIMESTAMP DEFAULT now(),
    updated_at      TIMESTAMP DEFAULT now()
);
```

### 1.2 Schema Versioning

Added to `src/db.py` as `SCHEMA_VERSION = 4` migration (v3→v4). Migration creates both `metric_definitions` and `column_metadata` tables. Existing data untouched.

### 1.3 Repository (`src/repositories/metrics.py`)

Follows existing pattern from `table_registry.py`:

- `list(category=None)` → all metrics, optionally filtered
- `get(metric_id)` → single metric or None
- `create(**kwargs)` → insert metric
- `update(metric_id, **kwargs)` → update fields
- `delete(metric_id)` → remove metric
- `find_by_table(table_name)` → metrics referencing a table
- `find_by_synonym(term)` → NL matching for Claude Code agent
- `import_from_yaml(yaml_path)` → parse YAML, upsert into DuckDB, return count
- `export_to_yaml(output_dir)` → DuckDB → YAML files, return count

### 1.4 YAML as Seed/Import Format

YAML files in `docs/metrics/` serve as:
- **Starter pack** — 10-15 generic SaaS metrics shipped with the project
- **Import source** — `da metrics import docs/metrics/` loads into DuckDB
- **Export target** — `da metrics export` dumps DuckDB → YAML (sharing, backup, version control)
- **Migration** — on first run after upgrade: detect YAML without DuckDB records → auto-import

Format remains compatible with the internal repo (same fields as `total_revenue.yml`).

### 1.5 Migration Script (`scripts/migrate_metrics_to_duckdb.py`)

1. Scans `docs/metrics/*/*.yml` via glob
2. Parses YAML, maps fields to DuckDB columns
3. `sql_by_*` variants → `sql_variants` JSON
4. INSERT OR REPLACE into `metric_definitions`
5. Idempotent — safe to run repeatedly

Auto-runs during schema migration v3→v4 if YAML files exist.

### 1.6 Metrics Index (`docs/metrics/metrics.yml`)

Master index for the YAML starter pack. After `da metrics import`, DuckDB becomes the source of truth. The YAML index is only used during import to define categories and discover files — it is NOT read at runtime.

```yaml
version: "2.0"
categories:
  - name: revenue
    folder: revenue/
    metrics: [total_revenue, mrr, arr, churn_rate]
  - name: product_usage
    folder: product_usage/
    metrics: [active_users, feature_adoption]
  - name: sales
    folder: sales/
    metrics: [new_customers, upsell_expansion, pipeline_value]
  - name: operations
    folder: operations/
    metrics: [support_resolution_time, infrastructure_cost]
```

### 1.7 Starter Pack Metrics (10-15 generic)

Ported and generalized from internal repo, adapted for generic SaaS data:

| Category | Metric | Internal source |
|---|---|---|
| **Revenue** | `total_revenue` (exists), `mrr`, `arr`, `churn_rate` | mrr.yml, new_arr.yml |
| **Product Usage** | `active_users`, `feature_adoption`, `usage_vs_limit` | usage_value.yml, usage_vs_limit.yml |
| **Sales** | `new_customers`, `upsell_expansion`, `pipeline_value` | upsell_expansion.yml, closed_won.yml |
| **Operations** | `support_resolution_time`, `infrastructure_cost` | resolution_time.yml, infra_cost.yml |

SQL queries are **generic templates** referencing typical tables (`orders`, `subscriptions`, `users`, `tickets`). Users adapt to their schema.

### 1.8 CLI Command `da metrics`

```
da metrics list [--category revenue]     # list from DuckDB
da metrics show revenue/mrr              # detail
da metrics import docs/metrics/          # YAML → DuckDB
da metrics export [--dir ./export/]      # DuckDB → YAML
da metrics validate                      # verify consistency (tables exist?)
da metrics add                           # interactive wizard
```

### 1.9 API Endpoints

```
GET  /api/metrics                        → list categories and metrics
GET  /api/metrics/{category}/{name}      → metric detail
POST /api/admin/metrics                  → create/update metric
DELETE /api/admin/metrics/{id}           → delete metric
POST /api/admin/metrics/import           → YAML upload → DuckDB
```

### 1.10 Profiler Integration

`src/profiler.py` already has `load_metrics()` logic. Wire new `src/repositories/metrics.py` into profiler so `profiles.json` includes metrics assigned to tables. Read from DuckDB instead of scanning YAML.

### 1.11 CLAUDE.md Instructions

Add section to CLAUDE.md:
> Before computing any business metric: `da metrics show {category}/{name}`, read the SQL and business rules, use the canonical SQL from the metric definition.

---

## 2. Analyst Bootstrap Flow

### 2.1 Two Bootstrap Modes

**Server-side** (already exists in `da setup`):
- `da setup init` → `bootstrap` → `test-connection` → `first-sync` → `verify`
- Sets up instance (instance.yaml, .env, Docker)
- No changes needed.

**Analyst-side** (new — equivalent of internal `bootstrap.yaml`):
- Analyst connects local Claude Code to a remote Agnes instance
- Downloads data, initializes DuckDB, sets up CLAUDE.md
- Uses API instead of SSH/rsync

### 2.2 Flow: `da analyst setup`

New command (subcommand of `da setup` or standalone `da analyst`):

```
Step 1: detect_existing_project
  → looks for ./CLAUDE.md with Agnes identifier
  → if found: "Project already set up. Want to resync? (da sync)"
  → if not: continue

Step 2: connect_to_instance
  → asks for instance URL (https://data.acme.com)
  → asks for credentials (email/password or OAuth token)
  → GET /api/health → verify availability
  → POST /auth/token → obtain JWT
  → store token in .env or ~/.agnes/credentials

Step 3: create_workspace
  → creates directory structure:
    ./data/parquet/          ← downloaded data
    ./data/duckdb/           ← local analytics.duckdb
    ./data/metadata/         ← profiles, schema
    ./user/artifacts/        ← analyst work output
    ./user/sessions/         ← Claude Code session logs

Step 4: download_schema_and_metrics
  → GET /api/data/tables → list of available tables
  → GET /api/metrics → all metrics
  → saves as local JSON/YAML cache

Step 5: download_data
  → for each table the user has access to:
    GET /api/data/table/{id}/download → parquet
  → Rich progress bar

Step 6: initialize_duckdb
  → creates local analytics.duckdb
  → CREATE VIEW for each downloaded parquet
  → verify: SELECT count(*) from a few tables

Step 7: generate_claude_md
  → generates CLAUDE.md from template (see 2.3)
  → creates empty CLAUDE.local.md
  → writes .claude/settings.json

Step 8: verify
  → runs test query
  → prints: "Setup complete. X tables, Y metrics, Z rows."
```

### 2.3 CLAUDE.md Template (`config/claude_md_template.txt`)

Generated template for analysts, adapted from internal repo:

```markdown
# {instance_name} — AI Data Analyst

## Rules
- Before computing any business metric: `da metrics show {category}/{name}`
- For current schema: read `data/metadata/schema.json`
- Do not use DESCRIBE/SHOW COLUMNS — read metadata files
- Save work output to `user/artifacts/`

## Metrics Workflow
1. `da metrics list` → identify relevant metric
2. `da metrics show revenue/mrr` → read SQL and rules
3. Use the SQL from the metric, adapt to the question

## Data Sync
- `da sync` → download current data from server
- Data refreshes every {sync_interval}

## Directory Structure
- `data/` — read-only (downloaded from server)
- `user/` — your workspace
- `CLAUDE.local.md` — your personal notes (never overwritten)
```

Placeholders `{instance_name}`, `{sync_interval}` substituted at generation time from instance config.

### 2.4 Returning-Session Detection

On every `da` CLI invocation:
- Check data age (`data/metadata/last_sync.json`)
- If >24h: suggest `da sync`
- If CLAUDE.md missing: suggest `da analyst setup`

### 2.5 Sync Command Extensions

Extensions to existing `da sync`:
```
da sync                    # download updated data from server
da sync --docs-only        # just metadata and metrics (fast)
da sync --upload-local     # upload CLAUDE.local.md to server (corporate memory)
```

---

## 3. Metadata Writer

### 3.1 DuckDB Schema — `column_metadata` table

New table in `system.duckdb` (part of v3→v4 migration alongside `metric_definitions`):

```sql
CREATE TABLE column_metadata (
    table_id        VARCHAR NOT NULL,        -- FK → table_registry.id
    column_name     VARCHAR NOT NULL,
    basetype        VARCHAR,                 -- STRING, INTEGER, NUMERIC, FLOAT, BOOLEAN, DATE, TIMESTAMP
    description     VARCHAR,
    confidence      VARCHAR DEFAULT 'manual', -- high, medium, low, manual
    source          VARCHAR DEFAULT 'manual', -- 'manual', 'ai_enrichment', 'keboola_import'
    updated_at      TIMESTAMP DEFAULT now(),
    PRIMARY KEY (table_id, column_name)
);
```

### 3.2 Workflow (3 phases)

**Phase 1: Discover** — profiler or AI agent analyzes columns
```
da admin metadata discover [--table orders]
  → for each column without description:
    sample 500 rows → heuristics for basetype
    if Claude Code agent: generate descriptions
  → saves as "proposal" JSON (same format as internal repo)
```

**Phase 2: Review** — user reviews proposals
```
da admin metadata review proposals/sales_metadata_20260410.json
  → prints table: column | basetype | description | confidence
  → user can edit or confirm
```

**Phase 3: Apply** — write to DuckDB + optional push to Keboola
```
da admin metadata apply proposals/sales_metadata_20260410.json
  → INSERT/UPDATE into column_metadata in DuckDB
  → --push-to-source: if source_type=keboola, POST to Keboola Storage API
  → --dry-run: just show what would change
```

### 3.3 Push to Keboola Storage API

Ported from `apply_metadata.py`:
- Provider: `"ai-metadata-enrichment"`
- Keys: `KBC.datatype.basetype`, `KBC.description`
- Endpoint: `POST {stack_url}/v2/storage/tables/{table_id}/metadata`
- Token and stack_url from `config/instance.yaml` / env vars (not hardcoded JSON)

Only works for tables with `source_type = 'keboola'` in `table_registry`. For BigQuery/CSV/Jira, metadata is stored locally in DuckDB only.

### 3.4 API Endpoints

```
GET  /api/admin/metadata/{table_id}           → column metadata for table
POST /api/admin/metadata/{table_id}           → save metadata (JSON body)
POST /api/admin/metadata/{table_id}/push      → push to source system
```

### 3.5 Integration

- **Profiler**: `src/profiler.py` enriches `profiles.json` with `column_metadata` from DuckDB
- **Catalog API**: `GET /api/catalog` returns metadata alongside profiles
- **Claude Code agent**: reads metadata via `da admin metadata show {table}` or from `profiles.json`

---

## Implementation Summary

### New Files

| Component | Files |
|---|---|
| **Metrics** | `src/repositories/metrics.py`, `src/metrics.py`, `cli/commands/metrics.py`, `app/api/metrics.py`, `scripts/migrate_metrics_to_duckdb.py`, 10-15 YAML in `docs/metrics/` |
| **Bootstrap** | `cli/commands/analyst.py`, `config/claude_md_template.txt` |
| **Metadata** | `src/repositories/column_metadata.py`, `app/api/metadata.py` (metadata commands added as subcommands of `da admin`) |

### Modified Files

| File | Changes |
|---|---|
| `src/db.py` | SCHEMA_VERSION=4, `metric_definitions` + `column_metadata` tables, v3→v4 migration |
| `src/profiler.py` | Read metrics + column_metadata from DuckDB instead of YAML scan |
| `app/main.py` | Register metrics + metadata routers |
| `cli/main.py` | Register `metrics` + `analyst` commands |
| `cli/commands/sync.py` | `--docs-only`, `--upload-local` flags |
| `CLAUDE.md` | Metrics workflow instructions |

### Schema Migration v3→v4

Single migration creating both tables. Auto-imports existing YAML metrics if found. Idempotent.

### Implementation Order

1. Schema v4 + metrics (framework + starter pack + CLI + API)
2. Bootstrap flow (analyst setup + CLAUDE.md template)
3. Metadata writer (discover + apply + Keboola push)

### Test Coverage

Each component gets its own test file following existing patterns:
- `tests/test_metrics.py` — repository CRUD, YAML import/export, API endpoints
- `tests/test_analyst_bootstrap.py` — setup flow (mocked API calls)
- `tests/test_column_metadata.py` — repository CRUD, proposal format, Keboola push (mocked)