agnes-the-ai-analyst/docs/superpowers/specs/2026-04-10-porting-internal-features-design.md
ZdenekSrotyr 1ce632bc0b docs: add design spec for porting internal features to OSS
Covers business metrics in DuckDB, analyst bootstrap flow,
and metadata writer — based on comparison with internal repo.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 18:49:34 +02:00

14 KiB

Porting Internal Features to OSS — Design Spec

Date: 2026-04-10 Status: Approved Approach: Metric-First (A) — metriky → bootstrap → metadata writer

Context

Comparison of keboola/internal_ai_data_analyst (private, Jan 2026) with the OSS version revealed three feature gaps worth porting. Many features initially thought missing (session collector, corporate memory, Jira SLA polling, CI/CD, telegram bot) already exist in OSS.

Primary user: Local Claude Code agent analyzing data. Web UI is secondary.

What's being ported:

  1. Business metrics layer (20+ YAML metrics → DuckDB-backed framework + starter pack)
  2. Analyst bootstrap flow (onboarding for analysts connecting to a remote instance)
  3. Metadata writer (column descriptions + basetype push back to Keboola Storage API)

What's NOT being ported:

  • macOS desktop app (narrow use-case, WebSocket gateway covers most needs)
  • Linux user management (replaced by DuckDB RBAC)
  • rsync distribution (replaced by FastAPI API)
  • systemd services (replaced by Docker Compose)

1. Business Metrics in DuckDB

1.1 DuckDB Schema — metric_definitions table

New table in system.duckdb, added as part of schema migration v3→v4:

CREATE TABLE metric_definitions (
    id              VARCHAR PRIMARY KEY,     -- 'revenue/mrr'
    name            VARCHAR NOT NULL,        -- 'mrr'
    display_name    VARCHAR NOT NULL,        -- 'Monthly Recurring Revenue'
    category        VARCHAR NOT NULL,        -- 'revenue'
    description     TEXT,
    type            VARCHAR DEFAULT 'sum',   -- sum, count, ratio, comparison
    unit            VARCHAR,                 -- 'USD', 'percentage', 'count'
    grain           VARCHAR DEFAULT 'monthly', -- monthly, weekly, daily
    table_name      VARCHAR,                 -- primary table
    tables          VARCHAR[],               -- for JOIN metrics
    expression      VARCHAR,                 -- 'SUM(total_amount)'
    time_column     VARCHAR,                 -- 'order_date'
    dimensions      VARCHAR[],              -- ['channel', 'region']
    filters         VARCHAR[],              -- descriptive WHERE conditions
    synonyms        VARCHAR[],              -- for NL matching
    notes           VARCHAR[],              -- business rules
    sql             TEXT NOT NULL,           -- canonical SQL query
    sql_variants    JSON,                   -- {"by_channel": "SELECT ...", "by_region": "..."}
    validation      JSON,                   -- {"method": "...", "result": "..."}
    source          VARCHAR DEFAULT 'manual', -- 'yaml_import', 'manual', 'api'
    created_at      TIMESTAMP DEFAULT now(),
    updated_at      TIMESTAMP DEFAULT now()
);

1.2 Schema Versioning

Added to src/db.py as SCHEMA_VERSION = 4 migration (v3→v4). Migration creates both metric_definitions and column_metadata tables. Existing data untouched.

1.3 Repository (src/repositories/metrics.py)

Follows existing pattern from table_registry.py:

  • list(category=None) → all metrics, optionally filtered
  • get(metric_id) → single metric or None
  • create(**kwargs) → insert metric
  • update(metric_id, **kwargs) → update fields
  • delete(metric_id) → remove metric
  • find_by_table(table_name) → metrics referencing a table
  • find_by_synonym(term) → NL matching for Claude Code agent
  • import_from_yaml(yaml_path) → parse YAML, upsert into DuckDB, return count
  • export_to_yaml(output_dir) → DuckDB → YAML files, return count

1.4 YAML as Seed/Import Format

YAML files in docs/metrics/ serve as:

  • Starter pack — 10-15 generic SaaS metrics shipped with the project
  • Import sourceda metrics import docs/metrics/ loads into DuckDB
  • Export targetda metrics export dumps DuckDB → YAML (sharing, backup, version control)
  • Migration — on first run after upgrade: detect YAML without DuckDB records → auto-import

Format remains compatible with the internal repo (same fields as total_revenue.yml).

1.5 Migration Script (scripts/migrate_metrics_to_duckdb.py)

  1. Scans docs/metrics/*/*.yml via glob
  2. Parses YAML, maps fields to DuckDB columns
  3. sql_by_* variants → sql_variants JSON
  4. INSERT OR REPLACE into metric_definitions
  5. Idempotent — safe to run repeatedly

Auto-runs during schema migration v3→v4 if YAML files exist.

1.6 Metrics Index (docs/metrics/metrics.yml)

Master index for the YAML starter pack. After da metrics import, DuckDB becomes the source of truth. The YAML index is only used during import to define categories and discover files — it is NOT read at runtime.

version: "2.0"
categories:
  - name: revenue
    folder: revenue/
    metrics: [total_revenue, mrr, arr, churn_rate]
  - name: product_usage
    folder: product_usage/
    metrics: [active_users, feature_adoption]
  - name: sales
    folder: sales/
    metrics: [new_customers, upsell_expansion, pipeline_value]
  - name: operations
    folder: operations/
    metrics: [support_resolution_time, infrastructure_cost]

1.7 Starter Pack Metrics (10-15 generic)

Ported and generalized from internal repo, adapted for generic SaaS data:

Category Metric Internal source
Revenue total_revenue (exists), mrr, arr, churn_rate mrr.yml, new_arr.yml
Product Usage active_users, feature_adoption, usage_vs_limit usage_value.yml, usage_vs_limit.yml
Sales new_customers, upsell_expansion, pipeline_value upsell_expansion.yml, closed_won.yml
Operations support_resolution_time, infrastructure_cost resolution_time.yml, infra_cost.yml

SQL queries are generic templates referencing typical tables (orders, subscriptions, users, tickets). Users adapt to their schema.

1.8 CLI Command da metrics

da metrics list [--category revenue]     # list from DuckDB
da metrics show revenue/mrr              # detail
da metrics import docs/metrics/          # YAML → DuckDB
da metrics export [--dir ./export/]      # DuckDB → YAML
da metrics validate                      # verify consistency (tables exist?)
da metrics add                           # interactive wizard

1.9 API Endpoints

GET  /api/metrics                        → list categories and metrics
GET  /api/metrics/{category}/{name}      → metric detail
POST /api/admin/metrics                  → create/update metric
DELETE /api/admin/metrics/{id}           → delete metric
POST /api/admin/metrics/import           → YAML upload → DuckDB

1.10 Profiler Integration

src/profiler.py already has load_metrics() logic. Wire new src/repositories/metrics.py into profiler so profiles.json includes metrics assigned to tables. Read from DuckDB instead of scanning YAML.

1.11 CLAUDE.md Instructions

Add section to CLAUDE.md:

Before computing any business metric: da metrics show {category}/{name}, read the SQL and business rules, use the canonical SQL from the metric definition.


2. Analyst Bootstrap Flow

2.1 Two Bootstrap Modes

Server-side (already exists in da setup):

  • da setup initbootstraptest-connectionfirst-syncverify
  • Sets up instance (instance.yaml, .env, Docker)
  • No changes needed.

Analyst-side (new — equivalent of internal bootstrap.yaml):

  • Analyst connects local Claude Code to a remote Agnes instance
  • Downloads data, initializes DuckDB, sets up CLAUDE.md
  • Uses API instead of SSH/rsync

2.2 Flow: da analyst setup

New command (subcommand of da setup or standalone da analyst):

Step 1: detect_existing_project
  → looks for ./CLAUDE.md with Agnes identifier
  → if found: "Project already set up. Want to resync? (da sync)"
  → if not: continue

Step 2: connect_to_instance
  → asks for instance URL (https://data.acme.com)
  → asks for credentials (email/password or OAuth token)
  → GET /api/health → verify availability
  → POST /auth/token → obtain JWT
  → store token in .env or ~/.agnes/credentials

Step 3: create_workspace
  → creates directory structure:
    ./data/parquet/          ← downloaded data
    ./data/duckdb/           ← local analytics.duckdb
    ./data/metadata/         ← profiles, schema
    ./user/artifacts/        ← analyst work output
    ./user/sessions/         ← Claude Code session logs

Step 4: download_schema_and_metrics
  → GET /api/data/tables → list of available tables
  → GET /api/metrics → all metrics
  → saves as local JSON/YAML cache

Step 5: download_data
  → for each table the user has access to:
    GET /api/data/table/{id}/download → parquet
  → Rich progress bar

Step 6: initialize_duckdb
  → creates local analytics.duckdb
  → CREATE VIEW for each downloaded parquet
  → verify: SELECT count(*) from a few tables

Step 7: generate_claude_md
  → generates CLAUDE.md from template (see 2.3)
  → creates empty CLAUDE.local.md
  → writes .claude/settings.json

Step 8: verify
  → runs test query
  → prints: "Setup complete. X tables, Y metrics, Z rows."

2.3 CLAUDE.md Template (config/claude_md_template.txt)

Generated template for analysts, adapted from internal repo:

# {instance_name} — AI Data Analyst

## Rules
- Before computing any business metric: `da metrics show {category}/{name}`
- For current schema: read `data/metadata/schema.json`
- Do not use DESCRIBE/SHOW COLUMNS — read metadata files
- Save work output to `user/artifacts/`

## Metrics Workflow
1. `da metrics list` → identify relevant metric
2. `da metrics show revenue/mrr` → read SQL and rules
3. Use the SQL from the metric, adapt to the question

## Data Sync
- `da sync` → download current data from server
- Data refreshes every {sync_interval}

## Directory Structure
- `data/` — read-only (downloaded from server)
- `user/` — your workspace
- `CLAUDE.local.md` — your personal notes (never overwritten)

Placeholders {instance_name}, {sync_interval} substituted at generation time from instance config.

2.4 Returning-Session Detection

On every da CLI invocation:

  • Check data age (data/metadata/last_sync.json)
  • If >24h: suggest da sync
  • If CLAUDE.md missing: suggest da analyst setup

2.5 Sync Command Extensions

Extensions to existing da sync:

da sync                    # download updated data from server
da sync --docs-only        # just metadata and metrics (fast)
da sync --upload-local     # upload CLAUDE.local.md to server (corporate memory)

3. Metadata Writer

3.1 DuckDB Schema — column_metadata table

New table in system.duckdb (part of v3→v4 migration alongside metric_definitions):

CREATE TABLE column_metadata (
    table_id        VARCHAR NOT NULL,        -- FK → table_registry.id
    column_name     VARCHAR NOT NULL,
    basetype        VARCHAR,                 -- STRING, INTEGER, NUMERIC, FLOAT, BOOLEAN, DATE, TIMESTAMP
    description     VARCHAR,
    confidence      VARCHAR DEFAULT 'manual', -- high, medium, low, manual
    source          VARCHAR DEFAULT 'manual', -- 'manual', 'ai_enrichment', 'keboola_import'
    updated_at      TIMESTAMP DEFAULT now(),
    PRIMARY KEY (table_id, column_name)
);

3.2 Workflow (3 phases)

Phase 1: Discover — profiler or AI agent analyzes columns

da admin metadata discover [--table orders]
  → for each column without description:
    sample 500 rows → heuristics for basetype
    if Claude Code agent: generate descriptions
  → saves as "proposal" JSON (same format as internal repo)

Phase 2: Review — user reviews proposals

da admin metadata review proposals/sales_metadata_20260410.json
  → prints table: column | basetype | description | confidence
  → user can edit or confirm

Phase 3: Apply — write to DuckDB + optional push to Keboola

da admin metadata apply proposals/sales_metadata_20260410.json
  → INSERT/UPDATE into column_metadata in DuckDB
  → --push-to-source: if source_type=keboola, POST to Keboola Storage API
  → --dry-run: just show what would change

3.3 Push to Keboola Storage API

Ported from apply_metadata.py:

  • Provider: "ai-metadata-enrichment"
  • Keys: KBC.datatype.basetype, KBC.description
  • Endpoint: POST {stack_url}/v2/storage/tables/{table_id}/metadata
  • Token and stack_url from config/instance.yaml / env vars (not hardcoded JSON)

Only works for tables with source_type = 'keboola' in table_registry. For BigQuery/CSV/Jira, metadata is stored locally in DuckDB only.

3.4 API Endpoints

GET  /api/admin/metadata/{table_id}           → column metadata for table
POST /api/admin/metadata/{table_id}           → save metadata (JSON body)
POST /api/admin/metadata/{table_id}/push      → push to source system

3.5 Integration

  • Profiler: src/profiler.py enriches profiles.json with column_metadata from DuckDB
  • Catalog API: GET /api/catalog returns metadata alongside profiles
  • Claude Code agent: reads metadata via da admin metadata show {table} or from profiles.json

Implementation Summary

New Files

Component Files
Metrics src/repositories/metrics.py, src/metrics.py, cli/commands/metrics.py, app/api/metrics.py, scripts/migrate_metrics_to_duckdb.py, 10-15 YAML in docs/metrics/
Bootstrap cli/commands/analyst.py, config/claude_md_template.txt
Metadata src/repositories/column_metadata.py, app/api/metadata.py (metadata commands added as subcommands of da admin)

Modified Files

File Changes
src/db.py SCHEMA_VERSION=4, metric_definitions + column_metadata tables, v3→v4 migration
src/profiler.py Read metrics + column_metadata from DuckDB instead of YAML scan
app/main.py Register metrics + metadata routers
cli/main.py Register metrics + analyst commands
cli/commands/sync.py --docs-only, --upload-local flags
CLAUDE.md Metrics workflow instructions

Schema Migration v3→v4

Single migration creating both tables. Auto-imports existing YAML metrics if found. Idempotent.

Implementation Order

  1. Schema v4 + metrics (framework + starter pack + CLI + API)
  2. Bootstrap flow (analyst setup + CLAUDE.md template)
  3. Metadata writer (discover + apply + Keboola push)

Test Coverage

Each component gets its own test file following existing patterns:

  • tests/test_metrics.py — repository CRUD, YAML import/export, API endpoints
  • tests/test_analyst_bootstrap.py — setup flow (mocked API calls)
  • tests/test_column_metadata.py — repository CRUD, proposal format, Keboola push (mocked)