docs: add design spec for porting internal features to OSS

Covers business metrics in DuckDB, analyst bootstrap flow, and metadata writer — based on comparison with internal repo. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 18:49:34 +02:00 · 2026-04-10 18:49:34 +02:00 · 1ce632bc0b
commit 1ce632bc0b
parent 5836bcde4c
1 changed files with 382 additions and 0 deletions
--- a/docs/superpowers/specs/2026-04-10-porting-internal-features-design.md
+++ b/docs/superpowers/specs/2026-04-10-porting-internal-features-design.md
@ -0,0 +1,382 @@
+# Porting Internal Features to OSS — Design Spec
+
+**Date:** 2026-04-10
+**Status:** Approved
+**Approach:** Metric-First (A) — metriky → bootstrap → metadata writer
+
+## Context
+
+Comparison of `keboola/internal_ai_data_analyst` (private, Jan 2026) with the OSS version revealed three feature gaps worth porting. Many features initially thought missing (session collector, corporate memory, Jira SLA polling, CI/CD, telegram bot) already exist in OSS.
+
+**Primary user:** Local Claude Code agent analyzing data. Web UI is secondary.
+
+**What's being ported:**
+1. Business metrics layer (20+ YAML metrics → DuckDB-backed framework + starter pack)
+2. Analyst bootstrap flow (onboarding for analysts connecting to a remote instance)
+3. Metadata writer (column descriptions + basetype push back to Keboola Storage API)
+
+**What's NOT being ported:**
+- macOS desktop app (narrow use-case, WebSocket gateway covers most needs)
+- Linux user management (replaced by DuckDB RBAC)
+- rsync distribution (replaced by FastAPI API)
+- systemd services (replaced by Docker Compose)
+
+---
+
+## 1. Business Metrics in DuckDB
+
+### 1.1 DuckDB Schema — `metric_definitions` table
+
+New table in `system.duckdb`, added as part of schema migration v3→v4:
+
+```sql
+CREATE TABLE metric_definitions (
+    id              VARCHAR PRIMARY KEY,     -- 'revenue/mrr'
+    name            VARCHAR NOT NULL,        -- 'mrr'
+    display_name    VARCHAR NOT NULL,        -- 'Monthly Recurring Revenue'
+    category        VARCHAR NOT NULL,        -- 'revenue'
+    description     TEXT,
+    type            VARCHAR DEFAULT 'sum',   -- sum, count, ratio, comparison
+    unit            VARCHAR,                 -- 'USD', 'percentage', 'count'
+    grain           VARCHAR DEFAULT 'monthly', -- monthly, weekly, daily
+    table_name      VARCHAR,                 -- primary table
+    tables          VARCHAR[],               -- for JOIN metrics
+    expression      VARCHAR,                 -- 'SUM(total_amount)'
+    time_column     VARCHAR,                 -- 'order_date'
+    dimensions      VARCHAR[],              -- ['channel', 'region']
+    filters         VARCHAR[],              -- descriptive WHERE conditions
+    synonyms        VARCHAR[],              -- for NL matching
+    notes           VARCHAR[],              -- business rules
+    sql             TEXT NOT NULL,           -- canonical SQL query
+    sql_variants    JSON,                   -- {"by_channel": "SELECT ...", "by_region": "..."}
+    validation      JSON,                   -- {"method": "...", "result": "..."}
+    source          VARCHAR DEFAULT 'manual', -- 'yaml_import', 'manual', 'api'
+    created_at      TIMESTAMP DEFAULT now(),
+    updated_at      TIMESTAMP DEFAULT now()
+);
+```
+
+### 1.2 Schema Versioning
+
+Added to `src/db.py` as `SCHEMA_VERSION = 4` migration (v3→v4). Migration creates both `metric_definitions` and `column_metadata` tables. Existing data untouched.
+
+### 1.3 Repository (`src/repositories/metrics.py`)
+
+Follows existing pattern from `table_registry.py`:
+
+- `list(category=None)` → all metrics, optionally filtered
+- `get(metric_id)` → single metric or None
+- `create(**kwargs)` → insert metric
+- `update(metric_id, **kwargs)` → update fields
+- `delete(metric_id)` → remove metric
+- `find_by_table(table_name)` → metrics referencing a table
+- `find_by_synonym(term)` → NL matching for Claude Code agent
+- `import_from_yaml(yaml_path)` → parse YAML, upsert into DuckDB, return count
+- `export_to_yaml(output_dir)` → DuckDB → YAML files, return count
+
+### 1.4 YAML as Seed/Import Format
+
+YAML files in `docs/metrics/` serve as:
+- **Starter pack** — 10-15 generic SaaS metrics shipped with the project
+- **Import source** — `da metrics import docs/metrics/` loads into DuckDB
+- **Export target** — `da metrics export` dumps DuckDB → YAML (sharing, backup, version control)
+- **Migration** — on first run after upgrade: detect YAML without DuckDB records → auto-import
+
+Format remains compatible with the internal repo (same fields as `total_revenue.yml`).
+
+### 1.5 Migration Script (`scripts/migrate_metrics_to_duckdb.py`)
+
+1. Scans `docs/metrics/*/*.yml` via glob
+2. Parses YAML, maps fields to DuckDB columns
+3. `sql_by_*` variants → `sql_variants` JSON
+4. INSERT OR REPLACE into `metric_definitions`
+5. Idempotent — safe to run repeatedly
+
+Auto-runs during schema migration v3→v4 if YAML files exist.
+
+### 1.6 Metrics Index (`docs/metrics/metrics.yml`)
+
+Master index for the YAML starter pack. After `da metrics import`, DuckDB becomes the source of truth. The YAML index is only used during import to define categories and discover files — it is NOT read at runtime.
+
+```yaml
+version: "2.0"
+categories:
+  - name: revenue
+    folder: revenue/
+    metrics: [total_revenue, mrr, arr, churn_rate]
+  - name: product_usage
+    folder: product_usage/
+    metrics: [active_users, feature_adoption]
+  - name: sales
+    folder: sales/
+    metrics: [new_customers, upsell_expansion, pipeline_value]
+  - name: operations
+    folder: operations/
+    metrics: [support_resolution_time, infrastructure_cost]
+```
+
+### 1.7 Starter Pack Metrics (10-15 generic)
+
+Ported and generalized from internal repo, adapted for generic SaaS data:
+
+| Category | Metric | Internal source |
+|---|---|---|
+| **Revenue** | `total_revenue` (exists), `mrr`, `arr`, `churn_rate` | mrr.yml, new_arr.yml |
+| **Product Usage** | `active_users`, `feature_adoption`, `usage_vs_limit` | usage_value.yml, usage_vs_limit.yml |
+| **Sales** | `new_customers`, `upsell_expansion`, `pipeline_value` | upsell_expansion.yml, closed_won.yml |
+| **Operations** | `support_resolution_time`, `infrastructure_cost` | resolution_time.yml, infra_cost.yml |
+
+SQL queries are **generic templates** referencing typical tables (`orders`, `subscriptions`, `users`, `tickets`). Users adapt to their schema.
+
+### 1.8 CLI Command `da metrics`
+
+```
+da metrics list [--category revenue]     # list from DuckDB
+da metrics show revenue/mrr              # detail
+da metrics import docs/metrics/          # YAML → DuckDB
+da metrics export [--dir ./export/]      # DuckDB → YAML
+da metrics validate                      # verify consistency (tables exist?)
+da metrics add                           # interactive wizard
+```
+
+### 1.9 API Endpoints
+
+```
+GET  /api/metrics                        → list categories and metrics
+GET  /api/metrics/{category}/{name}      → metric detail
+POST /api/admin/metrics                  → create/update metric
+DELETE /api/admin/metrics/{id}           → delete metric
+POST /api/admin/metrics/import           → YAML upload → DuckDB
+```
+
+### 1.10 Profiler Integration
+
+`src/profiler.py` already has `load_metrics()` logic. Wire new `src/repositories/metrics.py` into profiler so `profiles.json` includes metrics assigned to tables. Read from DuckDB instead of scanning YAML.
+
+### 1.11 CLAUDE.md Instructions
+
+Add section to CLAUDE.md:
+> Before computing any business metric: `da metrics show {category}/{name}`, read the SQL and business rules, use the canonical SQL from the metric definition.
+
+---
+
+## 2. Analyst Bootstrap Flow
+
+### 2.1 Two Bootstrap Modes
+
+**Server-side** (already exists in `da setup`):
+- `da setup init` → `bootstrap` → `test-connection` → `first-sync` → `verify`
+- Sets up instance (instance.yaml, .env, Docker)
+- No changes needed.
+
+**Analyst-side** (new — equivalent of internal `bootstrap.yaml`):
+- Analyst connects local Claude Code to a remote Agnes instance
+- Downloads data, initializes DuckDB, sets up CLAUDE.md
+- Uses API instead of SSH/rsync
+
+### 2.2 Flow: `da analyst setup`
+
+New command (subcommand of `da setup` or standalone `da analyst`):
+
+```
+Step 1: detect_existing_project
+  → looks for ./CLAUDE.md with Agnes identifier
+  → if found: "Project already set up. Want to resync? (da sync)"
+  → if not: continue
+
+Step 2: connect_to_instance
+  → asks for instance URL (https://data.acme.com)
+  → asks for credentials (email/password or OAuth token)
+  → GET /api/health → verify availability
+  → POST /auth/token → obtain JWT
+  → store token in .env or ~/.agnes/credentials
+
+Step 3: create_workspace
+  → creates directory structure:
+    ./data/parquet/          ← downloaded data
+    ./data/duckdb/           ← local analytics.duckdb
+    ./data/metadata/         ← profiles, schema
+    ./user/artifacts/        ← analyst work output
+    ./user/sessions/         ← Claude Code session logs
+
+Step 4: download_schema_and_metrics
+  → GET /api/data/tables → list of available tables
+  → GET /api/metrics → all metrics
+  → saves as local JSON/YAML cache
+
+Step 5: download_data
+  → for each table the user has access to:
+    GET /api/data/table/{id}/download → parquet
+  → Rich progress bar
+
+Step 6: initialize_duckdb
+  → creates local analytics.duckdb
+  → CREATE VIEW for each downloaded parquet
+  → verify: SELECT count(*) from a few tables
+
+Step 7: generate_claude_md
+  → generates CLAUDE.md from template (see 2.3)
+  → creates empty CLAUDE.local.md
+  → writes .claude/settings.json
+
+Step 8: verify
+  → runs test query
+  → prints: "Setup complete. X tables, Y metrics, Z rows."
+```
+
+### 2.3 CLAUDE.md Template (`config/claude_md_template.txt`)
+
+Generated template for analysts, adapted from internal repo:
+
+```markdown
+# {instance_name} — AI Data Analyst
+
+## Rules
+- Before computing any business metric: `da metrics show {category}/{name}`
+- For current schema: read `data/metadata/schema.json`
+- Do not use DESCRIBE/SHOW COLUMNS — read metadata files
+- Save work output to `user/artifacts/`
+
+## Metrics Workflow
+1. `da metrics list` → identify relevant metric
+2. `da metrics show revenue/mrr` → read SQL and rules
+3. Use the SQL from the metric, adapt to the question
+
+## Data Sync
+- `da sync` → download current data from server
+- Data refreshes every {sync_interval}
+
+## Directory Structure
+- `data/` — read-only (downloaded from server)
+- `user/` — your workspace
+- `CLAUDE.local.md` — your personal notes (never overwritten)
+```
+
+Placeholders `{instance_name}`, `{sync_interval}` substituted at generation time from instance config.
+
+### 2.4 Returning-Session Detection
+
+On every `da` CLI invocation:
+- Check data age (`data/metadata/last_sync.json`)
+- If >24h: suggest `da sync`
+- If CLAUDE.md missing: suggest `da analyst setup`
+
+### 2.5 Sync Command Extensions
+
+Extensions to existing `da sync`:
+```
+da sync                    # download updated data from server
+da sync --docs-only        # just metadata and metrics (fast)
+da sync --upload-local     # upload CLAUDE.local.md to server (corporate memory)
+```
+
+---
+
+## 3. Metadata Writer
+
+### 3.1 DuckDB Schema — `column_metadata` table
+
+New table in `system.duckdb` (part of v3→v4 migration alongside `metric_definitions`):
+
+```sql
+CREATE TABLE column_metadata (
+    table_id        VARCHAR NOT NULL,        -- FK → table_registry.id
+    column_name     VARCHAR NOT NULL,
+    basetype        VARCHAR,                 -- STRING, INTEGER, NUMERIC, FLOAT, BOOLEAN, DATE, TIMESTAMP
+    description     VARCHAR,
+    confidence      VARCHAR DEFAULT 'manual', -- high, medium, low, manual
+    source          VARCHAR DEFAULT 'manual', -- 'manual', 'ai_enrichment', 'keboola_import'
+    updated_at      TIMESTAMP DEFAULT now(),
+    PRIMARY KEY (table_id, column_name)
+);
+```
+
+### 3.2 Workflow (3 phases)
+
+**Phase 1: Discover** — profiler or AI agent analyzes columns
+```
+da admin metadata discover [--table orders]
+  → for each column without description:
+    sample 500 rows → heuristics for basetype
+    if Claude Code agent: generate descriptions
+  → saves as "proposal" JSON (same format as internal repo)
+```
+
+**Phase 2: Review** — user reviews proposals
+```
+da admin metadata review proposals/sales_metadata_20260410.json
+  → prints table: column | basetype | description | confidence
+  → user can edit or confirm
+```
+
+**Phase 3: Apply** — write to DuckDB + optional push to Keboola
+```
+da admin metadata apply proposals/sales_metadata_20260410.json
+  → INSERT/UPDATE into column_metadata in DuckDB
+  → --push-to-source: if source_type=keboola, POST to Keboola Storage API
+  → --dry-run: just show what would change
+```
+
+### 3.3 Push to Keboola Storage API
+
+Ported from `apply_metadata.py`:
+- Provider: `"ai-metadata-enrichment"`
+- Keys: `KBC.datatype.basetype`, `KBC.description`
+- Endpoint: `POST {stack_url}/v2/storage/tables/{table_id}/metadata`
+- Token and stack_url from `config/instance.yaml` / env vars (not hardcoded JSON)
+
+Only works for tables with `source_type = 'keboola'` in `table_registry`. For BigQuery/CSV/Jira, metadata is stored locally in DuckDB only.
+
+### 3.4 API Endpoints
+
+```
+GET  /api/admin/metadata/{table_id}           → column metadata for table
+POST /api/admin/metadata/{table_id}           → save metadata (JSON body)
+POST /api/admin/metadata/{table_id}/push      → push to source system
+```
+
+### 3.5 Integration
+
+- **Profiler**: `src/profiler.py` enriches `profiles.json` with `column_metadata` from DuckDB
+- **Catalog API**: `GET /api/catalog` returns metadata alongside profiles
+- **Claude Code agent**: reads metadata via `da admin metadata show {table}` or from `profiles.json`
+
+---
+
+## Implementation Summary
+
+### New Files
+
+| Component | Files |
+|---|---|
+| **Metrics** | `src/repositories/metrics.py`, `src/metrics.py`, `cli/commands/metrics.py`, `app/api/metrics.py`, `scripts/migrate_metrics_to_duckdb.py`, 10-15 YAML in `docs/metrics/` |
+| **Bootstrap** | `cli/commands/analyst.py`, `config/claude_md_template.txt` |
+| **Metadata** | `src/repositories/column_metadata.py`, `app/api/metadata.py` (metadata commands added as subcommands of `da admin`) |
+
+### Modified Files
+
+| File | Changes |
+|---|---|
+| `src/db.py` | SCHEMA_VERSION=4, `metric_definitions` + `column_metadata` tables, v3→v4 migration |
+| `src/profiler.py` | Read metrics + column_metadata from DuckDB instead of YAML scan |
+| `app/main.py` | Register metrics + metadata routers |
+| `cli/main.py` | Register `metrics` + `analyst` commands |
+| `cli/commands/sync.py` | `--docs-only`, `--upload-local` flags |
+| `CLAUDE.md` | Metrics workflow instructions |
+
+### Schema Migration v3→v4
+
+Single migration creating both tables. Auto-imports existing YAML metrics if found. Idempotent.
+
+### Implementation Order
+
+1. Schema v4 + metrics (framework + starter pack + CLI + API)
+2. Bootstrap flow (analyst setup + CLAUDE.md template)
+3. Metadata writer (discover + apply + Keboola push)
+
+### Test Coverage
+
+Each component gets its own test file following existing patterns:
+- `tests/test_metrics.py` — repository CRUD, YAML import/export, API endpoints
+- `tests/test_analyst_bootstrap.py` — setup flow (mocked API calls)
+- `tests/test_column_metadata.py` — repository CRUD, proposal format, Keboola push (mocked)