Document Data Catalog and Profiler pipeline in auto-install guide

- Add architecture diagram showing data flow from instance config
  through profiler to webapp
- Explain folder_mapping dual purpose (catalog categories + file paths)
- Add Step 6c for running the profiler
- Document foreign_keys for relationship diagrams
- Explain profiles.json fallback for catalog header stats
- Expand checklist with profiler verification steps
This commit is contained in:
Petr 2026-03-10 22:14:45 +01:00
parent 28543d98b1
commit f685dc357f

View file

@ -344,7 +344,52 @@ After server is set up, analysts self-onboard via the webapp:
## Step 6: Sample Data (Try Without a Data Adapter)
Before connecting a real data source, you can load sample data to verify the full pipeline
(Parquet files, Data Catalog, analyst rsync, Claude Code analysis).
(Parquet files, Data Catalog with profiling, analyst rsync, Claude Code analysis).
### How the Data Catalog & Profiler Pipeline Works
```
Instance repo Server filesystem Webapp
───────────── ──────────────── ──────
config/data_description.md ──symlink──> repo/docs/data_description.md
(tables, folder_mapping, │
foreign_keys) │
config/instance.yaml ────────symlink──> repo/config/instance.yaml
(catalog.categories, │
labels, icons, order) │
/data/src_data/parquet/*.parquet
┌─────────┴──────────┐
▼ ▼
python -m src.profiler _load_catalog_data()
│ │
▼ ▼
/data/src_data/metadata/ /catalog page
profiles.json (categories + tables)
┌──────────┴──────────┐
▼ ▼
/api/catalog/profile/ _load_data_stats()
(per-table stats, (header: "9 tables,
columns, alerts, ~217K rows total")
relationships)
```
Key files and their roles:
| File | Location | Purpose |
|------|----------|---------|
| `data_description.md` | Instance repo | Table definitions, folder_mapping (bucket→category), foreign_keys |
| `instance.yaml` | Instance repo | Catalog category labels, icons, display order |
| `*.parquet` | `/data/src_data/parquet/` | Actual data files (flat or in subfolders) |
| `profiles.json` | `/data/src_data/metadata/` | Profiler output: statistics, alerts, relationships per table |
| `sync_state.json` | `/data/src_data/metadata/` | Sync process stats (optional; profiler provides fallback) |
**Folder mapping** serves dual purpose: maps table IDs to catalog categories for the UI,
and maps to filesystem paths for the profiler. The profiler auto-detects flat layouts
(all parquet files in one directory) vs subfolder layouts (Keboola-style `parquet/<folder>/<table>.parquet`).
### 6a: Generate Parquet Files
@ -371,8 +416,13 @@ See `docs/sample-data.md` for the full data model and built-in analytical patter
The Data Catalog reads from two files in the **instance repo**:
1. **`config/data_description.md`** - table definitions with YAML block (tables, folder_mapping)
2. **`config/instance.yaml`** - catalog categories (label, icon, order)
1. **`config/data_description.md`** - YAML block with `folder_mapping`, `tables` (id, name, description, primary_key, sync_strategy, foreign_keys)
2. **`config/instance.yaml`** - `catalog.categories` with label, icon per category + `catalog.order`
The `folder_mapping` maps bucket prefixes from table IDs to category names. Example:
table ID `sample.sales.orders` → bucket `sample.sales` → folder `sales` → category "Sales & Orders".
Tables with `foreign_keys` will show interactive relationship diagrams in the profiler modal.
Add `data_description.md` to the instance repo with the sample tables:
@ -380,7 +430,18 @@ Add `data_description.md` to the instance repo with the sample tables:
cd /opt/data-analyst/instance
# Create data_description.md (see config/data_description.md.example in OSS repo)
# Must contain a ```yaml block with folder_mapping + tables list
# Must contain a ```yaml block with:
# folder_mapping: { "bucket.prefix": "category_key", ... }
# tables: list of table definitions
#
# Each table needs: id, name, description, primary_key, sync_strategy
# Optional: foreign_keys (for profiler Relationships tab)
#
# Example foreign_keys:
# foreign_keys:
# - column: "customer_id"
# references: "customers.customer_id"
# description: "Ordering customer"
# Add catalog categories to instance.yaml:
cat >> config/instance.yaml << 'YAML'
@ -421,6 +482,35 @@ ln -sf /opt/data-analyst/instance/config/data_description.md \
systemctl restart webapp
```
### 6c: Run Data Profiler
The profiler reads parquet files + `data_description.md` and generates `profiles.json`
with per-table statistics, column analysis, data quality alerts, and relationship maps.
```bash
cd /opt/data-analyst/repo
/opt/data-analyst/.venv/bin/python -m src.profiler
```
Output: `/data/src_data/metadata/profiles.json` (auto-created, readable by webapp).
The profiler provides:
- **Overview**: row count, column count, file size, date coverage, missing cell %
- **Columns**: type distribution, top values, histograms for numeric columns
- **Insights**: data quality alerts (high missing %, imbalanced categories, high cardinality)
- **Relationships**: FK diagram built from `foreign_keys` in `data_description.md`
- **Sample**: first 5 rows of the table
Without `sync_state.json` (no data adapter running), the profiler computes file sizes
directly from parquet files, and the catalog header derives table/row counts from `profiles.json`.
To re-run after data changes:
```bash
cd /opt/data-analyst/repo && /opt/data-analyst/.venv/bin/python -m src.profiler
# No webapp restart needed - profiles.json is read on each request
```
### Step 6 Checklist
| # | Check | Expected |
@ -428,8 +518,12 @@ systemctl restart webapp
| 6.1 | Parquet files | `ls /data/src_data/parquet/*.parquet` shows 9 files |
| 6.2 | Permissions | Files owned by root:data-ops, group-readable |
| 6.3 | Data Catalog | `/catalog` page shows 6 categories with 9 tables |
| 6.4 | Analyst sync | Analyst can rsync parquet files to local machine |
| 6.5 | DuckDB loads | `SELECT count(*) FROM read_parquet('orders.parquet')` returns rows |
| 6.4 | Catalog header | "9 tables, ~217K+ rows total" (from profiles.json) |
| 6.5 | Profile modal | Click "Profile" on any table → statistics, columns, insights |
| 6.6 | Relationships | Orders profile → shows customers, order_items, payments links |
| 6.7 | File sizes | Profile overview shows non-zero file size (e.g., 0.69 MB) |
| 6.8 | Analyst sync | Analyst can rsync parquet files to local machine |
| 6.9 | DuckDB loads | `SELECT count(*) FROM read_parquet('orders.parquet')` returns rows |
## Step 7: Real Data Source (Production)