Petr f685dc357f Document Data Catalog and Profiler pipeline in auto-install guide

- Add architecture diagram showing data flow from instance config
  through profiler to webapp
- Explain folder_mapping dual purpose (catalog categories + file paths)
- Add Step 6c for running the profiler
- Document foreign_keys for relationship diagrams
- Explain profiles.json fallback for catalog header stats
- Expand checklist with profiler verification steps

2026-03-10 22:14:45 +01:00

19 KiB

Raw Blame History

Automated Installation Guide

Step-by-step deployment of AI Data Analyst on a clean Ubuntu 24.04 VM.

Two repos are involved:

OSS repo (public/private): application code (padak/tmp_oss)
Instance repo (private): your config, secrets template, data schema (padak/tmp_oss_cfg)

Architecture on Server

/opt/data-analyst/
├── repo/              # OSS repo clone
│   ├── config/
│   │   └── instance.yaml -> ../../instance/config/instance.yaml  (symlink)
│   ├── webapp/
│   ├── server/
│   └── ...
├── instance/          # Private instance repo clone
│   ├── config/
│   │   ├── instance.yaml          # Branding, auth domains, data source
│   │   └── data_description.md    # Data schema (when configured)
│   ├── docs/setup/                # Custom CLAUDE.md template, etc.
│   ├── .env.example               # Secrets template
│   └── README.md
├── .env               # Secrets (not in git, from .env.example)
├── .venv/             # Python virtual environment
└── logs/              # Application logs

Key principle: OSS repo has no secrets/config. Instance repo has no code. Symlinks bridge them.

Prerequisites

DigitalOcean API token with ssh_key scope (or any Ubuntu 24.04 VM)
Two GitHub repos: one for OSS code, one for private instance config
SSH key on your local machine for server access

Known Issues

python3-venv must be installed before server/setup.sh (Ubuntu 24.04 omits it)
webapp-setup.sh generates SSL nginx config - use HTTP-only for IP-only deployments
DigitalOcean cloud-init cannot override password expiry; must use ssh_keys API field

Step 0: Create Repos

# Push OSS code to GitHub
git remote add origin git@github.com:YOUR_ORG/YOUR_OSS_REPO.git
git push -u origin main

# Create private instance config repo on GitHub (empty, private)
# We'll populate it from the server after setup

Step 1: Provision VM

1a: Create Droplet (DigitalOcean)

# Register SSH key (requires ssh_key scope on API token)
curl -s -X POST -H 'Content-Type: application/json' \
    -H "Authorization: Bearer $DO_TOKEN" \
    -d '{"name":"my-key","public_key":"ssh-ed25519 AAAA..."}' \
    "https://api.digitalocean.com/v2/account/keys"

# Create droplet with SSH key
curl -s -X POST -H 'Content-Type: application/json' \
    -H "Authorization: Bearer $DO_TOKEN" \
    -d '{
      "name":"data-analyst-1",
      "size":"s-1vcpu-2gb",
      "region":"ams3",
      "image":"ubuntu-24-04-x64",
      "ssh_keys":["KEY_ID_OR_FINGERPRINT"]
    }' \
    "https://api.digitalocean.com/v2/droplets"

1b: Install Prerequisites

ssh root@DROPLET_IP

# Wait for apt lock (auto-updates run on first boot)
apt update && apt install -y python3.12-venv python3-pip

1c: Generate Deploy Keys

Two separate keys - one per repo, for security isolation:

# Key for OSS repo
ssh-keygen -t ed25519 -f /root/.ssh/deploy_key -N "" -C "oss-app@$(hostname)"

# Key for private instance config repo
ssh-keygen -t ed25519 -f /root/.ssh/instance_key -N "" -C "instance-config@$(hostname)"

Add each public key as a deploy key on its respective GitHub repo:

deploy_key.pub -> OSS repo Settings > Deploy Keys
instance_key.pub -> Instance repo Settings > Deploy Keys

Configure SSH to use the right key per repo:

cat > /root/.ssh/config << 'EOF'
# OSS application repo
Host github-oss
  HostName github.com
  IdentityFile /root/.ssh/deploy_key
  StrictHostKeyChecking no

# Instance config repo (private)
Host github-cfg
  HostName github.com
  IdentityFile /root/.ssh/instance_key
  StrictHostKeyChecking no
EOF
chmod 600 /root/.ssh/config

1d: Clone OSS Repo & Run Setup

git clone git@github-oss:YOUR_ORG/YOUR_OSS_REPO.git /opt/data-analyst/repo
cd /opt/data-analyst/repo
REPO_URL="git@github-oss:YOUR_ORG/YOUR_OSS_REPO.git" bash server/setup.sh

Step 1 Checklist

#	Check	Expected
1.1	Groups	data-ops, dataread, data-private exist
1.2	Deploy user	uid deploy, groups: deploy, data-ops
1.3	Directories	/opt/data-analyst/{repo,.venv,logs}
1.4	Python venv	Flask loads in .venv
1.5	Scripts	add-analyst, list-analysts in /usr/local/bin

Step 2: Webapp Setup

2a: Run webapp-setup.sh

export SERVER_HOSTNAME="your-domain-or-ip"
bash server/webapp-setup.sh

For IP-only (no SSL), replace nginx config:

cat > /etc/nginx/sites-available/webapp << 'NGINX'
server {
    listen 80;
    server_name _;
    location / {
        proxy_pass http://unix:/run/webapp/webapp.sock;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
    }
    location /static/ {
        alias /opt/data-analyst/repo/webapp/static/;
        expires 1d;
    }
    location /health {
        proxy_pass http://unix:/run/webapp/webapp.sock;
        proxy_set_header Host $host;
        access_log off;
    }
}
NGINX
rm -f /etc/nginx/sites-enabled/default
nginx -t && systemctl restart nginx

2b: Create .env

SECRET_KEY=$(python3 -c 'import secrets; print(secrets.token_hex(32))')

cat > /opt/data-analyst/.env << EOF
WEBAPP_SECRET_KEY="${SECRET_KEY}"
SERVER_HOST="YOUR_IP"
SERVER_HOSTNAME="YOUR_IP_OR_DOMAIN"
GOOGLE_CLIENT_ID="placeholder"
GOOGLE_CLIENT_SECRET="placeholder"
DATA_SOURCE="local"
DATA_DIR="/data/src_data"
EOF

chown root:data-ops /opt/data-analyst/.env
chmod 640 /opt/data-analyst/.env

2c: Create Data Directories & Start

mkdir -p /data/src_data/{parquet,metadata} /data/docs /data/scripts
chown -R root:data-ops /data
chmod -R 2775 /data

mkdir -p /run/webapp
chown www-data:www-data /run/webapp

systemctl daemon-reload
systemctl start webapp
systemctl enable webapp

Step 2 Checklist

#	Check	Expected
2.1	Nginx	active, port 80
2.2	Webapp	active (gunicorn)
2.3	Health	`curl http://IP/health` returns JSON
2.4	Login page	HTTP 200 at /login

Step 3: Instance Configuration (Private Repo)

3a: Clone Instance Repo

git clone git@github-cfg:YOUR_ORG/YOUR_INSTANCE_REPO.git /opt/data-analyst/instance
chown -R root:data-ops /opt/data-analyst/instance
chmod -R 770 /opt/data-analyst/instance

3b: Initialize Instance Config (if empty repo)

If this is a fresh instance repo, create the initial config:

cd /opt/data-analyst/instance
mkdir -p config docs/setup

cat > config/instance.yaml << 'YAML'
instance:
  name: "My Data Analyst"
  subtitle: "My Organization"
  copyright: "My Org"

server:
  hostname: "YOUR_IP_OR_DOMAIN"
  host: "YOUR_IP"
  app_dir: "/opt/data-analyst"

auth:
  allowed_domain: "mycompany.com"
  webapp_secret_key: "${WEBAPP_SECRET_KEY}"

data_source:
  type: "local"

catalog:
  categories: {}
YAML

# Create .env.example as a template for future deployments
cat > .env.example << 'ENV'
WEBAPP_SECRET_KEY="generate-with: python3 -c 'import secrets; print(secrets.token_hex(32))'"
SERVER_HOST="server-ip"
SERVER_HOSTNAME="server-ip-or-domain"
GOOGLE_CLIENT_ID="placeholder"
GOOGLE_CLIENT_SECRET="placeholder"
DATA_SOURCE="local"
DATA_DIR="/data/src_data"
ENV

cat > .gitignore << 'GI'
.env
.env.local
*.swp
*~
.DS_Store
GI

git add -A && git commit -m "Initial instance config" && git push origin main

3c: Symlink Config into OSS Repo

# Remove any existing instance.yaml (from manual setup) and symlink
rm -f /opt/data-analyst/repo/config/instance.yaml
ln -s /opt/data-analyst/instance/config/instance.yaml /opt/data-analyst/repo/config/instance.yaml

# Symlink data_description.md (for Data Catalog - add when ready in Step 6)
ln -sf /opt/data-analyst/instance/config/data_description.md /opt/data-analyst/repo/docs/data_description.md

systemctl restart webapp

Step 3 Checklist

#	Check	Expected
3.1	Instance repo	/opt/data-analyst/instance/ exists
3.2	Symlink	config/instance.yaml -> ../../instance/config/instance.yaml
3.3	Webapp loads	Instance name shown on login page

Step 4: Authentication

Email magic link works without any external service.

Login page shows "Sign in with Email"
User enters email with allowed domain
Without SMTP: magic link shown in browser (dev mode)
With SMTP: link sent via email
Click link -> logged in -> dashboard

Optional: add Google OAuth by setting real GOOGLE_CLIENT_ID/GOOGLE_CLIENT_SECRET.

Step 4 Checklist

#	Check	Expected
4.1	Email auth	"Sign in with Email" on login page
4.2	Magic link	Generated for valid domain email
4.3	Domain check	Rejects wrong domains
4.4	Login flow	Magic link -> dashboard with session

Step 5: Onboarding Flow (End-User)

After server is set up, analysts self-onboard via the webapp:

Visit http://YOUR_SERVER/login and sign in with email
Dashboard shows "Get Started" with 4 steps:
- Create project folder (mkdir -p data-analyst && cd data-analyst)
- Generate SSH key (ssh-keygen -t ed25519 -f ~/.ssh/data_analyst_server -N '')
- Copy public key (cat ~/.ssh/data_analyst_server.pub)
- Paste key into form, click "Create Account"
After account creation, dashboard shows "Set up your local environment"
User runs claude in their project folder, pastes setup instructions
Claude Code configures SSH, rsyncs data, sets up Python + DuckDB

Step 6: Sample Data (Try Without a Data Adapter)

Before connecting a real data source, you can load sample data to verify the full pipeline (Parquet files, Data Catalog with profiling, analyst rsync, Claude Code analysis).

How the Data Catalog & Profiler Pipeline Works

Instance repo                        Server filesystem            Webapp
─────────────                        ────────────────            ──────
config/data_description.md ──symlink──> repo/docs/data_description.md
  (tables, folder_mapping,                │
   foreign_keys)                          │
                                          ▼
config/instance.yaml ────────symlink──> repo/config/instance.yaml
  (catalog.categories,                    │
   labels, icons, order)                  │
                                          ▼
                          /data/src_data/parquet/*.parquet
                                          │
                                ┌─────────┴──────────┐
                                ▼                    ▼
                     python -m src.profiler    _load_catalog_data()
                                │                    │
                                ▼                    ▼
                  /data/src_data/metadata/     /catalog page
                     profiles.json            (categories + tables)
                                │
                     ┌──────────┴──────────┐
                     ▼                     ▼
              /api/catalog/profile/   _load_data_stats()
               (per-table stats,      (header: "9 tables,
                columns, alerts,       ~217K rows total")
                relationships)

Key files and their roles:

File	Location	Purpose
`data_description.md`	Instance repo	Table definitions, folder_mapping (bucket→category), foreign_keys
`instance.yaml`	Instance repo	Catalog category labels, icons, display order
`*.parquet`	`/data/src_data/parquet/`	Actual data files (flat or in subfolders)
`profiles.json`	`/data/src_data/metadata/`	Profiler output: statistics, alerts, relationships per table
`sync_state.json`	`/data/src_data/metadata/`	Sync process stats (optional; profiler provides fallback)

Folder mapping serves dual purpose: maps table IDs to catalog categories for the UI, and maps to filesystem paths for the profiler. The profiler auto-detects flat layouts (all parquet files in one directory) vs subfolder layouts (Keboola-style parquet/<folder>/<table>.parquet).

6a: Generate Parquet Files

cd /opt/data-analyst/repo

# Install generator dependency
/opt/data-analyst/.venv/bin/pip install faker

# Generate Parquet files directly (uses project's ParquetManager
# for snappy compression, proper types, and metadata embedding)
/opt/data-analyst/.venv/bin/python scripts/generate_sample_data.py \
    --size m --format parquet --output /data/src_data/parquet --seed 42

# Set correct permissions
chown -R root:data-ops /data/src_data/parquet
chmod -R 2775 /data/src_data/parquet

Available sizes: xs (50 customers, ~1 MB), s (500, ~15 MB), m (5K, ~150 MB), l (50K, ~1.5 GB). See docs/sample-data.md for the full data model and built-in analytical patterns.

6b: Configure Data Catalog

The Data Catalog reads from two files in the instance repo:

config/data_description.md - YAML block with folder_mapping, tables (id, name, description, primary_key, sync_strategy, foreign_keys)
config/instance.yaml - catalog.categories with label, icon per category + catalog.order

The folder_mapping maps bucket prefixes from table IDs to category names. Example: table ID sample.sales.orders → bucket sample.sales → folder sales → category "Sales & Orders".

Tables with foreign_keys will show interactive relationship diagrams in the profiler modal.

Add data_description.md to the instance repo with the sample tables:

cd /opt/data-analyst/instance

# Create data_description.md (see config/data_description.md.example in OSS repo)
# Must contain a ```yaml block with:
#   folder_mapping:  { "bucket.prefix": "category_key", ... }
#   tables:          list of table definitions
#
# Each table needs: id, name, description, primary_key, sync_strategy
# Optional: foreign_keys (for profiler Relationships tab)
#
# Example foreign_keys:
#   foreign_keys:
#     - column: "customer_id"
#       references: "customers.customer_id"
#       description: "Ordering customer"

# Add catalog categories to instance.yaml:
cat >> config/instance.yaml << 'YAML'

catalog:
  categories:
    customers:
      label: "Customers"
      icon: "users"
    products:
      label: "Product Catalog"
      icon: "package"
    marketing:
      label: "Marketing & Campaigns"
      icon: "megaphone"
    web:
      label: "Web Analytics"
      icon: "globe"
    sales:
      label: "Sales & Orders"
      icon: "shopping-cart"
    support:
      label: "Support & Tickets"
      icon: "help-circle"
  order: [customers, products, marketing, web, sales, support]
YAML

git add -A && git commit -m "Add sample data catalog" && git push origin main

Then symlink and restart:

# Symlink data_description.md into OSS repo (if not already done)
ln -sf /opt/data-analyst/instance/config/data_description.md \
       /opt/data-analyst/repo/docs/data_description.md

systemctl restart webapp

6c: Run Data Profiler

The profiler reads parquet files + data_description.md and generates profiles.json with per-table statistics, column analysis, data quality alerts, and relationship maps.

cd /opt/data-analyst/repo
/opt/data-analyst/.venv/bin/python -m src.profiler

Output: /data/src_data/metadata/profiles.json (auto-created, readable by webapp).

The profiler provides:

Overview: row count, column count, file size, date coverage, missing cell %
Columns: type distribution, top values, histograms for numeric columns
Insights: data quality alerts (high missing %, imbalanced categories, high cardinality)
Relationships: FK diagram built from foreign_keys in data_description.md
Sample: first 5 rows of the table

Without sync_state.json (no data adapter running), the profiler computes file sizes directly from parquet files, and the catalog header derives table/row counts from profiles.json.

To re-run after data changes:

cd /opt/data-analyst/repo && /opt/data-analyst/.venv/bin/python -m src.profiler
# No webapp restart needed - profiles.json is read on each request

Step 6 Checklist

#	Check	Expected
6.1	Parquet files	`ls /data/src_data/parquet/*.parquet` shows 9 files
6.2	Permissions	Files owned by root:data-ops, group-readable
6.3	Data Catalog	`/catalog` page shows 6 categories with 9 tables
6.4	Catalog header	"9 tables, ~217K+ rows total" (from profiles.json)
6.5	Profile modal	Click "Profile" on any table → statistics, columns, insights
6.6	Relationships	Orders profile → shows customers, order_items, payments links
6.7	File sizes	Profile overview shows non-zero file size (e.g., 0.69 MB)
6.8	Analyst sync	Analyst can rsync parquet files to local machine
6.9	DuckDB loads	`SELECT count(*) FROM read_parquet('orders.parquet')` returns rows

Step 7: Real Data Source (Production)

When ready, replace sample data with a real data source adapter in instance/config/instance.yaml:

data_source:
  type: "keboola"
  keboola:
    storage_token: "${KEBOOLA_STORAGE_TOKEN}"
    stack_url: "https://connection.keboola.com"
    project_id: "12345"

Add the token to .env and create config/data_description.md with table schemas.

Other planned adapters: BigQuery, CSV import.

Deployment Workflow (Ongoing)

Update OSS code

cd /opt/data-analyst/repo && git pull
bash server/deploy.sh   # restarts services, syncs scripts/docs

Update instance config

cd /opt/data-analyst/instance && git pull
systemctl restart webapp  # picks up new instance.yaml via symlink

Both at once

cd /opt/data-analyst/repo && git pull
cd /opt/data-analyst/instance && git pull
bash server/deploy.sh

Server Layout Summary

/opt/data-analyst/
├── repo/           -> git@github-oss:ORG/OSS_REPO.git
├── instance/       -> git@github-cfg:ORG/INSTANCE_REPO.git
├── .env            # Secrets (not in git)
├── .venv/          # Python
└── logs/           # App logs

/root/.ssh/
├── deploy_key      # For OSS repo (github-oss alias)
├── instance_key    # For instance repo (github-cfg alias)
└── config          # Maps aliases to keys

Symlinks:
  repo/config/instance.yaml -> instance/config/instance.yaml
  repo/docs/data_description.md -> instance/config/data_description.md (optional)

Quick Verification

# Health check
curl http://YOUR_IP/health | python3 -m json.tool

# Login page
curl -s -o /dev/null -w "%{http_code}" http://YOUR_IP/login
# Expected: 200

# Instance config loaded
curl -s http://YOUR_IP/login | grep 'YOUR_INSTANCE_NAME'

19 KiB Raw Blame History