Petr 7f61ae8772 Update auto-install docs with Data Catalog setup

- Split Step 6 into 6a (Generate Parquet) and 6b (Configure Data Catalog)
- Document data_description.md + instance.yaml catalog categories
- Uncomment data_description.md symlink in Step 3c
- Add Data Catalog verification to Step 6 checklist

2026-03-10 22:00:28 +01:00

14 KiB

Raw Blame History

Automated Installation Guide

Step-by-step deployment of AI Data Analyst on a clean Ubuntu 24.04 VM.

Two repos are involved:

OSS repo (public/private): application code (padak/tmp_oss)
Instance repo (private): your config, secrets template, data schema (padak/tmp_oss_cfg)

Architecture on Server

/opt/data-analyst/
├── repo/              # OSS repo clone
│   ├── config/
│   │   └── instance.yaml -> ../../instance/config/instance.yaml  (symlink)
│   ├── webapp/
│   ├── server/
│   └── ...
├── instance/          # Private instance repo clone
│   ├── config/
│   │   ├── instance.yaml          # Branding, auth domains, data source
│   │   └── data_description.md    # Data schema (when configured)
│   ├── docs/setup/                # Custom CLAUDE.md template, etc.
│   ├── .env.example               # Secrets template
│   └── README.md
├── .env               # Secrets (not in git, from .env.example)
├── .venv/             # Python virtual environment
└── logs/              # Application logs

Key principle: OSS repo has no secrets/config. Instance repo has no code. Symlinks bridge them.

Prerequisites

DigitalOcean API token with ssh_key scope (or any Ubuntu 24.04 VM)
Two GitHub repos: one for OSS code, one for private instance config
SSH key on your local machine for server access

Known Issues

python3-venv must be installed before server/setup.sh (Ubuntu 24.04 omits it)
webapp-setup.sh generates SSL nginx config - use HTTP-only for IP-only deployments
DigitalOcean cloud-init cannot override password expiry; must use ssh_keys API field

Step 0: Create Repos

# Push OSS code to GitHub
git remote add origin git@github.com:YOUR_ORG/YOUR_OSS_REPO.git
git push -u origin main

# Create private instance config repo on GitHub (empty, private)
# We'll populate it from the server after setup

Step 1: Provision VM

1a: Create Droplet (DigitalOcean)

# Register SSH key (requires ssh_key scope on API token)
curl -s -X POST -H 'Content-Type: application/json' \
    -H "Authorization: Bearer $DO_TOKEN" \
    -d '{"name":"my-key","public_key":"ssh-ed25519 AAAA..."}' \
    "https://api.digitalocean.com/v2/account/keys"

# Create droplet with SSH key
curl -s -X POST -H 'Content-Type: application/json' \
    -H "Authorization: Bearer $DO_TOKEN" \
    -d '{
      "name":"data-analyst-1",
      "size":"s-1vcpu-2gb",
      "region":"ams3",
      "image":"ubuntu-24-04-x64",
      "ssh_keys":["KEY_ID_OR_FINGERPRINT"]
    }' \
    "https://api.digitalocean.com/v2/droplets"

1b: Install Prerequisites

ssh root@DROPLET_IP

# Wait for apt lock (auto-updates run on first boot)
apt update && apt install -y python3.12-venv python3-pip

1c: Generate Deploy Keys

Two separate keys - one per repo, for security isolation:

# Key for OSS repo
ssh-keygen -t ed25519 -f /root/.ssh/deploy_key -N "" -C "oss-app@$(hostname)"

# Key for private instance config repo
ssh-keygen -t ed25519 -f /root/.ssh/instance_key -N "" -C "instance-config@$(hostname)"

Add each public key as a deploy key on its respective GitHub repo:

deploy_key.pub -> OSS repo Settings > Deploy Keys
instance_key.pub -> Instance repo Settings > Deploy Keys

Configure SSH to use the right key per repo:

cat > /root/.ssh/config << 'EOF'
# OSS application repo
Host github-oss
  HostName github.com
  IdentityFile /root/.ssh/deploy_key
  StrictHostKeyChecking no

# Instance config repo (private)
Host github-cfg
  HostName github.com
  IdentityFile /root/.ssh/instance_key
  StrictHostKeyChecking no
EOF
chmod 600 /root/.ssh/config

1d: Clone OSS Repo & Run Setup

git clone git@github-oss:YOUR_ORG/YOUR_OSS_REPO.git /opt/data-analyst/repo
cd /opt/data-analyst/repo
REPO_URL="git@github-oss:YOUR_ORG/YOUR_OSS_REPO.git" bash server/setup.sh

Step 1 Checklist

#	Check	Expected
1.1	Groups	data-ops, dataread, data-private exist
1.2	Deploy user	uid deploy, groups: deploy, data-ops
1.3	Directories	/opt/data-analyst/{repo,.venv,logs}
1.4	Python venv	Flask loads in .venv
1.5	Scripts	add-analyst, list-analysts in /usr/local/bin

Step 2: Webapp Setup

2a: Run webapp-setup.sh

export SERVER_HOSTNAME="your-domain-or-ip"
bash server/webapp-setup.sh

For IP-only (no SSL), replace nginx config:

cat > /etc/nginx/sites-available/webapp << 'NGINX'
server {
    listen 80;
    server_name _;
    location / {
        proxy_pass http://unix:/run/webapp/webapp.sock;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
    }
    location /static/ {
        alias /opt/data-analyst/repo/webapp/static/;
        expires 1d;
    }
    location /health {
        proxy_pass http://unix:/run/webapp/webapp.sock;
        proxy_set_header Host $host;
        access_log off;
    }
}
NGINX
rm -f /etc/nginx/sites-enabled/default
nginx -t && systemctl restart nginx

2b: Create .env

SECRET_KEY=$(python3 -c 'import secrets; print(secrets.token_hex(32))')

cat > /opt/data-analyst/.env << EOF
WEBAPP_SECRET_KEY="${SECRET_KEY}"
SERVER_HOST="YOUR_IP"
SERVER_HOSTNAME="YOUR_IP_OR_DOMAIN"
GOOGLE_CLIENT_ID="placeholder"
GOOGLE_CLIENT_SECRET="placeholder"
DATA_SOURCE="local"
DATA_DIR="/data/src_data"
EOF

chown root:data-ops /opt/data-analyst/.env
chmod 640 /opt/data-analyst/.env

2c: Create Data Directories & Start

mkdir -p /data/src_data/{parquet,metadata} /data/docs /data/scripts
chown -R root:data-ops /data
chmod -R 2775 /data

mkdir -p /run/webapp
chown www-data:www-data /run/webapp

systemctl daemon-reload
systemctl start webapp
systemctl enable webapp

Step 2 Checklist

#	Check	Expected
2.1	Nginx	active, port 80
2.2	Webapp	active (gunicorn)
2.3	Health	`curl http://IP/health` returns JSON
2.4	Login page	HTTP 200 at /login

Step 3: Instance Configuration (Private Repo)

3a: Clone Instance Repo

git clone git@github-cfg:YOUR_ORG/YOUR_INSTANCE_REPO.git /opt/data-analyst/instance
chown -R root:data-ops /opt/data-analyst/instance
chmod -R 770 /opt/data-analyst/instance

3b: Initialize Instance Config (if empty repo)

If this is a fresh instance repo, create the initial config:

cd /opt/data-analyst/instance
mkdir -p config docs/setup

cat > config/instance.yaml << 'YAML'
instance:
  name: "My Data Analyst"
  subtitle: "My Organization"
  copyright: "My Org"

server:
  hostname: "YOUR_IP_OR_DOMAIN"
  host: "YOUR_IP"
  app_dir: "/opt/data-analyst"

auth:
  allowed_domain: "mycompany.com"
  webapp_secret_key: "${WEBAPP_SECRET_KEY}"

data_source:
  type: "local"

catalog:
  categories: {}
YAML

# Create .env.example as a template for future deployments
cat > .env.example << 'ENV'
WEBAPP_SECRET_KEY="generate-with: python3 -c 'import secrets; print(secrets.token_hex(32))'"
SERVER_HOST="server-ip"
SERVER_HOSTNAME="server-ip-or-domain"
GOOGLE_CLIENT_ID="placeholder"
GOOGLE_CLIENT_SECRET="placeholder"
DATA_SOURCE="local"
DATA_DIR="/data/src_data"
ENV

cat > .gitignore << 'GI'
.env
.env.local
*.swp
*~
.DS_Store
GI

git add -A && git commit -m "Initial instance config" && git push origin main

3c: Symlink Config into OSS Repo

# Remove any existing instance.yaml (from manual setup) and symlink
rm -f /opt/data-analyst/repo/config/instance.yaml
ln -s /opt/data-analyst/instance/config/instance.yaml /opt/data-analyst/repo/config/instance.yaml

# Symlink data_description.md (for Data Catalog - add when ready in Step 6)
ln -sf /opt/data-analyst/instance/config/data_description.md /opt/data-analyst/repo/docs/data_description.md

systemctl restart webapp

Step 3 Checklist

#	Check	Expected
3.1	Instance repo	/opt/data-analyst/instance/ exists
3.2	Symlink	config/instance.yaml -> ../../instance/config/instance.yaml
3.3	Webapp loads	Instance name shown on login page

Step 4: Authentication

Email magic link works without any external service.

Login page shows "Sign in with Email"
User enters email with allowed domain
Without SMTP: magic link shown in browser (dev mode)
With SMTP: link sent via email
Click link -> logged in -> dashboard

Optional: add Google OAuth by setting real GOOGLE_CLIENT_ID/GOOGLE_CLIENT_SECRET.

Step 4 Checklist

#	Check	Expected
4.1	Email auth	"Sign in with Email" on login page
4.2	Magic link	Generated for valid domain email
4.3	Domain check	Rejects wrong domains
4.4	Login flow	Magic link -> dashboard with session

Step 5: Onboarding Flow (End-User)

After server is set up, analysts self-onboard via the webapp:

Visit http://YOUR_SERVER/login and sign in with email
Dashboard shows "Get Started" with 4 steps:
- Create project folder (mkdir -p data-analyst && cd data-analyst)
- Generate SSH key (ssh-keygen -t ed25519 -f ~/.ssh/data_analyst_server -N '')
- Copy public key (cat ~/.ssh/data_analyst_server.pub)
- Paste key into form, click "Create Account"
After account creation, dashboard shows "Set up your local environment"
User runs claude in their project folder, pastes setup instructions
Claude Code configures SSH, rsyncs data, sets up Python + DuckDB

Step 6: Sample Data (Try Without a Data Adapter)

Before connecting a real data source, you can load sample data to verify the full pipeline (Parquet files, Data Catalog, analyst rsync, Claude Code analysis).

6a: Generate Parquet Files

cd /opt/data-analyst/repo

# Install generator dependency
/opt/data-analyst/.venv/bin/pip install faker

# Generate Parquet files directly (uses project's ParquetManager
# for snappy compression, proper types, and metadata embedding)
/opt/data-analyst/.venv/bin/python scripts/generate_sample_data.py \
    --size m --format parquet --output /data/src_data/parquet --seed 42

# Set correct permissions
chown -R root:data-ops /data/src_data/parquet
chmod -R 2775 /data/src_data/parquet

Available sizes: xs (50 customers, ~1 MB), s (500, ~15 MB), m (5K, ~150 MB), l (50K, ~1.5 GB). See docs/sample-data.md for the full data model and built-in analytical patterns.

6b: Configure Data Catalog

The Data Catalog reads from two files in the instance repo:

config/data_description.md - table definitions with YAML block (tables, folder_mapping)
config/instance.yaml - catalog categories (label, icon, order)

Add data_description.md to the instance repo with the sample tables:

cd /opt/data-analyst/instance

# Create data_description.md (see config/data_description.md.example in OSS repo)
# Must contain a ```yaml block with folder_mapping + tables list

# Add catalog categories to instance.yaml:
cat >> config/instance.yaml << 'YAML'

catalog:
  categories:
    customers:
      label: "Customers"
      icon: "users"
    products:
      label: "Product Catalog"
      icon: "package"
    marketing:
      label: "Marketing & Campaigns"
      icon: "megaphone"
    web:
      label: "Web Analytics"
      icon: "globe"
    sales:
      label: "Sales & Orders"
      icon: "shopping-cart"
    support:
      label: "Support & Tickets"
      icon: "help-circle"
  order: [customers, products, marketing, web, sales, support]
YAML

git add -A && git commit -m "Add sample data catalog" && git push origin main

Then symlink and restart:

# Symlink data_description.md into OSS repo (if not already done)
ln -sf /opt/data-analyst/instance/config/data_description.md \
       /opt/data-analyst/repo/docs/data_description.md

systemctl restart webapp

Step 6 Checklist

#	Check	Expected
6.1	Parquet files	`ls /data/src_data/parquet/*.parquet` shows 9 files
6.2	Permissions	Files owned by root:data-ops, group-readable
6.3	Data Catalog	`/catalog` page shows 6 categories with 9 tables
6.4	Analyst sync	Analyst can rsync parquet files to local machine
6.5	DuckDB loads	`SELECT count(*) FROM read_parquet('orders.parquet')` returns rows

Step 7: Real Data Source (Production)

When ready, replace sample data with a real data source adapter in instance/config/instance.yaml:

data_source:
  type: "keboola"
  keboola:
    storage_token: "${KEBOOLA_STORAGE_TOKEN}"
    stack_url: "https://connection.keboola.com"
    project_id: "12345"

Add the token to .env and create config/data_description.md with table schemas.

Other planned adapters: BigQuery, CSV import.

Deployment Workflow (Ongoing)

Update OSS code

cd /opt/data-analyst/repo && git pull
bash server/deploy.sh   # restarts services, syncs scripts/docs

Update instance config

cd /opt/data-analyst/instance && git pull
systemctl restart webapp  # picks up new instance.yaml via symlink

Both at once

cd /opt/data-analyst/repo && git pull
cd /opt/data-analyst/instance && git pull
bash server/deploy.sh

Server Layout Summary

/opt/data-analyst/
├── repo/           -> git@github-oss:ORG/OSS_REPO.git
├── instance/       -> git@github-cfg:ORG/INSTANCE_REPO.git
├── .env            # Secrets (not in git)
├── .venv/          # Python
└── logs/           # App logs

/root/.ssh/
├── deploy_key      # For OSS repo (github-oss alias)
├── instance_key    # For instance repo (github-cfg alias)
└── config          # Maps aliases to keys

Symlinks:
  repo/config/instance.yaml -> instance/config/instance.yaml
  repo/docs/data_description.md -> instance/config/data_description.md (optional)

Quick Verification

# Health check
curl http://YOUR_IP/health | python3 -m json.tool

# Login page
curl -s -o /dev/null -w "%{http_code}" http://YOUR_IP/login
# Expected: 200

# Instance config loaded
curl -s http://YOUR_IP/login | grep 'YOUR_INSTANCE_NAME'

14 KiB Raw Blame History