agnes-the-ai-analyst/docs/auto-install.md
Petr 49559fba1b Remove hardcoded Jira and Telemetry cards from catalog
These Keboola-specific data source cards don't belong in the OSS repo.
The catalog now shows only dynamic content: Core Business Data (from
data_description.md) and Business Metrics (from docs/metrics/*.yml).

Also update auto-install.md with Business Metrics documentation,
pipeline diagram, and expanded checklist.
2026-03-10 22:48:07 +01:00

22 KiB

Automated Installation Guide

Step-by-step deployment of AI Data Analyst on a clean Ubuntu 24.04 VM.

Two repos are involved:

  • OSS repo (public/private): application code (padak/tmp_oss)
  • Instance repo (private): your config, secrets template, data schema (padak/tmp_oss_cfg)

Architecture on Server

/opt/data-analyst/
├── repo/              # OSS repo clone
│   ├── config/
│   │   └── instance.yaml -> ../../instance/config/instance.yaml  (symlink)
│   ├── webapp/
│   ├── server/
│   └── ...
├── instance/          # Private instance repo clone
│   ├── config/
│   │   ├── instance.yaml          # Branding, auth domains, data source
│   │   └── data_description.md    # Data schema (when configured)
│   ├── docs/setup/                # Custom CLAUDE.md template, etc.
│   ├── .env.example               # Secrets template
│   └── README.md
├── .env               # Secrets (not in git, from .env.example)
├── .venv/             # Python virtual environment
└── logs/              # Application logs

Key principle: OSS repo has no secrets/config. Instance repo has no code. Symlinks bridge them.

Prerequisites

  1. DigitalOcean API token with ssh_key scope (or any Ubuntu 24.04 VM)
  2. Two GitHub repos: one for OSS code, one for private instance config
  3. SSH key on your local machine for server access

Known Issues

  • python3-venv must be installed before server/setup.sh (Ubuntu 24.04 omits it)
  • webapp-setup.sh generates SSL nginx config - use HTTP-only for IP-only deployments
  • DigitalOcean cloud-init cannot override password expiry; must use ssh_keys API field

Step 0: Create Repos

# Push OSS code to GitHub
git remote add origin git@github.com:YOUR_ORG/YOUR_OSS_REPO.git
git push -u origin main

# Create private instance config repo on GitHub (empty, private)
# We'll populate it from the server after setup

Step 1: Provision VM

1a: Create Droplet (DigitalOcean)

# Register SSH key (requires ssh_key scope on API token)
curl -s -X POST -H 'Content-Type: application/json' \
    -H "Authorization: Bearer $DO_TOKEN" \
    -d '{"name":"my-key","public_key":"ssh-ed25519 AAAA..."}' \
    "https://api.digitalocean.com/v2/account/keys"

# Create droplet with SSH key
curl -s -X POST -H 'Content-Type: application/json' \
    -H "Authorization: Bearer $DO_TOKEN" \
    -d '{
      "name":"data-analyst-1",
      "size":"s-1vcpu-2gb",
      "region":"ams3",
      "image":"ubuntu-24-04-x64",
      "ssh_keys":["KEY_ID_OR_FINGERPRINT"]
    }' \
    "https://api.digitalocean.com/v2/droplets"

1b: Install Prerequisites

ssh root@DROPLET_IP

# Wait for apt lock (auto-updates run on first boot)
apt update && apt install -y python3.12-venv python3-pip

1c: Generate Deploy Keys

Two separate keys - one per repo, for security isolation:

# Key for OSS repo
ssh-keygen -t ed25519 -f /root/.ssh/deploy_key -N "" -C "oss-app@$(hostname)"

# Key for private instance config repo
ssh-keygen -t ed25519 -f /root/.ssh/instance_key -N "" -C "instance-config@$(hostname)"

Add each public key as a deploy key on its respective GitHub repo:

  • deploy_key.pub -> OSS repo Settings > Deploy Keys
  • instance_key.pub -> Instance repo Settings > Deploy Keys

Configure SSH to use the right key per repo:

cat > /root/.ssh/config << 'EOF'
# OSS application repo
Host github-oss
  HostName github.com
  IdentityFile /root/.ssh/deploy_key
  StrictHostKeyChecking no

# Instance config repo (private)
Host github-cfg
  HostName github.com
  IdentityFile /root/.ssh/instance_key
  StrictHostKeyChecking no
EOF
chmod 600 /root/.ssh/config

1d: Clone OSS Repo & Run Setup

git clone git@github-oss:YOUR_ORG/YOUR_OSS_REPO.git /opt/data-analyst/repo
cd /opt/data-analyst/repo
REPO_URL="git@github-oss:YOUR_ORG/YOUR_OSS_REPO.git" bash server/setup.sh

Step 1 Checklist

# Check Expected
1.1 Groups data-ops, dataread, data-private exist
1.2 Deploy user uid deploy, groups: deploy, data-ops
1.3 Directories /opt/data-analyst/{repo,.venv,logs}
1.4 Python venv Flask loads in .venv
1.5 Scripts add-analyst, list-analysts in /usr/local/bin

Step 2: Webapp Setup

2a: Run webapp-setup.sh

export SERVER_HOSTNAME="your-domain-or-ip"
bash server/webapp-setup.sh

For IP-only (no SSL), replace nginx config:

cat > /etc/nginx/sites-available/webapp << 'NGINX'
server {
    listen 80;
    server_name _;
    location / {
        proxy_pass http://unix:/run/webapp/webapp.sock;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
    }
    location /static/ {
        alias /opt/data-analyst/repo/webapp/static/;
        expires 1d;
    }
    location /health {
        proxy_pass http://unix:/run/webapp/webapp.sock;
        proxy_set_header Host $host;
        access_log off;
    }
}
NGINX
rm -f /etc/nginx/sites-enabled/default
nginx -t && systemctl restart nginx

2b: Create .env

SECRET_KEY=$(python3 -c 'import secrets; print(secrets.token_hex(32))')

cat > /opt/data-analyst/.env << EOF
WEBAPP_SECRET_KEY="${SECRET_KEY}"
SERVER_HOST="YOUR_IP"
SERVER_HOSTNAME="YOUR_IP_OR_DOMAIN"
GOOGLE_CLIENT_ID="placeholder"
GOOGLE_CLIENT_SECRET="placeholder"
DATA_SOURCE="local"
DATA_DIR="/data/src_data"
EOF

chown root:data-ops /opt/data-analyst/.env
chmod 640 /opt/data-analyst/.env

2c: Create Data Directories & Start

mkdir -p /data/src_data/{parquet,metadata} /data/docs /data/scripts
chown -R root:data-ops /data
chmod -R 2775 /data

mkdir -p /run/webapp
chown www-data:www-data /run/webapp

systemctl daemon-reload
systemctl start webapp
systemctl enable webapp

Step 2 Checklist

# Check Expected
2.1 Nginx active, port 80
2.2 Webapp active (gunicorn)
2.3 Health curl http://IP/health returns JSON
2.4 Login page HTTP 200 at /login

Step 3: Instance Configuration (Private Repo)

3a: Clone Instance Repo

git clone git@github-cfg:YOUR_ORG/YOUR_INSTANCE_REPO.git /opt/data-analyst/instance
chown -R root:data-ops /opt/data-analyst/instance
chmod -R 770 /opt/data-analyst/instance

3b: Initialize Instance Config (if empty repo)

If this is a fresh instance repo, create the initial config:

cd /opt/data-analyst/instance
mkdir -p config docs/setup

cat > config/instance.yaml << 'YAML'
instance:
  name: "My Data Analyst"
  subtitle: "My Organization"
  copyright: "My Org"

server:
  hostname: "YOUR_IP_OR_DOMAIN"
  host: "YOUR_IP"
  app_dir: "/opt/data-analyst"

auth:
  allowed_domain: "mycompany.com"
  webapp_secret_key: "${WEBAPP_SECRET_KEY}"

data_source:
  type: "local"

catalog:
  categories: {}
YAML

# Create .env.example as a template for future deployments
cat > .env.example << 'ENV'
WEBAPP_SECRET_KEY="generate-with: python3 -c 'import secrets; print(secrets.token_hex(32))'"
SERVER_HOST="server-ip"
SERVER_HOSTNAME="server-ip-or-domain"
GOOGLE_CLIENT_ID="placeholder"
GOOGLE_CLIENT_SECRET="placeholder"
DATA_SOURCE="local"
DATA_DIR="/data/src_data"
ENV

cat > .gitignore << 'GI'
.env
.env.local
*.swp
*~
.DS_Store
GI

git add -A && git commit -m "Initial instance config" && git push origin main
# Remove any existing instance.yaml (from manual setup) and symlink
rm -f /opt/data-analyst/repo/config/instance.yaml
ln -s /opt/data-analyst/instance/config/instance.yaml /opt/data-analyst/repo/config/instance.yaml

# Symlink data_description.md (for Data Catalog - add when ready in Step 6)
ln -sf /opt/data-analyst/instance/config/data_description.md /opt/data-analyst/repo/docs/data_description.md

systemctl restart webapp

Step 3 Checklist

# Check Expected
3.1 Instance repo /opt/data-analyst/instance/ exists
3.2 Symlink config/instance.yaml -> ../../instance/config/instance.yaml
3.3 Webapp loads Instance name shown on login page

Step 4: Authentication

Email magic link works without any external service.

  1. Login page shows "Sign in with Email"
  2. User enters email with allowed domain
  3. Without SMTP: magic link shown in browser (dev mode)
  4. With SMTP: link sent via email
  5. Click link -> logged in -> dashboard

Optional: add Google OAuth by setting real GOOGLE_CLIENT_ID/GOOGLE_CLIENT_SECRET.

Step 4 Checklist

# Check Expected
4.1 Email auth "Sign in with Email" on login page
4.2 Magic link Generated for valid domain email
4.3 Domain check Rejects wrong domains
4.4 Login flow Magic link -> dashboard with session

Step 5: Onboarding Flow (End-User)

After server is set up, analysts self-onboard via the webapp:

  1. Visit http://YOUR_SERVER/login and sign in with email
  2. Dashboard shows "Get Started" with 4 steps:
    • Create project folder (mkdir -p data-analyst && cd data-analyst)
    • Generate SSH key (ssh-keygen -t ed25519 -f ~/.ssh/data_analyst_server -N '')
    • Copy public key (cat ~/.ssh/data_analyst_server.pub)
    • Paste key into form, click "Create Account"
  3. After account creation, dashboard shows "Set up your local environment"
  4. User runs claude in their project folder, pastes setup instructions
  5. Claude Code configures SSH, rsyncs data, sets up Python + DuckDB

Step 6: Sample Data (Try Without a Data Adapter)

Before connecting a real data source, you can load sample data to verify the full pipeline (Parquet files, Data Catalog with profiling, analyst rsync, Claude Code analysis).

How the Data Catalog & Profiler Pipeline Works

Instance repo                        Server filesystem            Webapp
─────────────                        ────────────────            ──────
config/data_description.md ──symlink──> repo/docs/data_description.md
  (tables, folder_mapping,                │
   foreign_keys)                          │
                                          ▼
config/instance.yaml ────────symlink──> repo/config/instance.yaml
  (catalog.categories,                    │
   labels, icons, order)                  │
                                          ▼
                          /data/src_data/parquet/*.parquet
                                          │
                                ┌─────────┴──────────┐
                                ▼                    ▼
                     python -m src.profiler    _load_catalog_data()
                                │                    │
                                ▼                    ▼
                  /data/src_data/metadata/     /catalog page
                     profiles.json            (categories + tables)
                                │
                     ┌──────────┴──────────┐
                     ▼                     ▼
              /api/catalog/profile/   _load_data_stats()
               (per-table stats,      (header: "9 tables,
                columns, alerts,       ~217K rows total")
                relationships,
                used_by_metrics)

docs/metrics/*/*.yml ──────────────> _load_metrics_data()
  (metric definitions,                     │
   SQL examples,                           ▼
   dimensions)              /catalog "Business Metrics" card
                            /api/metrics/<path> (modal detail)

Key files and their roles:

File Location Purpose
data_description.md Instance repo Table definitions, folder_mapping (bucket→category), foreign_keys
instance.yaml Instance repo Catalog category labels, icons, display order
*.parquet /data/src_data/parquet/ Actual data files (flat or in subfolders)
profiles.json /data/src_data/metadata/ Profiler output: statistics, alerts, relationships per table
sync_state.json /data/src_data/metadata/ Sync process stats (optional; profiler provides fallback)
docs/metrics/*/*.yml OSS repo (sample) or instance repo (production) Business metric definitions with SQL examples

Folder mapping serves dual purpose: maps table IDs to catalog categories for the UI, and maps to filesystem paths for the profiler. The profiler auto-detects flat layouts (all parquet files in one directory) vs subfolder layouts (Keboola-style parquet/<folder>/<table>.parquet).

6a: Generate Parquet Files

cd /opt/data-analyst/repo

# Install generator dependency
/opt/data-analyst/.venv/bin/pip install faker

# Generate Parquet files directly (uses project's ParquetManager
# for snappy compression, proper types, and metadata embedding)
/opt/data-analyst/.venv/bin/python scripts/generate_sample_data.py \
    --size m --format parquet --output /data/src_data/parquet --seed 42

# Set correct permissions
chown -R root:data-ops /data/src_data/parquet
chmod -R 2775 /data/src_data/parquet

Available sizes: xs (50 customers, ~1 MB), s (500, ~15 MB), m (5K, ~150 MB), l (50K, ~1.5 GB). See docs/sample-data.md for the full data model and built-in analytical patterns.

6b: Configure Data Catalog

The Data Catalog reads from two files in the instance repo:

  1. config/data_description.md - YAML block with folder_mapping, tables (id, name, description, primary_key, sync_strategy, foreign_keys)
  2. config/instance.yaml - catalog.categories with label, icon per category + catalog.order

The folder_mapping maps bucket prefixes from table IDs to category names. Example: table ID sample.sales.orders → bucket sample.sales → folder sales → category "Sales & Orders".

Tables with foreign_keys will show interactive relationship diagrams in the profiler modal.

Add data_description.md to the instance repo with the sample tables:

cd /opt/data-analyst/instance

# Create data_description.md (see config/data_description.md.example in OSS repo)
# Must contain a ```yaml block with:
#   folder_mapping:  { "bucket.prefix": "category_key", ... }
#   tables:          list of table definitions
#
# Each table needs: id, name, description, primary_key, sync_strategy
# Optional: foreign_keys (for profiler Relationships tab)
#
# Example foreign_keys:
#   foreign_keys:
#     - column: "customer_id"
#       references: "customers.customer_id"
#       description: "Ordering customer"

# Add catalog categories to instance.yaml:
cat >> config/instance.yaml << 'YAML'

catalog:
  categories:
    customers:
      label: "Customers"
      icon: "users"
    products:
      label: "Product Catalog"
      icon: "package"
    marketing:
      label: "Marketing & Campaigns"
      icon: "megaphone"
    web:
      label: "Web Analytics"
      icon: "globe"
    sales:
      label: "Sales & Orders"
      icon: "shopping-cart"
    support:
      label: "Support & Tickets"
      icon: "help-circle"
  order: [customers, products, marketing, web, sales, support]
YAML

git add -A && git commit -m "Add sample data catalog" && git push origin main

Then symlink and restart:

# Symlink data_description.md into OSS repo (if not already done)
ln -sf /opt/data-analyst/instance/config/data_description.md \
       /opt/data-analyst/repo/docs/data_description.md

systemctl restart webapp

6c: Business Metrics

The Data Catalog includes a Business Metrics card that dynamically renders metric definitions from YAML files. The OSS repo ships with 10 sample e-commerce metrics in docs/metrics/ (4 categories: revenue, customers, marketing, support) that align with the sample data generator tables.

How it works:

  • Webapp scans docs/metrics/*/*.yml (production: /data/docs/metrics/)
  • Each YAML file defines one metric with SQL examples, dimensions, and notes
  • The profiler links metrics to tables via used_by_metrics in profiles.json
  • Clicking a metric opens a modal with Overview, How to Use, SQL Examples, and Technical tabs

For sample data: metrics work out of the box - the OSS repo includes sample definitions.

For production: create metric YAMLs in the instance repo and deploy them to /data/docs/metrics/ on the server. The production path takes precedence over the OSS repo.

# Instance repo: create metric definitions
mkdir -p /opt/data-analyst/instance/docs/metrics/{revenue,operations}
# ... add your .yml files ...

# Deploy metrics to server
cp -r /opt/data-analyst/instance/docs/metrics/ /data/docs/metrics/
chown -R root:data-ops /data/docs/metrics
chmod -R 2775 /data/docs/metrics

Each metric YAML file follows this structure (list with one dict):

- name: metric_name
  display_name: Human Readable Name
  category: revenue          # must match parent directory name
  type: sum                  # sum, average, count_distinct, ratio
  unit: USD
  grain: monthly
  time_column: order_date
  table: orders              # primary table
  tables: [orders, customers]  # optional: all referenced tables
  expression: "SUM(total_amount)"
  description: "What this metric measures..."
  dimensions: [channel, region]
  notes: ["Important context..."]
  synonyms: [alias1, alias2]
  sql: |
    SELECT ... FROM ... GROUP BY ...    
  sql_by_channel: |           # any sql_* key is auto-discovered
    SELECT ... GROUP BY channel

6d: Run Data Profiler

The profiler reads parquet files + data_description.md and generates profiles.json with per-table statistics, column analysis, data quality alerts, and relationship maps.

cd /opt/data-analyst/repo
/opt/data-analyst/.venv/bin/python -m src.profiler

Output: /data/src_data/metadata/profiles.json (auto-created, readable by webapp).

The profiler provides:

  • Overview: row count, column count, file size, date coverage, missing cell %
  • Columns: type distribution, top values, histograms for numeric columns
  • Insights: data quality alerts (high missing %, imbalanced categories, high cardinality)
  • Relationships: FK diagram built from foreign_keys in data_description.md, plus linked Business Metrics
  • Used by Metrics: shows which metric definitions reference this table (from docs/metrics/)
  • Sample: first 5 rows of the table

Without sync_state.json (no data adapter running), the profiler computes file sizes directly from parquet files, and the catalog header derives table/row counts from profiles.json.

To re-run after data changes:

cd /opt/data-analyst/repo && /opt/data-analyst/.venv/bin/python -m src.profiler
# No webapp restart needed - profiles.json is read on each request

Step 6 Checklist

# Check Expected
6.1 Parquet files ls /data/src_data/parquet/*.parquet shows 9 files
6.2 Permissions Files owned by root:data-ops, group-readable
6.3 Data Catalog /catalog page shows 6 categories with 9 tables
6.4 Catalog header "9 tables, ~217K+ rows total" (from profiles.json)
6.5 Profile modal Click "Profile" on any table → statistics, columns, insights
6.6 Relationships Orders profile → shows customers, order_items, payments links
6.7 Used by Metrics Orders overview → shows total_revenue, campaign_roi, etc. badges
6.8 Business Metrics /catalog shows "Business Metrics" card with 4 categories, 10 metrics
6.9 Metric modal Click any metric → modal with SQL examples, dimensions, notes
6.10 File sizes Profile overview shows non-zero file size (e.g., 0.69 MB)
6.11 Analyst sync Analyst can rsync parquet files to local machine
6.12 DuckDB loads SELECT count(*) FROM read_parquet('orders.parquet') returns rows

Step 7: Real Data Source (Production)

When ready, replace sample data with a real data source adapter in instance/config/instance.yaml:

data_source:
  type: "keboola"
  keboola:
    storage_token: "${KEBOOLA_STORAGE_TOKEN}"
    stack_url: "https://connection.keboola.com"
    project_id: "12345"

Add the token to .env and create config/data_description.md with table schemas.

Other planned adapters: BigQuery, CSV import.

Deployment Workflow (Ongoing)

Update OSS code

cd /opt/data-analyst/repo && git pull
bash server/deploy.sh   # restarts services, syncs scripts/docs

Update instance config

cd /opt/data-analyst/instance && git pull
systemctl restart webapp  # picks up new instance.yaml via symlink

Both at once

cd /opt/data-analyst/repo && git pull
cd /opt/data-analyst/instance && git pull
bash server/deploy.sh

Server Layout Summary

/opt/data-analyst/
├── repo/           -> git@github-oss:ORG/OSS_REPO.git
├── instance/       -> git@github-cfg:ORG/INSTANCE_REPO.git
├── .env            # Secrets (not in git)
├── .venv/          # Python
└── logs/           # App logs

/root/.ssh/
├── deploy_key      # For OSS repo (github-oss alias)
├── instance_key    # For instance repo (github-cfg alias)
└── config          # Maps aliases to keys

Symlinks:
  repo/config/instance.yaml -> instance/config/instance.yaml
  repo/docs/data_description.md -> instance/config/data_description.md (optional)

Quick Verification

# Health check
curl http://YOUR_IP/health | python3 -m json.tool

# Login page
curl -s -o /dev/null -w "%{http_code}" http://YOUR_IP/login
# Expected: 200

# Instance config loaded
curl -s http://YOUR_IP/login | grep 'YOUR_INSTANCE_NAME'