- Add architecture diagram showing data flow from instance config through profiler to webapp - Explain folder_mapping dual purpose (catalog categories + file paths) - Add Step 6c for running the profiler - Document foreign_keys for relationship diagrams - Explain profiles.json fallback for catalog header stats - Expand checklist with profiler verification steps
19 KiB
Automated Installation Guide
Step-by-step deployment of AI Data Analyst on a clean Ubuntu 24.04 VM.
Two repos are involved:
- OSS repo (public/private): application code (
padak/tmp_oss) - Instance repo (private): your config, secrets template, data schema (
padak/tmp_oss_cfg)
Architecture on Server
/opt/data-analyst/
├── repo/ # OSS repo clone
│ ├── config/
│ │ └── instance.yaml -> ../../instance/config/instance.yaml (symlink)
│ ├── webapp/
│ ├── server/
│ └── ...
├── instance/ # Private instance repo clone
│ ├── config/
│ │ ├── instance.yaml # Branding, auth domains, data source
│ │ └── data_description.md # Data schema (when configured)
│ ├── docs/setup/ # Custom CLAUDE.md template, etc.
│ ├── .env.example # Secrets template
│ └── README.md
├── .env # Secrets (not in git, from .env.example)
├── .venv/ # Python virtual environment
└── logs/ # Application logs
Key principle: OSS repo has no secrets/config. Instance repo has no code. Symlinks bridge them.
Prerequisites
- DigitalOcean API token with
ssh_keyscope (or any Ubuntu 24.04 VM) - Two GitHub repos: one for OSS code, one for private instance config
- SSH key on your local machine for server access
Known Issues
python3-venvmust be installed beforeserver/setup.sh(Ubuntu 24.04 omits it)webapp-setup.shgenerates SSL nginx config - use HTTP-only for IP-only deployments- DigitalOcean cloud-init cannot override password expiry; must use
ssh_keysAPI field
Step 0: Create Repos
# Push OSS code to GitHub
git remote add origin git@github.com:YOUR_ORG/YOUR_OSS_REPO.git
git push -u origin main
# Create private instance config repo on GitHub (empty, private)
# We'll populate it from the server after setup
Step 1: Provision VM
1a: Create Droplet (DigitalOcean)
# Register SSH key (requires ssh_key scope on API token)
curl -s -X POST -H 'Content-Type: application/json' \
-H "Authorization: Bearer $DO_TOKEN" \
-d '{"name":"my-key","public_key":"ssh-ed25519 AAAA..."}' \
"https://api.digitalocean.com/v2/account/keys"
# Create droplet with SSH key
curl -s -X POST -H 'Content-Type: application/json' \
-H "Authorization: Bearer $DO_TOKEN" \
-d '{
"name":"data-analyst-1",
"size":"s-1vcpu-2gb",
"region":"ams3",
"image":"ubuntu-24-04-x64",
"ssh_keys":["KEY_ID_OR_FINGERPRINT"]
}' \
"https://api.digitalocean.com/v2/droplets"
1b: Install Prerequisites
ssh root@DROPLET_IP
# Wait for apt lock (auto-updates run on first boot)
apt update && apt install -y python3.12-venv python3-pip
1c: Generate Deploy Keys
Two separate keys - one per repo, for security isolation:
# Key for OSS repo
ssh-keygen -t ed25519 -f /root/.ssh/deploy_key -N "" -C "oss-app@$(hostname)"
# Key for private instance config repo
ssh-keygen -t ed25519 -f /root/.ssh/instance_key -N "" -C "instance-config@$(hostname)"
Add each public key as a deploy key on its respective GitHub repo:
deploy_key.pub-> OSS repo Settings > Deploy Keysinstance_key.pub-> Instance repo Settings > Deploy Keys
Configure SSH to use the right key per repo:
cat > /root/.ssh/config << 'EOF'
# OSS application repo
Host github-oss
HostName github.com
IdentityFile /root/.ssh/deploy_key
StrictHostKeyChecking no
# Instance config repo (private)
Host github-cfg
HostName github.com
IdentityFile /root/.ssh/instance_key
StrictHostKeyChecking no
EOF
chmod 600 /root/.ssh/config
1d: Clone OSS Repo & Run Setup
git clone git@github-oss:YOUR_ORG/YOUR_OSS_REPO.git /opt/data-analyst/repo
cd /opt/data-analyst/repo
REPO_URL="git@github-oss:YOUR_ORG/YOUR_OSS_REPO.git" bash server/setup.sh
Step 1 Checklist
| # | Check | Expected |
|---|---|---|
| 1.1 | Groups | data-ops, dataread, data-private exist |
| 1.2 | Deploy user | uid deploy, groups: deploy, data-ops |
| 1.3 | Directories | /opt/data-analyst/{repo,.venv,logs} |
| 1.4 | Python venv | Flask loads in .venv |
| 1.5 | Scripts | add-analyst, list-analysts in /usr/local/bin |
Step 2: Webapp Setup
2a: Run webapp-setup.sh
export SERVER_HOSTNAME="your-domain-or-ip"
bash server/webapp-setup.sh
For IP-only (no SSL), replace nginx config:
cat > /etc/nginx/sites-available/webapp << 'NGINX'
server {
listen 80;
server_name _;
location / {
proxy_pass http://unix:/run/webapp/webapp.sock;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
}
location /static/ {
alias /opt/data-analyst/repo/webapp/static/;
expires 1d;
}
location /health {
proxy_pass http://unix:/run/webapp/webapp.sock;
proxy_set_header Host $host;
access_log off;
}
}
NGINX
rm -f /etc/nginx/sites-enabled/default
nginx -t && systemctl restart nginx
2b: Create .env
SECRET_KEY=$(python3 -c 'import secrets; print(secrets.token_hex(32))')
cat > /opt/data-analyst/.env << EOF
WEBAPP_SECRET_KEY="${SECRET_KEY}"
SERVER_HOST="YOUR_IP"
SERVER_HOSTNAME="YOUR_IP_OR_DOMAIN"
GOOGLE_CLIENT_ID="placeholder"
GOOGLE_CLIENT_SECRET="placeholder"
DATA_SOURCE="local"
DATA_DIR="/data/src_data"
EOF
chown root:data-ops /opt/data-analyst/.env
chmod 640 /opt/data-analyst/.env
2c: Create Data Directories & Start
mkdir -p /data/src_data/{parquet,metadata} /data/docs /data/scripts
chown -R root:data-ops /data
chmod -R 2775 /data
mkdir -p /run/webapp
chown www-data:www-data /run/webapp
systemctl daemon-reload
systemctl start webapp
systemctl enable webapp
Step 2 Checklist
| # | Check | Expected |
|---|---|---|
| 2.1 | Nginx | active, port 80 |
| 2.2 | Webapp | active (gunicorn) |
| 2.3 | Health | curl http://IP/health returns JSON |
| 2.4 | Login page | HTTP 200 at /login |
Step 3: Instance Configuration (Private Repo)
3a: Clone Instance Repo
git clone git@github-cfg:YOUR_ORG/YOUR_INSTANCE_REPO.git /opt/data-analyst/instance
chown -R root:data-ops /opt/data-analyst/instance
chmod -R 770 /opt/data-analyst/instance
3b: Initialize Instance Config (if empty repo)
If this is a fresh instance repo, create the initial config:
cd /opt/data-analyst/instance
mkdir -p config docs/setup
cat > config/instance.yaml << 'YAML'
instance:
name: "My Data Analyst"
subtitle: "My Organization"
copyright: "My Org"
server:
hostname: "YOUR_IP_OR_DOMAIN"
host: "YOUR_IP"
app_dir: "/opt/data-analyst"
auth:
allowed_domain: "mycompany.com"
webapp_secret_key: "${WEBAPP_SECRET_KEY}"
data_source:
type: "local"
catalog:
categories: {}
YAML
# Create .env.example as a template for future deployments
cat > .env.example << 'ENV'
WEBAPP_SECRET_KEY="generate-with: python3 -c 'import secrets; print(secrets.token_hex(32))'"
SERVER_HOST="server-ip"
SERVER_HOSTNAME="server-ip-or-domain"
GOOGLE_CLIENT_ID="placeholder"
GOOGLE_CLIENT_SECRET="placeholder"
DATA_SOURCE="local"
DATA_DIR="/data/src_data"
ENV
cat > .gitignore << 'GI'
.env
.env.local
*.swp
*~
.DS_Store
GI
git add -A && git commit -m "Initial instance config" && git push origin main
3c: Symlink Config into OSS Repo
# Remove any existing instance.yaml (from manual setup) and symlink
rm -f /opt/data-analyst/repo/config/instance.yaml
ln -s /opt/data-analyst/instance/config/instance.yaml /opt/data-analyst/repo/config/instance.yaml
# Symlink data_description.md (for Data Catalog - add when ready in Step 6)
ln -sf /opt/data-analyst/instance/config/data_description.md /opt/data-analyst/repo/docs/data_description.md
systemctl restart webapp
Step 3 Checklist
| # | Check | Expected |
|---|---|---|
| 3.1 | Instance repo | /opt/data-analyst/instance/ exists |
| 3.2 | Symlink | config/instance.yaml -> ../../instance/config/instance.yaml |
| 3.3 | Webapp loads | Instance name shown on login page |
Step 4: Authentication
Email magic link works without any external service.
- Login page shows "Sign in with Email"
- User enters email with allowed domain
- Without SMTP: magic link shown in browser (dev mode)
- With SMTP: link sent via email
- Click link -> logged in -> dashboard
Optional: add Google OAuth by setting real GOOGLE_CLIENT_ID/GOOGLE_CLIENT_SECRET.
Step 4 Checklist
| # | Check | Expected |
|---|---|---|
| 4.1 | Email auth | "Sign in with Email" on login page |
| 4.2 | Magic link | Generated for valid domain email |
| 4.3 | Domain check | Rejects wrong domains |
| 4.4 | Login flow | Magic link -> dashboard with session |
Step 5: Onboarding Flow (End-User)
After server is set up, analysts self-onboard via the webapp:
- Visit
http://YOUR_SERVER/loginand sign in with email - Dashboard shows "Get Started" with 4 steps:
- Create project folder (
mkdir -p data-analyst && cd data-analyst) - Generate SSH key (
ssh-keygen -t ed25519 -f ~/.ssh/data_analyst_server -N '') - Copy public key (
cat ~/.ssh/data_analyst_server.pub) - Paste key into form, click "Create Account"
- Create project folder (
- After account creation, dashboard shows "Set up your local environment"
- User runs
claudein their project folder, pastes setup instructions - Claude Code configures SSH, rsyncs data, sets up Python + DuckDB
Step 6: Sample Data (Try Without a Data Adapter)
Before connecting a real data source, you can load sample data to verify the full pipeline (Parquet files, Data Catalog with profiling, analyst rsync, Claude Code analysis).
How the Data Catalog & Profiler Pipeline Works
Instance repo Server filesystem Webapp
───────────── ──────────────── ──────
config/data_description.md ──symlink──> repo/docs/data_description.md
(tables, folder_mapping, │
foreign_keys) │
▼
config/instance.yaml ────────symlink──> repo/config/instance.yaml
(catalog.categories, │
labels, icons, order) │
▼
/data/src_data/parquet/*.parquet
│
┌─────────┴──────────┐
▼ ▼
python -m src.profiler _load_catalog_data()
│ │
▼ ▼
/data/src_data/metadata/ /catalog page
profiles.json (categories + tables)
│
┌──────────┴──────────┐
▼ ▼
/api/catalog/profile/ _load_data_stats()
(per-table stats, (header: "9 tables,
columns, alerts, ~217K rows total")
relationships)
Key files and their roles:
| File | Location | Purpose |
|---|---|---|
data_description.md |
Instance repo | Table definitions, folder_mapping (bucket→category), foreign_keys |
instance.yaml |
Instance repo | Catalog category labels, icons, display order |
*.parquet |
/data/src_data/parquet/ |
Actual data files (flat or in subfolders) |
profiles.json |
/data/src_data/metadata/ |
Profiler output: statistics, alerts, relationships per table |
sync_state.json |
/data/src_data/metadata/ |
Sync process stats (optional; profiler provides fallback) |
Folder mapping serves dual purpose: maps table IDs to catalog categories for the UI,
and maps to filesystem paths for the profiler. The profiler auto-detects flat layouts
(all parquet files in one directory) vs subfolder layouts (Keboola-style parquet/<folder>/<table>.parquet).
6a: Generate Parquet Files
cd /opt/data-analyst/repo
# Install generator dependency
/opt/data-analyst/.venv/bin/pip install faker
# Generate Parquet files directly (uses project's ParquetManager
# for snappy compression, proper types, and metadata embedding)
/opt/data-analyst/.venv/bin/python scripts/generate_sample_data.py \
--size m --format parquet --output /data/src_data/parquet --seed 42
# Set correct permissions
chown -R root:data-ops /data/src_data/parquet
chmod -R 2775 /data/src_data/parquet
Available sizes: xs (50 customers, ~1 MB), s (500, ~15 MB), m (5K, ~150 MB), l (50K, ~1.5 GB).
See docs/sample-data.md for the full data model and built-in analytical patterns.
6b: Configure Data Catalog
The Data Catalog reads from two files in the instance repo:
config/data_description.md- YAML block withfolder_mapping,tables(id, name, description, primary_key, sync_strategy, foreign_keys)config/instance.yaml-catalog.categorieswith label, icon per category +catalog.order
The folder_mapping maps bucket prefixes from table IDs to category names. Example:
table ID sample.sales.orders → bucket sample.sales → folder sales → category "Sales & Orders".
Tables with foreign_keys will show interactive relationship diagrams in the profiler modal.
Add data_description.md to the instance repo with the sample tables:
cd /opt/data-analyst/instance
# Create data_description.md (see config/data_description.md.example in OSS repo)
# Must contain a ```yaml block with:
# folder_mapping: { "bucket.prefix": "category_key", ... }
# tables: list of table definitions
#
# Each table needs: id, name, description, primary_key, sync_strategy
# Optional: foreign_keys (for profiler Relationships tab)
#
# Example foreign_keys:
# foreign_keys:
# - column: "customer_id"
# references: "customers.customer_id"
# description: "Ordering customer"
# Add catalog categories to instance.yaml:
cat >> config/instance.yaml << 'YAML'
catalog:
categories:
customers:
label: "Customers"
icon: "users"
products:
label: "Product Catalog"
icon: "package"
marketing:
label: "Marketing & Campaigns"
icon: "megaphone"
web:
label: "Web Analytics"
icon: "globe"
sales:
label: "Sales & Orders"
icon: "shopping-cart"
support:
label: "Support & Tickets"
icon: "help-circle"
order: [customers, products, marketing, web, sales, support]
YAML
git add -A && git commit -m "Add sample data catalog" && git push origin main
Then symlink and restart:
# Symlink data_description.md into OSS repo (if not already done)
ln -sf /opt/data-analyst/instance/config/data_description.md \
/opt/data-analyst/repo/docs/data_description.md
systemctl restart webapp
6c: Run Data Profiler
The profiler reads parquet files + data_description.md and generates profiles.json
with per-table statistics, column analysis, data quality alerts, and relationship maps.
cd /opt/data-analyst/repo
/opt/data-analyst/.venv/bin/python -m src.profiler
Output: /data/src_data/metadata/profiles.json (auto-created, readable by webapp).
The profiler provides:
- Overview: row count, column count, file size, date coverage, missing cell %
- Columns: type distribution, top values, histograms for numeric columns
- Insights: data quality alerts (high missing %, imbalanced categories, high cardinality)
- Relationships: FK diagram built from
foreign_keysindata_description.md - Sample: first 5 rows of the table
Without sync_state.json (no data adapter running), the profiler computes file sizes
directly from parquet files, and the catalog header derives table/row counts from profiles.json.
To re-run after data changes:
cd /opt/data-analyst/repo && /opt/data-analyst/.venv/bin/python -m src.profiler
# No webapp restart needed - profiles.json is read on each request
Step 6 Checklist
| # | Check | Expected |
|---|---|---|
| 6.1 | Parquet files | ls /data/src_data/parquet/*.parquet shows 9 files |
| 6.2 | Permissions | Files owned by root:data-ops, group-readable |
| 6.3 | Data Catalog | /catalog page shows 6 categories with 9 tables |
| 6.4 | Catalog header | "9 tables, ~217K+ rows total" (from profiles.json) |
| 6.5 | Profile modal | Click "Profile" on any table → statistics, columns, insights |
| 6.6 | Relationships | Orders profile → shows customers, order_items, payments links |
| 6.7 | File sizes | Profile overview shows non-zero file size (e.g., 0.69 MB) |
| 6.8 | Analyst sync | Analyst can rsync parquet files to local machine |
| 6.9 | DuckDB loads | SELECT count(*) FROM read_parquet('orders.parquet') returns rows |
Step 7: Real Data Source (Production)
When ready, replace sample data with a real data source adapter in instance/config/instance.yaml:
data_source:
type: "keboola"
keboola:
storage_token: "${KEBOOLA_STORAGE_TOKEN}"
stack_url: "https://connection.keboola.com"
project_id: "12345"
Add the token to .env and create config/data_description.md with table schemas.
Other planned adapters: BigQuery, CSV import.
Deployment Workflow (Ongoing)
Update OSS code
cd /opt/data-analyst/repo && git pull
bash server/deploy.sh # restarts services, syncs scripts/docs
Update instance config
cd /opt/data-analyst/instance && git pull
systemctl restart webapp # picks up new instance.yaml via symlink
Both at once
cd /opt/data-analyst/repo && git pull
cd /opt/data-analyst/instance && git pull
bash server/deploy.sh
Server Layout Summary
/opt/data-analyst/
├── repo/ -> git@github-oss:ORG/OSS_REPO.git
├── instance/ -> git@github-cfg:ORG/INSTANCE_REPO.git
├── .env # Secrets (not in git)
├── .venv/ # Python
└── logs/ # App logs
/root/.ssh/
├── deploy_key # For OSS repo (github-oss alias)
├── instance_key # For instance repo (github-cfg alias)
└── config # Maps aliases to keys
Symlinks:
repo/config/instance.yaml -> instance/config/instance.yaml
repo/docs/data_description.md -> instance/config/data_description.md (optional)
Quick Verification
# Health check
curl http://YOUR_IP/health | python3 -m json.tool
# Login page
curl -s -o /dev/null -w "%{http_code}" http://YOUR_IP/login
# Expected: 200
# Instance config loaded
curl -s http://YOUR_IP/login | grep 'YOUR_INSTANCE_NAME'