# Automated Installation Guide Step-by-step deployment of AI Data Analyst on a clean Ubuntu 24.04 VM. Two repos are involved: - **OSS repo** (public/private): application code (`padak/tmp_oss`) - **Instance repo** (private): your config, secrets template, data schema (`padak/tmp_oss_cfg`) ## Architecture on Server ``` /opt/data-analyst/ ├── repo/ # OSS repo clone │ ├── config/ │ │ └── instance.yaml -> ../../instance/config/instance.yaml (symlink) │ ├── webapp/ │ ├── server/ │ └── ... ├── instance/ # Private instance repo clone │ ├── config/ │ │ ├── instance.yaml # Branding, auth domains, data source │ │ └── data_description.md # Data schema (when configured) │ ├── docs/setup/ # Custom CLAUDE.md template, etc. │ ├── .env.example # Secrets template │ └── README.md ├── .env # Secrets (not in git, from .env.example) ├── .venv/ # Python virtual environment └── logs/ # Application logs ``` Key principle: OSS repo has no secrets/config. Instance repo has no code. Symlinks bridge them. ## Prerequisites 1. **DigitalOcean API token** with `ssh_key` scope (or any Ubuntu 24.04 VM) 2. **Two GitHub repos**: one for OSS code, one for private instance config 3. **SSH key** on your local machine for server access ### Known Issues - `python3-venv` must be installed before `server/setup.sh` (Ubuntu 24.04 omits it) - `webapp-setup.sh` generates SSL nginx config - use HTTP-only for IP-only deployments - DigitalOcean cloud-init cannot override password expiry; must use `ssh_keys` API field ## Step 0: Create Repos ```bash # Push OSS code to GitHub git remote add origin git@github.com:YOUR_ORG/YOUR_OSS_REPO.git git push -u origin main # Create private instance config repo on GitHub (empty, private) # We'll populate it from the server after setup ``` ## Step 1: Provision VM ### 1a: Create Droplet (DigitalOcean) ```bash # Register SSH key (requires ssh_key scope on API token) curl -s -X POST -H 'Content-Type: application/json' \ -H "Authorization: Bearer $DO_TOKEN" \ -d '{"name":"my-key","public_key":"ssh-ed25519 AAAA..."}' \ "https://api.digitalocean.com/v2/account/keys" # Create droplet with SSH key curl -s -X POST -H 'Content-Type: application/json' \ -H "Authorization: Bearer $DO_TOKEN" \ -d '{ "name":"data-analyst-1", "size":"s-1vcpu-2gb", "region":"ams3", "image":"ubuntu-24-04-x64", "ssh_keys":["KEY_ID_OR_FINGERPRINT"] }' \ "https://api.digitalocean.com/v2/droplets" ``` ### 1b: Install Prerequisites ```bash ssh root@DROPLET_IP # Wait for apt lock (auto-updates run on first boot) apt update && apt install -y python3.12-venv python3-pip ``` ### 1c: Generate Deploy Keys Two separate keys - one per repo, for security isolation: ```bash # Key for OSS repo ssh-keygen -t ed25519 -f /root/.ssh/deploy_key -N "" -C "oss-app@$(hostname)" # Key for private instance config repo ssh-keygen -t ed25519 -f /root/.ssh/instance_key -N "" -C "instance-config@$(hostname)" ``` Add each public key as a **deploy key** on its respective GitHub repo: - `deploy_key.pub` -> OSS repo Settings > Deploy Keys - `instance_key.pub` -> Instance repo Settings > Deploy Keys Configure SSH to use the right key per repo: ```bash cat > /root/.ssh/config << 'EOF' # OSS application repo Host github-oss HostName github.com IdentityFile /root/.ssh/deploy_key StrictHostKeyChecking no # Instance config repo (private) Host github-cfg HostName github.com IdentityFile /root/.ssh/instance_key StrictHostKeyChecking no EOF chmod 600 /root/.ssh/config ``` ### 1d: Clone OSS Repo & Run Setup ```bash git clone git@github-oss:YOUR_ORG/YOUR_OSS_REPO.git /opt/data-analyst/repo cd /opt/data-analyst/repo REPO_URL="git@github-oss:YOUR_ORG/YOUR_OSS_REPO.git" bash server/setup.sh ``` ### Step 1 Checklist | # | Check | Expected | |---|-------|----------| | 1.1 | Groups | data-ops, dataread, data-private exist | | 1.2 | Deploy user | uid deploy, groups: deploy, data-ops | | 1.3 | Directories | /opt/data-analyst/{repo,.venv,logs} | | 1.4 | Python venv | Flask loads in .venv | | 1.5 | Scripts | add-analyst, list-analysts in /usr/local/bin | ## Step 2: Webapp Setup ### 2a: Run webapp-setup.sh ```bash export SERVER_HOSTNAME="your-domain-or-ip" bash server/webapp-setup.sh ``` For IP-only (no SSL), replace nginx config: ```bash cat > /etc/nginx/sites-available/webapp << 'NGINX' server { listen 80; server_name _; location / { proxy_pass http://unix:/run/webapp/webapp.sock; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection "upgrade"; } location /static/ { alias /opt/data-analyst/repo/webapp/static/; expires 1d; } location /health { proxy_pass http://unix:/run/webapp/webapp.sock; proxy_set_header Host $host; access_log off; } } NGINX rm -f /etc/nginx/sites-enabled/default nginx -t && systemctl restart nginx ``` ### 2b: Create .env ```bash SECRET_KEY=$(python3 -c 'import secrets; print(secrets.token_hex(32))') cat > /opt/data-analyst/.env << EOF WEBAPP_SECRET_KEY="${SECRET_KEY}" SERVER_HOST="YOUR_IP" SERVER_HOSTNAME="YOUR_IP_OR_DOMAIN" GOOGLE_CLIENT_ID="placeholder" GOOGLE_CLIENT_SECRET="placeholder" DATA_SOURCE="local" DATA_DIR="/data/src_data" EOF chown root:data-ops /opt/data-analyst/.env chmod 640 /opt/data-analyst/.env ``` ### 2c: Create Data Directories & Start ```bash mkdir -p /data/src_data/{parquet,metadata} /data/docs /data/scripts chown -R root:data-ops /data chmod -R 2775 /data mkdir -p /run/webapp chown www-data:www-data /run/webapp systemctl daemon-reload systemctl start webapp systemctl enable webapp ``` ### Step 2 Checklist | # | Check | Expected | |---|-------|----------| | 2.1 | Nginx | active, port 80 | | 2.2 | Webapp | active (gunicorn) | | 2.3 | Health | `curl http://IP/health` returns JSON | | 2.4 | Login page | HTTP 200 at /login | ## Step 3: Instance Configuration (Private Repo) ### 3a: Clone Instance Repo ```bash git clone git@github-cfg:YOUR_ORG/YOUR_INSTANCE_REPO.git /opt/data-analyst/instance chown -R root:data-ops /opt/data-analyst/instance chmod -R 770 /opt/data-analyst/instance ``` ### 3b: Initialize Instance Config (if empty repo) If this is a fresh instance repo, create the initial config: ```bash cd /opt/data-analyst/instance mkdir -p config docs/setup cat > config/instance.yaml << 'YAML' instance: name: "My Data Analyst" subtitle: "My Organization" copyright: "My Org" server: hostname: "YOUR_IP_OR_DOMAIN" host: "YOUR_IP" app_dir: "/opt/data-analyst" auth: allowed_domain: "mycompany.com" webapp_secret_key: "${WEBAPP_SECRET_KEY}" data_source: type: "local" catalog: categories: {} YAML # Create .env.example as a template for future deployments cat > .env.example << 'ENV' WEBAPP_SECRET_KEY="generate-with: python3 -c 'import secrets; print(secrets.token_hex(32))'" SERVER_HOST="server-ip" SERVER_HOSTNAME="server-ip-or-domain" GOOGLE_CLIENT_ID="placeholder" GOOGLE_CLIENT_SECRET="placeholder" DATA_SOURCE="local" DATA_DIR="/data/src_data" ENV cat > .gitignore << 'GI' .env .env.local *.swp *~ .DS_Store GI git add -A && git commit -m "Initial instance config" && git push origin main ``` ### 3c: Symlink Config into OSS Repo ```bash # Remove any existing instance.yaml (from manual setup) and symlink rm -f /opt/data-analyst/repo/config/instance.yaml ln -s /opt/data-analyst/instance/config/instance.yaml /opt/data-analyst/repo/config/instance.yaml # Symlink data_description.md (for Data Catalog - add when ready in Step 6) ln -sf /opt/data-analyst/instance/config/data_description.md /opt/data-analyst/repo/docs/data_description.md systemctl restart webapp ``` ### Step 3 Checklist | # | Check | Expected | |---|-------|----------| | 3.1 | Instance repo | /opt/data-analyst/instance/ exists | | 3.2 | Symlink | config/instance.yaml -> ../../instance/config/instance.yaml | | 3.3 | Webapp loads | Instance name shown on login page | ## Step 4: Authentication Email magic link works without any external service. 1. Login page shows "Sign in with Email" 2. User enters email with allowed domain 3. Without SMTP: magic link shown in browser (dev mode) 4. With SMTP: link sent via email 5. Click link -> logged in -> dashboard Optional: add Google OAuth by setting real `GOOGLE_CLIENT_ID`/`GOOGLE_CLIENT_SECRET`. ### Step 4 Checklist | # | Check | Expected | |---|-------|----------| | 4.1 | Email auth | "Sign in with Email" on login page | | 4.2 | Magic link | Generated for valid domain email | | 4.3 | Domain check | Rejects wrong domains | | 4.4 | Login flow | Magic link -> dashboard with session | ## Step 5: Onboarding Flow (End-User) After server is set up, analysts self-onboard via the webapp: 1. Visit `http://YOUR_SERVER/login` and sign in with email 2. Dashboard shows "Get Started" with 4 steps: - Create project folder (`mkdir -p data-analyst && cd data-analyst`) - Generate SSH key (`ssh-keygen -t ed25519 -f ~/.ssh/data_analyst_server -N ''`) - Copy public key (`cat ~/.ssh/data_analyst_server.pub`) - Paste key into form, click "Create Account" 3. After account creation, dashboard shows "Set up your local environment" 4. User runs `claude` in their project folder, pastes setup instructions 5. Claude Code configures SSH, rsyncs data, sets up Python + DuckDB ## Step 6: Sample Data (Try Without a Data Adapter) Before connecting a real data source, you can load sample data to verify the full pipeline (Parquet files, Data Catalog with profiling, analyst rsync, Claude Code analysis). ### How the Data Catalog & Profiler Pipeline Works ``` Instance repo Server filesystem Webapp ───────────── ──────────────── ────── config/data_description.md ──symlink──> repo/docs/data_description.md (tables, folder_mapping, │ foreign_keys) │ ▼ config/instance.yaml ────────symlink──> repo/config/instance.yaml (catalog.categories, │ labels, icons, order) │ ▼ /data/src_data/parquet/*.parquet │ ┌─────────┴──────────┐ ▼ ▼ python -m src.profiler _load_catalog_data() │ │ ▼ ▼ /data/src_data/metadata/ /catalog page profiles.json (categories + tables) │ ┌──────────┴──────────┐ ▼ ▼ /api/catalog/profile/ _load_data_stats() (per-table stats, (header: "9 tables, columns, alerts, ~217K rows total") relationships, used_by_metrics) docs/metrics/*/*.yml ──────────────> _load_metrics_data() (metric definitions, │ SQL examples, ▼ dimensions) /catalog "Business Metrics" card /api/metrics/ (modal detail) ``` Key files and their roles: | File | Location | Purpose | |------|----------|---------| | `data_description.md` | Instance repo | Table definitions, folder_mapping (bucket→category), foreign_keys | | `instance.yaml` | Instance repo | Catalog category labels, icons, display order | | `*.parquet` | `/data/src_data/parquet/` | Actual data files (flat or in subfolders) | | `profiles.json` | `/data/src_data/metadata/` | Profiler output: statistics, alerts, relationships per table | | `sync_state.json` | `/data/src_data/metadata/` | Sync process stats (optional; profiler provides fallback) | | `docs/metrics/*/*.yml` | OSS repo (sample) or instance repo (production) | Business metric definitions with SQL examples | **Folder mapping** serves dual purpose: maps table IDs to catalog categories for the UI, and maps to filesystem paths for the profiler. The profiler auto-detects flat layouts (all parquet files in one directory) vs subfolder layouts (Keboola-style `parquet//.parquet`). ### 6a: Generate Parquet Files ```bash cd /opt/data-analyst/repo # Install generator dependency /opt/data-analyst/.venv/bin/pip install faker # Generate Parquet files directly (uses project's ParquetManager # for snappy compression, proper types, and metadata embedding) /opt/data-analyst/.venv/bin/python scripts/generate_sample_data.py \ --size m --format parquet --output /data/src_data/parquet --seed 42 # Set correct permissions chown -R root:data-ops /data/src_data/parquet chmod -R 2775 /data/src_data/parquet ``` Available sizes: `xs` (50 customers, ~1 MB), `s` (500, ~15 MB), `m` (5K, ~150 MB), `l` (50K, ~1.5 GB). See `docs/sample-data.md` for the full data model and built-in analytical patterns. ### 6b: Configure Data Catalog The Data Catalog reads from two files in the **instance repo**: 1. **`config/data_description.md`** - YAML block with `folder_mapping`, `tables` (id, name, description, primary_key, sync_strategy, foreign_keys) 2. **`config/instance.yaml`** - `catalog.categories` with label, icon per category + `catalog.order` The `folder_mapping` maps bucket prefixes from table IDs to category names. Example: table ID `sample.sales.orders` → bucket `sample.sales` → folder `sales` → category "Sales & Orders". Tables with `foreign_keys` will show interactive relationship diagrams in the profiler modal. Add `data_description.md` to the instance repo with the sample tables: ```bash cd /opt/data-analyst/instance # Create data_description.md (see config/data_description.md.example in OSS repo) # Must contain a ```yaml block with: # folder_mapping: { "bucket.prefix": "category_key", ... } # tables: list of table definitions # # Each table needs: id, name, description, primary_key, sync_strategy # Optional: foreign_keys (for profiler Relationships tab) # # Example foreign_keys: # foreign_keys: # - column: "customer_id" # references: "customers.customer_id" # description: "Ordering customer" # Add catalog categories to instance.yaml: cat >> config/instance.yaml << 'YAML' catalog: categories: customers: label: "Customers" icon: "users" products: label: "Product Catalog" icon: "package" marketing: label: "Marketing & Campaigns" icon: "megaphone" web: label: "Web Analytics" icon: "globe" sales: label: "Sales & Orders" icon: "shopping-cart" support: label: "Support & Tickets" icon: "help-circle" order: [customers, products, marketing, web, sales, support] YAML git add -A && git commit -m "Add sample data catalog" && git push origin main ``` Then symlink and restart: ```bash # Symlink data_description.md into OSS repo (if not already done) ln -sf /opt/data-analyst/instance/config/data_description.md \ /opt/data-analyst/repo/docs/data_description.md systemctl restart webapp ``` ### 6c: Business Metrics The Data Catalog includes a **Business Metrics** card that dynamically renders metric definitions from YAML files. The OSS repo ships with 10 sample e-commerce metrics in `docs/metrics/` (4 categories: revenue, customers, marketing, support) that align with the sample data generator tables. **How it works:** - Webapp scans `docs/metrics/*/*.yml` (production: `/data/docs/metrics/`) - Each YAML file defines one metric with SQL examples, dimensions, and notes - The profiler links metrics to tables via `used_by_metrics` in `profiles.json` - Clicking a metric opens a modal with Overview, How to Use, SQL Examples, and Technical tabs **For sample data:** metrics work out of the box - the OSS repo includes sample definitions. **For production:** create metric YAMLs in the **instance repo** and deploy them to `/data/docs/metrics/` on the server. The production path takes precedence over the OSS repo. ```bash # Instance repo: create metric definitions mkdir -p /opt/data-analyst/instance/docs/metrics/{revenue,operations} # ... add your .yml files ... # Deploy metrics to server cp -r /opt/data-analyst/instance/docs/metrics/ /data/docs/metrics/ chown -R root:data-ops /data/docs/metrics chmod -R 2775 /data/docs/metrics ``` Each metric YAML file follows this structure (list with one dict): ```yaml - name: metric_name display_name: Human Readable Name category: revenue # must match parent directory name type: sum # sum, average, count_distinct, ratio unit: USD grain: monthly time_column: order_date table: orders # primary table tables: [orders, customers] # optional: all referenced tables expression: "SUM(total_amount)" description: "What this metric measures..." dimensions: [channel, region] notes: ["Important context..."] synonyms: [alias1, alias2] sql: | SELECT ... FROM ... GROUP BY ... sql_by_channel: | # any sql_* key is auto-discovered SELECT ... GROUP BY channel ``` ### 6d: Run Data Profiler The profiler reads parquet files + `data_description.md` and generates `profiles.json` with per-table statistics, column analysis, data quality alerts, and relationship maps. ```bash cd /opt/data-analyst/repo /opt/data-analyst/.venv/bin/python -m src.profiler ``` Output: `/data/src_data/metadata/profiles.json` (auto-created, readable by webapp). The profiler provides: - **Overview**: row count, column count, file size, date coverage, missing cell % - **Columns**: type distribution, top values, histograms for numeric columns - **Insights**: data quality alerts (high missing %, imbalanced categories, high cardinality) - **Relationships**: FK diagram built from `foreign_keys` in `data_description.md`, plus linked Business Metrics - **Used by Metrics**: shows which metric definitions reference this table (from `docs/metrics/`) - **Sample**: first 5 rows of the table Without `sync_state.json` (no data adapter running), the profiler computes file sizes directly from parquet files, and the catalog header derives table/row counts from `profiles.json`. To re-run after data changes: ```bash cd /opt/data-analyst/repo && /opt/data-analyst/.venv/bin/python -m src.profiler # No webapp restart needed - profiles.json is read on each request ``` ### Step 6 Checklist | # | Check | Expected | |---|-------|----------| | 6.1 | Parquet files | `ls /data/src_data/parquet/*.parquet` shows 9 files | | 6.2 | Permissions | Files owned by root:data-ops, group-readable | | 6.3 | Data Catalog | `/catalog` page shows 6 categories with 9 tables | | 6.4 | Catalog header | "9 tables, ~217K+ rows total" (from profiles.json) | | 6.5 | Profile modal | Click "Profile" on any table → statistics, columns, insights | | 6.6 | Relationships | Orders profile → shows customers, order_items, payments links | | 6.7 | Used by Metrics | Orders overview → shows total_revenue, campaign_roi, etc. badges | | 6.8 | Business Metrics | `/catalog` shows "Business Metrics" card with 4 categories, 10 metrics | | 6.9 | Metric modal | Click any metric → modal with SQL examples, dimensions, notes | | 6.10 | File sizes | Profile overview shows non-zero file size (e.g., 0.69 MB) | | 6.11 | Analyst sync | Analyst can rsync parquet files to local machine | | 6.12 | DuckDB loads | `SELECT count(*) FROM read_parquet('orders.parquet')` returns rows | ## Step 7: Real Data Source (Production) When ready, replace sample data with a real data source adapter in `instance/config/instance.yaml`: ```yaml data_source: type: "keboola" keboola: storage_token: "${KEBOOLA_STORAGE_TOKEN}" stack_url: "https://connection.keboola.com" project_id: "12345" ``` Add the token to `.env` and create `config/data_description.md` with table schemas. Other planned adapters: BigQuery, CSV import. ## Deployment Workflow (Ongoing) ### Update OSS code ```bash cd /opt/data-analyst/repo && git pull bash server/deploy.sh # restarts services, syncs scripts/docs ``` ### Update instance config ```bash cd /opt/data-analyst/instance && git pull systemctl restart webapp # picks up new instance.yaml via symlink ``` ### Both at once ```bash cd /opt/data-analyst/repo && git pull cd /opt/data-analyst/instance && git pull bash server/deploy.sh ``` ## Server Layout Summary ``` /opt/data-analyst/ ├── repo/ -> git@github-oss:ORG/OSS_REPO.git ├── instance/ -> git@github-cfg:ORG/INSTANCE_REPO.git ├── .env # Secrets (not in git) ├── .venv/ # Python └── logs/ # App logs /root/.ssh/ ├── deploy_key # For OSS repo (github-oss alias) ├── instance_key # For instance repo (github-cfg alias) └── config # Maps aliases to keys Symlinks: repo/config/instance.yaml -> instance/config/instance.yaml repo/docs/data_description.md -> instance/config/data_description.md (optional) ``` ## Quick Verification ```bash # Health check curl http://YOUR_IP/health | python3 -m json.tool # Login page curl -s -o /dev/null -w "%{http_code}" http://YOUR_IP/login # Expected: 200 # Instance config loaded curl -s http://YOUR_IP/login | grep 'YOUR_INSTANCE_NAME' ```