These Keboola-specific data source cards don't belong in the OSS repo. The catalog now shows only dynamic content: Core Business Data (from data_description.md) and Business Metrics (from docs/metrics/*.yml). Also update auto-install.md with Business Metrics documentation, pipeline diagram, and expanded checklist.
662 lines
22 KiB
Markdown
662 lines
22 KiB
Markdown
# Automated Installation Guide
|
|
|
|
Step-by-step deployment of AI Data Analyst on a clean Ubuntu 24.04 VM.
|
|
|
|
Two repos are involved:
|
|
- **OSS repo** (public/private): application code (`padak/tmp_oss`)
|
|
- **Instance repo** (private): your config, secrets template, data schema (`padak/tmp_oss_cfg`)
|
|
|
|
## Architecture on Server
|
|
|
|
```
|
|
/opt/data-analyst/
|
|
├── repo/ # OSS repo clone
|
|
│ ├── config/
|
|
│ │ └── instance.yaml -> ../../instance/config/instance.yaml (symlink)
|
|
│ ├── webapp/
|
|
│ ├── server/
|
|
│ └── ...
|
|
├── instance/ # Private instance repo clone
|
|
│ ├── config/
|
|
│ │ ├── instance.yaml # Branding, auth domains, data source
|
|
│ │ └── data_description.md # Data schema (when configured)
|
|
│ ├── docs/setup/ # Custom CLAUDE.md template, etc.
|
|
│ ├── .env.example # Secrets template
|
|
│ └── README.md
|
|
├── .env # Secrets (not in git, from .env.example)
|
|
├── .venv/ # Python virtual environment
|
|
└── logs/ # Application logs
|
|
```
|
|
|
|
Key principle: OSS repo has no secrets/config. Instance repo has no code. Symlinks bridge them.
|
|
|
|
## Prerequisites
|
|
|
|
1. **DigitalOcean API token** with `ssh_key` scope (or any Ubuntu 24.04 VM)
|
|
2. **Two GitHub repos**: one for OSS code, one for private instance config
|
|
3. **SSH key** on your local machine for server access
|
|
|
|
### Known Issues
|
|
|
|
- `python3-venv` must be installed before `server/setup.sh` (Ubuntu 24.04 omits it)
|
|
- `webapp-setup.sh` generates SSL nginx config - use HTTP-only for IP-only deployments
|
|
- DigitalOcean cloud-init cannot override password expiry; must use `ssh_keys` API field
|
|
|
|
## Step 0: Create Repos
|
|
|
|
```bash
|
|
# Push OSS code to GitHub
|
|
git remote add origin git@github.com:YOUR_ORG/YOUR_OSS_REPO.git
|
|
git push -u origin main
|
|
|
|
# Create private instance config repo on GitHub (empty, private)
|
|
# We'll populate it from the server after setup
|
|
```
|
|
|
|
## Step 1: Provision VM
|
|
|
|
### 1a: Create Droplet (DigitalOcean)
|
|
|
|
```bash
|
|
# Register SSH key (requires ssh_key scope on API token)
|
|
curl -s -X POST -H 'Content-Type: application/json' \
|
|
-H "Authorization: Bearer $DO_TOKEN" \
|
|
-d '{"name":"my-key","public_key":"ssh-ed25519 AAAA..."}' \
|
|
"https://api.digitalocean.com/v2/account/keys"
|
|
|
|
# Create droplet with SSH key
|
|
curl -s -X POST -H 'Content-Type: application/json' \
|
|
-H "Authorization: Bearer $DO_TOKEN" \
|
|
-d '{
|
|
"name":"data-analyst-1",
|
|
"size":"s-1vcpu-2gb",
|
|
"region":"ams3",
|
|
"image":"ubuntu-24-04-x64",
|
|
"ssh_keys":["KEY_ID_OR_FINGERPRINT"]
|
|
}' \
|
|
"https://api.digitalocean.com/v2/droplets"
|
|
```
|
|
|
|
### 1b: Install Prerequisites
|
|
|
|
```bash
|
|
ssh root@DROPLET_IP
|
|
|
|
# Wait for apt lock (auto-updates run on first boot)
|
|
apt update && apt install -y python3.12-venv python3-pip
|
|
```
|
|
|
|
### 1c: Generate Deploy Keys
|
|
|
|
Two separate keys - one per repo, for security isolation:
|
|
|
|
```bash
|
|
# Key for OSS repo
|
|
ssh-keygen -t ed25519 -f /root/.ssh/deploy_key -N "" -C "oss-app@$(hostname)"
|
|
|
|
# Key for private instance config repo
|
|
ssh-keygen -t ed25519 -f /root/.ssh/instance_key -N "" -C "instance-config@$(hostname)"
|
|
```
|
|
|
|
Add each public key as a **deploy key** on its respective GitHub repo:
|
|
- `deploy_key.pub` -> OSS repo Settings > Deploy Keys
|
|
- `instance_key.pub` -> Instance repo Settings > Deploy Keys
|
|
|
|
Configure SSH to use the right key per repo:
|
|
|
|
```bash
|
|
cat > /root/.ssh/config << 'EOF'
|
|
# OSS application repo
|
|
Host github-oss
|
|
HostName github.com
|
|
IdentityFile /root/.ssh/deploy_key
|
|
StrictHostKeyChecking no
|
|
|
|
# Instance config repo (private)
|
|
Host github-cfg
|
|
HostName github.com
|
|
IdentityFile /root/.ssh/instance_key
|
|
StrictHostKeyChecking no
|
|
EOF
|
|
chmod 600 /root/.ssh/config
|
|
```
|
|
|
|
### 1d: Clone OSS Repo & Run Setup
|
|
|
|
```bash
|
|
git clone git@github-oss:YOUR_ORG/YOUR_OSS_REPO.git /opt/data-analyst/repo
|
|
cd /opt/data-analyst/repo
|
|
REPO_URL="git@github-oss:YOUR_ORG/YOUR_OSS_REPO.git" bash server/setup.sh
|
|
```
|
|
|
|
### Step 1 Checklist
|
|
|
|
| # | Check | Expected |
|
|
|---|-------|----------|
|
|
| 1.1 | Groups | data-ops, dataread, data-private exist |
|
|
| 1.2 | Deploy user | uid deploy, groups: deploy, data-ops |
|
|
| 1.3 | Directories | /opt/data-analyst/{repo,.venv,logs} |
|
|
| 1.4 | Python venv | Flask loads in .venv |
|
|
| 1.5 | Scripts | add-analyst, list-analysts in /usr/local/bin |
|
|
|
|
## Step 2: Webapp Setup
|
|
|
|
### 2a: Run webapp-setup.sh
|
|
|
|
```bash
|
|
export SERVER_HOSTNAME="your-domain-or-ip"
|
|
bash server/webapp-setup.sh
|
|
```
|
|
|
|
For IP-only (no SSL), replace nginx config:
|
|
|
|
```bash
|
|
cat > /etc/nginx/sites-available/webapp << 'NGINX'
|
|
server {
|
|
listen 80;
|
|
server_name _;
|
|
location / {
|
|
proxy_pass http://unix:/run/webapp/webapp.sock;
|
|
proxy_set_header Host $host;
|
|
proxy_set_header X-Real-IP $remote_addr;
|
|
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
|
proxy_set_header X-Forwarded-Proto $scheme;
|
|
proxy_http_version 1.1;
|
|
proxy_set_header Upgrade $http_upgrade;
|
|
proxy_set_header Connection "upgrade";
|
|
}
|
|
location /static/ {
|
|
alias /opt/data-analyst/repo/webapp/static/;
|
|
expires 1d;
|
|
}
|
|
location /health {
|
|
proxy_pass http://unix:/run/webapp/webapp.sock;
|
|
proxy_set_header Host $host;
|
|
access_log off;
|
|
}
|
|
}
|
|
NGINX
|
|
rm -f /etc/nginx/sites-enabled/default
|
|
nginx -t && systemctl restart nginx
|
|
```
|
|
|
|
### 2b: Create .env
|
|
|
|
```bash
|
|
SECRET_KEY=$(python3 -c 'import secrets; print(secrets.token_hex(32))')
|
|
|
|
cat > /opt/data-analyst/.env << EOF
|
|
WEBAPP_SECRET_KEY="${SECRET_KEY}"
|
|
SERVER_HOST="YOUR_IP"
|
|
SERVER_HOSTNAME="YOUR_IP_OR_DOMAIN"
|
|
GOOGLE_CLIENT_ID="placeholder"
|
|
GOOGLE_CLIENT_SECRET="placeholder"
|
|
DATA_SOURCE="local"
|
|
DATA_DIR="/data/src_data"
|
|
EOF
|
|
|
|
chown root:data-ops /opt/data-analyst/.env
|
|
chmod 640 /opt/data-analyst/.env
|
|
```
|
|
|
|
### 2c: Create Data Directories & Start
|
|
|
|
```bash
|
|
mkdir -p /data/src_data/{parquet,metadata} /data/docs /data/scripts
|
|
chown -R root:data-ops /data
|
|
chmod -R 2775 /data
|
|
|
|
mkdir -p /run/webapp
|
|
chown www-data:www-data /run/webapp
|
|
|
|
systemctl daemon-reload
|
|
systemctl start webapp
|
|
systemctl enable webapp
|
|
```
|
|
|
|
### Step 2 Checklist
|
|
|
|
| # | Check | Expected |
|
|
|---|-------|----------|
|
|
| 2.1 | Nginx | active, port 80 |
|
|
| 2.2 | Webapp | active (gunicorn) |
|
|
| 2.3 | Health | `curl http://IP/health` returns JSON |
|
|
| 2.4 | Login page | HTTP 200 at /login |
|
|
|
|
## Step 3: Instance Configuration (Private Repo)
|
|
|
|
### 3a: Clone Instance Repo
|
|
|
|
```bash
|
|
git clone git@github-cfg:YOUR_ORG/YOUR_INSTANCE_REPO.git /opt/data-analyst/instance
|
|
chown -R root:data-ops /opt/data-analyst/instance
|
|
chmod -R 770 /opt/data-analyst/instance
|
|
```
|
|
|
|
### 3b: Initialize Instance Config (if empty repo)
|
|
|
|
If this is a fresh instance repo, create the initial config:
|
|
|
|
```bash
|
|
cd /opt/data-analyst/instance
|
|
mkdir -p config docs/setup
|
|
|
|
cat > config/instance.yaml << 'YAML'
|
|
instance:
|
|
name: "My Data Analyst"
|
|
subtitle: "My Organization"
|
|
copyright: "My Org"
|
|
|
|
server:
|
|
hostname: "YOUR_IP_OR_DOMAIN"
|
|
host: "YOUR_IP"
|
|
app_dir: "/opt/data-analyst"
|
|
|
|
auth:
|
|
allowed_domain: "mycompany.com"
|
|
webapp_secret_key: "${WEBAPP_SECRET_KEY}"
|
|
|
|
data_source:
|
|
type: "local"
|
|
|
|
catalog:
|
|
categories: {}
|
|
YAML
|
|
|
|
# Create .env.example as a template for future deployments
|
|
cat > .env.example << 'ENV'
|
|
WEBAPP_SECRET_KEY="generate-with: python3 -c 'import secrets; print(secrets.token_hex(32))'"
|
|
SERVER_HOST="server-ip"
|
|
SERVER_HOSTNAME="server-ip-or-domain"
|
|
GOOGLE_CLIENT_ID="placeholder"
|
|
GOOGLE_CLIENT_SECRET="placeholder"
|
|
DATA_SOURCE="local"
|
|
DATA_DIR="/data/src_data"
|
|
ENV
|
|
|
|
cat > .gitignore << 'GI'
|
|
.env
|
|
.env.local
|
|
*.swp
|
|
*~
|
|
.DS_Store
|
|
GI
|
|
|
|
git add -A && git commit -m "Initial instance config" && git push origin main
|
|
```
|
|
|
|
### 3c: Symlink Config into OSS Repo
|
|
|
|
```bash
|
|
# Remove any existing instance.yaml (from manual setup) and symlink
|
|
rm -f /opt/data-analyst/repo/config/instance.yaml
|
|
ln -s /opt/data-analyst/instance/config/instance.yaml /opt/data-analyst/repo/config/instance.yaml
|
|
|
|
# Symlink data_description.md (for Data Catalog - add when ready in Step 6)
|
|
ln -sf /opt/data-analyst/instance/config/data_description.md /opt/data-analyst/repo/docs/data_description.md
|
|
|
|
systemctl restart webapp
|
|
```
|
|
|
|
### Step 3 Checklist
|
|
|
|
| # | Check | Expected |
|
|
|---|-------|----------|
|
|
| 3.1 | Instance repo | /opt/data-analyst/instance/ exists |
|
|
| 3.2 | Symlink | config/instance.yaml -> ../../instance/config/instance.yaml |
|
|
| 3.3 | Webapp loads | Instance name shown on login page |
|
|
|
|
## Step 4: Authentication
|
|
|
|
Email magic link works without any external service.
|
|
|
|
1. Login page shows "Sign in with Email"
|
|
2. User enters email with allowed domain
|
|
3. Without SMTP: magic link shown in browser (dev mode)
|
|
4. With SMTP: link sent via email
|
|
5. Click link -> logged in -> dashboard
|
|
|
|
Optional: add Google OAuth by setting real `GOOGLE_CLIENT_ID`/`GOOGLE_CLIENT_SECRET`.
|
|
|
|
### Step 4 Checklist
|
|
|
|
| # | Check | Expected |
|
|
|---|-------|----------|
|
|
| 4.1 | Email auth | "Sign in with Email" on login page |
|
|
| 4.2 | Magic link | Generated for valid domain email |
|
|
| 4.3 | Domain check | Rejects wrong domains |
|
|
| 4.4 | Login flow | Magic link -> dashboard with session |
|
|
|
|
## Step 5: Onboarding Flow (End-User)
|
|
|
|
After server is set up, analysts self-onboard via the webapp:
|
|
|
|
1. Visit `http://YOUR_SERVER/login` and sign in with email
|
|
2. Dashboard shows "Get Started" with 4 steps:
|
|
- Create project folder (`mkdir -p data-analyst && cd data-analyst`)
|
|
- Generate SSH key (`ssh-keygen -t ed25519 -f ~/.ssh/data_analyst_server -N ''`)
|
|
- Copy public key (`cat ~/.ssh/data_analyst_server.pub`)
|
|
- Paste key into form, click "Create Account"
|
|
3. After account creation, dashboard shows "Set up your local environment"
|
|
4. User runs `claude` in their project folder, pastes setup instructions
|
|
5. Claude Code configures SSH, rsyncs data, sets up Python + DuckDB
|
|
|
|
## Step 6: Sample Data (Try Without a Data Adapter)
|
|
|
|
Before connecting a real data source, you can load sample data to verify the full pipeline
|
|
(Parquet files, Data Catalog with profiling, analyst rsync, Claude Code analysis).
|
|
|
|
### How the Data Catalog & Profiler Pipeline Works
|
|
|
|
```
|
|
Instance repo Server filesystem Webapp
|
|
───────────── ──────────────── ──────
|
|
config/data_description.md ──symlink──> repo/docs/data_description.md
|
|
(tables, folder_mapping, │
|
|
foreign_keys) │
|
|
▼
|
|
config/instance.yaml ────────symlink──> repo/config/instance.yaml
|
|
(catalog.categories, │
|
|
labels, icons, order) │
|
|
▼
|
|
/data/src_data/parquet/*.parquet
|
|
│
|
|
┌─────────┴──────────┐
|
|
▼ ▼
|
|
python -m src.profiler _load_catalog_data()
|
|
│ │
|
|
▼ ▼
|
|
/data/src_data/metadata/ /catalog page
|
|
profiles.json (categories + tables)
|
|
│
|
|
┌──────────┴──────────┐
|
|
▼ ▼
|
|
/api/catalog/profile/ _load_data_stats()
|
|
(per-table stats, (header: "9 tables,
|
|
columns, alerts, ~217K rows total")
|
|
relationships,
|
|
used_by_metrics)
|
|
|
|
docs/metrics/*/*.yml ──────────────> _load_metrics_data()
|
|
(metric definitions, │
|
|
SQL examples, ▼
|
|
dimensions) /catalog "Business Metrics" card
|
|
/api/metrics/<path> (modal detail)
|
|
```
|
|
|
|
Key files and their roles:
|
|
|
|
| File | Location | Purpose |
|
|
|------|----------|---------|
|
|
| `data_description.md` | Instance repo | Table definitions, folder_mapping (bucket→category), foreign_keys |
|
|
| `instance.yaml` | Instance repo | Catalog category labels, icons, display order |
|
|
| `*.parquet` | `/data/src_data/parquet/` | Actual data files (flat or in subfolders) |
|
|
| `profiles.json` | `/data/src_data/metadata/` | Profiler output: statistics, alerts, relationships per table |
|
|
| `sync_state.json` | `/data/src_data/metadata/` | Sync process stats (optional; profiler provides fallback) |
|
|
| `docs/metrics/*/*.yml` | OSS repo (sample) or instance repo (production) | Business metric definitions with SQL examples |
|
|
|
|
**Folder mapping** serves dual purpose: maps table IDs to catalog categories for the UI,
|
|
and maps to filesystem paths for the profiler. The profiler auto-detects flat layouts
|
|
(all parquet files in one directory) vs subfolder layouts (Keboola-style `parquet/<folder>/<table>.parquet`).
|
|
|
|
### 6a: Generate Parquet Files
|
|
|
|
```bash
|
|
cd /opt/data-analyst/repo
|
|
|
|
# Install generator dependency
|
|
/opt/data-analyst/.venv/bin/pip install faker
|
|
|
|
# Generate Parquet files directly (uses project's ParquetManager
|
|
# for snappy compression, proper types, and metadata embedding)
|
|
/opt/data-analyst/.venv/bin/python scripts/generate_sample_data.py \
|
|
--size m --format parquet --output /data/src_data/parquet --seed 42
|
|
|
|
# Set correct permissions
|
|
chown -R root:data-ops /data/src_data/parquet
|
|
chmod -R 2775 /data/src_data/parquet
|
|
```
|
|
|
|
Available sizes: `xs` (50 customers, ~1 MB), `s` (500, ~15 MB), `m` (5K, ~150 MB), `l` (50K, ~1.5 GB).
|
|
See `docs/sample-data.md` for the full data model and built-in analytical patterns.
|
|
|
|
### 6b: Configure Data Catalog
|
|
|
|
The Data Catalog reads from two files in the **instance repo**:
|
|
|
|
1. **`config/data_description.md`** - YAML block with `folder_mapping`, `tables` (id, name, description, primary_key, sync_strategy, foreign_keys)
|
|
2. **`config/instance.yaml`** - `catalog.categories` with label, icon per category + `catalog.order`
|
|
|
|
The `folder_mapping` maps bucket prefixes from table IDs to category names. Example:
|
|
table ID `sample.sales.orders` → bucket `sample.sales` → folder `sales` → category "Sales & Orders".
|
|
|
|
Tables with `foreign_keys` will show interactive relationship diagrams in the profiler modal.
|
|
|
|
Add `data_description.md` to the instance repo with the sample tables:
|
|
|
|
```bash
|
|
cd /opt/data-analyst/instance
|
|
|
|
# Create data_description.md (see config/data_description.md.example in OSS repo)
|
|
# Must contain a ```yaml block with:
|
|
# folder_mapping: { "bucket.prefix": "category_key", ... }
|
|
# tables: list of table definitions
|
|
#
|
|
# Each table needs: id, name, description, primary_key, sync_strategy
|
|
# Optional: foreign_keys (for profiler Relationships tab)
|
|
#
|
|
# Example foreign_keys:
|
|
# foreign_keys:
|
|
# - column: "customer_id"
|
|
# references: "customers.customer_id"
|
|
# description: "Ordering customer"
|
|
|
|
# Add catalog categories to instance.yaml:
|
|
cat >> config/instance.yaml << 'YAML'
|
|
|
|
catalog:
|
|
categories:
|
|
customers:
|
|
label: "Customers"
|
|
icon: "users"
|
|
products:
|
|
label: "Product Catalog"
|
|
icon: "package"
|
|
marketing:
|
|
label: "Marketing & Campaigns"
|
|
icon: "megaphone"
|
|
web:
|
|
label: "Web Analytics"
|
|
icon: "globe"
|
|
sales:
|
|
label: "Sales & Orders"
|
|
icon: "shopping-cart"
|
|
support:
|
|
label: "Support & Tickets"
|
|
icon: "help-circle"
|
|
order: [customers, products, marketing, web, sales, support]
|
|
YAML
|
|
|
|
git add -A && git commit -m "Add sample data catalog" && git push origin main
|
|
```
|
|
|
|
Then symlink and restart:
|
|
|
|
```bash
|
|
# Symlink data_description.md into OSS repo (if not already done)
|
|
ln -sf /opt/data-analyst/instance/config/data_description.md \
|
|
/opt/data-analyst/repo/docs/data_description.md
|
|
|
|
systemctl restart webapp
|
|
```
|
|
|
|
### 6c: Business Metrics
|
|
|
|
The Data Catalog includes a **Business Metrics** card that dynamically renders metric
|
|
definitions from YAML files. The OSS repo ships with 10 sample e-commerce metrics in
|
|
`docs/metrics/` (4 categories: revenue, customers, marketing, support) that align with
|
|
the sample data generator tables.
|
|
|
|
**How it works:**
|
|
- Webapp scans `docs/metrics/*/*.yml` (production: `/data/docs/metrics/`)
|
|
- Each YAML file defines one metric with SQL examples, dimensions, and notes
|
|
- The profiler links metrics to tables via `used_by_metrics` in `profiles.json`
|
|
- Clicking a metric opens a modal with Overview, How to Use, SQL Examples, and Technical tabs
|
|
|
|
**For sample data:** metrics work out of the box - the OSS repo includes sample definitions.
|
|
|
|
**For production:** create metric YAMLs in the **instance repo** and deploy them to
|
|
`/data/docs/metrics/` on the server. The production path takes precedence over the OSS repo.
|
|
|
|
```bash
|
|
# Instance repo: create metric definitions
|
|
mkdir -p /opt/data-analyst/instance/docs/metrics/{revenue,operations}
|
|
# ... add your .yml files ...
|
|
|
|
# Deploy metrics to server
|
|
cp -r /opt/data-analyst/instance/docs/metrics/ /data/docs/metrics/
|
|
chown -R root:data-ops /data/docs/metrics
|
|
chmod -R 2775 /data/docs/metrics
|
|
```
|
|
|
|
Each metric YAML file follows this structure (list with one dict):
|
|
|
|
```yaml
|
|
- name: metric_name
|
|
display_name: Human Readable Name
|
|
category: revenue # must match parent directory name
|
|
type: sum # sum, average, count_distinct, ratio
|
|
unit: USD
|
|
grain: monthly
|
|
time_column: order_date
|
|
table: orders # primary table
|
|
tables: [orders, customers] # optional: all referenced tables
|
|
expression: "SUM(total_amount)"
|
|
description: "What this metric measures..."
|
|
dimensions: [channel, region]
|
|
notes: ["Important context..."]
|
|
synonyms: [alias1, alias2]
|
|
sql: |
|
|
SELECT ... FROM ... GROUP BY ...
|
|
sql_by_channel: | # any sql_* key is auto-discovered
|
|
SELECT ... GROUP BY channel
|
|
```
|
|
|
|
### 6d: Run Data Profiler
|
|
|
|
The profiler reads parquet files + `data_description.md` and generates `profiles.json`
|
|
with per-table statistics, column analysis, data quality alerts, and relationship maps.
|
|
|
|
```bash
|
|
cd /opt/data-analyst/repo
|
|
/opt/data-analyst/.venv/bin/python -m src.profiler
|
|
```
|
|
|
|
Output: `/data/src_data/metadata/profiles.json` (auto-created, readable by webapp).
|
|
|
|
The profiler provides:
|
|
- **Overview**: row count, column count, file size, date coverage, missing cell %
|
|
- **Columns**: type distribution, top values, histograms for numeric columns
|
|
- **Insights**: data quality alerts (high missing %, imbalanced categories, high cardinality)
|
|
- **Relationships**: FK diagram built from `foreign_keys` in `data_description.md`, plus linked Business Metrics
|
|
- **Used by Metrics**: shows which metric definitions reference this table (from `docs/metrics/`)
|
|
- **Sample**: first 5 rows of the table
|
|
|
|
Without `sync_state.json` (no data adapter running), the profiler computes file sizes
|
|
directly from parquet files, and the catalog header derives table/row counts from `profiles.json`.
|
|
|
|
To re-run after data changes:
|
|
|
|
```bash
|
|
cd /opt/data-analyst/repo && /opt/data-analyst/.venv/bin/python -m src.profiler
|
|
# No webapp restart needed - profiles.json is read on each request
|
|
```
|
|
|
|
### Step 6 Checklist
|
|
|
|
| # | Check | Expected |
|
|
|---|-------|----------|
|
|
| 6.1 | Parquet files | `ls /data/src_data/parquet/*.parquet` shows 9 files |
|
|
| 6.2 | Permissions | Files owned by root:data-ops, group-readable |
|
|
| 6.3 | Data Catalog | `/catalog` page shows 6 categories with 9 tables |
|
|
| 6.4 | Catalog header | "9 tables, ~217K+ rows total" (from profiles.json) |
|
|
| 6.5 | Profile modal | Click "Profile" on any table → statistics, columns, insights |
|
|
| 6.6 | Relationships | Orders profile → shows customers, order_items, payments links |
|
|
| 6.7 | Used by Metrics | Orders overview → shows total_revenue, campaign_roi, etc. badges |
|
|
| 6.8 | Business Metrics | `/catalog` shows "Business Metrics" card with 4 categories, 10 metrics |
|
|
| 6.9 | Metric modal | Click any metric → modal with SQL examples, dimensions, notes |
|
|
| 6.10 | File sizes | Profile overview shows non-zero file size (e.g., 0.69 MB) |
|
|
| 6.11 | Analyst sync | Analyst can rsync parquet files to local machine |
|
|
| 6.12 | DuckDB loads | `SELECT count(*) FROM read_parquet('orders.parquet')` returns rows |
|
|
|
|
## Step 7: Real Data Source (Production)
|
|
|
|
When ready, replace sample data with a real data source adapter in `instance/config/instance.yaml`:
|
|
|
|
```yaml
|
|
data_source:
|
|
type: "keboola"
|
|
keboola:
|
|
storage_token: "${KEBOOLA_STORAGE_TOKEN}"
|
|
stack_url: "https://connection.keboola.com"
|
|
project_id: "12345"
|
|
```
|
|
|
|
Add the token to `.env` and create `config/data_description.md` with table schemas.
|
|
|
|
Other planned adapters: BigQuery, CSV import.
|
|
|
|
## Deployment Workflow (Ongoing)
|
|
|
|
### Update OSS code
|
|
```bash
|
|
cd /opt/data-analyst/repo && git pull
|
|
bash server/deploy.sh # restarts services, syncs scripts/docs
|
|
```
|
|
|
|
### Update instance config
|
|
```bash
|
|
cd /opt/data-analyst/instance && git pull
|
|
systemctl restart webapp # picks up new instance.yaml via symlink
|
|
```
|
|
|
|
### Both at once
|
|
```bash
|
|
cd /opt/data-analyst/repo && git pull
|
|
cd /opt/data-analyst/instance && git pull
|
|
bash server/deploy.sh
|
|
```
|
|
|
|
## Server Layout Summary
|
|
|
|
```
|
|
/opt/data-analyst/
|
|
├── repo/ -> git@github-oss:ORG/OSS_REPO.git
|
|
├── instance/ -> git@github-cfg:ORG/INSTANCE_REPO.git
|
|
├── .env # Secrets (not in git)
|
|
├── .venv/ # Python
|
|
└── logs/ # App logs
|
|
|
|
/root/.ssh/
|
|
├── deploy_key # For OSS repo (github-oss alias)
|
|
├── instance_key # For instance repo (github-cfg alias)
|
|
└── config # Maps aliases to keys
|
|
|
|
Symlinks:
|
|
repo/config/instance.yaml -> instance/config/instance.yaml
|
|
repo/docs/data_description.md -> instance/config/data_description.md (optional)
|
|
```
|
|
|
|
## Quick Verification
|
|
|
|
```bash
|
|
# Health check
|
|
curl http://YOUR_IP/health | python3 -m json.tool
|
|
|
|
# Login page
|
|
curl -s -o /dev/null -w "%{http_code}" http://YOUR_IP/login
|
|
# Expected: 200
|
|
|
|
# Instance config loaded
|
|
curl -s http://YOUR_IP/login | grep 'YOUR_INSTANCE_NAME'
|
|
```
|