- Replace padak/tmp_oss → keboola/agnes-the-ai-analyst in all docs, infra, CLI - Replace your-org/ai-data-analyst → keboola/agnes-the-ai-analyst in README, Jira docs - Remove real GCP project ID from terraform.tfvars.example - Delete internal draft documents (dev_docs/draft/) - Update infra/main.tf to clone from main branch
428 lines
12 KiB
Markdown
428 lines
12 KiB
Markdown
# VM Test Plan - Self-Service Data Onboarding
|
|
|
|
End-to-end test of the full platform on a clean VM with a new GitHub repository.
|
|
|
|
## Prerequisites
|
|
|
|
- Clean Ubuntu 22.04+ VM (or Debian 12) with root access
|
|
- GitHub account with ability to create repositories
|
|
- Domain name pointing to the VM (or use IP + skip SSL)
|
|
- Keboola project with Storage API token (for discovery/sync testing)
|
|
- Google OAuth credentials (for login testing)
|
|
|
|
---
|
|
|
|
## Step 0: Create GitHub Repository & Push
|
|
|
|
**On your local machine:**
|
|
|
|
```bash
|
|
cd /path/to/agnes-the-ai-analyst
|
|
|
|
# Create repo on GitHub (pick org/name)
|
|
gh repo create YOUR_ORG/ai-data-analyst --private --source=. --push
|
|
|
|
# Verify
|
|
gh repo view YOUR_ORG/ai-data-analyst
|
|
```
|
|
|
|
**Expected:** Repo created, code pushed, visible on GitHub.
|
|
|
|
---
|
|
|
|
## Step 1: VM Initial Setup
|
|
|
|
**On the VM as root:**
|
|
|
|
```bash
|
|
# Clone the repo
|
|
REPO_URL="git@github.com:YOUR_ORG/ai-data-analyst.git"
|
|
APP_DIR="/opt/data-analyst"
|
|
mkdir -p $APP_DIR
|
|
ssh-keygen -t ed25519 -f /root/.ssh/deploy_key -N ""
|
|
# Add deploy key to GitHub repo (Settings -> Deploy keys)
|
|
|
|
sudo -u deploy git clone $REPO_URL $APP_DIR/repo
|
|
|
|
# Run setup
|
|
cd $APP_DIR/repo
|
|
REPO_URL=$REPO_URL bash server/setup.sh
|
|
```
|
|
|
|
### Checklist
|
|
|
|
| # | Check | Command |
|
|
|---|-------|---------|
|
|
| 1.1 | Groups created | `getent group data-ops dataread data-private` |
|
|
| 1.2 | Deploy user exists | `id deploy` |
|
|
| 1.3 | Directory structure | `ls -la /opt/data-analyst/` |
|
|
| 1.4 | Python venv works | `/opt/data-analyst/.venv/bin/python -c "import flask; print('OK')"` |
|
|
| 1.5 | Management scripts | `which add-analyst list-analysts` |
|
|
|
|
---
|
|
|
|
## Step 2: Webapp Setup
|
|
|
|
```bash
|
|
export SERVER_HOSTNAME="data.yourdomain.com" # or skip SSL with IP
|
|
bash server/webapp-setup.sh
|
|
```
|
|
|
|
Then edit `/opt/data-analyst/.env`:
|
|
|
|
```bash
|
|
# Required
|
|
WEBAPP_SECRET_KEY="$(python3 -c 'import secrets; print(secrets.token_hex(32))')"
|
|
GOOGLE_CLIENT_ID="your-google-client-id"
|
|
GOOGLE_CLIENT_SECRET="your-google-client-secret"
|
|
SERVER_HOST="YOUR_VM_IP"
|
|
SERVER_HOSTNAME="data.yourdomain.com"
|
|
|
|
# For Keboola discovery/sync
|
|
KEBOOLA_STORAGE_TOKEN="your-token"
|
|
KEBOOLA_STACK_URL="https://connection.keboola.com"
|
|
KEBOOLA_PROJECT_ID="your-project-id"
|
|
DATA_SOURCE="keboola"
|
|
DATA_DIR="/data/src_data"
|
|
```
|
|
|
|
### Checklist
|
|
|
|
| # | Check | Command |
|
|
|---|-------|---------|
|
|
| 2.1 | Nginx running | `systemctl status nginx` |
|
|
| 2.2 | Webapp running | `systemctl status webapp` |
|
|
| 2.3 | SSL cert (if domain) | `curl -I https://data.yourdomain.com/health` |
|
|
| 2.4 | Health endpoint | `curl http://localhost:5000/health` (or via nginx) |
|
|
| 2.5 | Login page loads | Browser: `https://data.yourdomain.com/login` |
|
|
|
|
---
|
|
|
|
## Step 3: Instance Configuration
|
|
|
|
```bash
|
|
cd /opt/data-analyst/repo
|
|
cp config/instance.yaml.example config/instance.yaml
|
|
```
|
|
|
|
Edit `config/instance.yaml` with:
|
|
- `instance.name` / `instance.subtitle`
|
|
- `server.hostname` / `server.host`
|
|
- `auth.allowed_domain` (your Google domain)
|
|
- `data_source.type: "keboola"` + keboola settings
|
|
- `catalog.categories` (at least one, e.g., `crm: {label: "CRM", icon: "crm"}`)
|
|
|
|
### Checklist
|
|
|
|
| # | Check | Command |
|
|
|---|-------|---------|
|
|
| 3.1 | Config loads | `cd /opt/data-analyst/repo && .venv/bin/python -c "from config.loader import load_instance_config; print(load_instance_config())"` |
|
|
| 3.2 | Webapp picks it up | Restart webapp, check login page shows instance name |
|
|
|
|
---
|
|
|
|
## Step 4: Create Admin Account & Login
|
|
|
|
1. Login via Google OAuth in browser
|
|
2. Register account with SSH key
|
|
3. Verify the user is admin:
|
|
|
|
```bash
|
|
id YOUR_USERNAME # should be in data-ops or sudo group
|
|
# If not admin, manually add:
|
|
usermod -aG data-ops YOUR_USERNAME
|
|
```
|
|
|
|
### Checklist
|
|
|
|
| # | Check | Command |
|
|
|---|-------|---------|
|
|
| 4.1 | Google OAuth works | Login via browser |
|
|
| 4.2 | Account created | `list-analysts` shows your username |
|
|
| 4.3 | Dashboard loads | Browser: /dashboard shows data stats |
|
|
| 4.4 | Admin access | Browser: /admin/tables loads (no 403) |
|
|
|
|
---
|
|
|
|
## Step 5: Test Discovery API (Phase 1)
|
|
|
|
In browser, go to `/admin/tables` and click "Discover tables from source".
|
|
|
|
### Checklist
|
|
|
|
| # | Check | Expected |
|
|
|---|-------|----------|
|
|
| 5.1 | Discovery button works | Loading spinner, then tables appear |
|
|
| 5.2 | Tables grouped by bucket | Buckets shown as collapsible sections |
|
|
| 5.3 | Table details shown | Name, columns, row count, size for each table |
|
|
| 5.4 | "Available" badge | All tables show "Available" (none registered yet) |
|
|
| 5.5 | API direct test | `curl -b cookies.txt https://HOST/api/admin/discover-tables \| jq .total` |
|
|
|
|
---
|
|
|
|
## Step 6: Test Table Registry (Phase 2)
|
|
|
|
### 6a: Register tables via Admin UI
|
|
|
|
1. Click "Register" on a table in discovery results
|
|
2. Fill in: sync_strategy=full_refresh, confirm primary key
|
|
3. Click "Register Table"
|
|
4. Repeat for 2-3 more tables (try incremental too)
|
|
|
|
### 6b: Verify registry
|
|
|
|
```bash
|
|
# On server
|
|
cat /data/src_data/metadata/table_registry.json | python3 -m json.tool | head -30
|
|
|
|
# Check generated data_description.md
|
|
head -10 /opt/data-analyst/repo/docs/data_description.md
|
|
# Should show: <!-- AUTO-GENERATED from table_registry.json -->
|
|
|
|
# Check audit log
|
|
cat /data/src_data/metadata/registry_audit.log
|
|
```
|
|
|
|
### 6c: Test via API
|
|
|
|
```bash
|
|
# List registry
|
|
curl -b cookies.txt https://HOST/api/admin/registry | jq '.tables | length'
|
|
|
|
# Update a table
|
|
curl -b cookies.txt -X PUT https://HOST/api/admin/registry/in.c-crm.company \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"description": "Updated via API", "version": CURRENT_VERSION}'
|
|
|
|
# Delete a table
|
|
curl -b cookies.txt -X DELETE https://HOST/api/admin/registry/in.c-crm.company \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"version": CURRENT_VERSION}'
|
|
```
|
|
|
|
### Checklist
|
|
|
|
| # | Check | Expected |
|
|
|---|-------|----------|
|
|
| 6.1 | Register table | Success, table appears in registry panel |
|
|
| 6.2 | Badge changes | Registered tables show green "Registered" badge |
|
|
| 6.3 | data_description.md | Generated with AUTO-GENERATED header + checksum |
|
|
| 6.4 | Audit log written | Actions logged with timestamps and emails |
|
|
| 6.5 | Optimistic locking | Stale version POST returns 409 |
|
|
| 6.6 | Edit table | PUT changes description/strategy |
|
|
| 6.7 | Delete table | Table removed, badge reverts to "Available" |
|
|
|
|
---
|
|
|
|
## Step 7: Test Data Sync + Auto-Profiling (Phase 3)
|
|
|
|
```bash
|
|
cd /opt/data-analyst/repo
|
|
source .venv/bin/activate
|
|
|
|
# Run sync for registered tables
|
|
python -m src.data_sync
|
|
```
|
|
|
|
### Checklist
|
|
|
|
| # | Check | Expected |
|
|
|---|-------|----------|
|
|
| 7.1 | Sync completes | Tables downloaded, Parquet created |
|
|
| 7.2 | Schema.yml generated | `cat docs/schema.yml \| head` |
|
|
| 7.3 | Auto-profiling ran | Log shows "Auto-profiling: N profiled" |
|
|
| 7.4 | profiles.json exists | `ls -la /data/src_data/metadata/profiles.json` |
|
|
| 7.5 | Catalog shows profiles | Browser: /catalog -> click table -> profile data loads |
|
|
|
|
---
|
|
|
|
## Step 8: Test Per-Table Subscriptions (Phase 4)
|
|
|
|
### 8a: Via API
|
|
|
|
```bash
|
|
# Get current subscriptions
|
|
curl -b cookies.txt https://HOST/api/table-subscriptions | jq .
|
|
|
|
# Switch to explicit mode, subscribe to specific tables
|
|
curl -b cookies.txt -X POST https://HOST/api/table-subscriptions \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"table_mode": "explicit",
|
|
"tables": {"company": true, "contact": true, "events": false}
|
|
}'
|
|
```
|
|
|
|
### 8b: Via Catalog UI
|
|
|
|
1. Go to /catalog
|
|
2. Tables should show subscription status (all subscribed in "all" mode)
|
|
3. After switching to "explicit" mode via API, unsubscribed tables should be visually different
|
|
|
|
### Checklist
|
|
|
|
| # | Check | Expected |
|
|
|---|-------|----------|
|
|
| 8.1 | Default is "all" mode | GET returns `table_mode: "all"` |
|
|
| 8.2 | Switch to explicit | POST succeeds, settings saved |
|
|
| 8.3 | Config YAML updated | `cat /home/USERNAME/.sync_settings.yaml` shows `table_mode: explicit` |
|
|
| 8.4 | Catalog reflects subs | Subscribed vs unsubscribed tables visually distinct |
|
|
|
|
---
|
|
|
|
## Step 9: Test Smart Sync (Phase 5)
|
|
|
|
### 9a: Check rsync filter generation
|
|
|
|
```bash
|
|
# After setting explicit subscriptions:
|
|
cat /home/USERNAME/.sync_rsync_filter
|
|
# Should show include/exclude rules
|
|
```
|
|
|
|
### 9b: Test from analyst machine
|
|
|
|
```bash
|
|
# On analyst machine (or simulate):
|
|
bash server/scripts/sync_data.sh --dry-run
|
|
# Should show filter-based sync when explicit mode is active
|
|
```
|
|
|
|
### Checklist
|
|
|
|
| # | Check | Expected |
|
|
|---|-------|----------|
|
|
| 9.1 | Filter file exists | `.sync_rsync_filter` created in user home |
|
|
| 9.2 | Correct include/exclude | Subscribed tables included, others excluded |
|
|
| 9.3 | Dry-run uses filter | `--filter="merge ..."` in rsync output |
|
|
| 9.4 | Fallback works | Without filter file, syncs everything (backwards compat) |
|
|
|
|
---
|
|
|
|
## Step 10: Migration Test (One-Time Bootstrap)
|
|
|
|
If you already have a `docs/data_description.md` with tables defined:
|
|
|
|
```bash
|
|
python3 -c "
|
|
from src.table_registry import TableRegistry
|
|
from pathlib import Path
|
|
|
|
registry = TableRegistry.import_from_data_description(
|
|
Path('docs/data_description.md'),
|
|
Path('/data/src_data/metadata/table_registry.json'),
|
|
registered_by='migration@test.com'
|
|
)
|
|
print(f'Migrated {len(registry.list_tables())} tables')
|
|
print(f'Version: {registry.version}')
|
|
"
|
|
```
|
|
|
|
### Checklist
|
|
|
|
| # | Check | Expected |
|
|
|---|-------|----------|
|
|
| 10.1 | Migration succeeds | All tables imported |
|
|
| 10.2 | Registry JSON valid | `cat table_registry.json \| python3 -m json.tool` |
|
|
| 10.3 | migrated_from marker | `"migrated_from": "docs/data_description.md"` in metadata |
|
|
| 10.4 | Admin UI shows tables | /admin/tables lists all migrated tables |
|
|
|
|
---
|
|
|
|
## Step 11: Regression Tests
|
|
|
|
```bash
|
|
cd /opt/data-analyst/repo
|
|
source .venv/bin/activate
|
|
python -m pytest tests/ -v
|
|
```
|
|
|
|
### Checklist
|
|
|
|
| # | Check | Expected |
|
|
|---|-------|----------|
|
|
| 11.1 | All tests pass | 132+ tests, 0 failures |
|
|
| 11.2 | No import errors | All modules load cleanly |
|
|
|
|
---
|
|
|
|
## Quick Smoke Test Script
|
|
|
|
Run this after full setup to verify the critical path:
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
# smoke_test.sh - Quick verification of self-service onboarding
|
|
set -e
|
|
|
|
APP_DIR="/opt/data-analyst/repo"
|
|
cd "$APP_DIR"
|
|
source .venv/bin/activate
|
|
|
|
echo "=== Smoke Test ==="
|
|
|
|
# 1. Tests
|
|
echo "[1/5] Running tests..."
|
|
python -m pytest tests/ -q --tb=short
|
|
echo " PASS"
|
|
|
|
# 2. Registry module
|
|
echo "[2/5] Testing Table Registry..."
|
|
python -c "
|
|
from src.table_registry import TableRegistry
|
|
from pathlib import Path
|
|
import tempfile
|
|
r = TableRegistry(Path(tempfile.mktemp(suffix='.json')))
|
|
r.register_table({'id': 'test.t', 'name': 't', 'primary_key': 'id', 'sync_strategy': 'full_refresh'}, 'test')
|
|
assert r.is_registered('test.t')
|
|
r.unregister_table('test.t')
|
|
assert not r.is_registered('test.t')
|
|
print(' PASS')
|
|
"
|
|
|
|
# 3. Discovery (needs Keboola credentials)
|
|
echo "[3/5] Testing Discovery API..."
|
|
python -c "
|
|
try:
|
|
from src.data_sync import create_data_source
|
|
ds = create_data_source()
|
|
tables = ds.discover_tables()
|
|
print(f' PASS - Discovered {len(tables)} tables')
|
|
except Exception as e:
|
|
print(f' SKIP - {e}')
|
|
"
|
|
|
|
# 4. Profiler API
|
|
echo "[4/5] Testing Profiler API..."
|
|
python -c "
|
|
from src.profiler import profile_changed_tables
|
|
result = profile_changed_tables([])
|
|
assert result == {'success': 0, 'errors': 0, 'skipped': 0}
|
|
print(' PASS')
|
|
"
|
|
|
|
# 5. Webapp imports
|
|
echo "[5/5] Testing Webapp imports..."
|
|
python -c "
|
|
from webapp.auth import admin_required, login_required
|
|
from webapp.sync_settings_service import get_table_subscriptions, generate_rsync_filter
|
|
from src.table_registry import TableRegistry, ConflictError
|
|
print(' PASS')
|
|
"
|
|
|
|
echo ""
|
|
echo "=== All smoke tests passed ==="
|
|
```
|
|
|
|
---
|
|
|
|
## Troubleshooting
|
|
|
|
| Problem | Fix |
|
|
|---------|-----|
|
|
| `/admin/tables` returns 403 | User not in `data-ops` group. Run `usermod -aG data-ops USERNAME` |
|
|
| Discovery returns empty | Check `KEBOOLA_STORAGE_TOKEN` in `.env`, verify `DATA_SOURCE=keboola` |
|
|
| Profiles not generated | Check `/data/src_data/parquet/` has parquet files, check DuckDB installed |
|
|
| Rsync filter not created | Check `sudo` permissions for `www-data` in sudoers-webapp |
|
|
| `data_description.md` not updating | Check write permissions on `docs/` directory |
|
|
| Webapp won't start | Check `journalctl -u webapp -n 50` for errors |
|