AI-Cognitive-Leap/agnes-the-ai-analyst

Fork 0

Petr 879bc6c44f docs

2026-03-10 11:43:11 +01:00

12 KiB

Raw Blame History

VM Test Plan - Self-Service Data Onboarding

End-to-end test of the full platform on a clean VM with a new GitHub repository.

Prerequisites

Clean Ubuntu 22.04+ VM (or Debian 12) with root access
GitHub account with ability to create repositories
Domain name pointing to the VM (or use IP + skip SSL)
Keboola project with Storage API token (for discovery/sync testing)
Google OAuth credentials (for login testing)

Step 0: Create GitHub Repository & Push

On your local machine:

cd /Users/padak/github/oss-ai-data-analyst

# Create repo on GitHub (pick org/name)
gh repo create YOUR_ORG/ai-data-analyst --private --source=. --push

# Verify
gh repo view YOUR_ORG/ai-data-analyst

Expected: Repo created, code pushed, visible on GitHub.

Step 1: VM Initial Setup

On the VM as root:

# Clone the repo
REPO_URL="git@github.com:YOUR_ORG/ai-data-analyst.git"
APP_DIR="/opt/data-analyst"
mkdir -p $APP_DIR
ssh-keygen -t ed25519 -f /root/.ssh/deploy_key -N ""
# Add deploy key to GitHub repo (Settings -> Deploy keys)

sudo -u deploy git clone $REPO_URL $APP_DIR/repo

# Run setup
cd $APP_DIR/repo
REPO_URL=$REPO_URL bash server/setup.sh

Checklist

#	Check	Command
1.1	Groups created	`getent group data-ops dataread data-private`
1.2	Deploy user exists	`id deploy`
1.3	Directory structure	`ls -la /opt/data-analyst/`
1.4	Python venv works	`/opt/data-analyst/.venv/bin/python -c "import flask; print('OK')"`
1.5	Management scripts	`which add-analyst list-analysts`

Step 2: Webapp Setup

export SERVER_HOSTNAME="data.yourdomain.com"  # or skip SSL with IP
bash server/webapp-setup.sh

Then edit /opt/data-analyst/.env:

# Required
WEBAPP_SECRET_KEY="$(python3 -c 'import secrets; print(secrets.token_hex(32))')"
GOOGLE_CLIENT_ID="your-google-client-id"
GOOGLE_CLIENT_SECRET="your-google-client-secret"
SERVER_HOST="YOUR_VM_IP"
SERVER_HOSTNAME="data.yourdomain.com"

# For Keboola discovery/sync
KEBOOLA_STORAGE_TOKEN="your-token"
KEBOOLA_STACK_URL="https://connection.keboola.com"
KEBOOLA_PROJECT_ID="your-project-id"
DATA_SOURCE="keboola"
DATA_DIR="/data/src_data"

Checklist

#	Check	Command
2.1	Nginx running	`systemctl status nginx`
2.2	Webapp running	`systemctl status webapp`
2.3	SSL cert (if domain)	`curl -I https://data.yourdomain.com/health`
2.4	Health endpoint	`curl http://localhost:5000/health` (or via nginx)
2.5	Login page loads	Browser: `https://data.yourdomain.com/login`

Step 3: Instance Configuration

cd /opt/data-analyst/repo
cp config/instance.yaml.example config/instance.yaml

Edit config/instance.yaml with:

instance.name / instance.subtitle
server.hostname / server.host
auth.allowed_domain (your Google domain)
data_source.type: "keboola" + keboola settings
catalog.categories (at least one, e.g., crm: {label: "CRM", icon: "crm"})

Checklist

#	Check	Command
3.1	Config loads	`cd /opt/data-analyst/repo && .venv/bin/python -c "from config.loader import load_instance_config; print(load_instance_config())"`
3.2	Webapp picks it up	Restart webapp, check login page shows instance name

Login via Google OAuth in browser
Register account with SSH key
Verify the user is admin:

id YOUR_USERNAME           # should be in data-ops or sudo group
# If not admin, manually add:
usermod -aG data-ops YOUR_USERNAME

Checklist

#	Check	Command
4.1	Google OAuth works	Login via browser
4.2	Account created	`list-analysts` shows your username
4.3	Dashboard loads	Browser: /dashboard shows data stats
4.4	Admin access	Browser: /admin/tables loads (no 403)

Step 5: Test Discovery API (Phase 1)

In browser, go to /admin/tables and click "Discover tables from source".

Checklist

#	Check	Expected
5.1	Discovery button works	Loading spinner, then tables appear
5.2	Tables grouped by bucket	Buckets shown as collapsible sections
5.3	Table details shown	Name, columns, row count, size for each table
5.4	"Available" badge	All tables show "Available" (none registered yet)
5.5	API direct test	`curl -b cookies.txt https://HOST/api/admin/discover-tables \| jq .total`

Step 6: Test Table Registry (Phase 2)

6a: Register tables via Admin UI

Click "Register" on a table in discovery results
Fill in: sync_strategy=full_refresh, confirm primary key
Click "Register Table"
Repeat for 2-3 more tables (try incremental too)

6b: Verify registry

# On server
cat /data/src_data/metadata/table_registry.json | python3 -m json.tool | head -30

# Check generated data_description.md
head -10 /opt/data-analyst/repo/docs/data_description.md
# Should show: <!-- AUTO-GENERATED from table_registry.json -->

# Check audit log
cat /data/src_data/metadata/registry_audit.log

6c: Test via API

# List registry
curl -b cookies.txt https://HOST/api/admin/registry | jq '.tables | length'

# Update a table
curl -b cookies.txt -X PUT https://HOST/api/admin/registry/in.c-crm.company \
  -H "Content-Type: application/json" \
  -d '{"description": "Updated via API", "version": CURRENT_VERSION}'

# Delete a table
curl -b cookies.txt -X DELETE https://HOST/api/admin/registry/in.c-crm.company \
  -H "Content-Type: application/json" \
  -d '{"version": CURRENT_VERSION}'

Checklist

#	Check	Expected
6.1	Register table	Success, table appears in registry panel
6.2	Badge changes	Registered tables show green "Registered" badge
6.3	data_description.md	Generated with AUTO-GENERATED header + checksum
6.4	Audit log written	Actions logged with timestamps and emails
6.5	Optimistic locking	Stale version POST returns 409
6.6	Edit table	PUT changes description/strategy
6.7	Delete table	Table removed, badge reverts to "Available"

Step 7: Test Data Sync + Auto-Profiling (Phase 3)

cd /opt/data-analyst/repo
source .venv/bin/activate

# Run sync for registered tables
python -m src.data_sync

Checklist

#	Check	Expected
7.1	Sync completes	Tables downloaded, Parquet created
7.2	Schema.yml generated	`cat docs/schema.yml \| head`
7.3	Auto-profiling ran	Log shows "Auto-profiling: N profiled"
7.4	profiles.json exists	`ls -la /data/src_data/metadata/profiles.json`
7.5	Catalog shows profiles	Browser: /catalog -> click table -> profile data loads

Step 8: Test Per-Table Subscriptions (Phase 4)

8a: Via API

# Get current subscriptions
curl -b cookies.txt https://HOST/api/table-subscriptions | jq .

# Switch to explicit mode, subscribe to specific tables
curl -b cookies.txt -X POST https://HOST/api/table-subscriptions \
  -H "Content-Type: application/json" \
  -d '{
    "table_mode": "explicit",
    "tables": {"company": true, "contact": true, "events": false}
  }'

8b: Via Catalog UI

Go to /catalog
Tables should show subscription status (all subscribed in "all" mode)
After switching to "explicit" mode via API, unsubscribed tables should be visually different

Checklist

#	Check	Expected
8.1	Default is "all" mode	GET returns `table_mode: "all"`
8.2	Switch to explicit	POST succeeds, settings saved
8.3	Config YAML updated	`cat /home/USERNAME/.sync_settings.yaml` shows `table_mode: explicit`
8.4	Catalog reflects subs	Subscribed vs unsubscribed tables visually distinct

Step 9: Test Smart Sync (Phase 5)

9a: Check rsync filter generation

# After setting explicit subscriptions:
cat /home/USERNAME/.sync_rsync_filter
# Should show include/exclude rules

9b: Test from analyst machine

# On analyst machine (or simulate):
bash server/scripts/sync_data.sh --dry-run
# Should show filter-based sync when explicit mode is active

Checklist

#	Check	Expected
9.1	Filter file exists	`.sync_rsync_filter` created in user home
9.2	Correct include/exclude	Subscribed tables included, others excluded
9.3	Dry-run uses filter	`--filter="merge ..."` in rsync output
9.4	Fallback works	Without filter file, syncs everything (backwards compat)

Step 10: Migration Test (One-Time Bootstrap)

If you already have a docs/data_description.md with tables defined:

python3 -c "
from src.table_registry import TableRegistry
from pathlib import Path

registry = TableRegistry.import_from_data_description(
    Path('docs/data_description.md'),
    Path('/data/src_data/metadata/table_registry.json'),
    registered_by='migration@test.com'
)
print(f'Migrated {len(registry.list_tables())} tables')
print(f'Version: {registry.version}')
"

Checklist

#	Check	Expected
10.1	Migration succeeds	All tables imported
10.2	Registry JSON valid	`cat table_registry.json \| python3 -m json.tool`
10.3	migrated_from marker	`"migrated_from": "docs/data_description.md"` in metadata
10.4	Admin UI shows tables	/admin/tables lists all migrated tables

Step 11: Regression Tests

cd /opt/data-analyst/repo
source .venv/bin/activate
python -m pytest tests/ -v

Checklist

#	Check	Expected
11.1	All tests pass	132+ tests, 0 failures
11.2	No import errors	All modules load cleanly

Quick Smoke Test Script

Run this after full setup to verify the critical path:

#!/bin/bash
# smoke_test.sh - Quick verification of self-service onboarding
set -e

APP_DIR="/opt/data-analyst/repo"
cd "$APP_DIR"
source .venv/bin/activate

echo "=== Smoke Test ==="

# 1. Tests
echo "[1/5] Running tests..."
python -m pytest tests/ -q --tb=short
echo "  PASS"

# 2. Registry module
echo "[2/5] Testing Table Registry..."
python -c "
from src.table_registry import TableRegistry
from pathlib import Path
import tempfile
r = TableRegistry(Path(tempfile.mktemp(suffix='.json')))
r.register_table({'id': 'test.t', 'name': 't', 'primary_key': 'id', 'sync_strategy': 'full_refresh'}, 'test')
assert r.is_registered('test.t')
r.unregister_table('test.t')
assert not r.is_registered('test.t')
print('  PASS')
"

# 3. Discovery (needs Keboola credentials)
echo "[3/5] Testing Discovery API..."
python -c "
try:
    from src.data_sync import create_data_source
    ds = create_data_source()
    tables = ds.discover_tables()
    print(f'  PASS - Discovered {len(tables)} tables')
except Exception as e:
    print(f'  SKIP - {e}')
"

# 4. Profiler API
echo "[4/5] Testing Profiler API..."
python -c "
from src.profiler import profile_changed_tables
result = profile_changed_tables([])
assert result == {'success': 0, 'errors': 0, 'skipped': 0}
print('  PASS')
"

# 5. Webapp imports
echo "[5/5] Testing Webapp imports..."
python -c "
from webapp.auth import admin_required, login_required
from webapp.sync_settings_service import get_table_subscriptions, generate_rsync_filter
from src.table_registry import TableRegistry, ConflictError
print('  PASS')
"

echo ""
echo "=== All smoke tests passed ==="

Troubleshooting

Problem	Fix
`/admin/tables` returns 403	User not in `data-ops` group. Run `usermod -aG data-ops USERNAME`
Discovery returns empty	Check `KEBOOLA_STORAGE_TOKEN` in `.env`, verify `DATA_SOURCE=keboola`
Profiles not generated	Check `/data/src_data/parquet/` has parquet files, check DuckDB installed
Rsync filter not created	Check `sudo` permissions for `www-data` in sudoers-webapp
`data_description.md` not updating	Check write permissions on `docs/` directory
Webapp won't start	Check `journalctl -u webapp -n 50` for errors

12 KiB Raw Blame History

VM Test Plan - Self-Service Data Onboarding

Prerequisites

Step 0: Create GitHub Repository & Push

Step 1: VM Initial Setup

Checklist

Step 2: Webapp Setup

Checklist

Step 3: Instance Configuration

Checklist

Step 4: Create Admin Account & Login

Checklist

Step 5: Test Discovery API (Phase 1)

Checklist

Step 6: Test Table Registry (Phase 2)

6a: Register tables via Admin UI

6b: Verify registry

6c: Test via API

Checklist

Step 7: Test Data Sync + Auto-Profiling (Phase 3)

Checklist

Step 8: Test Per-Table Subscriptions (Phase 4)

8a: Via API

8b: Via Catalog UI

Checklist

Step 9: Test Smart Sync (Phase 5)

9a: Check rsync filter generation

9b: Test from analyst machine

Checklist

Step 10: Migration Test (One-Time Bootstrap)

Checklist

Step 11: Regression Tests

Checklist

Quick Smoke Test Script

Troubleshooting

12 KiB

Raw Blame History