agnes-the-ai-analyst/scripts
Petr b99ec576ca Add self-service data onboarding system
Table Registry as central source of truth (JSON) with atomic writes,
optimistic locking, audit logging, and data_description.md generation.
Existing readers (config.py, profiler.py) need zero changes.

Phase 1 - Discovery API:
  - discover_tables() on DataSource ABC + Keboola implementation
  - admin_required decorator with server-side recomputation
  - GET /api/admin/discover-tables endpoint

Phase 2 - Table Registry:
  - src/table_registry.py with CRUD, validation, migration from MD
  - Admin API: register/update/unregister with version locking
  - DELETE cascade cleans up per-user subscriptions

Phase 3 - Auto-Profiling:
  - profile_changed_tables() for incremental profiling
  - Non-fatal hook in sync_all() after successful sync

Phase 4 - Per-Table Subscriptions:
  - table_mode (all/explicit) with per-table toggles
  - GET/POST /api/table-subscriptions endpoints
  - Subscription status in catalog and dashboard views

Phase 5 - Smart Sync:
  - Python-generated rsync filter files (not shell YAML parsing)
  - sync_data.sh uses --filter="merge ..." for explicit mode

Phase 6 - Admin UI:
  - /admin/tables with discovery, registration modal, registry mgmt
  - Vanilla JS, matching existing design system
2026-03-09 14:25:37 +01:00
..
activate_venv.sh Initial commit: OSS data distribution platform 2026-03-08 23:31:28 +01:00
backfill_gap.sh Extract Jira into connectors/jira module 2026-03-09 11:17:50 +01:00
collect_session.py Initial commit: OSS data distribution platform 2026-03-08 23:31:28 +01:00
dev_run.py Merge dev_scripts/ into scripts/ 2026-03-09 13:11:36 +01:00
duckdb_manager.py Initial commit: OSS data distribution platform 2026-03-08 23:31:28 +01:00
generate_user_sync_configs.py Initial commit: OSS data distribution platform 2026-03-08 23:31:28 +01:00
init.sh Initial commit: OSS data distribution platform 2026-03-08 23:31:28 +01:00
README.md Merge dev_scripts/ into scripts/ 2026-03-09 13:11:36 +01:00
setup_views.sh Initial commit: OSS data distribution platform 2026-03-08 23:31:28 +01:00
sync_config_template.yaml Initial commit: OSS data distribution platform 2026-03-08 23:31:28 +01:00
sync_data.sh Add self-service data onboarding system 2026-03-09 14:25:37 +01:00
test_sync.sh Merge dev_scripts/ into scripts/ 2026-03-09 13:11:36 +01:00
update.sh Initial commit: OSS data distribution platform 2026-03-08 23:31:28 +01:00

Scripts

Helper scripts for working with AI Data Analyst project.

These scripts are synced from the server into server/scripts/ on the analyst's machine.

Available Scripts

setup_views.sh

Initialize or refresh DuckDB views on Parquet files.

bash server/scripts/setup_views.sh

sync_data.sh

Synchronize data from server, upload user files, and refresh DuckDB.

# Recommended: update scripts first, then sync
rsync -avz data-analyst:server/scripts/ ./server/scripts/   # Linux/macOS
scp -r data-analyst:server/scripts/* ./server/scripts/      # Windows fallback
bash server/scripts/sync_data.sh

# Other options:
bash server/scripts/sync_data.sh --dry-run  # Preview what would be synced (no changes)
bash server/scripts/sync_data.sh --push     # Only upload user/ to server

What sync does:

  1. Self-update check - detects if sync_data.sh changed, asks to re-run if so
  2. Downloads server/docs/, server/scripts/, server/metadata/ from server
  3. Updates CLAUDE.md from latest template
  4. Downloads server/parquet/ data files (with --delete to remove old files)
  5. Uploads user/ directory to server (backup, no --delete)
  6. Syncs Python environment to server
  7. Validates DuckDB - if corrupted, deletes and recreates from parquets
  8. Reinitializes DuckDB views (CREATE OR REPLACE VIEW for all tables)

Self-update mechanism: The script checks its own checksum before and after syncing scripts. If it detects it was updated, it exits with a message asking you to run sync again. This ensures you always run the latest sync logic.

DuckDB corruption recovery: If DuckDB file is corrupted (e.g., interrupted sync), it's automatically detected and recreated. All data is safe in parquet files - DuckDB only contains VIEW definitions.

Development Scripts

dev_run.py

Flask development server with authentication bypass for local testing.

python3 scripts/dev_run.py

Starts a local Flask server at http://127.0.0.1:5000 with:

  • Auth bypass routes (/dev-login, /dev-catalog) - no OAuth required
  • Debug mode with hot reload

test_sync.sh

Test rsync reliability with the data server.

bash scripts/test_sync.sh           # Full test sync
bash scripts/test_sync.sh --dry-run # Preview only

Typical Workflow

  1. First time setup: Follow bootstrap.yaml instructions
  2. Before analysis: Sync latest data
    bash server/scripts/sync_data.sh
    
  3. Analyze: Use DuckDB database at user/duckdb/analytics.duckdb