agnes-the-ai-analyst/scripts/README.md
Petr 26c4e0934d OSS cleanup: remove internal references, harden deployment, add config env interpolation
Phase 1 - Internal reference cleanup:
- Delete dev_docs/meetings/ (internal meeting notes/transcripts)
- Replace hardcoded usernames (padak/matejkys/dasa) with deploy/generic
- Replace "Internal AI Data Analyst" with "AI Data Analyst"
- Replace keboola/internal_ai_data_analyst URLs with your-org/ai-data-analyst
- Replace /tmp/keboola_load/ with /tmp/data_analyst_staging/ in dev_docs

Phase 2 - Deployment hardening:
- Tighten sudoers wildcards to explicit paths (visudo, sudoers cp)
- setup.sh creates all groups (data-ops, dataread, data-private) and deploy user
- webapp-setup.sh copies sudoers-webapp from repo instead of inline definition
- deploy.sh conditional copy for data_description.md (not in git for OSS)
- deploy.sh ownership changed to deploy:data-ops for /data/{scripts,docs,examples}

Phase 3 - Config and misc:
- Add ${ENV_VAR} interpolation to config/loader.py
- Expand config/instance.yaml.example with all sections (admins, deployment, auth, etc.)
- Create config/.env.template for secret values
- Add MIT LICENSE
- Fix .gitignore: add .venv/, docs/data_description.md
- Fix README.md: CSV status Planned, remove metrics/, update license text
- Translate Czech comments in requirements.txt to English
- Fix test_account_service.py: mock username mapping instead of relying on instance config

All 118 tests pass.
2026-03-09 07:59:57 +01:00

55 lines
2 KiB
Markdown

# Scripts
Helper scripts for working with AI Data Analyst project.
These scripts are synced from the server into `server/scripts/` on the analyst's machine.
## Available Scripts
### `setup_views.sh`
Initialize or refresh DuckDB views on Parquet files.
```bash
bash server/scripts/setup_views.sh
```
### `sync_data.sh`
Synchronize data from server, upload user files, and refresh DuckDB.
```bash
# Recommended: update scripts first, then sync
rsync -avz data-analyst:server/scripts/ ./server/scripts/ # Linux/macOS
scp -r data-analyst:server/scripts/* ./server/scripts/ # Windows fallback
bash server/scripts/sync_data.sh
# Other options:
bash server/scripts/sync_data.sh --dry-run # Preview what would be synced (no changes)
bash server/scripts/sync_data.sh --push # Only upload user/ to server
```
**What sync does:**
1. **Self-update check** - detects if sync_data.sh changed, asks to re-run if so
2. Downloads `server/docs/`, `server/scripts/`, `server/metadata/` from server
3. Updates `CLAUDE.md` from latest template
4. Downloads `server/parquet/` data files (with `--delete` to remove old files)
5. Uploads `user/` directory to server (backup, no `--delete`)
6. Syncs Python environment to server
7. **Validates DuckDB** - if corrupted, deletes and recreates from parquets
8. Reinitializes DuckDB views (`CREATE OR REPLACE VIEW` for all tables)
**Self-update mechanism:**
The script checks its own checksum before and after syncing scripts. If it detects it was updated, it exits with a message asking you to run sync again. This ensures you always run the latest sync logic.
**DuckDB corruption recovery:**
If DuckDB file is corrupted (e.g., interrupted sync), it's automatically detected and recreated. All data is safe in parquet files - DuckDB only contains VIEW definitions.
## Typical Workflow
1. **First time setup**: Follow bootstrap.yaml instructions
2. **Before analysis**: Sync latest data
```bash
bash server/scripts/sync_data.sh
```
4. **Analyze**: Use DuckDB database at `user/duckdb/analytics.duckdb`