fix: CSV all_varchar in legacy extractor, rewrite DEPLOYMENT.md from real deploy
- Legacy extractor now uses read_csv(all_varchar=true) to avoid type inference errors (e.g. seniority column typed as DOUBLE with string values) - DEPLOYMENT.md rewritten based on actual dev VM deployment experience: deploy key setup, DuckDB write locking, env reload gotchas, bootstrap flow
This commit is contained in:
parent
2635f77974
commit
79443e0df4
2 changed files with 190 additions and 86 deletions
|
|
@ -186,9 +186,10 @@ def _extract_via_legacy(
|
||||||
table_id = f"{bucket}.{source_table}" if bucket else tc.get("id", tc["name"])
|
table_id = f"{bucket}.{source_table}" if bucket else tc.get("id", tc["name"])
|
||||||
client.export_table(table_id, Path(csv_path))
|
client.export_table(table_id, Path(csv_path))
|
||||||
|
|
||||||
# Convert CSV to Parquet using DuckDB
|
# Convert CSV to Parquet using DuckDB — all_varchar avoids type inference errors
|
||||||
|
# (e.g. columns with mostly numeric values but some strings like "Non-Manager")
|
||||||
conv_conn = duckdb.connect()
|
conv_conn = duckdb.connect()
|
||||||
conv_conn.execute(f"COPY (SELECT * FROM read_csv_auto('{csv_path}')) TO '{pq_path}' (FORMAT PARQUET)")
|
conv_conn.execute(f"COPY (SELECT * FROM read_csv('{csv_path}', all_varchar=true)) TO '{pq_path}' (FORMAT PARQUET)")
|
||||||
conv_conn.close()
|
conv_conn.close()
|
||||||
finally:
|
finally:
|
||||||
if os.path.exists(csv_path):
|
if os.path.exists(csv_path):
|
||||||
|
|
|
||||||
|
|
@ -2,95 +2,198 @@
|
||||||
|
|
||||||
## Server Requirements
|
## Server Requirements
|
||||||
|
|
||||||
- Debian 12 / Ubuntu 22.04+
|
- Ubuntu 24.04 LTS
|
||||||
- 2+ vCPUs, 2+ GB RAM
|
- e2-small (2 vCPU, 2 GB RAM) or larger
|
||||||
- 10+ GB data disk
|
- 30 GB SSD boot disk
|
||||||
- Public IP with DNS
|
- Docker + Docker Compose
|
||||||
|
- Public IP with port 8000 open
|
||||||
|
|
||||||
## Initial Server Setup
|
## Quick Deploy (GCP)
|
||||||
|
|
||||||
1. Provision a VM (GCP, AWS, Azure, etc.)
|
### 1. Create VM
|
||||||
|
|
||||||
2. Run the setup script:
|
|
||||||
```bash
|
|
||||||
sudo bash server/setup.sh
|
|
||||||
```
|
|
||||||
|
|
||||||
This creates:
|
|
||||||
- System groups: `data-ops`, `dataread`, `data-private`
|
|
||||||
- Deploy user with appropriate permissions
|
|
||||||
- Directory structure under `/opt/data-analyst/`
|
|
||||||
- Python virtual environment
|
|
||||||
|
|
||||||
3. Set up the webapp:
|
|
||||||
```bash
|
|
||||||
sudo bash server/webapp-setup.sh
|
|
||||||
```
|
|
||||||
|
|
||||||
This installs:
|
|
||||||
- Gunicorn systemd service
|
|
||||||
- Nginx reverse proxy with SSL
|
|
||||||
- Log rotation
|
|
||||||
|
|
||||||
## CI/CD Pipeline
|
|
||||||
|
|
||||||
1. Copy the example workflow:
|
|
||||||
```bash
|
|
||||||
cp .github/workflows/deploy.yml.example .github/workflows/deploy.yml
|
|
||||||
```
|
|
||||||
|
|
||||||
2. Configure GitHub Secrets:
|
|
||||||
- `SERVER_HOST`: Server IP address
|
|
||||||
- `SERVER_USER`: Deploy username
|
|
||||||
- `SERVER_SSH_KEY`: Deploy SSH private key
|
|
||||||
- All environment variables from `.env`
|
|
||||||
|
|
||||||
3. Push to `main` branch triggers automatic deployment.
|
|
||||||
|
|
||||||
## Directory Structure on Server
|
|
||||||
|
|
||||||
```
|
|
||||||
/opt/data-analyst/
|
|
||||||
├── repo/ # Git clone of this repository
|
|
||||||
├── .env # Environment variables (secrets)
|
|
||||||
├── .venv/ # Python virtual environment
|
|
||||||
└── logs/ # Application logs
|
|
||||||
|
|
||||||
/data/
|
|
||||||
├── src_data/
|
|
||||||
│ ├── parquet/ # Converted data files
|
|
||||||
│ ├── metadata/ # Sync state, profiles
|
|
||||||
│ └── raw/ # Raw source data
|
|
||||||
├── docs/ # Documentation served to analysts
|
|
||||||
├── scripts/ # Scripts distributed to analysts
|
|
||||||
└── notifications/ # Notification system data
|
|
||||||
```
|
|
||||||
|
|
||||||
## Separate Config Repository
|
|
||||||
|
|
||||||
For production deployments, keep instance config in a separate private repository:
|
|
||||||
|
|
||||||
```
|
|
||||||
client-config-repo/
|
|
||||||
├── config/
|
|
||||||
│ ├── instance.yaml
|
|
||||||
│ └── data_description.md
|
|
||||||
├── .env.example
|
|
||||||
└── .github/workflows/deploy.yml
|
|
||||||
```
|
|
||||||
|
|
||||||
Set `CONFIG_DIR=/opt/data-analyst/client-config/config/` in the environment.
|
|
||||||
|
|
||||||
## SSL Setup
|
|
||||||
|
|
||||||
Use certbot for Let's Encrypt SSL:
|
|
||||||
```bash
|
```bash
|
||||||
sudo apt install certbot python3-certbot-nginx
|
gcloud compute instances create data-analyst-dev \
|
||||||
sudo certbot --nginx -d data.yourcompany.com
|
--project=YOUR_PROJECT \
|
||||||
|
--zone=europe-west1-b \
|
||||||
|
--machine-type=e2-small \
|
||||||
|
--image-family=ubuntu-2404-lts-amd64 \
|
||||||
|
--image-project=ubuntu-os-cloud \
|
||||||
|
--boot-disk-size=30GB \
|
||||||
|
--boot-disk-type=pd-ssd \
|
||||||
|
--tags=data-analyst-dev
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### 2. Install Docker
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -fsSL https://get.docker.com | sh
|
||||||
|
sudo usermod -aG docker $USER
|
||||||
|
# Log out and back in for group change to take effect
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Set up deploy key
|
||||||
|
|
||||||
|
Generate an SSH key for GitHub access:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ssh-keygen -t ed25519 -f ~/.ssh/agnes_deploy -N "" -C "agnes-deploy"
|
||||||
|
cat ~/.ssh/agnes_deploy.pub
|
||||||
|
# Add the public key as a deploy key on the GitHub repo
|
||||||
|
```
|
||||||
|
|
||||||
|
Configure SSH to use it:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cat > ~/.ssh/config << 'EOF'
|
||||||
|
Host github.com
|
||||||
|
IdentityFile ~/.ssh/agnes_deploy
|
||||||
|
StrictHostKeyChecking no
|
||||||
|
EOF
|
||||||
|
chmod 600 ~/.ssh/config
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4. Clone and configure
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo mkdir -p /opt/data-analyst
|
||||||
|
sudo chown $USER:$USER /opt/data-analyst
|
||||||
|
git clone git@github.com:keboola/agnes-the-ai-analyst.git /opt/data-analyst
|
||||||
|
cd /opt/data-analyst
|
||||||
|
```
|
||||||
|
|
||||||
|
Create `.env`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cat > .env << 'EOF'
|
||||||
|
JWT_SECRET_KEY=<generate: python3 -c "import secrets; print(secrets.token_hex(32))">
|
||||||
|
DATA_DIR=/data
|
||||||
|
LOG_LEVEL=info
|
||||||
|
KEBOOLA_STORAGE_TOKEN=<your-keboola-token>
|
||||||
|
KEBOOLA_STACK_URL=<your-keboola-stack-url>
|
||||||
|
SEED_ADMIN_EMAIL=<admin-email>
|
||||||
|
EOF
|
||||||
|
chmod 600 .env
|
||||||
|
```
|
||||||
|
|
||||||
|
Create `config/instance.yaml` (optional, for Keboola source config):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cp config/instance.yaml.example config/instance.yaml
|
||||||
|
# Edit with your values
|
||||||
|
```
|
||||||
|
|
||||||
|
### 5. Create data directories
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo mkdir -p /data/state /data/analytics /data/extracts
|
||||||
|
sudo chown -R $USER:$USER /data
|
||||||
|
```
|
||||||
|
|
||||||
|
### 6. Build and start
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd /opt/data-analyst
|
||||||
|
docker compose up -d
|
||||||
|
```
|
||||||
|
|
||||||
|
Wait for health check:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -s http://localhost:8000/api/health | python3 -m json.tool
|
||||||
|
```
|
||||||
|
|
||||||
|
### 7. Bootstrap admin user
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -X POST http://localhost:8000/auth/bootstrap
|
||||||
|
```
|
||||||
|
|
||||||
|
This creates the first admin user using `SEED_ADMIN_EMAIL` from `.env`.
|
||||||
|
|
||||||
|
### 8. Register tables and run first extraction
|
||||||
|
|
||||||
|
Register tables via the admin API, then:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Stop app first — DuckDB only supports one writer
|
||||||
|
docker compose down
|
||||||
|
docker compose run --rm extract
|
||||||
|
docker compose up -d
|
||||||
|
```
|
||||||
|
|
||||||
|
### 9. Open firewall (GCP)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
gcloud compute firewall-rules create allow-data-analyst-dev \
|
||||||
|
--allow tcp:8000 \
|
||||||
|
--target-tags=data-analyst-dev \
|
||||||
|
--project=YOUR_PROJECT
|
||||||
|
```
|
||||||
|
|
||||||
|
## Important Notes
|
||||||
|
|
||||||
|
### DuckDB Write Locking
|
||||||
|
|
||||||
|
DuckDB only supports one writer at a time. When running extraction:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker compose down # Stop app + scheduler
|
||||||
|
docker compose run --rm extract # Run extraction
|
||||||
|
docker compose up -d # Restart
|
||||||
|
```
|
||||||
|
|
||||||
|
The scheduler triggers extraction via the API, which handles locking internally.
|
||||||
|
|
||||||
|
### Environment Variable Changes
|
||||||
|
|
||||||
|
`docker compose restart` does NOT reload `.env`. Use:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker compose down && docker compose up -d
|
||||||
|
```
|
||||||
|
|
||||||
|
### Services
|
||||||
|
|
||||||
|
| Service | Profile | Description |
|
||||||
|
|---------|---------|-------------|
|
||||||
|
| `app` | default | FastAPI server on port 8000 |
|
||||||
|
| `scheduler` | default | Periodic sync + extraction |
|
||||||
|
| `extract` | extract | One-shot data extraction |
|
||||||
|
| `telegram-bot` | full | Telegram notifications |
|
||||||
|
| `ws-gateway` | full | WebSocket gateway |
|
||||||
|
| `corporate-memory` | full | Knowledge collector |
|
||||||
|
| `session-collector` | full | Session collection |
|
||||||
|
|
||||||
|
Start all services: `docker compose --profile full up -d`
|
||||||
|
|
||||||
|
### Directory Structure on Server
|
||||||
|
|
||||||
|
```
|
||||||
|
/opt/data-analyst/ # Git repo
|
||||||
|
.env # Secrets (chmod 600)
|
||||||
|
config/instance.yaml # Instance config
|
||||||
|
|
||||||
|
/data/ # Persistent data (Docker volume)
|
||||||
|
state/system.duckdb # System state (users, registry, sync)
|
||||||
|
analytics/server.duckdb # Analytics views
|
||||||
|
extracts/ # Per-source extract.duckdb + parquets
|
||||||
|
keboola/
|
||||||
|
bigquery/
|
||||||
|
jira/
|
||||||
|
```
|
||||||
|
|
||||||
|
## CI/CD
|
||||||
|
|
||||||
|
Push to `main` triggers GitHub Actions:
|
||||||
|
1. Run test suite (607 tests)
|
||||||
|
2. Build Docker image
|
||||||
|
3. Push to GHCR (`ghcr.io/keboola/agnes-the-ai-analyst`)
|
||||||
|
4. Deploy via Kamal
|
||||||
|
|
||||||
## Monitoring
|
## Monitoring
|
||||||
|
|
||||||
- Health check: `GET /health`
|
- Health: `GET /api/health`
|
||||||
- Logs: `journalctl -u webapp -f`
|
- Logs: `docker compose logs -f app`
|
||||||
- Disk usage: `df -h /data`
|
- Disk: `df -h /data`
|
||||||
|
- Tables: `curl -s http://localhost:8000/api/catalog | python3 -m json.tool`
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue