docs: update stale v1 docs to v2 Docker/FastAPI/DuckDB architecture

- CONFIGURATION.md: remove Flask/SendGrid/WEBAPP_SECRET_KEY references,
  update env vars to JWT_SECRET_KEY and SESSION_SECRET, point to
  config/.env.template and config/instance.yaml.example
- disaster-recovery.md: rewrite for Docker volumes; cover GCP disk
  snapshot backup/restore and full VM rebuild; drop systemd/nginx/SSH
- server.md: strip rsync, systemd, nginx, Linux group, and sudo
  sections; keep Docker Compose operations, log viewing, health checks,
  sync/admin CLI, and Jira webhook procedures
This commit is contained in:
ZdenekSrotyr 2026-04-09 18:44:25 +02:00
parent 7d036760f5
commit c8e232e43e
3 changed files with 347 additions and 2359 deletions

View file

@ -1,161 +1,61 @@
# Disaster Recovery
Recovery procedures for the Data Broker Server (`your-server`).
Recovery procedures for the AI Data Analyst Docker deployment.
## Overview
```
Disk Layout:
sda (10 GB) / System disk (instance) - EXPENDABLE
sdb (30 GB) /data Data disk - SNAPSHOTTED daily
sdc (30 GB) /home Home disk - SNAPSHOTTED daily
What lives where:
Docker volumes /data DuckDB files, parquet extracts, state
Git repo/ Application code — rebuild from GitHub
.env secrets Recreate from GitHub Secrets / 1Password
```
**Key principle**: sda is disposable. Everything on it is either in git or can be reinstalled. All unique data lives on sdb and sdc, which are independently snapshotted.
**Key principle**: the container is disposable. All unique data lives in the `/data`
Docker volume (or a GCP persistent disk mounted at `/data`). Re-pulling the image
and restoring `/data` brings the service back to full operation.
## What Lives Where
## Data Layout
| Location | Content | Recovery Method |
|----------|---------|-----------------|
| sda: `/opt/data-analyst/repo/` | Application code | `git clone` from GitHub |
| sda: `/opt/data-analyst/.venv/` | Python packages | `pip install -r requirements.txt` |
| sda: `/opt/data-analyst/.env` | Application secrets | deploy.sh creates from GitHub secrets |
| sda: `/etc/sudoers.d/` | Permissions | deploy.sh copies from repo |
| sda: `/etc/security/limits.d/` | Resource limits | deploy.sh copies from repo |
| sda: `/etc/nginx/` | Nginx config | deploy.sh or manual copy from repo |
| sda: `/etc/letsencrypt/` | SSL certificate | `certbot` renews automatically |
| sdb: `/data/src_data/parquet/` | Parquet data | Regenerate from Keboola (`update.sh`) or restore snapshot |
| sdb: `/data/notifications/` | Notification state | Restore from snapshot |
| sdb: `/data/docs/`, `/data/scripts/` | Docs & scripts | deploy.sh copies from repo |
| sdc: `/home/*/` | User accounts, SSH keys, workspaces, scripts | Restore from snapshot |
| Path | Content | Backup |
|------|---------|--------|
| `/data/state/system.duckdb` | Table registry, users, sync state | Daily snapshot |
| `/data/analytics/server.duckdb` | Master analytics DB (views) | Regenerated on start |
| `/data/extracts/*/extract.duckdb` | Per-source extract DBs | Daily snapshot |
| `/data/extracts/*/data/*.parquet` | Parquet files (local sources) | Daily snapshot |
## Scenario A: System Disk Failure (sda dies)
`analytics/server.duckdb` is rebuilt automatically by `SyncOrchestrator.rebuild()`
on every startup, so it does not need to be backed up separately.
**Impact**: Server is down, but all user data is safe on sdb/sdc.
## Scenario A: Container Crash / Bad Deploy
**Recovery time**: ~30 minutes
**Impact**: Service down, data intact.
### Steps
**Recovery time**: ~2 minutes
1. **Create new VM** (same zone, attach existing disks):
```bash
# Create new instance with existing disks
gcloud compute instances create your-server \
--project=your-gcp-project \
--zone=europe-north1-a \
--machine-type=e2-medium \
--image-family=debian-12 \
--image-project=debian-cloud \
--boot-disk-size=10GB \
--tags=http-server,https-server
```bash
# Pull latest image and restart
docker compose pull
docker compose up -d
# Attach existing data disks
gcloud compute instances attach-disk your-server \
--project=your-gcp-project \
--zone=europe-north1-a \
--disk=data-disk
# Check health
curl https://your-instance.example.com/health
```
gcloud compute instances attach-disk your-server \
--project=your-gcp-project \
--zone=europe-north1-a \
--disk=home-disk
```
If a bad image was pushed, roll back to the previous tag:
```bash
docker compose down
# Edit docker-compose.yml to pin the previous image tag
docker compose up -d
```
2. **SSH in and mount disks**:
```bash
# Mount data disk
mkdir -p /data
mount /dev/sdb /data
## Scenario B: /data Volume Corruption or Loss
# Mount home disk
mount /dev/sdc /home
**Impact**: All DuckDB state and parquet data lost.
# Add to fstab (get UUIDs with blkid)
echo "UUID=$(blkid -s UUID -o value /dev/sdb) /data ext4 discard,defaults,nofail 0 2" >> /etc/fstab
echo "UUID=$(blkid -s UUID -o value /dev/sdc) /home ext4 discard,defaults,nofail 0 2" >> /etc/fstab
```
**Recovery time**: ~10 minutes (from snapshot) or ~30 minutes (regenerate from source)
3. **Install prerequisites**:
```bash
apt-get update
apt-get install -y git python3.11-venv python3-pip nginx certbot python3-certbot-nginx
```
4. **Recreate deploy user and groups**:
```bash
# Create groups
groupadd dataread
groupadd data-private
groupadd data-ops
# Create deploy user
useradd -m -s /bin/bash deploy
usermod -aG data-ops deploy
# Restore deploy SSH key (generate new one)
sudo -u deploy ssh-keygen -t ed25519 -f /home/deploy/.ssh/id_ed25519 -N '' -C 'deploy@data-broker'
sudo -u deploy bash -c 'echo -e "Host github.com\n IdentityFile ~/.ssh/id_ed25519\n StrictHostKeyChecking accept-new" > /home/deploy/.ssh/config'
chmod 600 /home/deploy/.ssh/config
# Add new public key to GitHub as Deploy Key
cat /home/deploy/.ssh/id_ed25519.pub
```
5. **Clone repo and run setup**:
```bash
mkdir -p /opt/data-analyst
chown deploy:data-ops /opt/data-analyst
sudo -u deploy git clone git@github.com:keboola/agnes-the-ai-analyst.git /opt/data-analyst/repo
git config --global --add safe.directory /opt/data-analyst/repo
/opt/data-analyst/repo/server/setup.sh
```
6. **Restore user accounts from /home**:
```bash
# Users already exist on home-disk, just recreate /etc/passwd entries
# For each directory in /home (except deploy):
for dir in /home/*/; do
username=$(basename "$dir")
[[ "$username" == "deploy" ]] && continue
# Create user if not exists
if ! id "$username" &>/dev/null; then
useradd -M -d "/home/$username" -s /bin/bash "$username"
usermod -aG dataread "$username"
fi
done
```
Note: Group memberships (data-private, sudo, data-ops) need manual review. Check the admin list in `server/limits-users.conf` for admin users.
7. **Trigger deploy via GitHub Actions** (or manually):
```bash
sudo -u deploy bash -c 'cd /opt/data-analyst/repo && ./server/deploy.sh'
```
8. **Set up SSL certificate**:
```bash
certbot --nginx -d your-instance.example.com
```
9. **Restore crontab**:
```bash
sudo -u deploy crontab -e
# Add:
# MAILTO=admin@your-domain.com
# 0 6,14,19 * * * cd /opt/data-analyst/repo && ./scripts/update.sh > /var/log/update.log 2>&1 || cat /var/log/update.log
```
10. **Update external IP** if it changed:
- DNS: `your-instance.example.com` A record
- GitHub secrets: `SERVER_HOST`
- SSH configs of all users
## Scenario B: Data Disk Failure (sdb/data-disk dies)
**Impact**: Parquet data lost, users unaffected.
**Recovery time**: ~10 minutes (from snapshot) or ~30 minutes (from Keboola)
### Option 1: Restore from snapshot (faster)
### Option 1: Restore from GCP disk snapshot (faster)
```bash
# Find latest snapshot
@ -169,95 +69,99 @@ gcloud compute disks create data-disk \
--source-snapshot=SNAPSHOT_NAME \
--type=pd-balanced
# Attach to VM (may need to stop VM first)
# Attach to VM and mount
gcloud compute instances attach-disk your-server \
--project=your-gcp-project \
--zone=europe-north1-a \
--disk=data-disk
# Mount
ssh kids "sudo mount /dev/sdb /data"
# Restart containers
docker compose up -d
```
### Option 2: Regenerate from Keboola
### Option 2: Regenerate from source
```bash
# Create fresh disk
gcloud compute disks create data-disk \
--project=your-gcp-project \
--zone=europe-north1-a \
--size=30GB \
--type=pd-balanced
# Start with empty /data volume
docker compose up -d
# Attach, format, mount
ssh kids "sudo mkfs.ext4 /dev/sdb && sudo mount /dev/sdb /data"
# Run deploy to recreate directory structure
ssh kids "sudo -u deploy bash -c 'cd /opt/data-analyst/repo && ./server/deploy.sh'"
# Regenerate parquet data from Keboola
ssh kids "cd /opt/data-analyst/repo && ./scripts/update.sh"
# Trigger a full sync from the data source
curl -X POST http://localhost:8000/api/sync/trigger
# Or via CLI:
docker compose exec app da sync
```
## Scenario C: Home Disk Failure (sdc/home-disk dies)
DuckDB extract files and parquet will be repopulated from Keboola / BigQuery.
`system.duckdb` (table registry, users) must be restored from snapshot if
not regenerated — user accounts and table definitions are not recreated by sync.
**Impact**: All user accounts, SSH keys, and personal workspaces lost.
## Scenario C: Complete VM Loss
**Recovery time**: ~10 minutes (from snapshot)
**Recovery time**: ~20 minutes
### Restore from snapshot
1. **Create new VM** (or use managed instance group):
```bash
gcloud compute instances create your-server \
--project=your-gcp-project \
--zone=europe-north1-a \
--machine-type=e2-medium \
--image-family=debian-12 \
--image-project=debian-cloud
```
```bash
# Find latest snapshot
gcloud compute snapshots list --project=your-gcp-project \
--filter="sourceDisk:home-disk" --sort-by=~creationTimestamp --limit=5
2. **Install Docker**:
```bash
curl -fsSL https://get.docker.com | sh
```
# Create new disk from snapshot
gcloud compute disks create home-disk \
--project=your-gcp-project \
--zone=europe-north1-a \
--source-snapshot=SNAPSHOT_NAME \
--type=pd-balanced
3. **Attach and mount the data disk** (or restore from snapshot per Scenario B):
```bash
gcloud compute instances attach-disk your-server \
--project=your-gcp-project --zone=europe-north1-a --disk=data-disk
# Add mount to /etc/fstab and mount /data
```
# Attach to VM
gcloud compute instances attach-disk your-server \
--project=your-gcp-project \
--zone=europe-north1-a \
--disk=home-disk
4. **Clone repo and create .env**:
```bash
git clone git@github.com:your-org/ai-data-analyst.git /opt/data-analyst
cd /opt/data-analyst
cp config/.env.template .env
# Fill in secrets from GitHub Secrets / 1Password
```
# Mount
ssh kids "sudo mount /dev/sdc /home"
```
5. **Start the stack**:
```bash
docker compose up -d
```
If no snapshot exists, users must re-register via https://your-instance.example.com.
## Scenario D: Complete Server Loss (VM + all disks)
**Recovery time**: ~45 minutes
1. Follow **Scenario A** steps 1-5 (new VM, prerequisites, deploy user)
2. Restore `data-disk` from snapshot (Scenario B, Option 1)
3. Restore `home-disk` from snapshot (Scenario C)
4. Follow **Scenario A** steps 6-10 (user accounts, deploy, SSL, cron, IP)
6. **Update DNS** if the external IP changed:
- A record for `your-instance.example.com`
## Verification Checklist
After any recovery, verify:
- [ ] `ssh kids` works (admin access)
- [ ] `https://your-instance.example.com` loads (webapp)
- [ ] `https://your-instance.example.com/health` returns OK
- [ ] At least one analyst can SSH in
- [ ] `ls /data/src_data/parquet/` shows data
- [ ] `ls /home/` shows user directories
- [ ] `systemctl status webapp` is active
- [ ] `systemctl status notify-bot` is active
- [ ] `sudo crontab -u deploy -l` shows data sync cron
- [ ] `docker compose ps` — all services `Up`
- [ ] `https://your-instance.example.com/health` returns `{"status": "ok"}`
- [ ] Login works (Google OAuth or email magic link)
- [ ] At least one table appears in the data catalog
- [ ] `docker compose logs app` — no ERROR lines at startup
## Preventive Measures
- **GCP snapshots**: Daily automatic snapshots of `data-disk` and `home-disk` (14-day retention)
- **Setup script**: `server/setup-snapshot-schedule.sh` configures snapshot policy
- **Limits in git**: `server/limits-users.conf` is version-controlled and deployed automatically
- **All configs in git**: sudoers, nginx, systemd services, management scripts
- **Secrets in GitHub**: `.env` is recreated by deploy.sh from GitHub Actions secrets
- **GCP snapshots**: Daily automatic snapshots of the `/data` persistent disk
(14-day retention). Configure via:
```bash
gcloud compute resource-policies create snapshot-schedule daily-backup \
--project=your-gcp-project \
--region=europe-north1 \
--max-retention-days=14 \
--on-source-disk-delete=keep-auto-snapshots \
--daily-schedule \
--start-time=03:00
gcloud compute disks add-resource-policies data-disk \
--project=your-gcp-project --zone=europe-north1-a \
--resource-policies=daily-backup
```
- **Secrets in GitHub / 1Password**: `.env` is never committed; recreate from stored secrets
- **Image tags**: Pin a known-good image tag in `docker-compose.yml` before each deploy

File diff suppressed because it is too large Load diff

View file

@ -3,6 +3,7 @@
## instance.yaml
The main configuration file for your AI Data Analyst instance. Located at `config/instance.yaml`.
See `config/instance.yaml.example` for the full annotated template.
### Instance Branding
@ -17,10 +18,11 @@ instance:
```yaml
auth:
allowed_domain: "acme.com" # Google OAuth domain restriction
allowed_domain: "acme.com" # Email domain restriction for login
```
Only emails from this domain can log in via Google OAuth. External users can be added via password auth (requires SendGrid).
Only emails from this domain can log in via Google OAuth or email magic link.
Google OAuth is optional — if not configured, only email magic link auth is available.
### Email
@ -28,9 +30,15 @@ Only emails from this domain can log in via Google OAuth. External users can be
email:
from_address: "noreply@acme.com"
from_name: "Acme Data Analyst"
smtp_host: "${SMTP_HOST}"
smtp_port: 587
smtp_user: "${SMTP_USER}"
smtp_password: "${SMTP_PASSWORD}"
```
Used for password auth setup and reset emails. Requires `SENDGRID_API_KEY` in `.env`.
Used for magic link authentication. Without SMTP configured, magic links are shown
directly in the browser (development mode). Compatible with any SMTP relay (Gmail,
Mailgun, SendGrid SMTP, etc.).
### Server
@ -45,6 +53,7 @@ server:
```yaml
desktop:
jwt_issuer: "acme-analyst"
jwt_secret: "${DESKTOP_JWT_SECRET}"
url_scheme: "acme-analyst"
```
@ -52,22 +61,18 @@ desktop:
```yaml
data_source:
type: "keboola" # keboola, csv, bigquery
type: "keboola" # keboola, bigquery, local
```
### Users
```yaml
users:
john.doe:
name: "John Doe"
initials: "JD"
jane.smith:
name: "Jane Smith"
initials: "JS"
admin@acme.com:
display_name: "John Doe"
km_admin: true # Corporate Memory admin (optional)
username_mapping:
john.doe: john # Only if webapp and server names differ
username_mapping: {} # Map webapp email -> server username if different
```
### Datasets
@ -102,11 +107,15 @@ catalog:
## Environment Variables (.env)
Copy `config/.env.template` to `.env` and fill in values. The template contains
the full variable list with comments. Never commit `.env`.
### Required
| Variable | Description |
|----------|-------------|
| `WEBAPP_SECRET_KEY` | Flask session secret |
| `JWT_SECRET_KEY` | FastAPI JWT token secret (generate with `secrets.token_hex(32)`) |
| `SESSION_SECRET` | Session cookie secret (generate with `secrets.token_hex(32)`) |
| `GOOGLE_CLIENT_ID` | Google OAuth client ID |
| `GOOGLE_CLIENT_SECRET` | Google OAuth client secret |
@ -116,16 +125,29 @@ catalog:
|----------|-------------|
| `KEBOOLA_STORAGE_TOKEN` | Keboola Storage API token |
| `KEBOOLA_STACK_URL` | Keboola stack URL |
| `KEBOOLA_PROJECT_ID` | Keboola project ID |
| `DATA_DIR` | Data directory path |
| `DATA_DIR` | Data directory path (default: `/data` in Docker, `./data` locally) |
### Data Source (BigQuery)
| Variable | Description |
|----------|-------------|
| `BIGQUERY_PROJECT` | GCP project for job execution/billing |
| `BIGQUERY_LOCATION` | BigQuery location (e.g., `US`, `us-central1`) |
### Optional
| Variable | Description |
|----------|-------------|
| `SENDGRID_API_KEY` | For password auth emails |
| `SMTP_HOST` | SMTP relay host for magic link emails |
| `SMTP_PORT` | SMTP port (587 for STARTTLS, 465 for SSL) |
| `SMTP_USER` | SMTP username |
| `SMTP_PASSWORD` | SMTP password |
| `TELEGRAM_BOT_TOKEN` | For Telegram notifications |
| `ANTHROPIC_API_KEY` | For Corporate Memory AI |
| `ANTHROPIC_API_KEY` | For Corporate Memory AI (direct Anthropic) |
| `LLM_API_KEY` | API key for LLM proxy (LiteLLM, OpenRouter, etc.) |
| `JIRA_WEBHOOK_SECRET` | For Jira integration |
| `JIRA_WEBHOOK_SECRET` | For Jira webhook integration |
| `JIRA_API_TOKEN` | For Jira REST API access |
| `DESKTOP_JWT_SECRET` | Separate secret for desktop app tokens |
| `CONFIG_DIR` | Override config directory path |
| `LOG_LEVEL` | Logging level: `debug`, `info`, `warning`, `error` |
| `DOMAIN` | Public hostname for Caddy TLS (production profile) |