docs: update stale v1 docs to v2 Docker/FastAPI/DuckDB architecture

- CONFIGURATION.md: remove Flask/SendGrid/WEBAPP_SECRET_KEY references,
  update env vars to JWT_SECRET_KEY and SESSION_SECRET, point to
  config/.env.template and config/instance.yaml.example
- disaster-recovery.md: rewrite for Docker volumes; cover GCP disk
  snapshot backup/restore and full VM rebuild; drop systemd/nginx/SSH
- server.md: strip rsync, systemd, nginx, Linux group, and sudo
  sections; keep Docker Compose operations, log viewing, health checks,
  sync/admin CLI, and Jira webhook procedures
This commit is contained in:
ZdenekSrotyr 2026-04-09 18:44:25 +02:00
parent 7d036760f5
commit c8e232e43e
3 changed files with 347 additions and 2359 deletions

View file

@ -1,161 +1,61 @@
# Disaster Recovery # Disaster Recovery
Recovery procedures for the Data Broker Server (`your-server`). Recovery procedures for the AI Data Analyst Docker deployment.
## Overview ## Overview
``` ```
Disk Layout: What lives where:
sda (10 GB) / System disk (instance) - EXPENDABLE Docker volumes /data DuckDB files, parquet extracts, state
sdb (30 GB) /data Data disk - SNAPSHOTTED daily Git repo/ Application code — rebuild from GitHub
sdc (30 GB) /home Home disk - SNAPSHOTTED daily .env secrets Recreate from GitHub Secrets / 1Password
``` ```
**Key principle**: sda is disposable. Everything on it is either in git or can be reinstalled. All unique data lives on sdb and sdc, which are independently snapshotted. **Key principle**: the container is disposable. All unique data lives in the `/data`
Docker volume (or a GCP persistent disk mounted at `/data`). Re-pulling the image
and restoring `/data` brings the service back to full operation.
## What Lives Where ## Data Layout
| Location | Content | Recovery Method | | Path | Content | Backup |
|----------|---------|-----------------| |------|---------|--------|
| sda: `/opt/data-analyst/repo/` | Application code | `git clone` from GitHub | | `/data/state/system.duckdb` | Table registry, users, sync state | Daily snapshot |
| sda: `/opt/data-analyst/.venv/` | Python packages | `pip install -r requirements.txt` | | `/data/analytics/server.duckdb` | Master analytics DB (views) | Regenerated on start |
| sda: `/opt/data-analyst/.env` | Application secrets | deploy.sh creates from GitHub secrets | | `/data/extracts/*/extract.duckdb` | Per-source extract DBs | Daily snapshot |
| sda: `/etc/sudoers.d/` | Permissions | deploy.sh copies from repo | | `/data/extracts/*/data/*.parquet` | Parquet files (local sources) | Daily snapshot |
| sda: `/etc/security/limits.d/` | Resource limits | deploy.sh copies from repo |
| sda: `/etc/nginx/` | Nginx config | deploy.sh or manual copy from repo |
| sda: `/etc/letsencrypt/` | SSL certificate | `certbot` renews automatically |
| sdb: `/data/src_data/parquet/` | Parquet data | Regenerate from Keboola (`update.sh`) or restore snapshot |
| sdb: `/data/notifications/` | Notification state | Restore from snapshot |
| sdb: `/data/docs/`, `/data/scripts/` | Docs & scripts | deploy.sh copies from repo |
| sdc: `/home/*/` | User accounts, SSH keys, workspaces, scripts | Restore from snapshot |
## Scenario A: System Disk Failure (sda dies) `analytics/server.duckdb` is rebuilt automatically by `SyncOrchestrator.rebuild()`
on every startup, so it does not need to be backed up separately.
**Impact**: Server is down, but all user data is safe on sdb/sdc. ## Scenario A: Container Crash / Bad Deploy
**Recovery time**: ~30 minutes **Impact**: Service down, data intact.
### Steps **Recovery time**: ~2 minutes
1. **Create new VM** (same zone, attach existing disks): ```bash
```bash # Pull latest image and restart
# Create new instance with existing disks docker compose pull
gcloud compute instances create your-server \ docker compose up -d
--project=your-gcp-project \
--zone=europe-north1-a \
--machine-type=e2-medium \
--image-family=debian-12 \
--image-project=debian-cloud \
--boot-disk-size=10GB \
--tags=http-server,https-server
# Attach existing data disks # Check health
gcloud compute instances attach-disk your-server \ curl https://your-instance.example.com/health
--project=your-gcp-project \ ```
--zone=europe-north1-a \
--disk=data-disk
gcloud compute instances attach-disk your-server \ If a bad image was pushed, roll back to the previous tag:
--project=your-gcp-project \ ```bash
--zone=europe-north1-a \ docker compose down
--disk=home-disk # Edit docker-compose.yml to pin the previous image tag
``` docker compose up -d
```
2. **SSH in and mount disks**: ## Scenario B: /data Volume Corruption or Loss
```bash
# Mount data disk
mkdir -p /data
mount /dev/sdb /data
# Mount home disk **Impact**: All DuckDB state and parquet data lost.
mount /dev/sdc /home
# Add to fstab (get UUIDs with blkid) **Recovery time**: ~10 minutes (from snapshot) or ~30 minutes (regenerate from source)
echo "UUID=$(blkid -s UUID -o value /dev/sdb) /data ext4 discard,defaults,nofail 0 2" >> /etc/fstab
echo "UUID=$(blkid -s UUID -o value /dev/sdc) /home ext4 discard,defaults,nofail 0 2" >> /etc/fstab
```
3. **Install prerequisites**: ### Option 1: Restore from GCP disk snapshot (faster)
```bash
apt-get update
apt-get install -y git python3.11-venv python3-pip nginx certbot python3-certbot-nginx
```
4. **Recreate deploy user and groups**:
```bash
# Create groups
groupadd dataread
groupadd data-private
groupadd data-ops
# Create deploy user
useradd -m -s /bin/bash deploy
usermod -aG data-ops deploy
# Restore deploy SSH key (generate new one)
sudo -u deploy ssh-keygen -t ed25519 -f /home/deploy/.ssh/id_ed25519 -N '' -C 'deploy@data-broker'
sudo -u deploy bash -c 'echo -e "Host github.com\n IdentityFile ~/.ssh/id_ed25519\n StrictHostKeyChecking accept-new" > /home/deploy/.ssh/config'
chmod 600 /home/deploy/.ssh/config
# Add new public key to GitHub as Deploy Key
cat /home/deploy/.ssh/id_ed25519.pub
```
5. **Clone repo and run setup**:
```bash
mkdir -p /opt/data-analyst
chown deploy:data-ops /opt/data-analyst
sudo -u deploy git clone git@github.com:keboola/agnes-the-ai-analyst.git /opt/data-analyst/repo
git config --global --add safe.directory /opt/data-analyst/repo
/opt/data-analyst/repo/server/setup.sh
```
6. **Restore user accounts from /home**:
```bash
# Users already exist on home-disk, just recreate /etc/passwd entries
# For each directory in /home (except deploy):
for dir in /home/*/; do
username=$(basename "$dir")
[[ "$username" == "deploy" ]] && continue
# Create user if not exists
if ! id "$username" &>/dev/null; then
useradd -M -d "/home/$username" -s /bin/bash "$username"
usermod -aG dataread "$username"
fi
done
```
Note: Group memberships (data-private, sudo, data-ops) need manual review. Check the admin list in `server/limits-users.conf` for admin users.
7. **Trigger deploy via GitHub Actions** (or manually):
```bash
sudo -u deploy bash -c 'cd /opt/data-analyst/repo && ./server/deploy.sh'
```
8. **Set up SSL certificate**:
```bash
certbot --nginx -d your-instance.example.com
```
9. **Restore crontab**:
```bash
sudo -u deploy crontab -e
# Add:
# MAILTO=admin@your-domain.com
# 0 6,14,19 * * * cd /opt/data-analyst/repo && ./scripts/update.sh > /var/log/update.log 2>&1 || cat /var/log/update.log
```
10. **Update external IP** if it changed:
- DNS: `your-instance.example.com` A record
- GitHub secrets: `SERVER_HOST`
- SSH configs of all users
## Scenario B: Data Disk Failure (sdb/data-disk dies)
**Impact**: Parquet data lost, users unaffected.
**Recovery time**: ~10 minutes (from snapshot) or ~30 minutes (from Keboola)
### Option 1: Restore from snapshot (faster)
```bash ```bash
# Find latest snapshot # Find latest snapshot
@ -169,95 +69,99 @@ gcloud compute disks create data-disk \
--source-snapshot=SNAPSHOT_NAME \ --source-snapshot=SNAPSHOT_NAME \
--type=pd-balanced --type=pd-balanced
# Attach to VM (may need to stop VM first) # Attach to VM and mount
gcloud compute instances attach-disk your-server \ gcloud compute instances attach-disk your-server \
--project=your-gcp-project \ --project=your-gcp-project \
--zone=europe-north1-a \ --zone=europe-north1-a \
--disk=data-disk --disk=data-disk
# Mount # Restart containers
ssh kids "sudo mount /dev/sdb /data" docker compose up -d
``` ```
### Option 2: Regenerate from Keboola ### Option 2: Regenerate from source
```bash ```bash
# Create fresh disk # Start with empty /data volume
gcloud compute disks create data-disk \ docker compose up -d
--project=your-gcp-project \
--zone=europe-north1-a \
--size=30GB \
--type=pd-balanced
# Attach, format, mount # Trigger a full sync from the data source
ssh kids "sudo mkfs.ext4 /dev/sdb && sudo mount /dev/sdb /data" curl -X POST http://localhost:8000/api/sync/trigger
# Or via CLI:
# Run deploy to recreate directory structure docker compose exec app da sync
ssh kids "sudo -u deploy bash -c 'cd /opt/data-analyst/repo && ./server/deploy.sh'"
# Regenerate parquet data from Keboola
ssh kids "cd /opt/data-analyst/repo && ./scripts/update.sh"
``` ```
## Scenario C: Home Disk Failure (sdc/home-disk dies) DuckDB extract files and parquet will be repopulated from Keboola / BigQuery.
`system.duckdb` (table registry, users) must be restored from snapshot if
not regenerated — user accounts and table definitions are not recreated by sync.
**Impact**: All user accounts, SSH keys, and personal workspaces lost. ## Scenario C: Complete VM Loss
**Recovery time**: ~10 minutes (from snapshot) **Recovery time**: ~20 minutes
### Restore from snapshot 1. **Create new VM** (or use managed instance group):
```bash
gcloud compute instances create your-server \
--project=your-gcp-project \
--zone=europe-north1-a \
--machine-type=e2-medium \
--image-family=debian-12 \
--image-project=debian-cloud
```
```bash 2. **Install Docker**:
# Find latest snapshot ```bash
gcloud compute snapshots list --project=your-gcp-project \ curl -fsSL https://get.docker.com | sh
--filter="sourceDisk:home-disk" --sort-by=~creationTimestamp --limit=5 ```
# Create new disk from snapshot 3. **Attach and mount the data disk** (or restore from snapshot per Scenario B):
gcloud compute disks create home-disk \ ```bash
--project=your-gcp-project \ gcloud compute instances attach-disk your-server \
--zone=europe-north1-a \ --project=your-gcp-project --zone=europe-north1-a --disk=data-disk
--source-snapshot=SNAPSHOT_NAME \ # Add mount to /etc/fstab and mount /data
--type=pd-balanced ```
# Attach to VM 4. **Clone repo and create .env**:
gcloud compute instances attach-disk your-server \ ```bash
--project=your-gcp-project \ git clone git@github.com:your-org/ai-data-analyst.git /opt/data-analyst
--zone=europe-north1-a \ cd /opt/data-analyst
--disk=home-disk cp config/.env.template .env
# Fill in secrets from GitHub Secrets / 1Password
```
# Mount 5. **Start the stack**:
ssh kids "sudo mount /dev/sdc /home" ```bash
``` docker compose up -d
```
If no snapshot exists, users must re-register via https://your-instance.example.com. 6. **Update DNS** if the external IP changed:
- A record for `your-instance.example.com`
## Scenario D: Complete Server Loss (VM + all disks)
**Recovery time**: ~45 minutes
1. Follow **Scenario A** steps 1-5 (new VM, prerequisites, deploy user)
2. Restore `data-disk` from snapshot (Scenario B, Option 1)
3. Restore `home-disk` from snapshot (Scenario C)
4. Follow **Scenario A** steps 6-10 (user accounts, deploy, SSL, cron, IP)
## Verification Checklist ## Verification Checklist
After any recovery, verify: After any recovery, verify:
- [ ] `ssh kids` works (admin access) - [ ] `docker compose ps` — all services `Up`
- [ ] `https://your-instance.example.com` loads (webapp) - [ ] `https://your-instance.example.com/health` returns `{"status": "ok"}`
- [ ] `https://your-instance.example.com/health` returns OK - [ ] Login works (Google OAuth or email magic link)
- [ ] At least one analyst can SSH in - [ ] At least one table appears in the data catalog
- [ ] `ls /data/src_data/parquet/` shows data - [ ] `docker compose logs app` — no ERROR lines at startup
- [ ] `ls /home/` shows user directories
- [ ] `systemctl status webapp` is active
- [ ] `systemctl status notify-bot` is active
- [ ] `sudo crontab -u deploy -l` shows data sync cron
## Preventive Measures ## Preventive Measures
- **GCP snapshots**: Daily automatic snapshots of `data-disk` and `home-disk` (14-day retention) - **GCP snapshots**: Daily automatic snapshots of the `/data` persistent disk
- **Setup script**: `server/setup-snapshot-schedule.sh` configures snapshot policy (14-day retention). Configure via:
- **Limits in git**: `server/limits-users.conf` is version-controlled and deployed automatically ```bash
- **All configs in git**: sudoers, nginx, systemd services, management scripts gcloud compute resource-policies create snapshot-schedule daily-backup \
- **Secrets in GitHub**: `.env` is recreated by deploy.sh from GitHub Actions secrets --project=your-gcp-project \
--region=europe-north1 \
--max-retention-days=14 \
--on-source-disk-delete=keep-auto-snapshots \
--daily-schedule \
--start-time=03:00
gcloud compute disks add-resource-policies data-disk \
--project=your-gcp-project --zone=europe-north1-a \
--resource-policies=daily-backup
```
- **Secrets in GitHub / 1Password**: `.env` is never committed; recreate from stored secrets
- **Image tags**: Pin a known-good image tag in `docker-compose.yml` before each deploy

File diff suppressed because it is too large Load diff

View file

@ -3,6 +3,7 @@
## instance.yaml ## instance.yaml
The main configuration file for your AI Data Analyst instance. Located at `config/instance.yaml`. The main configuration file for your AI Data Analyst instance. Located at `config/instance.yaml`.
See `config/instance.yaml.example` for the full annotated template.
### Instance Branding ### Instance Branding
@ -17,10 +18,11 @@ instance:
```yaml ```yaml
auth: auth:
allowed_domain: "acme.com" # Google OAuth domain restriction allowed_domain: "acme.com" # Email domain restriction for login
``` ```
Only emails from this domain can log in via Google OAuth. External users can be added via password auth (requires SendGrid). Only emails from this domain can log in via Google OAuth or email magic link.
Google OAuth is optional — if not configured, only email magic link auth is available.
### Email ### Email
@ -28,9 +30,15 @@ Only emails from this domain can log in via Google OAuth. External users can be
email: email:
from_address: "noreply@acme.com" from_address: "noreply@acme.com"
from_name: "Acme Data Analyst" from_name: "Acme Data Analyst"
smtp_host: "${SMTP_HOST}"
smtp_port: 587
smtp_user: "${SMTP_USER}"
smtp_password: "${SMTP_PASSWORD}"
``` ```
Used for password auth setup and reset emails. Requires `SENDGRID_API_KEY` in `.env`. Used for magic link authentication. Without SMTP configured, magic links are shown
directly in the browser (development mode). Compatible with any SMTP relay (Gmail,
Mailgun, SendGrid SMTP, etc.).
### Server ### Server
@ -45,6 +53,7 @@ server:
```yaml ```yaml
desktop: desktop:
jwt_issuer: "acme-analyst" jwt_issuer: "acme-analyst"
jwt_secret: "${DESKTOP_JWT_SECRET}"
url_scheme: "acme-analyst" url_scheme: "acme-analyst"
``` ```
@ -52,22 +61,18 @@ desktop:
```yaml ```yaml
data_source: data_source:
type: "keboola" # keboola, csv, bigquery type: "keboola" # keboola, bigquery, local
``` ```
### Users ### Users
```yaml ```yaml
users: users:
john.doe: admin@acme.com:
name: "John Doe" display_name: "John Doe"
initials: "JD" km_admin: true # Corporate Memory admin (optional)
jane.smith:
name: "Jane Smith"
initials: "JS"
username_mapping: username_mapping: {} # Map webapp email -> server username if different
john.doe: john # Only if webapp and server names differ
``` ```
### Datasets ### Datasets
@ -102,11 +107,15 @@ catalog:
## Environment Variables (.env) ## Environment Variables (.env)
Copy `config/.env.template` to `.env` and fill in values. The template contains
the full variable list with comments. Never commit `.env`.
### Required ### Required
| Variable | Description | | Variable | Description |
|----------|-------------| |----------|-------------|
| `WEBAPP_SECRET_KEY` | Flask session secret | | `JWT_SECRET_KEY` | FastAPI JWT token secret (generate with `secrets.token_hex(32)`) |
| `SESSION_SECRET` | Session cookie secret (generate with `secrets.token_hex(32)`) |
| `GOOGLE_CLIENT_ID` | Google OAuth client ID | | `GOOGLE_CLIENT_ID` | Google OAuth client ID |
| `GOOGLE_CLIENT_SECRET` | Google OAuth client secret | | `GOOGLE_CLIENT_SECRET` | Google OAuth client secret |
@ -116,16 +125,29 @@ catalog:
|----------|-------------| |----------|-------------|
| `KEBOOLA_STORAGE_TOKEN` | Keboola Storage API token | | `KEBOOLA_STORAGE_TOKEN` | Keboola Storage API token |
| `KEBOOLA_STACK_URL` | Keboola stack URL | | `KEBOOLA_STACK_URL` | Keboola stack URL |
| `KEBOOLA_PROJECT_ID` | Keboola project ID | | `DATA_DIR` | Data directory path (default: `/data` in Docker, `./data` locally) |
| `DATA_DIR` | Data directory path |
### Data Source (BigQuery)
| Variable | Description |
|----------|-------------|
| `BIGQUERY_PROJECT` | GCP project for job execution/billing |
| `BIGQUERY_LOCATION` | BigQuery location (e.g., `US`, `us-central1`) |
### Optional ### Optional
| Variable | Description | | Variable | Description |
|----------|-------------| |----------|-------------|
| `SENDGRID_API_KEY` | For password auth emails | | `SMTP_HOST` | SMTP relay host for magic link emails |
| `SMTP_PORT` | SMTP port (587 for STARTTLS, 465 for SSL) |
| `SMTP_USER` | SMTP username |
| `SMTP_PASSWORD` | SMTP password |
| `TELEGRAM_BOT_TOKEN` | For Telegram notifications | | `TELEGRAM_BOT_TOKEN` | For Telegram notifications |
| `ANTHROPIC_API_KEY` | For Corporate Memory AI | | `ANTHROPIC_API_KEY` | For Corporate Memory AI (direct Anthropic) |
| `LLM_API_KEY` | API key for LLM proxy (LiteLLM, OpenRouter, etc.) | | `LLM_API_KEY` | API key for LLM proxy (LiteLLM, OpenRouter, etc.) |
| `JIRA_WEBHOOK_SECRET` | For Jira integration | | `JIRA_WEBHOOK_SECRET` | For Jira webhook integration |
| `JIRA_API_TOKEN` | For Jira REST API access |
| `DESKTOP_JWT_SECRET` | Separate secret for desktop app tokens |
| `CONFIG_DIR` | Override config directory path | | `CONFIG_DIR` | Override config directory path |
| `LOG_LEVEL` | Logging level: `debug`, `info`, `warning`, `error` |
| `DOMAIN` | Public hostname for Caddy TLS (production profile) |