docs: update stale v1 docs to v2 Docker/FastAPI/DuckDB architecture

- CONFIGURATION.md: remove Flask/SendGrid/WEBAPP_SECRET_KEY references, update env vars to JWT_SECRET_KEY and SESSION_SECRET, point to config/.env.template and config/instance.yaml.example - disaster-recovery.md: rewrite for Docker volumes; cover GCP disk snapshot backup/restore and full VM rebuild; drop systemd/nginx/SSH - server.md: strip rsync, systemd, nginx, Linux group, and sudo sections; keep Docker Compose operations, log viewing, health checks, sync/admin CLI, and Jira webhook procedures
2026-04-09 18:44:25 +02:00 · 2026-04-09 18:44:25 +02:00 · c8e232e43e
commit c8e232e43e
parent 7d036760f5
3 changed files with 347 additions and 2359 deletions
--- a/dev_docs/disaster-recovery.md
+++ b/dev_docs/disaster-recovery.md
@ -1,161 +1,61 @@
 # Disaster Recovery

-Recovery procedures for the Data Broker Server (`your-server`).
+Recovery procedures for the AI Data Analyst Docker deployment.

 ## Overview

 ```
-Disk Layout:
-  sda (10 GB) /         System disk (instance) - EXPENDABLE
-  sdb (30 GB) /data     Data disk - SNAPSHOTTED daily
-  sdc (30 GB) /home     Home disk - SNAPSHOTTED daily
+What lives where:
+  Docker volumes  /data        DuckDB files, parquet extracts, state
+  Git             repo/        Application code — rebuild from GitHub
+  .env            secrets      Recreate from GitHub Secrets / 1Password
 ```

-**Key principle**: sda is disposable. Everything on it is either in git or can be reinstalled. All unique data lives on sdb and sdc, which are independently snapshotted.
+**Key principle**: the container is disposable. All unique data lives in the `/data`
+Docker volume (or a GCP persistent disk mounted at `/data`). Re-pulling the image
+and restoring `/data` brings the service back to full operation.

-## What Lives Where
+## Data Layout

-| Location | Content | Recovery Method |
-|----------|---------|-----------------|
-| sda: `/opt/data-analyst/repo/` | Application code | `git clone` from GitHub |
-| sda: `/opt/data-analyst/.venv/` | Python packages | `pip install -r requirements.txt` |
-| sda: `/opt/data-analyst/.env` | Application secrets | deploy.sh creates from GitHub secrets |
-| sda: `/etc/sudoers.d/` | Permissions | deploy.sh copies from repo |
-| sda: `/etc/security/limits.d/` | Resource limits | deploy.sh copies from repo |
-| sda: `/etc/nginx/` | Nginx config | deploy.sh or manual copy from repo |
-| sda: `/etc/letsencrypt/` | SSL certificate | `certbot` renews automatically |
-| sdb: `/data/src_data/parquet/` | Parquet data | Regenerate from Keboola (`update.sh`) or restore snapshot |
-| sdb: `/data/notifications/` | Notification state | Restore from snapshot |
-| sdb: `/data/docs/`, `/data/scripts/` | Docs & scripts | deploy.sh copies from repo |
-| sdc: `/home/*/` | User accounts, SSH keys, workspaces, scripts | Restore from snapshot |
+| Path | Content | Backup |
+|------|---------|--------|
+| `/data/state/system.duckdb` | Table registry, users, sync state | Daily snapshot |
+| `/data/analytics/server.duckdb` | Master analytics DB (views) | Regenerated on start |
+| `/data/extracts/*/extract.duckdb` | Per-source extract DBs | Daily snapshot |
+| `/data/extracts/*/data/*.parquet` | Parquet files (local sources) | Daily snapshot |

-## Scenario A: System Disk Failure (sda dies)
+`analytics/server.duckdb` is rebuilt automatically by `SyncOrchestrator.rebuild()`
+on every startup, so it does not need to be backed up separately.

-**Impact**: Server is down, but all user data is safe on sdb/sdc.
+## Scenario A: Container Crash / Bad Deploy

-**Recovery time**: ~30 minutes
+**Impact**: Service down, data intact.

-### Steps
+**Recovery time**: ~2 minutes

-1. **Create new VM** (same zone, attach existing disks):
-   ```bash
-   # Create new instance with existing disks
-   gcloud compute instances create your-server \
-     --project=your-gcp-project \
-     --zone=europe-north1-a \
-     --machine-type=e2-medium \
-     --image-family=debian-12 \
-     --image-project=debian-cloud \
-     --boot-disk-size=10GB \
-     --tags=http-server,https-server
+```bash
+# Pull latest image and restart
+docker compose pull
+docker compose up -d

-   # Attach existing data disks
-   gcloud compute instances attach-disk your-server \
-     --project=your-gcp-project \
-     --zone=europe-north1-a \
-     --disk=data-disk
+# Check health
+curl https://your-instance.example.com/health
+```

-   gcloud compute instances attach-disk your-server \
-     --project=your-gcp-project \
-     --zone=europe-north1-a \
-     --disk=home-disk
-   ```
+If a bad image was pushed, roll back to the previous tag:
+```bash
+docker compose down
+# Edit docker-compose.yml to pin the previous image tag
+docker compose up -d
+```

-2. **SSH in and mount disks**:
-   ```bash
-   # Mount data disk
-   mkdir -p /data
-   mount /dev/sdb /data
+## Scenario B: /data Volume Corruption or Loss

-   # Mount home disk
-   mount /dev/sdc /home
+**Impact**: All DuckDB state and parquet data lost.

-   # Add to fstab (get UUIDs with blkid)
-   echo "UUID=$(blkid -s UUID -o value /dev/sdb) /data ext4 discard,defaults,nofail 0 2" >> /etc/fstab
-   echo "UUID=$(blkid -s UUID -o value /dev/sdc) /home ext4 discard,defaults,nofail 0 2" >> /etc/fstab
-   ```
+**Recovery time**: ~10 minutes (from snapshot) or ~30 minutes (regenerate from source)

-3. **Install prerequisites**:
-   ```bash
-   apt-get update
-   apt-get install -y git python3.11-venv python3-pip nginx certbot python3-certbot-nginx
-   ```
-
-4. **Recreate deploy user and groups**:
-   ```bash
-   # Create groups
-   groupadd dataread
-   groupadd data-private
-   groupadd data-ops
-
-   # Create deploy user
-   useradd -m -s /bin/bash deploy
-   usermod -aG data-ops deploy
-
-   # Restore deploy SSH key (generate new one)
-   sudo -u deploy ssh-keygen -t ed25519 -f /home/deploy/.ssh/id_ed25519 -N '' -C 'deploy@data-broker'
-   sudo -u deploy bash -c 'echo -e "Host github.com\n  IdentityFile ~/.ssh/id_ed25519\n  StrictHostKeyChecking accept-new" > /home/deploy/.ssh/config'
-   chmod 600 /home/deploy/.ssh/config
-
-   # Add new public key to GitHub as Deploy Key
-   cat /home/deploy/.ssh/id_ed25519.pub
-   ```
-
-5. **Clone repo and run setup**:
-   ```bash
-   mkdir -p /opt/data-analyst
-   chown deploy:data-ops /opt/data-analyst
-   sudo -u deploy git clone git@github.com:keboola/agnes-the-ai-analyst.git /opt/data-analyst/repo
-   git config --global --add safe.directory /opt/data-analyst/repo
-   /opt/data-analyst/repo/server/setup.sh
-   ```
-
-6. **Restore user accounts from /home**:
-   ```bash
-   # Users already exist on home-disk, just recreate /etc/passwd entries
-   # For each directory in /home (except deploy):
-   for dir in /home/*/; do
-     username=$(basename "$dir")
-     [[ "$username" == "deploy" ]] && continue
-     # Create user if not exists
-     if ! id "$username" &>/dev/null; then
-       useradd -M -d "/home/$username" -s /bin/bash "$username"
-       usermod -aG dataread "$username"
-     fi
-   done
-   ```
-   Note: Group memberships (data-private, sudo, data-ops) need manual review. Check the admin list in `server/limits-users.conf` for admin users.
-
-7. **Trigger deploy via GitHub Actions** (or manually):
-   ```bash
-   sudo -u deploy bash -c 'cd /opt/data-analyst/repo && ./server/deploy.sh'
-   ```
-
-8. **Set up SSL certificate**:
-   ```bash
-   certbot --nginx -d your-instance.example.com
-   ```
-
-9. **Restore crontab**:
-   ```bash
-   sudo -u deploy crontab -e
-   # Add:
-   # MAILTO=admin@your-domain.com
-   # 0 6,14,19 * * * cd /opt/data-analyst/repo && ./scripts/update.sh > /var/log/update.log 2>&1 || cat /var/log/update.log
-   ```
-
-10. **Update external IP** if it changed:
-    - DNS: `your-instance.example.com` A record
-    - GitHub secrets: `SERVER_HOST`
-    - SSH configs of all users
-
-## Scenario B: Data Disk Failure (sdb/data-disk dies)
-
-**Impact**: Parquet data lost, users unaffected.
-
-**Recovery time**: ~10 minutes (from snapshot) or ~30 minutes (from Keboola)
-
-### Option 1: Restore from snapshot (faster)
+### Option 1: Restore from GCP disk snapshot (faster)

 ```bash
 # Find latest snapshot
@ -169,95 +69,99 @@ gcloud compute disks create data-disk \
  --source-snapshot=SNAPSHOT_NAME \
  --type=pd-balanced

-# Attach to VM (may need to stop VM first)
+# Attach to VM and mount
 gcloud compute instances attach-disk your-server \
  --project=your-gcp-project \
  --zone=europe-north1-a \
  --disk=data-disk

-# Mount
-ssh kids "sudo mount /dev/sdb /data"
+# Restart containers
+docker compose up -d
 ```

-### Option 2: Regenerate from Keboola
+### Option 2: Regenerate from source

 ```bash
-# Create fresh disk
-gcloud compute disks create data-disk \
-  --project=your-gcp-project \
-  --zone=europe-north1-a \
-  --size=30GB \
-  --type=pd-balanced
+# Start with empty /data volume
+docker compose up -d

-# Attach, format, mount
-ssh kids "sudo mkfs.ext4 /dev/sdb && sudo mount /dev/sdb /data"
-
-# Run deploy to recreate directory structure
-ssh kids "sudo -u deploy bash -c 'cd /opt/data-analyst/repo && ./server/deploy.sh'"
-
-# Regenerate parquet data from Keboola
-ssh kids "cd /opt/data-analyst/repo && ./scripts/update.sh"
+# Trigger a full sync from the data source
+curl -X POST http://localhost:8000/api/sync/trigger
+# Or via CLI:
+docker compose exec app da sync
 ```

-## Scenario C: Home Disk Failure (sdc/home-disk dies)
+DuckDB extract files and parquet will be repopulated from Keboola / BigQuery.
+`system.duckdb` (table registry, users) must be restored from snapshot if
+not regenerated — user accounts and table definitions are not recreated by sync.

-**Impact**: All user accounts, SSH keys, and personal workspaces lost.
+## Scenario C: Complete VM Loss

-**Recovery time**: ~10 minutes (from snapshot)
+**Recovery time**: ~20 minutes

-### Restore from snapshot
+1. **Create new VM** (or use managed instance group):
+   ```bash
+   gcloud compute instances create your-server \
+     --project=your-gcp-project \
+     --zone=europe-north1-a \
+     --machine-type=e2-medium \
+     --image-family=debian-12 \
+     --image-project=debian-cloud
+   ```

-```bash
-# Find latest snapshot
-gcloud compute snapshots list --project=your-gcp-project \
-  --filter="sourceDisk:home-disk" --sort-by=~creationTimestamp --limit=5
+2. **Install Docker**:
+   ```bash
+   curl -fsSL https://get.docker.com | sh
+   ```

-# Create new disk from snapshot
-gcloud compute disks create home-disk \
-  --project=your-gcp-project \
-  --zone=europe-north1-a \
-  --source-snapshot=SNAPSHOT_NAME \
-  --type=pd-balanced
+3. **Attach and mount the data disk** (or restore from snapshot per Scenario B):
+   ```bash
+   gcloud compute instances attach-disk your-server \
+     --project=your-gcp-project --zone=europe-north1-a --disk=data-disk
+   # Add mount to /etc/fstab and mount /data
+   ```

-# Attach to VM
-gcloud compute instances attach-disk your-server \
-  --project=your-gcp-project \
-  --zone=europe-north1-a \
-  --disk=home-disk
+4. **Clone repo and create .env**:
+   ```bash
+   git clone git@github.com:your-org/ai-data-analyst.git /opt/data-analyst
+   cd /opt/data-analyst
+   cp config/.env.template .env
+   # Fill in secrets from GitHub Secrets / 1Password
+   ```

-# Mount
-ssh kids "sudo mount /dev/sdc /home"
-```
+5. **Start the stack**:
+   ```bash
+   docker compose up -d
+   ```

-If no snapshot exists, users must re-register via https://your-instance.example.com.
-
-## Scenario D: Complete Server Loss (VM + all disks)
-
-**Recovery time**: ~45 minutes
-
-1. Follow **Scenario A** steps 1-5 (new VM, prerequisites, deploy user)
-2. Restore `data-disk` from snapshot (Scenario B, Option 1)
-3. Restore `home-disk` from snapshot (Scenario C)
-4. Follow **Scenario A** steps 6-10 (user accounts, deploy, SSL, cron, IP)
+6. **Update DNS** if the external IP changed:
+   - A record for `your-instance.example.com`

 ## Verification Checklist

 After any recovery, verify:

- [ ] `ssh kids` works (admin access)
- [ ] `https://your-instance.example.com` loads (webapp)
- [ ] `https://your-instance.example.com/health` returns OK
- [ ] At least one analyst can SSH in
- [ ] `ls /data/src_data/parquet/` shows data
- [ ] `ls /home/` shows user directories
- [ ] `systemctl status webapp` is active
- [ ] `systemctl status notify-bot` is active
- [ ] `sudo crontab -u deploy -l` shows data sync cron
+- [ ] `docker compose ps` — all services `Up`
+- [ ] `https://your-instance.example.com/health` returns `{"status": "ok"}`
+- [ ] Login works (Google OAuth or email magic link)
+- [ ] At least one table appears in the data catalog
+- [ ] `docker compose logs app` — no ERROR lines at startup

 ## Preventive Measures

- **GCP snapshots**: Daily automatic snapshots of `data-disk` and `home-disk` (14-day retention)
- **Setup script**: `server/setup-snapshot-schedule.sh` configures snapshot policy
- **Limits in git**: `server/limits-users.conf` is version-controlled and deployed automatically
- **All configs in git**: sudoers, nginx, systemd services, management scripts
- **Secrets in GitHub**: `.env` is recreated by deploy.sh from GitHub Actions secrets
+- **GCP snapshots**: Daily automatic snapshots of the `/data` persistent disk
+  (14-day retention). Configure via:
+  ```bash
+  gcloud compute resource-policies create snapshot-schedule daily-backup \
+    --project=your-gcp-project \
+    --region=europe-north1 \
+    --max-retention-days=14 \
+    --on-source-disk-delete=keep-auto-snapshots \
+    --daily-schedule \
+    --start-time=03:00
+  gcloud compute disks add-resource-policies data-disk \
+    --project=your-gcp-project --zone=europe-north1-a \
+    --resource-policies=daily-backup
+  ```
+- **Secrets in GitHub / 1Password**: `.env` is never committed; recreate from stored secrets
+- **Image tags**: Pin a known-good image tag in `docker-compose.yml` before each deploy
--- a/dev_docs/server.md
+++ b/dev_docs/server.md
--- a/docs/CONFIGURATION.md
+++ b/docs/CONFIGURATION.md
@ -3,6 +3,7 @@
 ## instance.yaml

 The main configuration file for your AI Data Analyst instance. Located at `config/instance.yaml`.
+See `config/instance.yaml.example` for the full annotated template.

 ### Instance Branding

@ -17,10 +18,11 @@ instance:

 ```yaml
 auth:
-  allowed_domain: "acme.com"     # Google OAuth domain restriction
+  allowed_domain: "acme.com"     # Email domain restriction for login
 ```

-Only emails from this domain can log in via Google OAuth. External users can be added via password auth (requires SendGrid).
+Only emails from this domain can log in via Google OAuth or email magic link.
+Google OAuth is optional — if not configured, only email magic link auth is available.

 ### Email

@ -28,9 +30,15 @@ Only emails from this domain can log in via Google OAuth. External users can be
 email:
  from_address: "noreply@acme.com"
  from_name: "Acme Data Analyst"
+  smtp_host: "${SMTP_HOST}"
+  smtp_port: 587
+  smtp_user: "${SMTP_USER}"
+  smtp_password: "${SMTP_PASSWORD}"
 ```

-Used for password auth setup and reset emails. Requires `SENDGRID_API_KEY` in `.env`.
+Used for magic link authentication. Without SMTP configured, magic links are shown
+directly in the browser (development mode). Compatible with any SMTP relay (Gmail,
+Mailgun, SendGrid SMTP, etc.).

 ### Server

@ -45,6 +53,7 @@ server:
 ```yaml
 desktop:
  jwt_issuer: "acme-analyst"
+  jwt_secret: "${DESKTOP_JWT_SECRET}"
  url_scheme: "acme-analyst"
 ```

@ -52,22 +61,18 @@ desktop:

 ```yaml
 data_source:
-  type: "keboola"               # keboola, csv, bigquery
+  type: "keboola"               # keboola, bigquery, local
 ```

 ### Users

 ```yaml
 users:
-  john.doe:
-    name: "John Doe"
-    initials: "JD"
-  jane.smith:
-    name: "Jane Smith"
-    initials: "JS"
+  admin@acme.com:
+    display_name: "John Doe"
+    km_admin: true              # Corporate Memory admin (optional)

-username_mapping:
-  john.doe: john                 # Only if webapp and server names differ
+username_mapping: {}            # Map webapp email -> server username if different
 ```

 ### Datasets
@ -102,11 +107,15 @@ catalog:

 ## Environment Variables (.env)

+Copy `config/.env.template` to `.env` and fill in values. The template contains
+the full variable list with comments. Never commit `.env`.
+
 ### Required

 | Variable | Description |
 |----------|-------------|
-| `WEBAPP_SECRET_KEY` | Flask session secret |
+| `JWT_SECRET_KEY` | FastAPI JWT token secret (generate with `secrets.token_hex(32)`) |
+| `SESSION_SECRET` | Session cookie secret (generate with `secrets.token_hex(32)`) |
 | `GOOGLE_CLIENT_ID` | Google OAuth client ID |
 | `GOOGLE_CLIENT_SECRET` | Google OAuth client secret |

@ -116,16 +125,29 @@ catalog:
 |----------|-------------|
 | `KEBOOLA_STORAGE_TOKEN` | Keboola Storage API token |
 | `KEBOOLA_STACK_URL` | Keboola stack URL |
-| `KEBOOLA_PROJECT_ID` | Keboola project ID |
-| `DATA_DIR` | Data directory path |
+| `DATA_DIR` | Data directory path (default: `/data` in Docker, `./data` locally) |
+
+### Data Source (BigQuery)
+
+| Variable | Description |
+|----------|-------------|
+| `BIGQUERY_PROJECT` | GCP project for job execution/billing |
+| `BIGQUERY_LOCATION` | BigQuery location (e.g., `US`, `us-central1`) |

 ### Optional

 | Variable | Description |
 |----------|-------------|
-| `SENDGRID_API_KEY` | For password auth emails |
+| `SMTP_HOST` | SMTP relay host for magic link emails |
+| `SMTP_PORT` | SMTP port (587 for STARTTLS, 465 for SSL) |
+| `SMTP_USER` | SMTP username |
+| `SMTP_PASSWORD` | SMTP password |
 | `TELEGRAM_BOT_TOKEN` | For Telegram notifications |
-| `ANTHROPIC_API_KEY` | For Corporate Memory AI |
+| `ANTHROPIC_API_KEY` | For Corporate Memory AI (direct Anthropic) |
 | `LLM_API_KEY` | API key for LLM proxy (LiteLLM, OpenRouter, etc.) |
-| `JIRA_WEBHOOK_SECRET` | For Jira integration |
+| `JIRA_WEBHOOK_SECRET` | For Jira webhook integration |
+| `JIRA_API_TOKEN` | For Jira REST API access |
+| `DESKTOP_JWT_SECRET` | Separate secret for desktop app tokens |
 | `CONFIG_DIR` | Override config directory path |
+| `LOG_LEVEL` | Logging level: `debug`, `info`, `warning`, `error` |
+| `DOMAIN` | Public hostname for Caddy TLS (production profile) |