agnes-the-ai-analyst/dev_docs/disaster-recovery.md
Petr 26c4e0934d OSS cleanup: remove internal references, harden deployment, add config env interpolation
Phase 1 - Internal reference cleanup:
- Delete dev_docs/meetings/ (internal meeting notes/transcripts)
- Replace hardcoded usernames (padak/matejkys/dasa) with deploy/generic
- Replace "Internal AI Data Analyst" with "AI Data Analyst"
- Replace keboola/internal_ai_data_analyst URLs with your-org/ai-data-analyst
- Replace /tmp/keboola_load/ with /tmp/data_analyst_staging/ in dev_docs

Phase 2 - Deployment hardening:
- Tighten sudoers wildcards to explicit paths (visudo, sudoers cp)
- setup.sh creates all groups (data-ops, dataread, data-private) and deploy user
- webapp-setup.sh copies sudoers-webapp from repo instead of inline definition
- deploy.sh conditional copy for data_description.md (not in git for OSS)
- deploy.sh ownership changed to deploy:data-ops for /data/{scripts,docs,examples}

Phase 3 - Config and misc:
- Add ${ENV_VAR} interpolation to config/loader.py
- Expand config/instance.yaml.example with all sections (admins, deployment, auth, etc.)
- Create config/.env.template for secret values
- Add MIT LICENSE
- Fix .gitignore: add .venv/, docs/data_description.md
- Fix README.md: CSV status Planned, remove metrics/, update license text
- Translate Czech comments in requirements.txt to English
- Fix test_account_service.py: mock username mapping instead of relying on instance config

All 118 tests pass.
2026-03-09 07:59:57 +01:00

8.6 KiB

Disaster Recovery

Recovery procedures for the Data Broker Server (data-broker-for-claude).

Overview

Disk Layout:
  sda (10 GB) /         System disk (instance) - EXPENDABLE
  sdb (30 GB) /data     Data disk - SNAPSHOTTED daily
  sdc (30 GB) /home     Home disk - SNAPSHOTTED daily

Key principle: sda is disposable. Everything on it is either in git or can be reinstalled. All unique data lives on sdb and sdc, which are independently snapshotted.

What Lives Where

Location Content Recovery Method
sda: /opt/data-analyst/repo/ Application code git clone from GitHub
sda: /opt/data-analyst/.venv/ Python packages pip install -r requirements.txt
sda: /opt/data-analyst/.env Application secrets deploy.sh creates from GitHub secrets
sda: /etc/sudoers.d/ Permissions deploy.sh copies from repo
sda: /etc/security/limits.d/ Resource limits deploy.sh copies from repo
sda: /etc/nginx/ Nginx config deploy.sh or manual copy from repo
sda: /etc/letsencrypt/ SSL certificate certbot renews automatically
sdb: /data/src_data/parquet/ Parquet data Regenerate from Keboola (update.sh) or restore snapshot
sdb: /data/notifications/ Notification state Restore from snapshot
sdb: /data/docs/, /data/scripts/ Docs & scripts deploy.sh copies from repo
sdc: /home/*/ User accounts, SSH keys, workspaces, scripts Restore from snapshot

Scenario A: System Disk Failure (sda dies)

Impact: Server is down, but all user data is safe on sdb/sdc.

Recovery time: ~30 minutes

Steps

  1. Create new VM (same zone, attach existing disks):

    # Create new instance with existing disks
    gcloud compute instances create data-broker-for-claude \
      --project=kids-ai-data-analysis \
      --zone=europe-north1-a \
      --machine-type=e2-medium \
      --image-family=debian-12 \
      --image-project=debian-cloud \
      --boot-disk-size=10GB \
      --tags=http-server,https-server
    
    # Attach existing data disks
    gcloud compute instances attach-disk data-broker-for-claude \
      --project=kids-ai-data-analysis \
      --zone=europe-north1-a \
      --disk=data-disk
    
    gcloud compute instances attach-disk data-broker-for-claude \
      --project=kids-ai-data-analysis \
      --zone=europe-north1-a \
      --disk=home-disk
    
  2. SSH in and mount disks:

    # Mount data disk
    mkdir -p /data
    mount /dev/sdb /data
    
    # Mount home disk
    mount /dev/sdc /home
    
    # Add to fstab (get UUIDs with blkid)
    echo "UUID=$(blkid -s UUID -o value /dev/sdb) /data ext4 discard,defaults,nofail 0 2" >> /etc/fstab
    echo "UUID=$(blkid -s UUID -o value /dev/sdc) /home ext4 discard,defaults,nofail 0 2" >> /etc/fstab
    
  3. Install prerequisites:

    apt-get update
    apt-get install -y git python3.11-venv python3-pip nginx certbot python3-certbot-nginx
    
  4. Recreate deploy user and groups:

    # Create groups
    groupadd dataread
    groupadd data-private
    groupadd data-ops
    
    # Create deploy user
    useradd -m -s /bin/bash deploy
    usermod -aG data-ops deploy
    
    # Restore deploy SSH key (generate new one)
    sudo -u deploy ssh-keygen -t ed25519 -f /home/deploy/.ssh/id_ed25519 -N '' -C 'deploy@data-broker'
    sudo -u deploy bash -c 'echo -e "Host github.com\n  IdentityFile ~/.ssh/id_ed25519\n  StrictHostKeyChecking accept-new" > /home/deploy/.ssh/config'
    chmod 600 /home/deploy/.ssh/config
    
    # Add new public key to GitHub as Deploy Key
    cat /home/deploy/.ssh/id_ed25519.pub
    
  5. Clone repo and run setup:

    mkdir -p /opt/data-analyst
    chown deploy:data-ops /opt/data-analyst
    sudo -u deploy git clone git@github.com:your-org/ai-data-analyst.git /opt/data-analyst/repo
    git config --global --add safe.directory /opt/data-analyst/repo
    /opt/data-analyst/repo/server/setup.sh
    
  6. Restore user accounts from /home:

    # Users already exist on home-disk, just recreate /etc/passwd entries
    # For each directory in /home (except deploy):
    for dir in /home/*/; do
      username=$(basename "$dir")
      [[ "$username" == "deploy" ]] && continue
      # Create user if not exists
      if ! id "$username" &>/dev/null; then
        useradd -M -d "/home/$username" -s /bin/bash "$username"
        usermod -aG dataread "$username"
      fi
    done
    

    Note: Group memberships (data-private, sudo, data-ops) need manual review. Check the admin list in server/limits-users.conf for admin users.

  7. Trigger deploy via GitHub Actions (or manually):

    sudo -u deploy bash -c 'cd /opt/data-analyst/repo && ./server/deploy.sh'
    
  8. Set up SSL certificate:

    certbot --nginx -d your-instance.example.com
    
  9. Restore crontab:

    sudo -u deploy crontab -e
    # Add:
    # MAILTO=admin@your-domain.com
    # 0 6,14,19 * * * cd /opt/data-analyst/repo && ./scripts/update.sh > /var/log/update.log 2>&1 || cat /var/log/update.log
    
  10. Update external IP if it changed:

    • DNS: your-instance.example.com A record
    • GitHub secrets: SERVER_HOST
    • SSH configs of all users

Scenario B: Data Disk Failure (sdb/data-disk dies)

Impact: Parquet data lost, users unaffected.

Recovery time: ~10 minutes (from snapshot) or ~30 minutes (from Keboola)

Option 1: Restore from snapshot (faster)

# Find latest snapshot
gcloud compute snapshots list --project=kids-ai-data-analysis \
  --filter="sourceDisk:data-disk" --sort-by=~creationTimestamp --limit=5

# Create new disk from snapshot
gcloud compute disks create data-disk \
  --project=kids-ai-data-analysis \
  --zone=europe-north1-a \
  --source-snapshot=SNAPSHOT_NAME \
  --type=pd-balanced

# Attach to VM (may need to stop VM first)
gcloud compute instances attach-disk data-broker-for-claude \
  --project=kids-ai-data-analysis \
  --zone=europe-north1-a \
  --disk=data-disk

# Mount
ssh kids "sudo mount /dev/sdb /data"

Option 2: Regenerate from Keboola

# Create fresh disk
gcloud compute disks create data-disk \
  --project=kids-ai-data-analysis \
  --zone=europe-north1-a \
  --size=30GB \
  --type=pd-balanced

# Attach, format, mount
ssh kids "sudo mkfs.ext4 /dev/sdb && sudo mount /dev/sdb /data"

# Run deploy to recreate directory structure
ssh kids "sudo -u deploy bash -c 'cd /opt/data-analyst/repo && ./server/deploy.sh'"

# Regenerate parquet data from Keboola
ssh kids "cd /opt/data-analyst/repo && ./scripts/update.sh"

Scenario C: Home Disk Failure (sdc/home-disk dies)

Impact: All user accounts, SSH keys, and personal workspaces lost.

Recovery time: ~10 minutes (from snapshot)

Restore from snapshot

# Find latest snapshot
gcloud compute snapshots list --project=kids-ai-data-analysis \
  --filter="sourceDisk:home-disk" --sort-by=~creationTimestamp --limit=5

# Create new disk from snapshot
gcloud compute disks create home-disk \
  --project=kids-ai-data-analysis \
  --zone=europe-north1-a \
  --source-snapshot=SNAPSHOT_NAME \
  --type=pd-balanced

# Attach to VM
gcloud compute instances attach-disk data-broker-for-claude \
  --project=kids-ai-data-analysis \
  --zone=europe-north1-a \
  --disk=home-disk

# Mount
ssh kids "sudo mount /dev/sdc /home"

If no snapshot exists, users must re-register via https://your-instance.example.com.

Scenario D: Complete Server Loss (VM + all disks)

Recovery time: ~45 minutes

  1. Follow Scenario A steps 1-5 (new VM, prerequisites, deploy user)
  2. Restore data-disk from snapshot (Scenario B, Option 1)
  3. Restore home-disk from snapshot (Scenario C)
  4. Follow Scenario A steps 6-10 (user accounts, deploy, SSL, cron, IP)

Verification Checklist

After any recovery, verify:

  • ssh kids works (admin access)
  • https://your-instance.example.com loads (webapp)
  • https://your-instance.example.com/health returns OK
  • At least one analyst can SSH in
  • ls /data/src_data/parquet/ shows data
  • ls /home/ shows user directories
  • systemctl status webapp is active
  • systemctl status notify-bot is active
  • sudo crontab -u deploy -l shows data sync cron

Preventive Measures

  • GCP snapshots: Daily automatic snapshots of data-disk and home-disk (14-day retention)
  • Setup script: server/setup-snapshot-schedule.sh configures snapshot policy
  • Limits in git: server/limits-users.conf is version-controlled and deployed automatically
  • All configs in git: sudoers, nginx, systemd services, management scripts
  • Secrets in GitHub: .env is recreated by deploy.sh from GitHub Actions secrets