agnes-the-ai-analyst/dev_docs/disaster-recovery.md
Petr 485ac0a742 Security fixes: sanitize dev_docs, harden sudoers and config validation
H1 - Sanitize dev_docs/ for public release:
  - Replace all real employee names with generic placeholders
    (padak->admin1, matejkys->admin2, dasa->admin3, petr->john, etc.)
  - Replace GCP project ID (kids-ai-data-analysis -> your-gcp-project)
  - Replace server hostname (data-broker-for-claude -> your-server)
  - Replace real IP address (34.88.8.46 -> YOUR_SERVER_IP)
  - Replace internal FQDN with placeholder
  - Covers: security.md, server.md, disaster-recovery.md, desktop-app.md,
    session_explore.md, plan-rsync-fix.md, draft/*.md

H3 - webapp-setup.sh: validate sudoers syntax BEFORE copying to /etc/sudoers.d
  - Prevents broken sudo if syntax is invalid
  - Uses install -m 440 for atomic copy with correct permissions

M1 - setup.sh: deploy user created with /usr/sbin/nologin instead of /bin/bash
  - CI/CD service account does not need interactive shell

M2 - config/loader.py: warn on missing env vars, validate webapp_secret_key
  - _resolve_env_refs now logs warnings for unset ${ENV_VAR} references
  - _validate_config checks auth.webapp_secret_key is non-empty
  - Prevents Flask signing sessions with empty secret key

All 118 tests pass.
2026-03-09 08:06:45 +01:00

8.5 KiB

Disaster Recovery

Recovery procedures for the Data Broker Server (your-server).

Overview

Disk Layout:
  sda (10 GB) /         System disk (instance) - EXPENDABLE
  sdb (30 GB) /data     Data disk - SNAPSHOTTED daily
  sdc (30 GB) /home     Home disk - SNAPSHOTTED daily

Key principle: sda is disposable. Everything on it is either in git or can be reinstalled. All unique data lives on sdb and sdc, which are independently snapshotted.

What Lives Where

Location Content Recovery Method
sda: /opt/data-analyst/repo/ Application code git clone from GitHub
sda: /opt/data-analyst/.venv/ Python packages pip install -r requirements.txt
sda: /opt/data-analyst/.env Application secrets deploy.sh creates from GitHub secrets
sda: /etc/sudoers.d/ Permissions deploy.sh copies from repo
sda: /etc/security/limits.d/ Resource limits deploy.sh copies from repo
sda: /etc/nginx/ Nginx config deploy.sh or manual copy from repo
sda: /etc/letsencrypt/ SSL certificate certbot renews automatically
sdb: /data/src_data/parquet/ Parquet data Regenerate from Keboola (update.sh) or restore snapshot
sdb: /data/notifications/ Notification state Restore from snapshot
sdb: /data/docs/, /data/scripts/ Docs & scripts deploy.sh copies from repo
sdc: /home/*/ User accounts, SSH keys, workspaces, scripts Restore from snapshot

Scenario A: System Disk Failure (sda dies)

Impact: Server is down, but all user data is safe on sdb/sdc.

Recovery time: ~30 minutes

Steps

  1. Create new VM (same zone, attach existing disks):

    # Create new instance with existing disks
    gcloud compute instances create your-server \
      --project=your-gcp-project \
      --zone=europe-north1-a \
      --machine-type=e2-medium \
      --image-family=debian-12 \
      --image-project=debian-cloud \
      --boot-disk-size=10GB \
      --tags=http-server,https-server
    
    # Attach existing data disks
    gcloud compute instances attach-disk your-server \
      --project=your-gcp-project \
      --zone=europe-north1-a \
      --disk=data-disk
    
    gcloud compute instances attach-disk your-server \
      --project=your-gcp-project \
      --zone=europe-north1-a \
      --disk=home-disk
    
  2. SSH in and mount disks:

    # Mount data disk
    mkdir -p /data
    mount /dev/sdb /data
    
    # Mount home disk
    mount /dev/sdc /home
    
    # Add to fstab (get UUIDs with blkid)
    echo "UUID=$(blkid -s UUID -o value /dev/sdb) /data ext4 discard,defaults,nofail 0 2" >> /etc/fstab
    echo "UUID=$(blkid -s UUID -o value /dev/sdc) /home ext4 discard,defaults,nofail 0 2" >> /etc/fstab
    
  3. Install prerequisites:

    apt-get update
    apt-get install -y git python3.11-venv python3-pip nginx certbot python3-certbot-nginx
    
  4. Recreate deploy user and groups:

    # Create groups
    groupadd dataread
    groupadd data-private
    groupadd data-ops
    
    # Create deploy user
    useradd -m -s /bin/bash deploy
    usermod -aG data-ops deploy
    
    # Restore deploy SSH key (generate new one)
    sudo -u deploy ssh-keygen -t ed25519 -f /home/deploy/.ssh/id_ed25519 -N '' -C 'deploy@data-broker'
    sudo -u deploy bash -c 'echo -e "Host github.com\n  IdentityFile ~/.ssh/id_ed25519\n  StrictHostKeyChecking accept-new" > /home/deploy/.ssh/config'
    chmod 600 /home/deploy/.ssh/config
    
    # Add new public key to GitHub as Deploy Key
    cat /home/deploy/.ssh/id_ed25519.pub
    
  5. Clone repo and run setup:

    mkdir -p /opt/data-analyst
    chown deploy:data-ops /opt/data-analyst
    sudo -u deploy git clone git@github.com:your-org/ai-data-analyst.git /opt/data-analyst/repo
    git config --global --add safe.directory /opt/data-analyst/repo
    /opt/data-analyst/repo/server/setup.sh
    
  6. Restore user accounts from /home:

    # Users already exist on home-disk, just recreate /etc/passwd entries
    # For each directory in /home (except deploy):
    for dir in /home/*/; do
      username=$(basename "$dir")
      [[ "$username" == "deploy" ]] && continue
      # Create user if not exists
      if ! id "$username" &>/dev/null; then
        useradd -M -d "/home/$username" -s /bin/bash "$username"
        usermod -aG dataread "$username"
      fi
    done
    

    Note: Group memberships (data-private, sudo, data-ops) need manual review. Check the admin list in server/limits-users.conf for admin users.

  7. Trigger deploy via GitHub Actions (or manually):

    sudo -u deploy bash -c 'cd /opt/data-analyst/repo && ./server/deploy.sh'
    
  8. Set up SSL certificate:

    certbot --nginx -d your-instance.example.com
    
  9. Restore crontab:

    sudo -u deploy crontab -e
    # Add:
    # MAILTO=admin@your-domain.com
    # 0 6,14,19 * * * cd /opt/data-analyst/repo && ./scripts/update.sh > /var/log/update.log 2>&1 || cat /var/log/update.log
    
  10. Update external IP if it changed:

    • DNS: your-instance.example.com A record
    • GitHub secrets: SERVER_HOST
    • SSH configs of all users

Scenario B: Data Disk Failure (sdb/data-disk dies)

Impact: Parquet data lost, users unaffected.

Recovery time: ~10 minutes (from snapshot) or ~30 minutes (from Keboola)

Option 1: Restore from snapshot (faster)

# Find latest snapshot
gcloud compute snapshots list --project=your-gcp-project \
  --filter="sourceDisk:data-disk" --sort-by=~creationTimestamp --limit=5

# Create new disk from snapshot
gcloud compute disks create data-disk \
  --project=your-gcp-project \
  --zone=europe-north1-a \
  --source-snapshot=SNAPSHOT_NAME \
  --type=pd-balanced

# Attach to VM (may need to stop VM first)
gcloud compute instances attach-disk your-server \
  --project=your-gcp-project \
  --zone=europe-north1-a \
  --disk=data-disk

# Mount
ssh kids "sudo mount /dev/sdb /data"

Option 2: Regenerate from Keboola

# Create fresh disk
gcloud compute disks create data-disk \
  --project=your-gcp-project \
  --zone=europe-north1-a \
  --size=30GB \
  --type=pd-balanced

# Attach, format, mount
ssh kids "sudo mkfs.ext4 /dev/sdb && sudo mount /dev/sdb /data"

# Run deploy to recreate directory structure
ssh kids "sudo -u deploy bash -c 'cd /opt/data-analyst/repo && ./server/deploy.sh'"

# Regenerate parquet data from Keboola
ssh kids "cd /opt/data-analyst/repo && ./scripts/update.sh"

Scenario C: Home Disk Failure (sdc/home-disk dies)

Impact: All user accounts, SSH keys, and personal workspaces lost.

Recovery time: ~10 minutes (from snapshot)

Restore from snapshot

# Find latest snapshot
gcloud compute snapshots list --project=your-gcp-project \
  --filter="sourceDisk:home-disk" --sort-by=~creationTimestamp --limit=5

# Create new disk from snapshot
gcloud compute disks create home-disk \
  --project=your-gcp-project \
  --zone=europe-north1-a \
  --source-snapshot=SNAPSHOT_NAME \
  --type=pd-balanced

# Attach to VM
gcloud compute instances attach-disk your-server \
  --project=your-gcp-project \
  --zone=europe-north1-a \
  --disk=home-disk

# Mount
ssh kids "sudo mount /dev/sdc /home"

If no snapshot exists, users must re-register via https://your-instance.example.com.

Scenario D: Complete Server Loss (VM + all disks)

Recovery time: ~45 minutes

  1. Follow Scenario A steps 1-5 (new VM, prerequisites, deploy user)
  2. Restore data-disk from snapshot (Scenario B, Option 1)
  3. Restore home-disk from snapshot (Scenario C)
  4. Follow Scenario A steps 6-10 (user accounts, deploy, SSL, cron, IP)

Verification Checklist

After any recovery, verify:

  • ssh kids works (admin access)
  • https://your-instance.example.com loads (webapp)
  • https://your-instance.example.com/health returns OK
  • At least one analyst can SSH in
  • ls /data/src_data/parquet/ shows data
  • ls /home/ shows user directories
  • systemctl status webapp is active
  • systemctl status notify-bot is active
  • sudo crontab -u deploy -l shows data sync cron

Preventive Measures

  • GCP snapshots: Daily automatic snapshots of data-disk and home-disk (14-day retention)
  • Setup script: server/setup-snapshot-schedule.sh configures snapshot policy
  • Limits in git: server/limits-users.conf is version-controlled and deployed automatically
  • All configs in git: sudoers, nginx, systemd services, management scripts
  • Secrets in GitHub: .env is recreated by deploy.sh from GitHub Actions secrets