Open-source AI data analyst platform extracted from internal repo. Includes data sync engine, Keboola adapter, Flask web portal, server deployment scripts, and configuration templates.
8.6 KiB
Disaster Recovery
Recovery procedures for the Data Broker Server (data-broker-for-claude).
Overview
Disk Layout:
sda (10 GB) / System disk (instance) - EXPENDABLE
sdb (30 GB) /data Data disk - SNAPSHOTTED daily
sdc (30 GB) /home Home disk - SNAPSHOTTED daily
Key principle: sda is disposable. Everything on it is either in git or can be reinstalled. All unique data lives on sdb and sdc, which are independently snapshotted.
What Lives Where
| Location | Content | Recovery Method |
|---|---|---|
sda: /opt/data-analyst/repo/ |
Application code | git clone from GitHub |
sda: /opt/data-analyst/.venv/ |
Python packages | pip install -r requirements.txt |
sda: /opt/data-analyst/.env |
Application secrets | deploy.sh creates from GitHub secrets |
sda: /etc/sudoers.d/ |
Permissions | deploy.sh copies from repo |
sda: /etc/security/limits.d/ |
Resource limits | deploy.sh copies from repo |
sda: /etc/nginx/ |
Nginx config | deploy.sh or manual copy from repo |
sda: /etc/letsencrypt/ |
SSL certificate | certbot renews automatically |
sdb: /data/src_data/parquet/ |
Parquet data | Regenerate from Keboola (update.sh) or restore snapshot |
sdb: /data/notifications/ |
Notification state | Restore from snapshot |
sdb: /data/docs/, /data/scripts/ |
Docs & scripts | deploy.sh copies from repo |
sdc: /home/*/ |
User accounts, SSH keys, workspaces, scripts | Restore from snapshot |
Scenario A: System Disk Failure (sda dies)
Impact: Server is down, but all user data is safe on sdb/sdc.
Recovery time: ~30 minutes
Steps
-
Create new VM (same zone, attach existing disks):
# Create new instance with existing disks gcloud compute instances create data-broker-for-claude \ --project=kids-ai-data-analysis \ --zone=europe-north1-a \ --machine-type=e2-medium \ --image-family=debian-12 \ --image-project=debian-cloud \ --boot-disk-size=10GB \ --tags=http-server,https-server # Attach existing data disks gcloud compute instances attach-disk data-broker-for-claude \ --project=kids-ai-data-analysis \ --zone=europe-north1-a \ --disk=data-disk gcloud compute instances attach-disk data-broker-for-claude \ --project=kids-ai-data-analysis \ --zone=europe-north1-a \ --disk=home-disk -
SSH in and mount disks:
# Mount data disk mkdir -p /data mount /dev/sdb /data # Mount home disk mount /dev/sdc /home # Add to fstab (get UUIDs with blkid) echo "UUID=$(blkid -s UUID -o value /dev/sdb) /data ext4 discard,defaults,nofail 0 2" >> /etc/fstab echo "UUID=$(blkid -s UUID -o value /dev/sdc) /home ext4 discard,defaults,nofail 0 2" >> /etc/fstab -
Install prerequisites:
apt-get update apt-get install -y git python3.11-venv python3-pip nginx certbot python3-certbot-nginx -
Recreate deploy user and groups:
# Create groups groupadd dataread groupadd data-private groupadd data-ops # Create deploy user useradd -m -s /bin/bash deploy usermod -aG data-ops deploy # Restore deploy SSH key (generate new one) sudo -u deploy ssh-keygen -t ed25519 -f /home/deploy/.ssh/id_ed25519 -N '' -C 'deploy@data-broker' sudo -u deploy bash -c 'echo -e "Host github.com\n IdentityFile ~/.ssh/id_ed25519\n StrictHostKeyChecking accept-new" > /home/deploy/.ssh/config' chmod 600 /home/deploy/.ssh/config # Add new public key to GitHub as Deploy Key cat /home/deploy/.ssh/id_ed25519.pub -
Clone repo and run setup:
mkdir -p /opt/data-analyst chown deploy:data-ops /opt/data-analyst sudo -u deploy git clone git@github.com:keboola/internal_ai_data_analyst.git /opt/data-analyst/repo git config --global --add safe.directory /opt/data-analyst/repo /opt/data-analyst/repo/server/setup.sh -
Restore user accounts from /home:
# Users already exist on home-disk, just recreate /etc/passwd entries # For each directory in /home (except deploy): for dir in /home/*/; do username=$(basename "$dir") [[ "$username" == "deploy" ]] && continue # Create user if not exists if ! id "$username" &>/dev/null; then useradd -M -d "/home/$username" -s /bin/bash "$username" usermod -aG dataread "$username" fi doneNote: Group memberships (data-private, sudo, data-ops) need manual review. Check the admin list in
server/limits-users.conffor admin users. -
Trigger deploy via GitHub Actions (or manually):
sudo -u deploy bash -c 'cd /opt/data-analyst/repo && ./server/deploy.sh' -
Set up SSL certificate:
certbot --nginx -d your-instance.example.com -
Restore crontab:
sudo -u deploy crontab -e # Add: # MAILTO=admin@your-domain.com # 0 6,14,19 * * * cd /opt/data-analyst/repo && ./scripts/update.sh > /var/log/update.log 2>&1 || cat /var/log/update.log -
Update external IP if it changed:
- DNS:
your-instance.example.comA record - GitHub secrets:
SERVER_HOST - SSH configs of all users
- DNS:
Scenario B: Data Disk Failure (sdb/data-disk dies)
Impact: Parquet data lost, users unaffected.
Recovery time: ~10 minutes (from snapshot) or ~30 minutes (from Keboola)
Option 1: Restore from snapshot (faster)
# Find latest snapshot
gcloud compute snapshots list --project=kids-ai-data-analysis \
--filter="sourceDisk:data-disk" --sort-by=~creationTimestamp --limit=5
# Create new disk from snapshot
gcloud compute disks create data-disk \
--project=kids-ai-data-analysis \
--zone=europe-north1-a \
--source-snapshot=SNAPSHOT_NAME \
--type=pd-balanced
# Attach to VM (may need to stop VM first)
gcloud compute instances attach-disk data-broker-for-claude \
--project=kids-ai-data-analysis \
--zone=europe-north1-a \
--disk=data-disk
# Mount
ssh kids "sudo mount /dev/sdb /data"
Option 2: Regenerate from Keboola
# Create fresh disk
gcloud compute disks create data-disk \
--project=kids-ai-data-analysis \
--zone=europe-north1-a \
--size=30GB \
--type=pd-balanced
# Attach, format, mount
ssh kids "sudo mkfs.ext4 /dev/sdb && sudo mount /dev/sdb /data"
# Run deploy to recreate directory structure
ssh kids "sudo -u deploy bash -c 'cd /opt/data-analyst/repo && ./server/deploy.sh'"
# Regenerate parquet data from Keboola
ssh kids "cd /opt/data-analyst/repo && ./scripts/update.sh"
Scenario C: Home Disk Failure (sdc/home-disk dies)
Impact: All user accounts, SSH keys, and personal workspaces lost.
Recovery time: ~10 minutes (from snapshot)
Restore from snapshot
# Find latest snapshot
gcloud compute snapshots list --project=kids-ai-data-analysis \
--filter="sourceDisk:home-disk" --sort-by=~creationTimestamp --limit=5
# Create new disk from snapshot
gcloud compute disks create home-disk \
--project=kids-ai-data-analysis \
--zone=europe-north1-a \
--source-snapshot=SNAPSHOT_NAME \
--type=pd-balanced
# Attach to VM
gcloud compute instances attach-disk data-broker-for-claude \
--project=kids-ai-data-analysis \
--zone=europe-north1-a \
--disk=home-disk
# Mount
ssh kids "sudo mount /dev/sdc /home"
If no snapshot exists, users must re-register via https://your-instance.example.com.
Scenario D: Complete Server Loss (VM + all disks)
Recovery time: ~45 minutes
- Follow Scenario A steps 1-5 (new VM, prerequisites, deploy user)
- Restore
data-diskfrom snapshot (Scenario B, Option 1) - Restore
home-diskfrom snapshot (Scenario C) - Follow Scenario A steps 6-10 (user accounts, deploy, SSL, cron, IP)
Verification Checklist
After any recovery, verify:
ssh kidsworks (admin access)https://your-instance.example.comloads (webapp)https://your-instance.example.com/healthreturns OK- At least one analyst can SSH in
ls /data/src_data/parquet/shows datals /home/shows user directoriessystemctl status webappis activesystemctl status notify-botis activesudo crontab -u deploy -lshows data sync cron
Preventive Measures
- GCP snapshots: Daily automatic snapshots of
data-diskandhome-disk(14-day retention) - Setup script:
server/setup-snapshot-schedule.shconfigures snapshot policy - Limits in git:
server/limits-users.confis version-controlled and deployed automatically - All configs in git: sudoers, nginx, systemd services, management scripts
- Secrets in GitHub:
.envis recreated by deploy.sh from GitHub Actions secrets