Phase 1 - Internal reference cleanup:
- Delete dev_docs/meetings/ (internal meeting notes/transcripts)
- Replace hardcoded usernames (padak/matejkys/dasa) with deploy/generic
- Replace "Internal AI Data Analyst" with "AI Data Analyst"
- Replace keboola/internal_ai_data_analyst URLs with your-org/ai-data-analyst
- Replace /tmp/keboola_load/ with /tmp/data_analyst_staging/ in dev_docs
Phase 2 - Deployment hardening:
- Tighten sudoers wildcards to explicit paths (visudo, sudoers cp)
- setup.sh creates all groups (data-ops, dataread, data-private) and deploy user
- webapp-setup.sh copies sudoers-webapp from repo instead of inline definition
- deploy.sh conditional copy for data_description.md (not in git for OSS)
- deploy.sh ownership changed to deploy:data-ops for /data/{scripts,docs,examples}
Phase 3 - Config and misc:
- Add ${ENV_VAR} interpolation to config/loader.py
- Expand config/instance.yaml.example with all sections (admins, deployment, auth, etc.)
- Create config/.env.template for secret values
- Add MIT LICENSE
- Fix .gitignore: add .venv/, docs/data_description.md
- Fix README.md: CSV status Planned, remove metrics/, update license text
- Translate Czech comments in requirements.txt to English
- Fix test_account_service.py: mock username mapping instead of relying on instance config
All 118 tests pass.
8.6 KiB
Disaster Recovery
Recovery procedures for the Data Broker Server (data-broker-for-claude).
Overview
Disk Layout:
sda (10 GB) / System disk (instance) - EXPENDABLE
sdb (30 GB) /data Data disk - SNAPSHOTTED daily
sdc (30 GB) /home Home disk - SNAPSHOTTED daily
Key principle: sda is disposable. Everything on it is either in git or can be reinstalled. All unique data lives on sdb and sdc, which are independently snapshotted.
What Lives Where
| Location | Content | Recovery Method |
|---|---|---|
sda: /opt/data-analyst/repo/ |
Application code | git clone from GitHub |
sda: /opt/data-analyst/.venv/ |
Python packages | pip install -r requirements.txt |
sda: /opt/data-analyst/.env |
Application secrets | deploy.sh creates from GitHub secrets |
sda: /etc/sudoers.d/ |
Permissions | deploy.sh copies from repo |
sda: /etc/security/limits.d/ |
Resource limits | deploy.sh copies from repo |
sda: /etc/nginx/ |
Nginx config | deploy.sh or manual copy from repo |
sda: /etc/letsencrypt/ |
SSL certificate | certbot renews automatically |
sdb: /data/src_data/parquet/ |
Parquet data | Regenerate from Keboola (update.sh) or restore snapshot |
sdb: /data/notifications/ |
Notification state | Restore from snapshot |
sdb: /data/docs/, /data/scripts/ |
Docs & scripts | deploy.sh copies from repo |
sdc: /home/*/ |
User accounts, SSH keys, workspaces, scripts | Restore from snapshot |
Scenario A: System Disk Failure (sda dies)
Impact: Server is down, but all user data is safe on sdb/sdc.
Recovery time: ~30 minutes
Steps
-
Create new VM (same zone, attach existing disks):
# Create new instance with existing disks gcloud compute instances create data-broker-for-claude \ --project=kids-ai-data-analysis \ --zone=europe-north1-a \ --machine-type=e2-medium \ --image-family=debian-12 \ --image-project=debian-cloud \ --boot-disk-size=10GB \ --tags=http-server,https-server # Attach existing data disks gcloud compute instances attach-disk data-broker-for-claude \ --project=kids-ai-data-analysis \ --zone=europe-north1-a \ --disk=data-disk gcloud compute instances attach-disk data-broker-for-claude \ --project=kids-ai-data-analysis \ --zone=europe-north1-a \ --disk=home-disk -
SSH in and mount disks:
# Mount data disk mkdir -p /data mount /dev/sdb /data # Mount home disk mount /dev/sdc /home # Add to fstab (get UUIDs with blkid) echo "UUID=$(blkid -s UUID -o value /dev/sdb) /data ext4 discard,defaults,nofail 0 2" >> /etc/fstab echo "UUID=$(blkid -s UUID -o value /dev/sdc) /home ext4 discard,defaults,nofail 0 2" >> /etc/fstab -
Install prerequisites:
apt-get update apt-get install -y git python3.11-venv python3-pip nginx certbot python3-certbot-nginx -
Recreate deploy user and groups:
# Create groups groupadd dataread groupadd data-private groupadd data-ops # Create deploy user useradd -m -s /bin/bash deploy usermod -aG data-ops deploy # Restore deploy SSH key (generate new one) sudo -u deploy ssh-keygen -t ed25519 -f /home/deploy/.ssh/id_ed25519 -N '' -C 'deploy@data-broker' sudo -u deploy bash -c 'echo -e "Host github.com\n IdentityFile ~/.ssh/id_ed25519\n StrictHostKeyChecking accept-new" > /home/deploy/.ssh/config' chmod 600 /home/deploy/.ssh/config # Add new public key to GitHub as Deploy Key cat /home/deploy/.ssh/id_ed25519.pub -
Clone repo and run setup:
mkdir -p /opt/data-analyst chown deploy:data-ops /opt/data-analyst sudo -u deploy git clone git@github.com:your-org/ai-data-analyst.git /opt/data-analyst/repo git config --global --add safe.directory /opt/data-analyst/repo /opt/data-analyst/repo/server/setup.sh -
Restore user accounts from /home:
# Users already exist on home-disk, just recreate /etc/passwd entries # For each directory in /home (except deploy): for dir in /home/*/; do username=$(basename "$dir") [[ "$username" == "deploy" ]] && continue # Create user if not exists if ! id "$username" &>/dev/null; then useradd -M -d "/home/$username" -s /bin/bash "$username" usermod -aG dataread "$username" fi doneNote: Group memberships (data-private, sudo, data-ops) need manual review. Check the admin list in
server/limits-users.conffor admin users. -
Trigger deploy via GitHub Actions (or manually):
sudo -u deploy bash -c 'cd /opt/data-analyst/repo && ./server/deploy.sh' -
Set up SSL certificate:
certbot --nginx -d your-instance.example.com -
Restore crontab:
sudo -u deploy crontab -e # Add: # MAILTO=admin@your-domain.com # 0 6,14,19 * * * cd /opt/data-analyst/repo && ./scripts/update.sh > /var/log/update.log 2>&1 || cat /var/log/update.log -
Update external IP if it changed:
- DNS:
your-instance.example.comA record - GitHub secrets:
SERVER_HOST - SSH configs of all users
- DNS:
Scenario B: Data Disk Failure (sdb/data-disk dies)
Impact: Parquet data lost, users unaffected.
Recovery time: ~10 minutes (from snapshot) or ~30 minutes (from Keboola)
Option 1: Restore from snapshot (faster)
# Find latest snapshot
gcloud compute snapshots list --project=kids-ai-data-analysis \
--filter="sourceDisk:data-disk" --sort-by=~creationTimestamp --limit=5
# Create new disk from snapshot
gcloud compute disks create data-disk \
--project=kids-ai-data-analysis \
--zone=europe-north1-a \
--source-snapshot=SNAPSHOT_NAME \
--type=pd-balanced
# Attach to VM (may need to stop VM first)
gcloud compute instances attach-disk data-broker-for-claude \
--project=kids-ai-data-analysis \
--zone=europe-north1-a \
--disk=data-disk
# Mount
ssh kids "sudo mount /dev/sdb /data"
Option 2: Regenerate from Keboola
# Create fresh disk
gcloud compute disks create data-disk \
--project=kids-ai-data-analysis \
--zone=europe-north1-a \
--size=30GB \
--type=pd-balanced
# Attach, format, mount
ssh kids "sudo mkfs.ext4 /dev/sdb && sudo mount /dev/sdb /data"
# Run deploy to recreate directory structure
ssh kids "sudo -u deploy bash -c 'cd /opt/data-analyst/repo && ./server/deploy.sh'"
# Regenerate parquet data from Keboola
ssh kids "cd /opt/data-analyst/repo && ./scripts/update.sh"
Scenario C: Home Disk Failure (sdc/home-disk dies)
Impact: All user accounts, SSH keys, and personal workspaces lost.
Recovery time: ~10 minutes (from snapshot)
Restore from snapshot
# Find latest snapshot
gcloud compute snapshots list --project=kids-ai-data-analysis \
--filter="sourceDisk:home-disk" --sort-by=~creationTimestamp --limit=5
# Create new disk from snapshot
gcloud compute disks create home-disk \
--project=kids-ai-data-analysis \
--zone=europe-north1-a \
--source-snapshot=SNAPSHOT_NAME \
--type=pd-balanced
# Attach to VM
gcloud compute instances attach-disk data-broker-for-claude \
--project=kids-ai-data-analysis \
--zone=europe-north1-a \
--disk=home-disk
# Mount
ssh kids "sudo mount /dev/sdc /home"
If no snapshot exists, users must re-register via https://your-instance.example.com.
Scenario D: Complete Server Loss (VM + all disks)
Recovery time: ~45 minutes
- Follow Scenario A steps 1-5 (new VM, prerequisites, deploy user)
- Restore
data-diskfrom snapshot (Scenario B, Option 1) - Restore
home-diskfrom snapshot (Scenario C) - Follow Scenario A steps 6-10 (user accounts, deploy, SSL, cron, IP)
Verification Checklist
After any recovery, verify:
ssh kidsworks (admin access)https://your-instance.example.comloads (webapp)https://your-instance.example.com/healthreturns OK- At least one analyst can SSH in
ls /data/src_data/parquet/shows datals /home/shows user directoriessystemctl status webappis activesystemctl status notify-botis activesudo crontab -u deploy -lshows data sync cron
Preventive Measures
- GCP snapshots: Daily automatic snapshots of
data-diskandhome-disk(14-day retention) - Setup script:
server/setup-snapshot-schedule.shconfigures snapshot policy - Limits in git:
server/limits-users.confis version-controlled and deployed automatically - All configs in git: sudoers, nginx, systemd services, management scripts
- Secrets in GitHub:
.envis recreated by deploy.sh from GitHub Actions secrets