agnes-the-ai-analyst/dev_docs/disaster-recovery.md
ZdenekSrotyr c8e232e43e docs: update stale v1 docs to v2 Docker/FastAPI/DuckDB architecture
- CONFIGURATION.md: remove Flask/SendGrid/WEBAPP_SECRET_KEY references,
  update env vars to JWT_SECRET_KEY and SESSION_SECRET, point to
  config/.env.template and config/instance.yaml.example
- disaster-recovery.md: rewrite for Docker volumes; cover GCP disk
  snapshot backup/restore and full VM rebuild; drop systemd/nginx/SSH
- server.md: strip rsync, systemd, nginx, Linux group, and sudo
  sections; keep Docker Compose operations, log viewing, health checks,
  sync/admin CLI, and Jira webhook procedures
2026-04-09 18:44:25 +02:00

4.9 KiB

Disaster Recovery

Recovery procedures for the AI Data Analyst Docker deployment.

Overview

What lives where:
  Docker volumes  /data        DuckDB files, parquet extracts, state
  Git             repo/        Application code — rebuild from GitHub
  .env            secrets      Recreate from GitHub Secrets / 1Password

Key principle: the container is disposable. All unique data lives in the /data Docker volume (or a GCP persistent disk mounted at /data). Re-pulling the image and restoring /data brings the service back to full operation.

Data Layout

Path Content Backup
/data/state/system.duckdb Table registry, users, sync state Daily snapshot
/data/analytics/server.duckdb Master analytics DB (views) Regenerated on start
/data/extracts/*/extract.duckdb Per-source extract DBs Daily snapshot
/data/extracts/*/data/*.parquet Parquet files (local sources) Daily snapshot

analytics/server.duckdb is rebuilt automatically by SyncOrchestrator.rebuild() on every startup, so it does not need to be backed up separately.

Scenario A: Container Crash / Bad Deploy

Impact: Service down, data intact.

Recovery time: ~2 minutes

# Pull latest image and restart
docker compose pull
docker compose up -d

# Check health
curl https://your-instance.example.com/health

If a bad image was pushed, roll back to the previous tag:

docker compose down
# Edit docker-compose.yml to pin the previous image tag
docker compose up -d

Scenario B: /data Volume Corruption or Loss

Impact: All DuckDB state and parquet data lost.

Recovery time: ~10 minutes (from snapshot) or ~30 minutes (regenerate from source)

Option 1: Restore from GCP disk snapshot (faster)

# Find latest snapshot
gcloud compute snapshots list --project=your-gcp-project \
  --filter="sourceDisk:data-disk" --sort-by=~creationTimestamp --limit=5

# Create new disk from snapshot
gcloud compute disks create data-disk \
  --project=your-gcp-project \
  --zone=europe-north1-a \
  --source-snapshot=SNAPSHOT_NAME \
  --type=pd-balanced

# Attach to VM and mount
gcloud compute instances attach-disk your-server \
  --project=your-gcp-project \
  --zone=europe-north1-a \
  --disk=data-disk

# Restart containers
docker compose up -d

Option 2: Regenerate from source

# Start with empty /data volume
docker compose up -d

# Trigger a full sync from the data source
curl -X POST http://localhost:8000/api/sync/trigger
# Or via CLI:
docker compose exec app da sync

DuckDB extract files and parquet will be repopulated from Keboola / BigQuery. system.duckdb (table registry, users) must be restored from snapshot if not regenerated — user accounts and table definitions are not recreated by sync.

Scenario C: Complete VM Loss

Recovery time: ~20 minutes

  1. Create new VM (or use managed instance group):

    gcloud compute instances create your-server \
      --project=your-gcp-project \
      --zone=europe-north1-a \
      --machine-type=e2-medium \
      --image-family=debian-12 \
      --image-project=debian-cloud
    
  2. Install Docker:

    curl -fsSL https://get.docker.com | sh
    
  3. Attach and mount the data disk (or restore from snapshot per Scenario B):

    gcloud compute instances attach-disk your-server \
      --project=your-gcp-project --zone=europe-north1-a --disk=data-disk
    # Add mount to /etc/fstab and mount /data
    
  4. Clone repo and create .env:

    git clone git@github.com:your-org/ai-data-analyst.git /opt/data-analyst
    cd /opt/data-analyst
    cp config/.env.template .env
    # Fill in secrets from GitHub Secrets / 1Password
    
  5. Start the stack:

    docker compose up -d
    
  6. Update DNS if the external IP changed:

    • A record for your-instance.example.com

Verification Checklist

After any recovery, verify:

  • docker compose ps — all services Up
  • https://your-instance.example.com/health returns {"status": "ok"}
  • Login works (Google OAuth or email magic link)
  • At least one table appears in the data catalog
  • docker compose logs app — no ERROR lines at startup

Preventive Measures

  • GCP snapshots: Daily automatic snapshots of the /data persistent disk (14-day retention). Configure via:
    gcloud compute resource-policies create snapshot-schedule daily-backup \
      --project=your-gcp-project \
      --region=europe-north1 \
      --max-retention-days=14 \
      --on-source-disk-delete=keep-auto-snapshots \
      --daily-schedule \
      --start-time=03:00
    gcloud compute disks add-resource-policies data-disk \
      --project=your-gcp-project --zone=europe-north1-a \
      --resource-policies=daily-backup
    
  • Secrets in GitHub / 1Password: .env is never committed; recreate from stored secrets
  • Image tags: Pin a known-good image tag in docker-compose.yml before each deploy