- CONFIGURATION.md: remove Flask/SendGrid/WEBAPP_SECRET_KEY references, update env vars to JWT_SECRET_KEY and SESSION_SECRET, point to config/.env.template and config/instance.yaml.example - disaster-recovery.md: rewrite for Docker volumes; cover GCP disk snapshot backup/restore and full VM rebuild; drop systemd/nginx/SSH - server.md: strip rsync, systemd, nginx, Linux group, and sudo sections; keep Docker Compose operations, log viewing, health checks, sync/admin CLI, and Jira webhook procedures
4.9 KiB
Disaster Recovery
Recovery procedures for the AI Data Analyst Docker deployment.
Overview
What lives where:
Docker volumes /data DuckDB files, parquet extracts, state
Git repo/ Application code — rebuild from GitHub
.env secrets Recreate from GitHub Secrets / 1Password
Key principle: the container is disposable. All unique data lives in the /data
Docker volume (or a GCP persistent disk mounted at /data). Re-pulling the image
and restoring /data brings the service back to full operation.
Data Layout
| Path | Content | Backup |
|---|---|---|
/data/state/system.duckdb |
Table registry, users, sync state | Daily snapshot |
/data/analytics/server.duckdb |
Master analytics DB (views) | Regenerated on start |
/data/extracts/*/extract.duckdb |
Per-source extract DBs | Daily snapshot |
/data/extracts/*/data/*.parquet |
Parquet files (local sources) | Daily snapshot |
analytics/server.duckdb is rebuilt automatically by SyncOrchestrator.rebuild()
on every startup, so it does not need to be backed up separately.
Scenario A: Container Crash / Bad Deploy
Impact: Service down, data intact.
Recovery time: ~2 minutes
# Pull latest image and restart
docker compose pull
docker compose up -d
# Check health
curl https://your-instance.example.com/health
If a bad image was pushed, roll back to the previous tag:
docker compose down
# Edit docker-compose.yml to pin the previous image tag
docker compose up -d
Scenario B: /data Volume Corruption or Loss
Impact: All DuckDB state and parquet data lost.
Recovery time: ~10 minutes (from snapshot) or ~30 minutes (regenerate from source)
Option 1: Restore from GCP disk snapshot (faster)
# Find latest snapshot
gcloud compute snapshots list --project=your-gcp-project \
--filter="sourceDisk:data-disk" --sort-by=~creationTimestamp --limit=5
# Create new disk from snapshot
gcloud compute disks create data-disk \
--project=your-gcp-project \
--zone=europe-north1-a \
--source-snapshot=SNAPSHOT_NAME \
--type=pd-balanced
# Attach to VM and mount
gcloud compute instances attach-disk your-server \
--project=your-gcp-project \
--zone=europe-north1-a \
--disk=data-disk
# Restart containers
docker compose up -d
Option 2: Regenerate from source
# Start with empty /data volume
docker compose up -d
# Trigger a full sync from the data source
curl -X POST http://localhost:8000/api/sync/trigger
# Or via CLI:
docker compose exec app da sync
DuckDB extract files and parquet will be repopulated from Keboola / BigQuery.
system.duckdb (table registry, users) must be restored from snapshot if
not regenerated — user accounts and table definitions are not recreated by sync.
Scenario C: Complete VM Loss
Recovery time: ~20 minutes
-
Create new VM (or use managed instance group):
gcloud compute instances create your-server \ --project=your-gcp-project \ --zone=europe-north1-a \ --machine-type=e2-medium \ --image-family=debian-12 \ --image-project=debian-cloud -
Install Docker:
curl -fsSL https://get.docker.com | sh -
Attach and mount the data disk (or restore from snapshot per Scenario B):
gcloud compute instances attach-disk your-server \ --project=your-gcp-project --zone=europe-north1-a --disk=data-disk # Add mount to /etc/fstab and mount /data -
Clone repo and create .env:
git clone git@github.com:your-org/ai-data-analyst.git /opt/data-analyst cd /opt/data-analyst cp config/.env.template .env # Fill in secrets from GitHub Secrets / 1Password -
Start the stack:
docker compose up -d -
Update DNS if the external IP changed:
- A record for
your-instance.example.com
- A record for
Verification Checklist
After any recovery, verify:
docker compose ps— all servicesUphttps://your-instance.example.com/healthreturns{"status": "ok"}- Login works (Google OAuth or email magic link)
- At least one table appears in the data catalog
docker compose logs app— no ERROR lines at startup
Preventive Measures
- GCP snapshots: Daily automatic snapshots of the
/datapersistent disk (14-day retention). Configure via:gcloud compute resource-policies create snapshot-schedule daily-backup \ --project=your-gcp-project \ --region=europe-north1 \ --max-retention-days=14 \ --on-source-disk-delete=keep-auto-snapshots \ --daily-schedule \ --start-time=03:00 gcloud compute disks add-resource-policies data-disk \ --project=your-gcp-project --zone=europe-north1-a \ --resource-policies=daily-backup - Secrets in GitHub / 1Password:
.envis never committed; recreate from stored secrets - Image tags: Pin a known-good image tag in
docker-compose.ymlbefore each deploy