agnes-the-ai-analyst/dev_docs/disaster-recovery.md
ZdenekSrotyr c8e232e43e docs: update stale v1 docs to v2 Docker/FastAPI/DuckDB architecture
- CONFIGURATION.md: remove Flask/SendGrid/WEBAPP_SECRET_KEY references,
  update env vars to JWT_SECRET_KEY and SESSION_SECRET, point to
  config/.env.template and config/instance.yaml.example
- disaster-recovery.md: rewrite for Docker volumes; cover GCP disk
  snapshot backup/restore and full VM rebuild; drop systemd/nginx/SSH
- server.md: strip rsync, systemd, nginx, Linux group, and sudo
  sections; keep Docker Compose operations, log viewing, health checks,
  sync/admin CLI, and Jira webhook procedures
2026-04-09 18:44:25 +02:00

167 lines
4.9 KiB
Markdown

# Disaster Recovery
Recovery procedures for the AI Data Analyst Docker deployment.
## Overview
```
What lives where:
Docker volumes /data DuckDB files, parquet extracts, state
Git repo/ Application code — rebuild from GitHub
.env secrets Recreate from GitHub Secrets / 1Password
```
**Key principle**: the container is disposable. All unique data lives in the `/data`
Docker volume (or a GCP persistent disk mounted at `/data`). Re-pulling the image
and restoring `/data` brings the service back to full operation.
## Data Layout
| Path | Content | Backup |
|------|---------|--------|
| `/data/state/system.duckdb` | Table registry, users, sync state | Daily snapshot |
| `/data/analytics/server.duckdb` | Master analytics DB (views) | Regenerated on start |
| `/data/extracts/*/extract.duckdb` | Per-source extract DBs | Daily snapshot |
| `/data/extracts/*/data/*.parquet` | Parquet files (local sources) | Daily snapshot |
`analytics/server.duckdb` is rebuilt automatically by `SyncOrchestrator.rebuild()`
on every startup, so it does not need to be backed up separately.
## Scenario A: Container Crash / Bad Deploy
**Impact**: Service down, data intact.
**Recovery time**: ~2 minutes
```bash
# Pull latest image and restart
docker compose pull
docker compose up -d
# Check health
curl https://your-instance.example.com/health
```
If a bad image was pushed, roll back to the previous tag:
```bash
docker compose down
# Edit docker-compose.yml to pin the previous image tag
docker compose up -d
```
## Scenario B: /data Volume Corruption or Loss
**Impact**: All DuckDB state and parquet data lost.
**Recovery time**: ~10 minutes (from snapshot) or ~30 minutes (regenerate from source)
### Option 1: Restore from GCP disk snapshot (faster)
```bash
# Find latest snapshot
gcloud compute snapshots list --project=your-gcp-project \
--filter="sourceDisk:data-disk" --sort-by=~creationTimestamp --limit=5
# Create new disk from snapshot
gcloud compute disks create data-disk \
--project=your-gcp-project \
--zone=europe-north1-a \
--source-snapshot=SNAPSHOT_NAME \
--type=pd-balanced
# Attach to VM and mount
gcloud compute instances attach-disk your-server \
--project=your-gcp-project \
--zone=europe-north1-a \
--disk=data-disk
# Restart containers
docker compose up -d
```
### Option 2: Regenerate from source
```bash
# Start with empty /data volume
docker compose up -d
# Trigger a full sync from the data source
curl -X POST http://localhost:8000/api/sync/trigger
# Or via CLI:
docker compose exec app da sync
```
DuckDB extract files and parquet will be repopulated from Keboola / BigQuery.
`system.duckdb` (table registry, users) must be restored from snapshot if
not regenerated — user accounts and table definitions are not recreated by sync.
## Scenario C: Complete VM Loss
**Recovery time**: ~20 minutes
1. **Create new VM** (or use managed instance group):
```bash
gcloud compute instances create your-server \
--project=your-gcp-project \
--zone=europe-north1-a \
--machine-type=e2-medium \
--image-family=debian-12 \
--image-project=debian-cloud
```
2. **Install Docker**:
```bash
curl -fsSL https://get.docker.com | sh
```
3. **Attach and mount the data disk** (or restore from snapshot per Scenario B):
```bash
gcloud compute instances attach-disk your-server \
--project=your-gcp-project --zone=europe-north1-a --disk=data-disk
# Add mount to /etc/fstab and mount /data
```
4. **Clone repo and create .env**:
```bash
git clone git@github.com:your-org/ai-data-analyst.git /opt/data-analyst
cd /opt/data-analyst
cp config/.env.template .env
# Fill in secrets from GitHub Secrets / 1Password
```
5. **Start the stack**:
```bash
docker compose up -d
```
6. **Update DNS** if the external IP changed:
- A record for `your-instance.example.com`
## Verification Checklist
After any recovery, verify:
- [ ] `docker compose ps` — all services `Up`
- [ ] `https://your-instance.example.com/health` returns `{"status": "ok"}`
- [ ] Login works (Google OAuth or email magic link)
- [ ] At least one table appears in the data catalog
- [ ] `docker compose logs app` — no ERROR lines at startup
## Preventive Measures
- **GCP snapshots**: Daily automatic snapshots of the `/data` persistent disk
(14-day retention). Configure via:
```bash
gcloud compute resource-policies create snapshot-schedule daily-backup \
--project=your-gcp-project \
--region=europe-north1 \
--max-retention-days=14 \
--on-source-disk-delete=keep-auto-snapshots \
--daily-schedule \
--start-time=03:00
gcloud compute disks add-resource-policies data-disk \
--project=your-gcp-project --zone=europe-north1-a \
--resource-policies=daily-backup
```
- **Secrets in GitHub / 1Password**: `.env` is never committed; recreate from stored secrets
- **Image tags**: Pin a known-good image tag in `docker-compose.yml` before each deploy