agnes-the-ai-analyst/dev_docs/disaster-recovery.md

# Disaster Recovery

Recovery procedures for the AI Data Analyst Docker deployment.

## Overview

```
What lives where:
  Docker volumes  /data        DuckDB files, parquet extracts, state
  Git             repo/        Application code — rebuild from GitHub
  .env            secrets      Recreate from GitHub Secrets / 1Password
```

**Key principle**: the container is disposable. All unique data lives in the `/data`
Docker volume (or a GCP persistent disk mounted at `/data`). Re-pulling the image
and restoring `/data` brings the service back to full operation.

## Data Layout

| Path | Content | Backup |
|------|---------|--------|
| `/data/state/system.duckdb` | Table registry, users, sync state | Daily snapshot |
| `/data/analytics/server.duckdb` | Master analytics DB (views) | Regenerated on start |
| `/data/extracts/*/extract.duckdb` | Per-source extract DBs | Daily snapshot |
| `/data/extracts/*/data/*.parquet` | Parquet files (local sources) | Daily snapshot |

`analytics/server.duckdb` is rebuilt automatically by `SyncOrchestrator.rebuild()`
on every startup, so it does not need to be backed up separately.

## Scenario A: Container Crash / Bad Deploy

**Impact**: Service down, data intact.

**Recovery time**: ~2 minutes

```bash
# Pull latest image and restart
docker compose pull
docker compose up -d

# Check health
curl https://your-instance.example.com/health
```

If a bad image was pushed, roll back to the previous tag:
```bash
docker compose down
# Edit docker-compose.yml to pin the previous image tag
docker compose up -d
```

## Scenario B: /data Volume Corruption or Loss

**Impact**: All DuckDB state and parquet data lost.

**Recovery time**: ~10 minutes (from snapshot) or ~30 minutes (regenerate from source)

### Option 1: Restore from GCP disk snapshot (faster)

```bash
# Find latest snapshot
gcloud compute snapshots list --project=your-gcp-project \
  --filter="sourceDisk:data-disk" --sort-by=~creationTimestamp --limit=5

# Create new disk from snapshot
gcloud compute disks create data-disk \
  --project=your-gcp-project \
  --zone=europe-north1-a \
  --source-snapshot=SNAPSHOT_NAME \
  --type=pd-balanced

# Attach to VM and mount
gcloud compute instances attach-disk your-server \
  --project=your-gcp-project \
  --zone=europe-north1-a \
  --disk=data-disk

# Restart containers
docker compose up -d
```

### Option 2: Regenerate from source

```bash
# Start with empty /data volume
docker compose up -d

# Trigger a full sync from the data source
curl -X POST http://localhost:8000/api/sync/trigger
```

DuckDB extract files and parquet will be repopulated from Keboola / BigQuery.
`system.duckdb` (table registry, users) must be restored from snapshot if
not regenerated — user accounts and table definitions are not recreated by sync.

## Scenario C: Complete VM Loss

**Recovery time**: ~20 minutes

1. **Create new VM** (or use managed instance group):
   ```bash
   gcloud compute instances create your-server \
     --project=your-gcp-project \
     --zone=europe-north1-a \
     --machine-type=e2-medium \
     --image-family=debian-12 \
     --image-project=debian-cloud
   ```

2. **Install Docker**:
   ```bash
   curl -fsSL https://get.docker.com | sh
   ```

3. **Attach and mount the data disk** (or restore from snapshot per Scenario B):
   ```bash
   gcloud compute instances attach-disk your-server \
     --project=your-gcp-project --zone=europe-north1-a --disk=data-disk
   # Add mount to /etc/fstab and mount /data
   ```

4. **Clone repo and create .env**:
   ```bash
   git clone git@github.com:keboola/agnes-the-ai-analyst.git <install-dir>
   cd <install-dir>
   cp config/.env.template .env
   # Fill in secrets from GitHub Secrets / 1Password
   ```

5. **Start the stack**:
   ```bash
   docker compose up -d
   ```

6. **Update DNS** if the external IP changed:
   - A record for `your-instance.example.com`

## Verification Checklist

After any recovery, verify:

- [ ] `docker compose ps` — all services `Up`
- [ ] `https://your-instance.example.com/health` returns `{"status": "ok"}`
- [ ] Login works (Google OAuth or email magic link)
- [ ] At least one table appears in the data catalog
- [ ] `docker compose logs app` — no ERROR lines at startup

## Preventive Measures

- **GCP snapshots**: Daily automatic snapshots of the `/data` persistent disk
  (14-day retention). Configure via:
  ```bash
  gcloud compute resource-policies create snapshot-schedule daily-backup \
    --project=your-gcp-project \
    --region=europe-north1 \
    --max-retention-days=14 \
    --on-source-disk-delete=keep-auto-snapshots \
    --daily-schedule \
    --start-time=03:00
  gcloud compute disks add-resource-policies data-disk \
    --project=your-gcp-project --zone=europe-north1-a \
    --resource-policies=daily-backup
  ```
- **Secrets in GitHub / 1Password**: `.env` is never committed; recreate from stored secrets
- **Image tags**: Pin a known-good image tag in `docker-compose.yml` before each deploy