agnes-the-ai-analyst/dev_docs/disaster-recovery.md
ZdenekSrotyr 8233c3e3f9 chore(docs): replace stale da verbs and vendor-specific install paths
Sweep operator runbooks (docs/QUICKSTART, docs/HEADLESS_USAGE,
docs/architecture, docs/sample-data, docs/agent-workspace-prompt,
docs/metrics/metrics.yml, dev_docs/server, dev_docs/disaster-recovery),
the corporate-memory service README, the jira connector README + backfill
scripts, the deploy skill, and test docstrings. Replaces `da sync` →
`agnes pull`, `da analyst setup` → `agnes init`, `da metrics ...` →
`agnes catalog --metrics` / `agnes admin metrics ...`, `da fetch` →
`agnes snapshot create`, plus the matching docker-compose admin
invocations.

Vendor-specific `/opt/data-analyst/` install paths in jira backfill /
consistency scripts and operator docs are replaced with the
placeholder `<install-dir>` and a new `AGNES_ENV_FILE` env-var override
that lets a deployment inject its actual install path without a code
change. Aligns with the OSS vendor-agnostic policy in CLAUDE.md.

CHANGELOG `### Internal` entry summarizes the audit and reaffirms the
intentional stale-marker tuples (`_LEGACY_STRINGS`, `_OUR_COMMAND_MARKERS`)
that must keep referencing `da sync` / `da fetch` / etc. for hook upgrade
and override-detection logic.
2026-05-04 21:22:19 +02:00

165 lines
4.9 KiB
Markdown

# Disaster Recovery
Recovery procedures for the AI Data Analyst Docker deployment.
## Overview
```
What lives where:
Docker volumes /data DuckDB files, parquet extracts, state
Git repo/ Application code — rebuild from GitHub
.env secrets Recreate from GitHub Secrets / 1Password
```
**Key principle**: the container is disposable. All unique data lives in the `/data`
Docker volume (or a GCP persistent disk mounted at `/data`). Re-pulling the image
and restoring `/data` brings the service back to full operation.
## Data Layout
| Path | Content | Backup |
|------|---------|--------|
| `/data/state/system.duckdb` | Table registry, users, sync state | Daily snapshot |
| `/data/analytics/server.duckdb` | Master analytics DB (views) | Regenerated on start |
| `/data/extracts/*/extract.duckdb` | Per-source extract DBs | Daily snapshot |
| `/data/extracts/*/data/*.parquet` | Parquet files (local sources) | Daily snapshot |
`analytics/server.duckdb` is rebuilt automatically by `SyncOrchestrator.rebuild()`
on every startup, so it does not need to be backed up separately.
## Scenario A: Container Crash / Bad Deploy
**Impact**: Service down, data intact.
**Recovery time**: ~2 minutes
```bash
# Pull latest image and restart
docker compose pull
docker compose up -d
# Check health
curl https://your-instance.example.com/health
```
If a bad image was pushed, roll back to the previous tag:
```bash
docker compose down
# Edit docker-compose.yml to pin the previous image tag
docker compose up -d
```
## Scenario B: /data Volume Corruption or Loss
**Impact**: All DuckDB state and parquet data lost.
**Recovery time**: ~10 minutes (from snapshot) or ~30 minutes (regenerate from source)
### Option 1: Restore from GCP disk snapshot (faster)
```bash
# Find latest snapshot
gcloud compute snapshots list --project=your-gcp-project \
--filter="sourceDisk:data-disk" --sort-by=~creationTimestamp --limit=5
# Create new disk from snapshot
gcloud compute disks create data-disk \
--project=your-gcp-project \
--zone=europe-north1-a \
--source-snapshot=SNAPSHOT_NAME \
--type=pd-balanced
# Attach to VM and mount
gcloud compute instances attach-disk your-server \
--project=your-gcp-project \
--zone=europe-north1-a \
--disk=data-disk
# Restart containers
docker compose up -d
```
### Option 2: Regenerate from source
```bash
# Start with empty /data volume
docker compose up -d
# Trigger a full sync from the data source
curl -X POST http://localhost:8000/api/sync/trigger
```
DuckDB extract files and parquet will be repopulated from Keboola / BigQuery.
`system.duckdb` (table registry, users) must be restored from snapshot if
not regenerated — user accounts and table definitions are not recreated by sync.
## Scenario C: Complete VM Loss
**Recovery time**: ~20 minutes
1. **Create new VM** (or use managed instance group):
```bash
gcloud compute instances create your-server \
--project=your-gcp-project \
--zone=europe-north1-a \
--machine-type=e2-medium \
--image-family=debian-12 \
--image-project=debian-cloud
```
2. **Install Docker**:
```bash
curl -fsSL https://get.docker.com | sh
```
3. **Attach and mount the data disk** (or restore from snapshot per Scenario B):
```bash
gcloud compute instances attach-disk your-server \
--project=your-gcp-project --zone=europe-north1-a --disk=data-disk
# Add mount to /etc/fstab and mount /data
```
4. **Clone repo and create .env**:
```bash
git clone git@github.com:keboola/agnes-the-ai-analyst.git <install-dir>
cd <install-dir>
cp config/.env.template .env
# Fill in secrets from GitHub Secrets / 1Password
```
5. **Start the stack**:
```bash
docker compose up -d
```
6. **Update DNS** if the external IP changed:
- A record for `your-instance.example.com`
## Verification Checklist
After any recovery, verify:
- [ ] `docker compose ps` — all services `Up`
- [ ] `https://your-instance.example.com/health` returns `{"status": "ok"}`
- [ ] Login works (Google OAuth or email magic link)
- [ ] At least one table appears in the data catalog
- [ ] `docker compose logs app` — no ERROR lines at startup
## Preventive Measures
- **GCP snapshots**: Daily automatic snapshots of the `/data` persistent disk
(14-day retention). Configure via:
```bash
gcloud compute resource-policies create snapshot-schedule daily-backup \
--project=your-gcp-project \
--region=europe-north1 \
--max-retention-days=14 \
--on-source-disk-delete=keep-auto-snapshots \
--daily-schedule \
--start-time=03:00
gcloud compute disks add-resource-policies data-disk \
--project=your-gcp-project --zone=europe-north1-a \
--resource-policies=daily-backup
```
- **Secrets in GitHub / 1Password**: `.env` is never committed; recreate from stored secrets
- **Image tags**: Pin a known-good image tag in `docker-compose.yml` before each deploy