agnes-the-ai-analyst/dev_docs/disaster-recovery.md
ZdenekSrotyr 8233c3e3f9 chore(docs): replace stale da verbs and vendor-specific install paths
Sweep operator runbooks (docs/QUICKSTART, docs/HEADLESS_USAGE,
docs/architecture, docs/sample-data, docs/agent-workspace-prompt,
docs/metrics/metrics.yml, dev_docs/server, dev_docs/disaster-recovery),
the corporate-memory service README, the jira connector README + backfill
scripts, the deploy skill, and test docstrings. Replaces `da sync` →
`agnes pull`, `da analyst setup` → `agnes init`, `da metrics ...` →
`agnes catalog --metrics` / `agnes admin metrics ...`, `da fetch` →
`agnes snapshot create`, plus the matching docker-compose admin
invocations.

Vendor-specific `/opt/data-analyst/` install paths in jira backfill /
consistency scripts and operator docs are replaced with the
placeholder `<install-dir>` and a new `AGNES_ENV_FILE` env-var override
that lets a deployment inject its actual install path without a code
change. Aligns with the OSS vendor-agnostic policy in CLAUDE.md.

CHANGELOG `### Internal` entry summarizes the audit and reaffirms the
intentional stale-marker tuples (`_LEGACY_STRINGS`, `_OUR_COMMAND_MARKERS`)
that must keep referencing `da sync` / `da fetch` / etc. for hook upgrade
and override-detection logic.
2026-05-04 21:22:19 +02:00

4.9 KiB

Disaster Recovery

Recovery procedures for the AI Data Analyst Docker deployment.

Overview

What lives where:
  Docker volumes  /data        DuckDB files, parquet extracts, state
  Git             repo/        Application code — rebuild from GitHub
  .env            secrets      Recreate from GitHub Secrets / 1Password

Key principle: the container is disposable. All unique data lives in the /data Docker volume (or a GCP persistent disk mounted at /data). Re-pulling the image and restoring /data brings the service back to full operation.

Data Layout

Path Content Backup
/data/state/system.duckdb Table registry, users, sync state Daily snapshot
/data/analytics/server.duckdb Master analytics DB (views) Regenerated on start
/data/extracts/*/extract.duckdb Per-source extract DBs Daily snapshot
/data/extracts/*/data/*.parquet Parquet files (local sources) Daily snapshot

analytics/server.duckdb is rebuilt automatically by SyncOrchestrator.rebuild() on every startup, so it does not need to be backed up separately.

Scenario A: Container Crash / Bad Deploy

Impact: Service down, data intact.

Recovery time: ~2 minutes

# Pull latest image and restart
docker compose pull
docker compose up -d

# Check health
curl https://your-instance.example.com/health

If a bad image was pushed, roll back to the previous tag:

docker compose down
# Edit docker-compose.yml to pin the previous image tag
docker compose up -d

Scenario B: /data Volume Corruption or Loss

Impact: All DuckDB state and parquet data lost.

Recovery time: ~10 minutes (from snapshot) or ~30 minutes (regenerate from source)

Option 1: Restore from GCP disk snapshot (faster)

# Find latest snapshot
gcloud compute snapshots list --project=your-gcp-project \
  --filter="sourceDisk:data-disk" --sort-by=~creationTimestamp --limit=5

# Create new disk from snapshot
gcloud compute disks create data-disk \
  --project=your-gcp-project \
  --zone=europe-north1-a \
  --source-snapshot=SNAPSHOT_NAME \
  --type=pd-balanced

# Attach to VM and mount
gcloud compute instances attach-disk your-server \
  --project=your-gcp-project \
  --zone=europe-north1-a \
  --disk=data-disk

# Restart containers
docker compose up -d

Option 2: Regenerate from source

# Start with empty /data volume
docker compose up -d

# Trigger a full sync from the data source
curl -X POST http://localhost:8000/api/sync/trigger

DuckDB extract files and parquet will be repopulated from Keboola / BigQuery. system.duckdb (table registry, users) must be restored from snapshot if not regenerated — user accounts and table definitions are not recreated by sync.

Scenario C: Complete VM Loss

Recovery time: ~20 minutes

  1. Create new VM (or use managed instance group):

    gcloud compute instances create your-server \
      --project=your-gcp-project \
      --zone=europe-north1-a \
      --machine-type=e2-medium \
      --image-family=debian-12 \
      --image-project=debian-cloud
    
  2. Install Docker:

    curl -fsSL https://get.docker.com | sh
    
  3. Attach and mount the data disk (or restore from snapshot per Scenario B):

    gcloud compute instances attach-disk your-server \
      --project=your-gcp-project --zone=europe-north1-a --disk=data-disk
    # Add mount to /etc/fstab and mount /data
    
  4. Clone repo and create .env:

    git clone git@github.com:keboola/agnes-the-ai-analyst.git <install-dir>
    cd <install-dir>
    cp config/.env.template .env
    # Fill in secrets from GitHub Secrets / 1Password
    
  5. Start the stack:

    docker compose up -d
    
  6. Update DNS if the external IP changed:

    • A record for your-instance.example.com

Verification Checklist

After any recovery, verify:

  • docker compose ps — all services Up
  • https://your-instance.example.com/health returns {"status": "ok"}
  • Login works (Google OAuth or email magic link)
  • At least one table appears in the data catalog
  • docker compose logs app — no ERROR lines at startup

Preventive Measures

  • GCP snapshots: Daily automatic snapshots of the /data persistent disk (14-day retention). Configure via:
    gcloud compute resource-policies create snapshot-schedule daily-backup \
      --project=your-gcp-project \
      --region=europe-north1 \
      --max-retention-days=14 \
      --on-source-disk-delete=keep-auto-snapshots \
      --daily-schedule \
      --start-time=03:00
    gcloud compute disks add-resource-policies data-disk \
      --project=your-gcp-project --zone=europe-north1-a \
      --resource-policies=daily-backup
    
  • Secrets in GitHub / 1Password: .env is never committed; recreate from stored secrets
  • Image tags: Pin a known-good image tag in docker-compose.yml before each deploy