ZdenekSrotyr 8233c3e3f9 chore(docs): replace stale da verbs and vendor-specific install paths

Sweep operator runbooks (docs/QUICKSTART, docs/HEADLESS_USAGE,
docs/architecture, docs/sample-data, docs/agent-workspace-prompt,
docs/metrics/metrics.yml, dev_docs/server, dev_docs/disaster-recovery),
the corporate-memory service README, the jira connector README + backfill
scripts, the deploy skill, and test docstrings. Replaces `da sync` →
`agnes pull`, `da analyst setup` → `agnes init`, `da metrics ...` →
`agnes catalog --metrics` / `agnes admin metrics ...`, `da fetch` →
`agnes snapshot create`, plus the matching docker-compose admin
invocations.

Vendor-specific `/opt/data-analyst/` install paths in jira backfill /
consistency scripts and operator docs are replaced with the
placeholder `<install-dir>` and a new `AGNES_ENV_FILE` env-var override
that lets a deployment inject its actual install path without a code
change. Aligns with the OSS vendor-agnostic policy in CLAUDE.md.

CHANGELOG `### Internal` entry summarizes the audit and reaffirms the
intentional stale-marker tuples (`_LEGACY_STRINGS`, `_OUR_COMMAND_MARKERS`)
that must keep referencing `da sync` / `da fetch` / etc. for hook upgrade
and override-detection logic.

2026-05-04 21:22:19 +02:00

4.9 KiB

Raw Blame History

Disaster Recovery

Recovery procedures for the AI Data Analyst Docker deployment.

Overview

What lives where:
  Docker volumes  /data        DuckDB files, parquet extracts, state
  Git             repo/        Application code — rebuild from GitHub
  .env            secrets      Recreate from GitHub Secrets / 1Password

Key principle: the container is disposable. All unique data lives in the /data Docker volume (or a GCP persistent disk mounted at /data). Re-pulling the image and restoring /data brings the service back to full operation.

Data Layout

Path	Content	Backup
`/data/state/system.duckdb`	Table registry, users, sync state	Daily snapshot
`/data/analytics/server.duckdb`	Master analytics DB (views)	Regenerated on start
`/data/extracts/*/extract.duckdb`	Per-source extract DBs	Daily snapshot
`/data/extracts//data/.parquet`	Parquet files (local sources)	Daily snapshot

analytics/server.duckdb is rebuilt automatically by SyncOrchestrator.rebuild() on every startup, so it does not need to be backed up separately.

Scenario A: Container Crash / Bad Deploy

Impact: Service down, data intact.

Recovery time: ~2 minutes

# Pull latest image and restart
docker compose pull
docker compose up -d

# Check health
curl https://your-instance.example.com/health

If a bad image was pushed, roll back to the previous tag:

docker compose down
# Edit docker-compose.yml to pin the previous image tag
docker compose up -d

Scenario B: /data Volume Corruption or Loss

Impact: All DuckDB state and parquet data lost.

Recovery time: ~10 minutes (from snapshot) or ~30 minutes (regenerate from source)

Option 1: Restore from GCP disk snapshot (faster)

# Find latest snapshot
gcloud compute snapshots list --project=your-gcp-project \
  --filter="sourceDisk:data-disk" --sort-by=~creationTimestamp --limit=5

# Create new disk from snapshot
gcloud compute disks create data-disk \
  --project=your-gcp-project \
  --zone=europe-north1-a \
  --source-snapshot=SNAPSHOT_NAME \
  --type=pd-balanced

# Attach to VM and mount
gcloud compute instances attach-disk your-server \
  --project=your-gcp-project \
  --zone=europe-north1-a \
  --disk=data-disk

# Restart containers
docker compose up -d

Option 2: Regenerate from source

# Start with empty /data volume
docker compose up -d

# Trigger a full sync from the data source
curl -X POST http://localhost:8000/api/sync/trigger

DuckDB extract files and parquet will be repopulated from Keboola / BigQuery. system.duckdb (table registry, users) must be restored from snapshot if not regenerated — user accounts and table definitions are not recreated by sync.

Scenario C: Complete VM Loss

Recovery time: ~20 minutes

Create new VM (or use managed instance group):

gcloud compute instances create your-server \
  --project=your-gcp-project \
  --zone=europe-north1-a \
  --machine-type=e2-medium \
  --image-family=debian-12 \
  --image-project=debian-cloud

Install Docker:
```
curl -fsSL https://get.docker.com | sh
```

Attach and mount the data disk (or restore from snapshot per Scenario B):

gcloud compute instances attach-disk your-server \
  --project=your-gcp-project --zone=europe-north1-a --disk=data-disk
# Add mount to /etc/fstab and mount /data

Clone repo and create .env:

git clone git@github.com:keboola/agnes-the-ai-analyst.git <install-dir>
cd <install-dir>
cp config/.env.template .env
# Fill in secrets from GitHub Secrets / 1Password

Start the stack:
```
docker compose up -d
```
Update DNS if the external IP changed:
- A record for your-instance.example.com

Verification Checklist

After any recovery, verify:

docker compose ps — all services Up
https://your-instance.example.com/health returns {"status": "ok"}
Login works (Google OAuth or email magic link)
At least one table appears in the data catalog
docker compose logs app — no ERROR lines at startup

Preventive Measures

GCP snapshots: Daily automatic snapshots of the /data persistent disk (14-day retention). Configure via:

gcloud compute resource-policies create snapshot-schedule daily-backup \
  --project=your-gcp-project \
  --region=europe-north1 \
  --max-retention-days=14 \
  --on-source-disk-delete=keep-auto-snapshots \
  --daily-schedule \
  --start-time=03:00
gcloud compute disks add-resource-policies data-disk \
  --project=your-gcp-project --zone=europe-north1-a \
  --resource-policies=daily-backup

Secrets in GitHub / 1Password: .env is never committed; recreate from stored secrets
Image tags: Pin a known-good image tag in docker-compose.yml before each deploy

4.9 KiB Raw Blame History