agnes-the-ai-analyst/scripts
Vojtech ddffdfeafd
fix(ops): fail-fast guard in agnes-auto-upgrade — refuse start if config disk not mounted (#146)
* fix(ops): fail-fast guard in agnes-auto-upgrade — refuse to start containers if config disk not mounted

Companion to keboola/agnes-the-ai-analyst-infra#62. Same incident:
foundryai-development 2026-04-30, marketplaces / DuckDB / session secret
written to /data (sdb) instead of the config disk (sdc), wiped on next
container recreate.

## Why an app-side guard

agnes-auto-upgrade.sh fires every 5 min on every VM. If `/data/state` is
not on the config disk (because of the propagation regression fixed by the
infra PR, or the boot-time udev race fixed by infra #58, or any future
mount-loss path), this script previously ran `docker compose up -d`
anyway — and the app silently wrote state onto the wrong disk. Next
recreate, that state was gone.

The boot-time fixes in infra are preventive. This is the runtime backstop.

## Behavior

Before the existing pull/up logic, when /dev/disk/by-id/google-config-disk
exists on the VM:

1. Up to 3 mount-and-verify attempts with backoff (2s, 4s, 6s).
   - Mount the config disk if /data/state is not a mountpoint.
   - Detect mismatch: if /data/state is mounted from the wrong source,
     umount and retry.
2. After the loop, assert findmnt source matches the config disk.
   - On mismatch: `logger -t agnes-auto-upgrade FATAL` + exit 1. systemd
     marks the service failed; no docker compose action runs; existing
     containers (if any) keep running on stale state, but no new write
     lands on the wrong disk.
3. Once verified mounted: re-apply `mount --make-rprivate /data /data/state`
   on every run. Idempotent. Guards against propagation regressions
   sneaking back in via future docker / kernel changes.

VMs without a config disk (foundryai-poc, single-disk legacy) skip the
whole block — the `if [ -e $CONFIG_DEVICE ]` guard.

## Tested

Patched script installed on foundryai-development as a hotfix; manual run
post-migration was a no-op (digest unchanged); /data/state stayed on sdc
across a full `docker compose down + up -d` cycle.

## Rollout

- This file is fetched by infra startup.sh from
  raw.githubusercontent.com/keboola/agnes-the-ai-analyst/main on every
  boot. Once merged to main, all VMs pick up the new script on their
  next boot — no infra recreate needed.
- For immediate rollout to running VMs without waiting for next boot:
  `scp scripts/ops/agnes-auto-upgrade.sh <vm>:/tmp/ &&
   ssh <vm> sudo install -m755 -o root -g root /tmp/agnes-auto-upgrade.sh
   /usr/local/bin/agnes-auto-upgrade.sh` (already done on
  foundryai-development).

* chore: vendor-agnostic comment + changelog text

Drop customer-specific VM names from the script comment and
CHANGELOG entry. The OSS distribution should not name a particular
operator's hosts; the technical description already conveys why
the guard exists.

* fix(ops): suppress mount stderr in retry loop

Match the rest of the script's error-tolerant idiom (2>/dev/null).
Mount failures in the cold-boot udev race the loop is designed
to handle gracefully should not flow to stdout — cron would mail
on every transient retry.

Devin BUG_0001 on PR #146.

* fix(changelog): move auto-upgrade entry to [Unreleased]

Entry landed under v0.20.0 because that section was [Unreleased]
when this branch first opened — releases v0.21–v0.24 cut in the
meantime stranded it inside an already-released section. Move it
back where new entries belong.

Devin BUG_0001 on PR #146.

* fix(infra): single-source agnes-auto-upgrade.sh via curl from main

Replace the inline heredoc copy of the auto-upgrade script in the
customer-instance Terraform startup template with a curl fetch from
raw.githubusercontent.com on every boot. The inline copy had drifted
several iterations behind canonical scripts/ops/agnes-auto-upgrade.sh
(missing TLS overlay detection, array-form COMPOSE_FILES, and now
the config-disk fail-fast guard from this PR).

Devin ANALYSIS_0001 on PR #146.

* fix(infra): fetch docker-compose.tls.yml unconditionally + document coupling

The canonical agnes-auto-upgrade.sh from main detects TLS at runtime
via cert files on disk, regardless of the TLS_MODE Terraform variable.
Certs can appear after boot via agnes-tls-rotate.sh or manual
provisioning, and the cron job would then fail every 5 min under
'set -euo pipefail' because docker-compose.tls.yml was never fetched.

Also document the main-vs-COMPOSE_REF coupling: when the canonical
script references a new compose file, the fetch list above must be
updated to match — pinned-ref VMs would otherwise break.

Devin BUG_0001 + ANALYSIS_0001 on PR #146.

* fix(ops,infra): unconditional Caddyfile + skip tls overlay if missing

Caddyfile fetch now matches docker-compose.tls.yml: unconditional in
startup-script.sh.tpl. Without it, Docker would auto-create an empty
directory at the bind-mount target and Caddy would crash-loop while
the tls overlay has already closed :8000 — making the app
unreachable on any non-caddy VM where certs land via rotate or
manual provisioning.

Defensive layer: agnes-auto-upgrade.sh now also requires Caddyfile
to exist (size > 0) before activating the tls profile, with a
WARN log if it's missing. Belt-and-suspenders so the failure mode
is contained even when the script is deployed by some other path
(not just the customer-instance TF module).

Devin BUG_0001 on PR #146.

* chore(release): cut 0.25.0

---------

Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
2026-04-30 20:07:22 +02:00
..
debug chore(oss): isolate customer-specific deploy bits from scripts/grpn/ (#88, wave 1) (#94) 2026-04-27 20:24:34 +02:00
dev feat(setup): cross-platform TLS bootstrap + marketplace plugin install (#137) 2026-04-30 08:56:45 +02:00
ops fix(ops): fail-fast guard in agnes-auto-upgrade — refuse start if config disk not mounted (#146) 2026-04-30 20:07:22 +02:00
bootstrap-gcp.sh fix(bootstrap): grant monitoring.editor + enable monitoring API 2026-04-21 20:32:50 +02:00
duckdb_manager.py chore(oss): isolate customer-specific deploy bits from scripts/grpn/ (#88, wave 1) (#94) 2026-04-27 20:24:34 +02:00
fetch-env-from-secrets.sh chore(oss): isolate customer-specific deploy bits from scripts/grpn/ (#88, wave 1) (#94) 2026-04-27 20:24:34 +02:00
generate_openapi.py feat: multi-instance deployment — all 14 must-have items from spec 2026-04-10 11:57:42 +02:00
generate_sample_data.py feat(observability): request_id end-to-end + dev debug toolbar + centralized logging (#136) 2026-04-29 22:54:21 +02:00
init.sh refactor: final cleanup — delete legacy auth, clean deps, fix hash, migrate to uv 2026-03-31 19:18:30 +02:00
migrate_json_to_duckdb.py feat(observability): request_id end-to-end + dev debug toolbar + centralized logging (#136) 2026-04-29 22:54:21 +02:00
migrate_metrics_to_duckdb.py feat(observability): request_id end-to-end + dev debug toolbar + centralized logging (#136) 2026-04-29 22:54:21 +02:00
migrate_parquets_to_extracts.py feat(observability): request_id end-to-end + dev debug toolbar + centralized logging (#136) 2026-04-29 22:54:21 +02:00
migrate_registry_to_duckdb.py feat(observability): request_id end-to-end + dev debug toolbar + centralized logging (#136) 2026-04-29 22:54:21 +02:00
README.md fix: rewrite Makefile and scripts/README.md 2026-04-09 17:16:04 +02:00
run-local-dev.ps1 feat(dev): add Windows PowerShell wrapper for local development (#80) 2026-04-28 23:59:11 +02:00
run-local-dev.sh fix(security+ops) + release(0.12.1): #82 #85 #87 hardening + cut 0.12.1 (#104) 2026-04-28 19:57:30 +02:00
seed_corporate_memory.py feat(memory): corporate memory v1+v1.5 + 0.15.0 (#72) 2026-04-29 07:16:22 +02:00
seed_dummy_tables.py feat(rbac+marketplace): RBAC v13 + Claude Code marketplace + #81/#83/#44 hardening 2026-04-28 14:25:04 +02:00
smoke-test.sh fix(ci): smoke-test stale route + rollback ghcr auth + issues:write (#140) 2026-04-30 09:42:27 +02:00
tls-fetch.sh feat(tls): corporate-CA HTTPS with URL-driven rotation, on-VM CSR gen, self-signed fallback (#51) 2026-04-25 19:51:25 +00:00

Scripts

Utility and migration scripts for Agnes AI Data Analyst.

Active Scripts

Script Purpose
generate_sample_data.py Generate sample data for development/demo
duckdb_manager.py DuckDB database management utilities
init.sh Initial server setup (install deps, create dirs)

Migration Scripts (one-time use)

Script Purpose
migrate_json_to_duckdb.py Migrate v1 JSON state files to DuckDB
migrate_parquets_to_extracts.py Migrate v1 parquet layout to extract.duckdb
migrate_registry_to_duckdb.py Migrate v1 table registry to DuckDB