* chore(oss): isolate customer-specific deploy bits from scripts/grpn/ (#88) Vendor-neutralization step before public release. The directory mixed two concerns: (1) generic ops scripts referenced from mainline OSS infrastructure (TLS rotation, auto-upgrade cron) and (2) one operator's hackathon manual-deploy helper with hardcoded GCP project IDs, VM names, and admin emails. Splitting them per concern. Moved (still in OSS, just under a vendor-neutral name): - scripts/grpn/agnes-tls-rotate.sh → scripts/ops/agnes-tls-rotate.sh - scripts/grpn/agnes-auto-upgrade.sh → scripts/ops/agnes-auto-upgrade.sh Removed (belongs in private consumer infra repos, not upstream OSS): - scripts/grpn/Makefile (hardcoded prj-grp-foundryai-dev-7c37, foundryai-development VM name, e_zsrotyr@groupon.com bootstrap email) - scripts/grpn/README.md (GRPN hackathon deploy walkthrough) - docs/superpowers/plans/2026-04-22-grpn-deploy-learnings.md (org-specific deploy log) Cross-refs updated in README.md, CLAUDE.md, docs/DEPLOYMENT.md, docker-compose.yml. CHANGELOG entry flags BREAKING (ops) for any consumer infra repo that installs these scripts via path-based systemd timers. This is the first wave of #88 — the remaining leaks (test data with prj-grp-dataview-prod-1ff9, AIAgent.FoundryAI tags in OpenMetadata test fixtures, docstrings in connectors/openmetadata/enricher.py) will be a separate, smaller PR. Refs #88. * chore(oss): comprehensive vendor-neutralization (#88 wave 2 + review fixes) PR #94 review found that the original wave-1 grep was scoped wrong and many leaks survived. This commit closes wave 1 properly AND folds in all wave-2 anonymization in a single pass — easier to review than two PRs. Wave-1 review-fix corrections: - Caddyfile: scripts/grpn/agnes-tls-rotate.sh → scripts/ops/ (the original wave-1 grep filter excluded extensionless files like Caddyfile). - CHANGELOG bullet rewritten — original wording implied an in-repo migration for infra/modules/customer-instance/, which is wrong (the TF module embeds the script inline via heredoc, never sourced from scripts/grpn/). Now flags downstream consumer infra repos only. - infra/modules/customer-instance/variables.tf: Czech docstring with `grpn` example → English description with `acme, example` placeholders. Wave-2 anonymization: - Code docstrings (connectors/openmetadata/{client,transformer,enricher}.py, src/catalog_export.py, scripts/duckdb_manager.py): prj-grp-… → my-bq-project / prj-example-1234, AIAgent.FoundryAI → AIAgent.MyAgent, FoundryAIDataModel → AnalyticsDataModel. - Test fixtures (4 files): same set of replacements — 157 tests still pass. - .github/workflows/keboola-deploy.yml: "Groupon-side dev VMs" comment → generic "per-developer dev VMs". - docs/auth-groups.md + scripts/debug/probe_google_groups.py: kids-ai-data-analysis project name → acme-internal-prod placeholder. - 5 planning/spec docs under docs/superpowers/{plans,specs}/2026-04-21-*: hardcoded IPs (34.77.94.14, 34.77.102.61) → <dev-vm-ip>/<prod-vm-ip>; GRPN/Groupon → Acme/another-customer; prj-grp-… → prj-example-…. - scripts/switch-dev-vm.sh deleted — hackathon-era helper hardcoded to a specific shared dev VM. Per-developer dev VMs are the supported pattern. Final grep `groupon|grpn|foundryai|prj-grp|groupondev|34\.77\.(94|102)\.…|kids-ai-data` returns zero hits (excluding CHANGELOG.md historical entries). CHANGELOG entry expanded to document both waves under one bullet, with the BREAKING (ops) clarification about the TF module being unaffected. Refs review of #94, closes #88. * fix(oss): close remaining #94 review-2 findings (Czech, padak refs, CHANGELOG) Reviewer of PR #94 round 2 caught 4 remaining items the wave-2 pass missed: 1. infra/modules/customer-instance/variables.tf had Czech descriptions on 8 more variables. Previous review only flagged line 19; this round audited the rest. Translated lines 2, 28, 42-46 (heredoc), 60, 65, 71, 78, 84 to English. Same review concern: a Terraform module that is the customer-facing API surface in Czech is unfit for OSS distribution. 2. infra/modules/customer-instance/outputs.tf had Czech descriptions on four outputs. Same fix. 3. docs/padak-security.md referenced a private repo (padak/keboola_agent_cli#206) in two places. Replaced with generic 'tracked upstream in the auth-CLI repo' per CLAUDE.md vendor-agnostic rule (no cross-refs to private repos). 4. scripts/fetch-env-from-secrets.sh:41 had a Czech comment. Translated. 5. CHANGELOG cosmetic: bullet said 'AIAgent.FoundryAI -> AIAgent.MyAgent' but the actual code uses both MyAgent (in docstrings) and Example (in test fixtures). Reworded to mention both targets. Final grep across all shipping file types (.md, .py, .yml, .yaml, .sh, Makefile, .json, .tf, .tpl, Caddyfile, .toml) for groupon|grpn|foundryai| prj-grp|groupondev|34.77.94.14|34.77.102.61|kids-ai-data|padak/keboola_agent_cli returns ZERO hits (excluding CHANGELOG.md). Czech-diacritic grep across .tf/.toml/Caddyfile/Makefile/.yml returns ZERO hits. 157/157 OpenMetadata + DuckDB tests still pass. * fix(oss): close #94 round-3 leaks (env.template, instance.yaml.example, padak typo) Round-3 reviewer caught two MUST-FIX leaks the round-2 grep missed (grep was scoped to extensions that did not include .template / .example suffixes — the audit was right, the previous grep was not paranoid enough): 1. config/instance.yaml.example:114 — '(optional - Groupon-specific)' brand leak in a shipping config example. Replaced with '(optional)'. 2. config/.env.template:68 — stale path 'scripts/grpn/agnes-tls-rotate.sh' in operator-facing env-template comment. The script lives at scripts/ops/ now (commit 16a85cc); this comment had been pointing operators at a non-existent path. 3. docs/padak-security.md:188 — phrase duplication 'tracked in tracked upstream' from a sloppy substitution in round-2. Trivial wording fix. Final paranoid grep across .md/.py/.yml/.yaml/.sh/Makefile/.json/.tf/.tpl/ Caddyfile/.toml/.template/.example/.env* with the full token set (groupon|grpn|foundryai|prj-grp|groupondev|34\.77\.94\.14|34\.77\.102\.61| kids-ai-data|padak/keboola_agent_cli) returns ZERO hits, excluding CHANGELOG.md historical entries. * fix(oss): #94 round-4 — QUICKSTART.md + rename padak-security.md Devin Review caught two findings on the latest round-3 commit: 1. docs/QUICKSTART.md:67 still pointed users at the deleted scripts/switch-dev-vm.sh. A Quickstart user following step-by-step would hit a missing-file error at the final step. Replaced with the inline gcloud-ssh equivalent that the Removed bullet documents. 2. docs/padak-security.md filename retains the personal identifier 'padak'. The PR fixed the body content (replaced padak/keboola_agent_cli#206 references with generic wording) but missed the filename. Renamed to docs/security-audit-2026-04.md (date-anchored, vendor-neutral). Updated the historical CHANGELOG link to point at the new path with an inline note about the rename. * fix(oss): redact remaining hardcoded IPs from planning docs + remove default email Devin Review caught two more leaks: 1. scripts/fetch-env-from-secrets.sh line 16 had a hardcoded personal-email default (zdenek.srotyr@keboola.com). Replaced with ':?' bash error so SEED_ADMIN_EMAIL must be explicitly set — safer than carrying any specific identity. 2. Planning docs still had 35.195.96.98 and 34.62.223.189 (legacy prod/dev IPs) that the round-1 IP-replace pattern missed (it only targeted 34.77.x.x). Generic regex redaction across all five planning docs replaces every public IP with <redacted-ip>, preserving private/loopback/IAP ranges.
848 lines
34 KiB
Markdown
848 lines
34 KiB
Markdown
# Hackathon E2E Dry-Run Plan
|
||
|
||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||
|
||
**Goal:** Validate the full developer→dev-VM→merge→prod flow end-to-end the day before a multi-developer hackathon, so any broken link is found and fixed before participants arrive.
|
||
|
||
**Architecture:** This is an operational dry-run, not a code feature. The executing agent pushes a throwaway feature branch to the public repo, verifies that CI produces a per-branch Docker image tag on GHCR, switches the shared `agnes-dev` VM onto that tag via the existing auto-upgrade cron, verifies that the CI test gate blocks a deliberately-broken PR from reaching `:stable`, and produces a helper script + report. The plan is **strictly non-destructive for prod** — prod-pinning (point 6 of the original outline) is explicitly out of scope and left to the user.
|
||
|
||
**Tech Stack:** Bash / `gcloud` / `gh` / `git` / `docker` / `curl` / Python (`pytest`) / Terraform (plan only, no apply). No app code changes.
|
||
|
||
---
|
||
|
||
## Out of Scope (do NOT do)
|
||
|
||
- Any `terraform apply` against real infrastructure. TF `plan` is allowed; TF `apply` is forbidden.
|
||
- Pinning `prod_instance.image_tag` in `agnes-infra-keboola`. User will do this themselves after the dry-run succeeds.
|
||
- Rotating admin passwords, Keboola tokens, or JWT secrets.
|
||
- Modifying `main` branch of any repo. All changes happen on throwaway branches, which are deleted at the end.
|
||
- Creating new GCP resources (VMs, disks, IPs, secrets, SAs).
|
||
|
||
If any step would require doing one of the above, **STOP and ask the user**.
|
||
|
||
---
|
||
|
||
## Prerequisites
|
||
|
||
Before starting, the executing agent MUST verify all of the following. If any fails, abort and report which prerequisite is missing — do NOT try to fix it.
|
||
|
||
- [ ] **Working directory** is the `tmp_oss` checkout at `/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss`. Current branch can be anything; the plan will create a new branch.
|
||
|
||
- [ ] **`gh auth status`** shows authenticated, with `workflow` scope. Run:
|
||
|
||
```bash
|
||
gh auth status 2>&1 | grep -E "(Logged in|Token scopes)"
|
||
```
|
||
|
||
Expected: line containing `Logged in to github.com` and a line listing scopes that include `workflow`. If `workflow` scope is missing, abort with message: `Run: gh auth refresh -h github.com -s workflow`.
|
||
|
||
- [ ] **`gcloud` authenticated** to project `internal-prod`. Run:
|
||
|
||
```bash
|
||
gcloud config get-value project
|
||
gcloud auth list --filter=status:ACTIVE --format="value(account)"
|
||
```
|
||
|
||
Expected: project is `internal-prod`, at least one active account. If not, abort with message: `Run: gcloud config set project internal-prod && gcloud auth login`.
|
||
|
||
- [ ] **SSH to `agnes-dev` works** (OS Login). Run:
|
||
|
||
```bash
|
||
gcloud compute ssh agnes-dev --zone=europe-west1-b --command="echo ok" --quiet
|
||
```
|
||
|
||
Expected: output contains `ok`. First connection may take ~20s while OS Login provisions. If fails with permission error, abort with message: `User needs compute.osLogin role on agnes-dev VM`.
|
||
|
||
- [ ] **`docker` CLI available** locally (for `docker manifest inspect`). Run: `docker --version`. Expected: version output. If missing, abort.
|
||
|
||
- [ ] **Public GHCR pull works**. Run:
|
||
|
||
```bash
|
||
docker manifest inspect ghcr.io/keboola/agnes-the-ai-analyst:stable > /dev/null && echo ok
|
||
```
|
||
|
||
Expected: `ok`. If fails, abort — something is wrong with public image visibility.
|
||
|
||
- [ ] **Clone of `agnes-infra-keboola` exists or can be cloned** at `/tmp/agnes-infra-keboola`. Run:
|
||
|
||
```bash
|
||
if [ ! -d /tmp/agnes-infra-keboola ]; then
|
||
gh repo clone keboola/agnes-infra-keboola /tmp/agnes-infra-keboola
|
||
fi
|
||
cd /tmp/agnes-infra-keboola && git status --short
|
||
```
|
||
|
||
Expected: clone succeeds, `git status` is clean. If clone fails, skip Task 4 (TF plan verification) and note it in the final report.
|
||
|
||
**Gate:** All 7 prerequisite checks pass, OR the agent has clearly reported which ones failed and reduced scope accordingly. Only then proceed to Task 1.
|
||
|
||
---
|
||
|
||
## Task 1: Baseline Snapshot
|
||
|
||
**Purpose:** Record the current state of both VMs and the TF outputs so the agent can detect drift at the end and prove it left everything as it found it.
|
||
|
||
**Files:**
|
||
- Create: `/tmp/dryrun-baseline/prod-health.json`
|
||
- Create: `/tmp/dryrun-baseline/dev-health.json`
|
||
- Create: `/tmp/dryrun-baseline/prod-image.txt`
|
||
- Create: `/tmp/dryrun-baseline/dev-image.txt`
|
||
- Create: `/tmp/dryrun-baseline/dev-env.txt`
|
||
|
||
- [ ] **Step 1.1: Create baseline directory**
|
||
|
||
```bash
|
||
mkdir -p /tmp/dryrun-baseline
|
||
```
|
||
|
||
- [ ] **Step 1.2: Capture prod health**
|
||
|
||
```bash
|
||
curl -sf --max-time 10 http://<prod-vm-ip>:8000/api/health > /tmp/dryrun-baseline/prod-health.json
|
||
cat /tmp/dryrun-baseline/prod-health.json | python3 -m json.tool
|
||
```
|
||
|
||
Expected: JSON with `"status"` field equal to `"healthy"` or `"degraded"`. If `"unhealthy"` or curl times out, abort with message: `Prod is not in acceptable baseline state — investigate before dry-run`.
|
||
|
||
- [ ] **Step 1.3: Capture dev health**
|
||
|
||
```bash
|
||
curl -sf --max-time 10 http://<dev-vm-ip>:8000/api/health > /tmp/dryrun-baseline/dev-health.json
|
||
cat /tmp/dryrun-baseline/dev-health.json | python3 -m json.tool
|
||
```
|
||
|
||
Expected: JSON with `"status"` in `{healthy, degraded}`. Same abort condition as 1.2.
|
||
|
||
- [ ] **Step 1.4: Capture current image tags on both VMs**
|
||
|
||
```bash
|
||
gcloud compute ssh agnes-prod --zone=europe-west1-b --quiet --command \
|
||
"docker inspect \$(docker ps -qf name=app) --format '{{.Config.Image}}'" \
|
||
> /tmp/dryrun-baseline/prod-image.txt
|
||
gcloud compute ssh agnes-dev --zone=europe-west1-b --quiet --command \
|
||
"docker inspect \$(docker ps -qf name=app) --format '{{.Config.Image}}'" \
|
||
> /tmp/dryrun-baseline/dev-image.txt
|
||
cat /tmp/dryrun-baseline/prod-image.txt /tmp/dryrun-baseline/dev-image.txt
|
||
```
|
||
|
||
Expected: each file contains exactly one line like `ghcr.io/keboola/agnes-the-ai-analyst:stable` or `:stable-2026.04.XX`. Non-empty.
|
||
|
||
- [ ] **Step 1.5: Capture `agnes-dev` `.env` AGNES_TAG line**
|
||
|
||
```bash
|
||
gcloud compute ssh agnes-dev --zone=europe-west1-b --quiet --command \
|
||
"sudo grep -E '^AGNES_TAG=' /data/.env || echo 'AGNES_TAG_NOT_SET'" \
|
||
> /tmp/dryrun-baseline/dev-env.txt
|
||
cat /tmp/dryrun-baseline/dev-env.txt
|
||
```
|
||
|
||
Expected: output is `AGNES_TAG=dev` or similar. Record exact value for restoration in Task 6. If `AGNES_TAG_NOT_SET`, abort — the VM is in an unknown config state.
|
||
|
||
- [ ] **Step 1.6: Record baseline to report buffer**
|
||
|
||
Append to a running report at `/tmp/dryrun-report.md` (create if not exists):
|
||
|
||
```bash
|
||
cat > /tmp/dryrun-report.md <<EOF
|
||
# Hackathon Dry-Run Report
|
||
|
||
**Run at:** $(date -u +"%Y-%m-%dT%H:%M:%SZ")
|
||
|
||
## Baseline (Task 1)
|
||
|
||
- Prod health status: $(jq -r '.status' /tmp/dryrun-baseline/prod-health.json)
|
||
- Dev health status: $(jq -r '.status' /tmp/dryrun-baseline/dev-health.json)
|
||
- Prod image: $(cat /tmp/dryrun-baseline/prod-image.txt)
|
||
- Dev image: $(cat /tmp/dryrun-baseline/dev-image.txt)
|
||
- Dev AGNES_TAG: $(cat /tmp/dryrun-baseline/dev-env.txt)
|
||
|
||
EOF
|
||
cat /tmp/dryrun-report.md
|
||
```
|
||
|
||
Expected: report file exists, all fields populated (no empty values).
|
||
|
||
**Task 1 gate:** baseline directory has 5 non-empty files, report has 5 non-empty bullet lines. Proceed.
|
||
|
||
---
|
||
|
||
## Task 2: Verify Per-Branch GHCR Build
|
||
|
||
**Purpose:** Push a throwaway feature branch to the public repo, wait for the release workflow, and confirm that the per-branch `:dev-<slug>` tag appears on GHCR.
|
||
|
||
**Files:**
|
||
- Create (throwaway): branch `feature/hack-dryrun-<timestamp>` in `tmp_oss` + one trivial commit touching `docs/QUICKSTART.md`
|
||
|
||
**Branch naming:** the agent MUST use `feature/hack-dryrun-<epoch>` (e.g. `feature/hack-dryrun-1745254321`) so the slug is unique per run and cleanup is deterministic.
|
||
|
||
- [ ] **Step 2.1: Compute branch name and expected slug**
|
||
|
||
Per `.github/workflows/release.yml:92-98` logic: strip `feature/` prefix, sanitise `[^a-zA-Z0-9-]` to `-`, lowercase, cut 50 chars.
|
||
|
||
```bash
|
||
EPOCH=$(date +%s)
|
||
BRANCH="feature/hack-dryrun-${EPOCH}"
|
||
SLUG=$(echo "$BRANCH" | sed 's|^feature/||' | sed 's|[^a-zA-Z0-9-]|-|g' | tr '[:upper:]' '[:lower:]' | cut -c1-50)
|
||
echo "BRANCH=$BRANCH"
|
||
echo "SLUG=$SLUG"
|
||
echo "EXPECTED_TAG=ghcr.io/keboola/agnes-the-ai-analyst:dev-$SLUG"
|
||
# Persist for later steps
|
||
echo "$BRANCH" > /tmp/dryrun-baseline/branch-name.txt
|
||
echo "$SLUG" > /tmp/dryrun-baseline/slug.txt
|
||
```
|
||
|
||
Expected: BRANCH like `feature/hack-dryrun-1745254321`, SLUG like `hack-dryrun-1745254321`. Persisted.
|
||
|
||
- [ ] **Step 2.2: Create branch with trivial commit**
|
||
|
||
```bash
|
||
cd "/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss"
|
||
# Save current branch so we can return
|
||
git rev-parse --abbrev-ref HEAD > /tmp/dryrun-baseline/starting-branch.txt
|
||
BRANCH=$(cat /tmp/dryrun-baseline/branch-name.txt)
|
||
git checkout -b "$BRANCH"
|
||
echo "<!-- dryrun $(date -u +%FT%TZ) -->" >> docs/QUICKSTART.md
|
||
git add docs/QUICKSTART.md
|
||
git commit -m "dryrun: verify per-branch GHCR tag"
|
||
git push -u origin "$BRANCH"
|
||
```
|
||
|
||
Expected: branch created, one commit, push succeeds with upstream tracking. If push is rejected (e.g. protection), abort.
|
||
|
||
- [ ] **Step 2.3: Wait for release workflow to complete**
|
||
|
||
```bash
|
||
cd "/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss"
|
||
BRANCH=$(cat /tmp/dryrun-baseline/branch-name.txt)
|
||
# Get the most recent run id for this branch + workflow
|
||
sleep 10 # give GH a moment to register the run
|
||
RUN_ID=$(gh run list --branch "$BRANCH" --workflow release.yml --limit 1 --json databaseId --jq '.[0].databaseId')
|
||
echo "Watching run $RUN_ID"
|
||
gh run watch "$RUN_ID" --exit-status --interval 15
|
||
echo "Workflow exit: $?"
|
||
```
|
||
|
||
Expected: exit status 0 after ~3-5 min. If exit != 0, print the logs:
|
||
|
||
```bash
|
||
gh run view "$RUN_ID" --log-failed | tail -100
|
||
```
|
||
|
||
and abort with message: `Release workflow failed for throwaway branch — investigate before hackathon`.
|
||
|
||
- [ ] **Step 2.4: Verify per-branch tag exists on GHCR**
|
||
|
||
```bash
|
||
SLUG=$(cat /tmp/dryrun-baseline/slug.txt)
|
||
EXPECTED="ghcr.io/keboola/agnes-the-ai-analyst:dev-$SLUG"
|
||
docker manifest inspect "$EXPECTED" > /tmp/dryrun-baseline/ghcr-manifest.json
|
||
DIGEST=$(jq -r '.config.digest // .manifests[0].digest' /tmp/dryrun-baseline/ghcr-manifest.json)
|
||
echo "Tag exists: $EXPECTED"
|
||
echo "Digest: $DIGEST"
|
||
echo "$DIGEST" > /tmp/dryrun-baseline/expected-digest.txt
|
||
```
|
||
|
||
Expected: `docker manifest inspect` returns JSON (exit 0), a non-empty digest is extracted. If the tag is missing, abort with message: `release.yml did not produce :dev-<slug> tag — check build-and-push step logs`.
|
||
|
||
- [ ] **Step 2.5: Record Task 2 result**
|
||
|
||
```bash
|
||
SLUG=$(cat /tmp/dryrun-baseline/slug.txt)
|
||
cat >> /tmp/dryrun-report.md <<EOF
|
||
## Task 2: Per-Branch GHCR Build — PASS
|
||
|
||
- Branch: $(cat /tmp/dryrun-baseline/branch-name.txt)
|
||
- Slug: $SLUG
|
||
- Tag: ghcr.io/keboola/agnes-the-ai-analyst:dev-$SLUG
|
||
- Digest: $(cat /tmp/dryrun-baseline/expected-digest.txt)
|
||
|
||
EOF
|
||
```
|
||
|
||
**Task 2 gate:** `:dev-<slug>` manifest exists. Proceed.
|
||
|
||
---
|
||
|
||
## Task 3: Dev VM Switch Flow
|
||
|
||
**Purpose:** Simulate the hackathon developer path — have the shared `agnes-dev` VM pick up the per-branch image via the existing auto-upgrade cron, verify the new image is running, then (in Task 6) roll back.
|
||
|
||
**Files touched (reversibly):**
|
||
- `/data/.env` on `agnes-dev` VM — one-line `AGNES_TAG=` change (rollback is captured in baseline from Step 1.5)
|
||
|
||
- [ ] **Step 3.1: Switch `agnes-dev` `.env` AGNES_TAG to the per-branch tag**
|
||
|
||
```bash
|
||
SLUG=$(cat /tmp/dryrun-baseline/slug.txt)
|
||
NEW_TAG="dev-$SLUG"
|
||
gcloud compute ssh agnes-dev --zone=europe-west1-b --quiet --command "\
|
||
sudo cp /data/.env /data/.env.dryrun-bak && \
|
||
sudo sed -i 's|^AGNES_TAG=.*|AGNES_TAG=$NEW_TAG|' /data/.env && \
|
||
sudo grep -E '^AGNES_TAG=' /data/.env"
|
||
```
|
||
|
||
Expected: final line is `AGNES_TAG=dev-<slug>`. If sed didn't match (no `AGNES_TAG=` line existed), abort and manually investigate.
|
||
|
||
- [ ] **Step 3.2: Trigger auto-upgrade cron script immediately**
|
||
|
||
```bash
|
||
gcloud compute ssh agnes-dev --zone=europe-west1-b --quiet --command \
|
||
"sudo /usr/local/bin/agnes-auto-upgrade.sh 2>&1 | tail -30"
|
||
```
|
||
|
||
Expected: output shows `docker compose pull` + `docker compose up -d` activity. If the script doesn't exist or errors, abort with message: `auto-upgrade script missing or broken on agnes-dev`.
|
||
|
||
- [ ] **Step 3.3: Wait for app container to become healthy**
|
||
|
||
```bash
|
||
# Poll /api/health for up to 90s
|
||
for i in $(seq 1 30); do
|
||
STATUS=$(curl -s --max-time 5 http://<dev-vm-ip>:8000/api/health | jq -r '.status' 2>/dev/null || echo "down")
|
||
echo "[$i/30] status=$STATUS"
|
||
if [ "$STATUS" = "healthy" ] || [ "$STATUS" = "degraded" ]; then
|
||
break
|
||
fi
|
||
sleep 3
|
||
done
|
||
[ "$STATUS" = "healthy" ] || [ "$STATUS" = "degraded" ] || { echo "FAIL: dev never healthy"; exit 1; }
|
||
```
|
||
|
||
Expected: reaches `healthy`/`degraded` within 90s.
|
||
|
||
- [ ] **Step 3.4: Verify the running image is the per-branch one**
|
||
|
||
```bash
|
||
SLUG=$(cat /tmp/dryrun-baseline/slug.txt)
|
||
EXPECTED_DIGEST=$(cat /tmp/dryrun-baseline/expected-digest.txt)
|
||
RUNNING_IMAGE=$(gcloud compute ssh agnes-dev --zone=europe-west1-b --quiet --command \
|
||
"docker inspect \$(docker ps -qf name=app) --format '{{.Image}}'")
|
||
echo "Running image digest: $RUNNING_IMAGE"
|
||
# The running image line will be sha256:xxxxx. Compare to the manifest digest we recorded.
|
||
# They should match (or differ only by multi-arch manifest indirection — compare via docker inspect on remote)
|
||
gcloud compute ssh agnes-dev --zone=europe-west1-b --quiet --command \
|
||
"docker inspect \$(docker ps -qf name=app) --format '{{.Config.Image}}' && \
|
||
docker image inspect \$(docker ps -qf name=app --format '{{.Image}}' | head -1) --format '{{.RepoTags}}{{.RepoDigests}}'"
|
||
```
|
||
|
||
Expected: `RepoTags` or `RepoDigests` output includes either `:dev-$SLUG` or the digest from Step 2.4. If neither matches, the cron didn't pull the new tag — record as FAIL and continue (cleanup is still required).
|
||
|
||
- [ ] **Step 3.5: Record Task 3 result**
|
||
|
||
The agent must judge PASS/FAIL based on Step 3.4 output: PASS iff `RepoTags` or `RepoDigests` contained `:dev-$SLUG` or the digest captured in Step 2.4.
|
||
|
||
```bash
|
||
SLUG=$(cat /tmp/dryrun-baseline/slug.txt)
|
||
# Replace <RESULT> with PASS or FAIL based on the Step 3.4 output the agent observed.
|
||
# Replace <IMAGE_OUTPUT> with the RepoTags/RepoDigests line from Step 3.4.
|
||
# Replace <SECONDS> with the loop iteration count from Step 3.3 × 3.
|
||
cat >> /tmp/dryrun-report.md <<EOF
|
||
## Task 3: Dev VM Switch — <RESULT>
|
||
|
||
- Switched agnes-dev to AGNES_TAG=dev-$SLUG
|
||
- Health after switch: reached healthy/degraded within 90s
|
||
- Running image: <IMAGE_OUTPUT>
|
||
- Time from cron trigger to healthy: <SECONDS>s
|
||
|
||
EOF
|
||
```
|
||
|
||
**Task 3 gate:** health reached OK state; running image verified. Proceed even if image verification was inconclusive — rollback still required.
|
||
|
||
---
|
||
|
||
## Task 4: Terraform Plan Verification (Private Repo)
|
||
|
||
**Purpose:** Validate that adding a new entry to `dev_instances` produces a clean `terraform plan` (not apply) in `agnes-infra-keboola`. This proves the TF module accepts the variable shape the hackathon docs will recommend.
|
||
|
||
**Skip condition:** If prerequisites check found that `/tmp/agnes-infra-keboola` clone failed, skip this entire task and record `SKIPPED — repo unavailable` in the report.
|
||
|
||
**Files touched (throwaway branch only):**
|
||
- `/tmp/agnes-infra-keboola/terraform/terraform.tfvars` (throwaway edit)
|
||
|
||
- [ ] **Step 4.1: Create throwaway branch in private repo**
|
||
|
||
```bash
|
||
cd /tmp/agnes-infra-keboola
|
||
git checkout main
|
||
git pull
|
||
EPOCH=$(date +%s)
|
||
BRANCH="dryrun-tfplan-${EPOCH}"
|
||
echo "$BRANCH" > /tmp/dryrun-baseline/tf-branch.txt
|
||
git checkout -b "$BRANCH"
|
||
```
|
||
|
||
Expected: clean checkout of main, new branch created.
|
||
|
||
- [ ] **Step 4.2: Add throwaway dev_instance entry**
|
||
|
||
Read `terraform/terraform.tfvars` first to understand the current `dev_instances` shape. Then append a new entry.
|
||
|
||
The `dev_instances` variable schema (from `infra/modules/customer-instance/variables.tf:41-49`) is:
|
||
```hcl
|
||
list(object({
|
||
name = string
|
||
machine_type = optional(string, "e2-small")
|
||
image_tag = optional(string, "dev")
|
||
}))
|
||
```
|
||
|
||
Modify the `dev_instances` list to append:
|
||
```hcl
|
||
{ name = "agnes-hack-dryrun", image_tag = "dev-<slug-from-task-2>" }
|
||
```
|
||
|
||
The agent should detect the current tfvars format and insert accordingly. If the file does not already contain `dev_instances`, abort and report format-mismatch.
|
||
|
||
```bash
|
||
SLUG=$(cat /tmp/dryrun-baseline/slug.txt)
|
||
# Show current tfvars for context
|
||
cat /tmp/agnes-infra-keboola/terraform/terraform.tfvars | grep -A 20 "dev_instances"
|
||
# Agent must edit the file to add the new entry — use the Edit tool rather than sed to be safe.
|
||
```
|
||
|
||
After editing, show the diff:
|
||
```bash
|
||
cd /tmp/agnes-infra-keboola
|
||
git diff terraform/terraform.tfvars
|
||
```
|
||
|
||
Expected: diff adds exactly one new entry to `dev_instances` list with `name = "agnes-hack-dryrun"` and `image_tag = "dev-<slug>"`.
|
||
|
||
- [ ] **Step 4.3: Run `terraform plan` locally (no apply)**
|
||
|
||
```bash
|
||
cd /tmp/agnes-infra-keboola/terraform
|
||
export GOOGLE_APPLICATION_CREDENTIALS="$HOME/.agnes-keys/agnes-deploy-internal-prod-key.json"
|
||
[ -f "$GOOGLE_APPLICATION_CREDENTIALS" ] || { echo "SA key not found — skipping plan"; exit 2; }
|
||
terraform init -input=false -upgrade=false
|
||
terraform plan -input=false -no-color -out=/tmp/dryrun-tfplan.bin > /tmp/dryrun-tfplan.txt 2>&1
|
||
RC=$?
|
||
echo "terraform plan exit: $RC"
|
||
tail -40 /tmp/dryrun-tfplan.txt
|
||
```
|
||
|
||
Expected:
|
||
- exit 0 or 2 (2 = changes detected, which is what we want)
|
||
- output ends with `Plan: N to add, M to change, K to destroy.` where `N >= 1` (at least the new VM + disk + IP) and `K == 0` (we must NOT be destroying anything)
|
||
|
||
If `K > 0` or `terraform plan` errors, abort and DO NOT proceed to Step 4.4. Report the plan output verbatim in the final report.
|
||
|
||
- [ ] **Step 4.4: Discard throwaway branch (no push, no apply)**
|
||
|
||
```bash
|
||
cd /tmp/agnes-infra-keboola
|
||
git checkout main
|
||
BRANCH=$(cat /tmp/dryrun-baseline/tf-branch.txt)
|
||
git branch -D "$BRANCH"
|
||
# Branch was never pushed, so nothing to clean up remotely.
|
||
```
|
||
|
||
Expected: branch deleted locally, main is current, working tree clean.
|
||
|
||
- [ ] **Step 4.5: Record Task 4 result**
|
||
|
||
```bash
|
||
ADDS=$(grep -E "Plan:" /tmp/dryrun-tfplan.txt | head -1)
|
||
DESTROYS_OK=$(grep -E "Plan:.*0 to destroy" /tmp/dryrun-tfplan.txt && echo yes || echo no)
|
||
cat >> /tmp/dryrun-report.md <<EOF
|
||
## Task 4: TF Plan for New Dev VM — <PASS|SKIPPED|FAIL>
|
||
|
||
- Plan summary: $ADDS
|
||
- Zero destroys: $DESTROYS_OK
|
||
- Full plan output: see /tmp/dryrun-tfplan.txt
|
||
|
||
EOF
|
||
```
|
||
|
||
**Task 4 gate:** plan produced with 0 destroys and ≥1 add. Proceed.
|
||
|
||
---
|
||
|
||
## Task 5: Verify Smoke-Test Gate Blocks Broken PR
|
||
|
||
**Purpose:** Confirm that a pull request with a deliberately-failing test does NOT produce a passing CI — which is the safety net that keeps `:stable` from auto-promoting broken images to prod.
|
||
|
||
**Files touched (throwaway branch only):**
|
||
- `tests/test_dryrun_should_fail.py` (new file on throwaway branch)
|
||
|
||
**Important:** This task creates a PR (not a merge). The PR is closed without merging in Step 5.5.
|
||
|
||
- [ ] **Step 5.1: Create throwaway branch with failing test**
|
||
|
||
```bash
|
||
cd "/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss"
|
||
git checkout main
|
||
git pull
|
||
EPOCH=$(date +%s)
|
||
BRANCH="dryrun-break-smoke-${EPOCH}"
|
||
echo "$BRANCH" > /tmp/dryrun-baseline/smoke-branch.txt
|
||
git checkout -b "$BRANCH"
|
||
cat > tests/test_dryrun_should_fail.py <<'PYEOF'
|
||
def test_intentional_fail_for_dryrun():
|
||
"""Intentional failure to verify CI gate blocks broken PRs. Remove after dryrun."""
|
||
assert False, "dryrun: this test is supposed to fail"
|
||
PYEOF
|
||
git add tests/test_dryrun_should_fail.py
|
||
git commit -m "dryrun: intentional failing test (will be reverted)"
|
||
git push -u origin "$BRANCH"
|
||
```
|
||
|
||
Expected: push succeeds.
|
||
|
||
- [ ] **Step 5.2: Open PR**
|
||
|
||
```bash
|
||
cd "/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss"
|
||
PR_URL=$(gh pr create --title "dryrun: verify CI gate (DO NOT MERGE)" \
|
||
--body "Intentionally failing test to verify CI blocks bad merges. Will be closed immediately after CI result." \
|
||
--base main)
|
||
echo "$PR_URL" > /tmp/dryrun-baseline/pr-url.txt
|
||
echo "Opened: $PR_URL"
|
||
```
|
||
|
||
Expected: PR URL returned.
|
||
|
||
- [ ] **Step 5.3: Wait for CI `test` job to complete (expected: FAIL)**
|
||
|
||
```bash
|
||
cd "/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss"
|
||
BRANCH=$(cat /tmp/dryrun-baseline/smoke-branch.txt)
|
||
sleep 15
|
||
RUN_ID=$(gh run list --branch "$BRANCH" --workflow release.yml --limit 1 --json databaseId --jq '.[0].databaseId')
|
||
echo "Watching run $RUN_ID (expected to FAIL)"
|
||
# Use --exit-status WITHOUT `set -e`; we expect non-zero
|
||
set +e
|
||
gh run watch "$RUN_ID" --exit-status --interval 15
|
||
EXIT=$?
|
||
set -e
|
||
echo "Exit code: $EXIT (non-zero is EXPECTED here)"
|
||
```
|
||
|
||
Expected: exit code != 0. If exit code IS 0, that means CI passed despite `assert False` → the test suite is not being run, or the file was excluded → record as **FAIL — CI gate broken**.
|
||
|
||
- [ ] **Step 5.4: Verify PR mergeability check shows failure**
|
||
|
||
```bash
|
||
PR_URL=$(cat /tmp/dryrun-baseline/pr-url.txt)
|
||
PR_NUM=$(basename "$PR_URL")
|
||
STATE=$(gh pr view "$PR_NUM" --json statusCheckRollup --jq '.statusCheckRollup[] | select(.name=="test") | .conclusion')
|
||
echo "test job conclusion: $STATE"
|
||
```
|
||
|
||
Expected: `FAILURE`. If `SUCCESS`, the gate is broken.
|
||
|
||
- [ ] **Step 5.5: Close PR and delete branch**
|
||
|
||
```bash
|
||
cd "/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss"
|
||
PR_URL=$(cat /tmp/dryrun-baseline/pr-url.txt)
|
||
PR_NUM=$(basename "$PR_URL")
|
||
gh pr close "$PR_NUM" --delete-branch --comment "dryrun complete — CI gate verified, closing without merge"
|
||
# Also delete locally
|
||
git checkout main
|
||
BRANCH=$(cat /tmp/dryrun-baseline/smoke-branch.txt)
|
||
git branch -D "$BRANCH" 2>/dev/null || true
|
||
```
|
||
|
||
Expected: PR closed, local branch gone.
|
||
|
||
- [ ] **Step 5.6: Check whether `main` has required status checks configured**
|
||
|
||
```bash
|
||
gh api repos/keboola/agnes-the-ai-analyst/branches/main/protection 2>/tmp/dryrun-protection-err.txt > /tmp/dryrun-protection.json
|
||
RC=$?
|
||
if [ $RC -ne 0 ]; then
|
||
echo "No branch protection on main (or insufficient permissions to read it)"
|
||
cat /tmp/dryrun-protection-err.txt
|
||
PROTECTION_NOTE="NONE — branch is unprotected; broken PRs can be merged. Recommend adding 'test' as required status check."
|
||
else
|
||
REQUIRED=$(jq -r '.required_status_checks.contexts[]?' /tmp/dryrun-protection.json 2>/dev/null | tr '\n' ',')
|
||
echo "Required checks: $REQUIRED"
|
||
if echo "$REQUIRED" | grep -q "test"; then
|
||
PROTECTION_NOTE="OK — 'test' is required."
|
||
else
|
||
PROTECTION_NOTE="PARTIAL — protection exists but 'test' is not required. Contexts: $REQUIRED"
|
||
fi
|
||
fi
|
||
echo "$PROTECTION_NOTE" > /tmp/dryrun-baseline/protection-note.txt
|
||
```
|
||
|
||
Expected: note written. Does not abort — informational only.
|
||
|
||
- [ ] **Step 5.7: Record Task 5 result**
|
||
|
||
```bash
|
||
cat >> /tmp/dryrun-report.md <<EOF
|
||
## Task 5: CI Gate — <PASS|FAIL>
|
||
|
||
- Throwaway PR: $(cat /tmp/dryrun-baseline/pr-url.txt) (closed)
|
||
- CI 'test' job result on broken code: <FAILURE expected>
|
||
- Branch protection on main: $(cat /tmp/dryrun-baseline/protection-note.txt)
|
||
|
||
EOF
|
||
```
|
||
|
||
**Task 5 gate:** broken PR's CI status is FAILURE. Proceed. If `PROTECTION_NOTE` says NONE/PARTIAL, the final report must flag this as a **hackathon-blocking recommendation**.
|
||
|
||
---
|
||
|
||
## Task 6: Cleanup and Baseline Restoration
|
||
|
||
**Purpose:** Leave the system in exactly the state recorded in Task 1. This is the most important task — a dirty dry-run poisons the hackathon.
|
||
|
||
- [ ] **Step 6.1: Restore `agnes-dev` AGNES_TAG**
|
||
|
||
```bash
|
||
ORIG_LINE=$(cat /tmp/dryrun-baseline/dev-env.txt)
|
||
# ORIG_LINE looks like: AGNES_TAG=dev
|
||
ORIG_VALUE=$(echo "$ORIG_LINE" | cut -d= -f2-)
|
||
gcloud compute ssh agnes-dev --zone=europe-west1-b --quiet --command "\
|
||
sudo sed -i 's|^AGNES_TAG=.*|AGNES_TAG=$ORIG_VALUE|' /data/.env && \
|
||
sudo rm -f /data/.env.dryrun-bak && \
|
||
sudo grep -E '^AGNES_TAG=' /data/.env && \
|
||
sudo /usr/local/bin/agnes-auto-upgrade.sh 2>&1 | tail -20"
|
||
```
|
||
|
||
Expected: AGNES_TAG line matches original, auto-upgrade pulls back to the original tag.
|
||
|
||
- [ ] **Step 6.2: Wait for dev VM to return to healthy state on original tag**
|
||
|
||
```bash
|
||
for i in $(seq 1 30); do
|
||
STATUS=$(curl -s --max-time 5 http://<dev-vm-ip>:8000/api/health | jq -r '.status' 2>/dev/null || echo down)
|
||
echo "[$i/30] status=$STATUS"
|
||
[ "$STATUS" = "healthy" ] || [ "$STATUS" = "degraded" ] && break
|
||
sleep 3
|
||
done
|
||
```
|
||
|
||
Expected: reaches healthy/degraded within 90s.
|
||
|
||
- [ ] **Step 6.3: Verify running image matches baseline**
|
||
|
||
```bash
|
||
RESTORED=$(gcloud compute ssh agnes-dev --zone=europe-west1-b --quiet --command \
|
||
"docker inspect \$(docker ps -qf name=app) --format '{{.Config.Image}}'")
|
||
ORIG=$(cat /tmp/dryrun-baseline/dev-image.txt)
|
||
echo "Restored: $RESTORED"
|
||
echo "Original: $ORIG"
|
||
[ "$RESTORED" = "$ORIG" ] && echo MATCH || echo "MISMATCH — investigate"
|
||
```
|
||
|
||
Expected: MATCH. If MISMATCH, the baseline-tag digest may have advanced (auto-upgrade pulled newer `:stable`/`:dev` floating image during the run) — that is acceptable as long as the `.Config.Image` *tag* matches. Record exact difference in report.
|
||
|
||
- [ ] **Step 6.4: Delete throwaway branches in public repo**
|
||
|
||
```bash
|
||
cd "/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss"
|
||
STARTING=$(cat /tmp/dryrun-baseline/starting-branch.txt)
|
||
git checkout "$STARTING"
|
||
FEAT_BRANCH=$(cat /tmp/dryrun-baseline/branch-name.txt)
|
||
SMOKE_BRANCH=$(cat /tmp/dryrun-baseline/smoke-branch.txt 2>/dev/null || echo "")
|
||
# Local delete
|
||
git branch -D "$FEAT_BRANCH" 2>/dev/null || true
|
||
[ -n "$SMOKE_BRANCH" ] && git branch -D "$SMOKE_BRANCH" 2>/dev/null || true
|
||
# Remote delete (smoke branch was already deleted via `gh pr close --delete-branch` in Step 5.5)
|
||
git push origin --delete "$FEAT_BRANCH" 2>/dev/null || echo "(feature branch already gone)"
|
||
```
|
||
|
||
Expected: local branches gone, remote feature branch deleted. QUICKSTART.md commit on throwaway branch vanishes from origin.
|
||
|
||
- [ ] **Step 6.5: Final health check on prod (must match baseline)**
|
||
|
||
```bash
|
||
curl -sf --max-time 10 http://<prod-vm-ip>:8000/api/health > /tmp/dryrun-baseline/prod-health-after.json
|
||
BEFORE=$(jq -r '.status' /tmp/dryrun-baseline/prod-health.json)
|
||
AFTER=$(jq -r '.status' /tmp/dryrun-baseline/prod-health-after.json)
|
||
echo "Prod status before: $BEFORE / after: $AFTER"
|
||
[ "$BEFORE" = "$AFTER" ] && echo UNCHANGED || echo DRIFT
|
||
```
|
||
|
||
Expected: UNCHANGED. (Note: prod was never touched, so this is sanity only.)
|
||
|
||
- [ ] **Step 6.6: Record Task 6 result**
|
||
|
||
```bash
|
||
cat >> /tmp/dryrun-report.md <<EOF
|
||
## Task 6: Cleanup — <PASS|FAIL>
|
||
|
||
- agnes-dev AGNES_TAG restored to: $(cat /tmp/dryrun-baseline/dev-env.txt)
|
||
- agnes-dev health after restore: $(curl -s --max-time 5 http://<dev-vm-ip>:8000/api/health | jq -r '.status')
|
||
- agnes-dev image: matches baseline? <MATCH|MISMATCH — paste both>
|
||
- Throwaway branches deleted: feature, smoke
|
||
- Prod status unchanged: <UNCHANGED|DRIFT>
|
||
|
||
EOF
|
||
```
|
||
|
||
**Task 6 gate:** dev VM back on its baseline tag, branches gone, prod untouched.
|
||
|
||
---
|
||
|
||
## Task 7: Generate Deliverables
|
||
|
||
**Purpose:** Produce the artefacts the user needs tomorrow: a helper script for the hackathon team and a consolidated report.
|
||
|
||
**Files:**
|
||
- Create: `scripts/switch-dev-vm.sh` (new)
|
||
- Create (already being built): `/tmp/dryrun-report.md`
|
||
|
||
- [ ] **Step 7.1: Write `scripts/switch-dev-vm.sh`**
|
||
|
||
Create file at `scripts/switch-dev-vm.sh`:
|
||
|
||
```bash
|
||
#!/usr/bin/env bash
|
||
# switch-dev-vm.sh — point the shared hackathon dev VM at the caller's branch image.
|
||
#
|
||
# Usage:
|
||
# scripts/switch-dev-vm.sh <branch-slug>
|
||
# scripts/switch-dev-vm.sh hack-zs-metrics
|
||
#
|
||
# Prerequisite: your branch has been pushed and the release.yml workflow has completed,
|
||
# producing ghcr.io/keboola/agnes-the-ai-analyst:dev-<slug>.
|
||
#
|
||
# The slug is derived from your branch name by stripping the leading "feature/" and
|
||
# replacing non-alphanumeric chars with "-". For branch "feature/hack-zs-metrics" the slug
|
||
# is "hack-zs-metrics".
|
||
set -euo pipefail
|
||
|
||
if [ $# -ne 1 ]; then
|
||
echo "Usage: $0 <branch-slug>" >&2
|
||
echo "Example: $0 hack-zs-metrics" >&2
|
||
exit 2
|
||
fi
|
||
|
||
SLUG="$1"
|
||
VM="agnes-dev"
|
||
ZONE="europe-west1-b"
|
||
TAG="dev-$SLUG"
|
||
IMAGE="ghcr.io/keboola/agnes-the-ai-analyst:$TAG"
|
||
|
||
echo "[1/4] Verifying $IMAGE exists on GHCR..."
|
||
docker manifest inspect "$IMAGE" > /dev/null || {
|
||
echo "ERROR: $IMAGE not found on GHCR. Did your release.yml run finish?" >&2
|
||
echo "Check: gh run list --branch feature/$SLUG --workflow release.yml" >&2
|
||
exit 1
|
||
}
|
||
|
||
echo "[2/4] Updating AGNES_TAG on $VM to $TAG..."
|
||
gcloud compute ssh "$VM" --zone="$ZONE" --quiet --command "\
|
||
sudo sed -i 's|^AGNES_TAG=.*|AGNES_TAG=$TAG|' /data/.env && \
|
||
sudo grep -E '^AGNES_TAG=' /data/.env"
|
||
|
||
echo "[3/4] Triggering auto-upgrade..."
|
||
gcloud compute ssh "$VM" --zone="$ZONE" --quiet --command \
|
||
"sudo /usr/local/bin/agnes-auto-upgrade.sh 2>&1 | tail -10"
|
||
|
||
echo "[4/4] Waiting for app to become healthy..."
|
||
for i in $(seq 1 30); do
|
||
STATUS=$(curl -s --max-time 5 http://<dev-vm-ip>:8000/api/health | python3 -c 'import sys,json; print(json.load(sys.stdin).get("status","down"))' 2>/dev/null || echo down)
|
||
echo " [$i/30] status=$STATUS"
|
||
if [ "$STATUS" = "healthy" ] || [ "$STATUS" = "degraded" ]; then
|
||
echo "OK — agnes-dev now running $TAG. Open http://<dev-vm-ip>:8000"
|
||
exit 0
|
||
fi
|
||
sleep 3
|
||
done
|
||
echo "ERROR: agnes-dev did not become healthy in 90s. SSH in and check: docker compose logs" >&2
|
||
exit 1
|
||
```
|
||
|
||
```bash
|
||
chmod +x scripts/switch-dev-vm.sh
|
||
bash -n scripts/switch-dev-vm.sh # syntax check
|
||
```
|
||
|
||
Expected: syntax-check passes, file executable.
|
||
|
||
- [ ] **Step 7.2: Commit the script on a fresh branch and open PR**
|
||
|
||
```bash
|
||
cd "/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss"
|
||
git checkout -b feature/hackathon-dryrun-deliverables
|
||
git add scripts/switch-dev-vm.sh
|
||
git commit -m "chore: add switch-dev-vm.sh helper for hackathon"
|
||
git push -u origin HEAD
|
||
gh pr create --title "chore: add switch-dev-vm.sh helper for hackathon" \
|
||
--body "Adds scripts/switch-dev-vm.sh. Produced by the 2026-04-21 hackathon dry-run. Reviewed by user before merge." \
|
||
--base main > /tmp/dryrun-baseline/deliverable-pr.txt
|
||
cat /tmp/dryrun-baseline/deliverable-pr.txt
|
||
```
|
||
|
||
Expected: PR URL. **Do not merge** — leave for user review.
|
||
|
||
- [ ] **Step 7.3: Finalise report with overall verdict**
|
||
|
||
Determine overall verdict by inspecting each Task's PASS/FAIL line in `/tmp/dryrun-report.md`. Overall is PASS only if all tasks PASS (SKIPPED Task 4 is acceptable — note it).
|
||
|
||
Append to report:
|
||
|
||
```bash
|
||
cat >> /tmp/dryrun-report.md <<EOF
|
||
---
|
||
|
||
## Overall Verdict
|
||
|
||
<PASS | PASS WITH GAPS | FAIL>
|
||
|
||
## Recommendations for the User Before Hackathon Starts
|
||
|
||
1. <If protection-note said NONE/PARTIAL:> Configure required status check 'test' on main branch of keboola/agnes-the-ai-analyst.
|
||
2. Pin prod image_tag in agnes-infra-keboola/terraform/terraform.tfvars from "stable" to "stable-2026.04.XX" (current running version). Revert after hackathon.
|
||
3. Rotate admin password '1234' on prod (<prod-vm-ip>:8000/login) and dev (<dev-vm-ip>:8000/login).
|
||
4. Wire notification_channel_ids in tfvars so uptime alerts actually notify someone.
|
||
5. Share the hackathon 1-pager + switch-dev-vm.sh via the team Slack channel.
|
||
6. Review PR $(cat /tmp/dryrun-baseline/deliverable-pr.txt) and merge if switch-dev-vm.sh looks good.
|
||
|
||
## Artefacts
|
||
|
||
- Full report: /tmp/dryrun-report.md (this file)
|
||
- Baseline snapshots: /tmp/dryrun-baseline/*.{json,txt}
|
||
- TF plan output: /tmp/dryrun-tfplan.txt (if Task 4 ran)
|
||
- Deliverable PR: $(cat /tmp/dryrun-baseline/deliverable-pr.txt)
|
||
|
||
EOF
|
||
cat /tmp/dryrun-report.md
|
||
```
|
||
|
||
Expected: full report printed.
|
||
|
||
- [ ] **Step 7.4: Print final summary to chat**
|
||
|
||
Agent should output, in its final message to the user:
|
||
- Overall verdict (one line)
|
||
- Each task's result (one line each)
|
||
- Any unresolved anomalies
|
||
- Link to deliverable PR
|
||
- Path to full report
|
||
|
||
**Task 7 gate:** report complete, PR open, all artefacts listed.
|
||
|
||
---
|
||
|
||
## Abort / Rollback Procedures
|
||
|
||
If any task fails mid-execution, the agent must still perform Task 6 cleanup before reporting failure. Specifically:
|
||
|
||
- If Task 2 push succeeded but Task 3 failed → still run Task 6 Steps 6.1-6.4 to restore dev VM and delete the branch.
|
||
- If Task 5 PR was opened but workflow didn't finish → close the PR with `gh pr close --delete-branch` and log it.
|
||
- If Task 4 TF plan showed destroys → abort immediately, do NOT attempt apply, record in report, continue to Task 6.
|
||
|
||
If Task 6 itself fails (dev VM won't come back healthy on original tag), the agent must:
|
||
1. Print the baseline values (from `/tmp/dryrun-baseline/dev-env.txt`, `/tmp/dryrun-baseline/dev-image.txt`) so the user can manually SSH and fix.
|
||
2. Attempt `gcloud compute ssh agnes-dev --zone=europe-west1-b --command "docker compose -f /opt/agnes/docker-compose.yml logs --tail 100"` and include output in the report.
|
||
3. Mark overall verdict as FAIL and stop.
|
||
|
||
## What a Successful Run Looks Like
|
||
|
||
- Task 1 baseline: captured with prod+dev healthy/degraded
|
||
- Task 2: GHCR manifest exists for `:dev-hack-dryrun-<epoch>`
|
||
- Task 3: agnes-dev briefly running the per-branch image, healthy within 90s
|
||
- Task 4: `terraform plan` showed `1+ to add, 0 to destroy` (or SKIPPED)
|
||
- Task 5: CI `test` job reported FAILURE on the broken PR, PR closed
|
||
- Task 6: agnes-dev back on its baseline AGNES_TAG, healthy, branches gone
|
||
- Task 7: `scripts/switch-dev-vm.sh` committed on PR for user review, full report in `/tmp/dryrun-report.md`
|
||
- Final agent message: verdict + 6 bullet results + deliverable PR link
|
||
|
||
Duration: ~45-75 minutes, bounded primarily by CI workflow runs (~3-5 min each, two runs) and TF init (~30s-2min cold).
|