# Hackathon E2E Dry-Run Plan > **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. **Goal:** Validate the full developer→dev-VM→merge→prod flow end-to-end the day before a multi-developer hackathon, so any broken link is found and fixed before participants arrive. **Architecture:** This is an operational dry-run, not a code feature. The executing agent pushes a throwaway feature branch to the public repo, verifies that CI produces a per-branch Docker image tag on GHCR, switches the shared `agnes-dev` VM onto that tag via the existing auto-upgrade cron, verifies that the CI test gate blocks a deliberately-broken PR from reaching `:stable`, and produces a helper script + report. The plan is **strictly non-destructive for prod** — prod-pinning (point 6 of the original outline) is explicitly out of scope and left to the user. **Tech Stack:** Bash / `gcloud` / `gh` / `git` / `docker` / `curl` / Python (`pytest`) / Terraform (plan only, no apply). No app code changes. --- ## Out of Scope (do NOT do) - Any `terraform apply` against real infrastructure. TF `plan` is allowed; TF `apply` is forbidden. - Pinning `prod_instance.image_tag` in `agnes-infra-keboola`. User will do this themselves after the dry-run succeeds. - Rotating admin passwords, Keboola tokens, or JWT secrets. - Modifying `main` branch of any repo. All changes happen on throwaway branches, which are deleted at the end. - Creating new GCP resources (VMs, disks, IPs, secrets, SAs). If any step would require doing one of the above, **STOP and ask the user**. --- ## Prerequisites Before starting, the executing agent MUST verify all of the following. If any fails, abort and report which prerequisite is missing — do NOT try to fix it. - [ ] **Working directory** is the `tmp_oss` checkout at `/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss`. Current branch can be anything; the plan will create a new branch. - [ ] **`gh auth status`** shows authenticated, with `workflow` scope. Run: ```bash gh auth status 2>&1 | grep -E "(Logged in|Token scopes)" ``` Expected: line containing `Logged in to github.com` and a line listing scopes that include `workflow`. If `workflow` scope is missing, abort with message: `Run: gh auth refresh -h github.com -s workflow`. - [ ] **`gcloud` authenticated** to project `kids-ai-data-analysis`. Run: ```bash gcloud config get-value project gcloud auth list --filter=status:ACTIVE --format="value(account)" ``` Expected: project is `kids-ai-data-analysis`, at least one active account. If not, abort with message: `Run: gcloud config set project kids-ai-data-analysis && gcloud auth login`. - [ ] **SSH to `agnes-dev` works** (OS Login). Run: ```bash gcloud compute ssh agnes-dev --zone=europe-west1-b --command="echo ok" --quiet ``` Expected: output contains `ok`. First connection may take ~20s while OS Login provisions. If fails with permission error, abort with message: `User needs compute.osLogin role on agnes-dev VM`. - [ ] **`docker` CLI available** locally (for `docker manifest inspect`). Run: `docker --version`. Expected: version output. If missing, abort. - [ ] **Public GHCR pull works**. Run: ```bash docker manifest inspect ghcr.io/keboola/agnes-the-ai-analyst:stable > /dev/null && echo ok ``` Expected: `ok`. If fails, abort — something is wrong with public image visibility. - [ ] **Clone of `agnes-infra-keboola` exists or can be cloned** at `/tmp/agnes-infra-keboola`. Run: ```bash if [ ! -d /tmp/agnes-infra-keboola ]; then gh repo clone keboola/agnes-infra-keboola /tmp/agnes-infra-keboola fi cd /tmp/agnes-infra-keboola && git status --short ``` Expected: clone succeeds, `git status` is clean. If clone fails, skip Task 4 (TF plan verification) and note it in the final report. **Gate:** All 7 prerequisite checks pass, OR the agent has clearly reported which ones failed and reduced scope accordingly. Only then proceed to Task 1. --- ## Task 1: Baseline Snapshot **Purpose:** Record the current state of both VMs and the TF outputs so the agent can detect drift at the end and prove it left everything as it found it. **Files:** - Create: `/tmp/dryrun-baseline/prod-health.json` - Create: `/tmp/dryrun-baseline/dev-health.json` - Create: `/tmp/dryrun-baseline/prod-image.txt` - Create: `/tmp/dryrun-baseline/dev-image.txt` - Create: `/tmp/dryrun-baseline/dev-env.txt` - [ ] **Step 1.1: Create baseline directory** ```bash mkdir -p /tmp/dryrun-baseline ``` - [ ] **Step 1.2: Capture prod health** ```bash curl -sf --max-time 10 http://34.77.102.61:8000/api/health > /tmp/dryrun-baseline/prod-health.json cat /tmp/dryrun-baseline/prod-health.json | python3 -m json.tool ``` Expected: JSON with `"status"` field equal to `"healthy"` or `"degraded"`. If `"unhealthy"` or curl times out, abort with message: `Prod is not in acceptable baseline state — investigate before dry-run`. - [ ] **Step 1.3: Capture dev health** ```bash curl -sf --max-time 10 http://34.77.94.14:8000/api/health > /tmp/dryrun-baseline/dev-health.json cat /tmp/dryrun-baseline/dev-health.json | python3 -m json.tool ``` Expected: JSON with `"status"` in `{healthy, degraded}`. Same abort condition as 1.2. - [ ] **Step 1.4: Capture current image tags on both VMs** ```bash gcloud compute ssh agnes-prod --zone=europe-west1-b --quiet --command \ "docker inspect \$(docker ps -qf name=app) --format '{{.Config.Image}}'" \ > /tmp/dryrun-baseline/prod-image.txt gcloud compute ssh agnes-dev --zone=europe-west1-b --quiet --command \ "docker inspect \$(docker ps -qf name=app) --format '{{.Config.Image}}'" \ > /tmp/dryrun-baseline/dev-image.txt cat /tmp/dryrun-baseline/prod-image.txt /tmp/dryrun-baseline/dev-image.txt ``` Expected: each file contains exactly one line like `ghcr.io/keboola/agnes-the-ai-analyst:stable` or `:stable-2026.04.XX`. Non-empty. - [ ] **Step 1.5: Capture `agnes-dev` `.env` AGNES_TAG line** ```bash gcloud compute ssh agnes-dev --zone=europe-west1-b --quiet --command \ "sudo grep -E '^AGNES_TAG=' /data/.env || echo 'AGNES_TAG_NOT_SET'" \ > /tmp/dryrun-baseline/dev-env.txt cat /tmp/dryrun-baseline/dev-env.txt ``` Expected: output is `AGNES_TAG=dev` or similar. Record exact value for restoration in Task 6. If `AGNES_TAG_NOT_SET`, abort — the VM is in an unknown config state. - [ ] **Step 1.6: Record baseline to report buffer** Append to a running report at `/tmp/dryrun-report.md` (create if not exists): ```bash cat > /tmp/dryrun-report.md <` tag appears on GHCR. **Files:** - Create (throwaway): branch `feature/hack-dryrun-` in `tmp_oss` + one trivial commit touching `docs/QUICKSTART.md` **Branch naming:** the agent MUST use `feature/hack-dryrun-` (e.g. `feature/hack-dryrun-1745254321`) so the slug is unique per run and cleanup is deterministic. - [ ] **Step 2.1: Compute branch name and expected slug** Per `.github/workflows/release.yml:92-98` logic: strip `feature/` prefix, sanitise `[^a-zA-Z0-9-]` to `-`, lowercase, cut 50 chars. ```bash EPOCH=$(date +%s) BRANCH="feature/hack-dryrun-${EPOCH}" SLUG=$(echo "$BRANCH" | sed 's|^feature/||' | sed 's|[^a-zA-Z0-9-]|-|g' | tr '[:upper:]' '[:lower:]' | cut -c1-50) echo "BRANCH=$BRANCH" echo "SLUG=$SLUG" echo "EXPECTED_TAG=ghcr.io/keboola/agnes-the-ai-analyst:dev-$SLUG" # Persist for later steps echo "$BRANCH" > /tmp/dryrun-baseline/branch-name.txt echo "$SLUG" > /tmp/dryrun-baseline/slug.txt ``` Expected: BRANCH like `feature/hack-dryrun-1745254321`, SLUG like `hack-dryrun-1745254321`. Persisted. - [ ] **Step 2.2: Create branch with trivial commit** ```bash cd "/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss" # Save current branch so we can return git rev-parse --abbrev-ref HEAD > /tmp/dryrun-baseline/starting-branch.txt BRANCH=$(cat /tmp/dryrun-baseline/branch-name.txt) git checkout -b "$BRANCH" echo "" >> docs/QUICKSTART.md git add docs/QUICKSTART.md git commit -m "dryrun: verify per-branch GHCR tag" git push -u origin "$BRANCH" ``` Expected: branch created, one commit, push succeeds with upstream tracking. If push is rejected (e.g. protection), abort. - [ ] **Step 2.3: Wait for release workflow to complete** ```bash cd "/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss" BRANCH=$(cat /tmp/dryrun-baseline/branch-name.txt) # Get the most recent run id for this branch + workflow sleep 10 # give GH a moment to register the run RUN_ID=$(gh run list --branch "$BRANCH" --workflow release.yml --limit 1 --json databaseId --jq '.[0].databaseId') echo "Watching run $RUN_ID" gh run watch "$RUN_ID" --exit-status --interval 15 echo "Workflow exit: $?" ``` Expected: exit status 0 after ~3-5 min. If exit != 0, print the logs: ```bash gh run view "$RUN_ID" --log-failed | tail -100 ``` and abort with message: `Release workflow failed for throwaway branch — investigate before hackathon`. - [ ] **Step 2.4: Verify per-branch tag exists on GHCR** ```bash SLUG=$(cat /tmp/dryrun-baseline/slug.txt) EXPECTED="ghcr.io/keboola/agnes-the-ai-analyst:dev-$SLUG" docker manifest inspect "$EXPECTED" > /tmp/dryrun-baseline/ghcr-manifest.json DIGEST=$(jq -r '.config.digest // .manifests[0].digest' /tmp/dryrun-baseline/ghcr-manifest.json) echo "Tag exists: $EXPECTED" echo "Digest: $DIGEST" echo "$DIGEST" > /tmp/dryrun-baseline/expected-digest.txt ``` Expected: `docker manifest inspect` returns JSON (exit 0), a non-empty digest is extracted. If the tag is missing, abort with message: `release.yml did not produce :dev- tag — check build-and-push step logs`. - [ ] **Step 2.5: Record Task 2 result** ```bash SLUG=$(cat /tmp/dryrun-baseline/slug.txt) cat >> /tmp/dryrun-report.md <` manifest exists. Proceed. --- ## Task 3: Dev VM Switch Flow **Purpose:** Simulate the hackathon developer path — have the shared `agnes-dev` VM pick up the per-branch image via the existing auto-upgrade cron, verify the new image is running, then (in Task 6) roll back. **Files touched (reversibly):** - `/data/.env` on `agnes-dev` VM — one-line `AGNES_TAG=` change (rollback is captured in baseline from Step 1.5) - [ ] **Step 3.1: Switch `agnes-dev` `.env` AGNES_TAG to the per-branch tag** ```bash SLUG=$(cat /tmp/dryrun-baseline/slug.txt) NEW_TAG="dev-$SLUG" gcloud compute ssh agnes-dev --zone=europe-west1-b --quiet --command "\ sudo cp /data/.env /data/.env.dryrun-bak && \ sudo sed -i 's|^AGNES_TAG=.*|AGNES_TAG=$NEW_TAG|' /data/.env && \ sudo grep -E '^AGNES_TAG=' /data/.env" ``` Expected: final line is `AGNES_TAG=dev-`. If sed didn't match (no `AGNES_TAG=` line existed), abort and manually investigate. - [ ] **Step 3.2: Trigger auto-upgrade cron script immediately** ```bash gcloud compute ssh agnes-dev --zone=europe-west1-b --quiet --command \ "sudo /usr/local/bin/agnes-auto-upgrade.sh 2>&1 | tail -30" ``` Expected: output shows `docker compose pull` + `docker compose up -d` activity. If the script doesn't exist or errors, abort with message: `auto-upgrade script missing or broken on agnes-dev`. - [ ] **Step 3.3: Wait for app container to become healthy** ```bash # Poll /api/health for up to 90s for i in $(seq 1 30); do STATUS=$(curl -s --max-time 5 http://34.77.94.14:8000/api/health | jq -r '.status' 2>/dev/null || echo "down") echo "[$i/30] status=$STATUS" if [ "$STATUS" = "healthy" ] || [ "$STATUS" = "degraded" ]; then break fi sleep 3 done [ "$STATUS" = "healthy" ] || [ "$STATUS" = "degraded" ] || { echo "FAIL: dev never healthy"; exit 1; } ``` Expected: reaches `healthy`/`degraded` within 90s. - [ ] **Step 3.4: Verify the running image is the per-branch one** ```bash SLUG=$(cat /tmp/dryrun-baseline/slug.txt) EXPECTED_DIGEST=$(cat /tmp/dryrun-baseline/expected-digest.txt) RUNNING_IMAGE=$(gcloud compute ssh agnes-dev --zone=europe-west1-b --quiet --command \ "docker inspect \$(docker ps -qf name=app) --format '{{.Image}}'") echo "Running image digest: $RUNNING_IMAGE" # The running image line will be sha256:xxxxx. Compare to the manifest digest we recorded. # They should match (or differ only by multi-arch manifest indirection — compare via docker inspect on remote) gcloud compute ssh agnes-dev --zone=europe-west1-b --quiet --command \ "docker inspect \$(docker ps -qf name=app) --format '{{.Config.Image}}' && \ docker image inspect \$(docker ps -qf name=app --format '{{.Image}}' | head -1) --format '{{.RepoTags}}{{.RepoDigests}}'" ``` Expected: `RepoTags` or `RepoDigests` output includes either `:dev-$SLUG` or the digest from Step 2.4. If neither matches, the cron didn't pull the new tag — record as FAIL and continue (cleanup is still required). - [ ] **Step 3.5: Record Task 3 result** The agent must judge PASS/FAIL based on Step 3.4 output: PASS iff `RepoTags` or `RepoDigests` contained `:dev-$SLUG` or the digest captured in Step 2.4. ```bash SLUG=$(cat /tmp/dryrun-baseline/slug.txt) # Replace with PASS or FAIL based on the Step 3.4 output the agent observed. # Replace with the RepoTags/RepoDigests line from Step 3.4. # Replace with the loop iteration count from Step 3.3 × 3. cat >> /tmp/dryrun-report.md < - Switched agnes-dev to AGNES_TAG=dev-$SLUG - Health after switch: reached healthy/degraded within 90s - Running image: - Time from cron trigger to healthy: s EOF ``` **Task 3 gate:** health reached OK state; running image verified. Proceed even if image verification was inconclusive — rollback still required. --- ## Task 4: Terraform Plan Verification (Private Repo) **Purpose:** Validate that adding a new entry to `dev_instances` produces a clean `terraform plan` (not apply) in `agnes-infra-keboola`. This proves the TF module accepts the variable shape the hackathon docs will recommend. **Skip condition:** If prerequisites check found that `/tmp/agnes-infra-keboola` clone failed, skip this entire task and record `SKIPPED — repo unavailable` in the report. **Files touched (throwaway branch only):** - `/tmp/agnes-infra-keboola/terraform/terraform.tfvars` (throwaway edit) - [ ] **Step 4.1: Create throwaway branch in private repo** ```bash cd /tmp/agnes-infra-keboola git checkout main git pull EPOCH=$(date +%s) BRANCH="dryrun-tfplan-${EPOCH}" echo "$BRANCH" > /tmp/dryrun-baseline/tf-branch.txt git checkout -b "$BRANCH" ``` Expected: clean checkout of main, new branch created. - [ ] **Step 4.2: Add throwaway dev_instance entry** Read `terraform/terraform.tfvars` first to understand the current `dev_instances` shape. Then append a new entry. The `dev_instances` variable schema (from `infra/modules/customer-instance/variables.tf:41-49`) is: ```hcl list(object({ name = string machine_type = optional(string, "e2-small") image_tag = optional(string, "dev") })) ``` Modify the `dev_instances` list to append: ```hcl { name = "agnes-hack-dryrun", image_tag = "dev-" } ``` The agent should detect the current tfvars format and insert accordingly. If the file does not already contain `dev_instances`, abort and report format-mismatch. ```bash SLUG=$(cat /tmp/dryrun-baseline/slug.txt) # Show current tfvars for context cat /tmp/agnes-infra-keboola/terraform/terraform.tfvars | grep -A 20 "dev_instances" # Agent must edit the file to add the new entry — use the Edit tool rather than sed to be safe. ``` After editing, show the diff: ```bash cd /tmp/agnes-infra-keboola git diff terraform/terraform.tfvars ``` Expected: diff adds exactly one new entry to `dev_instances` list with `name = "agnes-hack-dryrun"` and `image_tag = "dev-"`. - [ ] **Step 4.3: Run `terraform plan` locally (no apply)** ```bash cd /tmp/agnes-infra-keboola/terraform export GOOGLE_APPLICATION_CREDENTIALS="$HOME/.agnes-keys/agnes-deploy-kids-ai-data-analysis-key.json" [ -f "$GOOGLE_APPLICATION_CREDENTIALS" ] || { echo "SA key not found — skipping plan"; exit 2; } terraform init -input=false -upgrade=false terraform plan -input=false -no-color -out=/tmp/dryrun-tfplan.bin > /tmp/dryrun-tfplan.txt 2>&1 RC=$? echo "terraform plan exit: $RC" tail -40 /tmp/dryrun-tfplan.txt ``` Expected: - exit 0 or 2 (2 = changes detected, which is what we want) - output ends with `Plan: N to add, M to change, K to destroy.` where `N >= 1` (at least the new VM + disk + IP) and `K == 0` (we must NOT be destroying anything) If `K > 0` or `terraform plan` errors, abort and DO NOT proceed to Step 4.4. Report the plan output verbatim in the final report. - [ ] **Step 4.4: Discard throwaway branch (no push, no apply)** ```bash cd /tmp/agnes-infra-keboola git checkout main BRANCH=$(cat /tmp/dryrun-baseline/tf-branch.txt) git branch -D "$BRANCH" # Branch was never pushed, so nothing to clean up remotely. ``` Expected: branch deleted locally, main is current, working tree clean. - [ ] **Step 4.5: Record Task 4 result** ```bash ADDS=$(grep -E "Plan:" /tmp/dryrun-tfplan.txt | head -1) DESTROYS_OK=$(grep -E "Plan:.*0 to destroy" /tmp/dryrun-tfplan.txt && echo yes || echo no) cat >> /tmp/dryrun-report.md < - Plan summary: $ADDS - Zero destroys: $DESTROYS_OK - Full plan output: see /tmp/dryrun-tfplan.txt EOF ``` **Task 4 gate:** plan produced with 0 destroys and ≥1 add. Proceed. --- ## Task 5: Verify Smoke-Test Gate Blocks Broken PR **Purpose:** Confirm that a pull request with a deliberately-failing test does NOT produce a passing CI — which is the safety net that keeps `:stable` from auto-promoting broken images to prod. **Files touched (throwaway branch only):** - `tests/test_dryrun_should_fail.py` (new file on throwaway branch) **Important:** This task creates a PR (not a merge). The PR is closed without merging in Step 5.5. - [ ] **Step 5.1: Create throwaway branch with failing test** ```bash cd "/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss" git checkout main git pull EPOCH=$(date +%s) BRANCH="dryrun-break-smoke-${EPOCH}" echo "$BRANCH" > /tmp/dryrun-baseline/smoke-branch.txt git checkout -b "$BRANCH" cat > tests/test_dryrun_should_fail.py <<'PYEOF' def test_intentional_fail_for_dryrun(): """Intentional failure to verify CI gate blocks broken PRs. Remove after dryrun.""" assert False, "dryrun: this test is supposed to fail" PYEOF git add tests/test_dryrun_should_fail.py git commit -m "dryrun: intentional failing test (will be reverted)" git push -u origin "$BRANCH" ``` Expected: push succeeds. - [ ] **Step 5.2: Open PR** ```bash cd "/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss" PR_URL=$(gh pr create --title "dryrun: verify CI gate (DO NOT MERGE)" \ --body "Intentionally failing test to verify CI blocks bad merges. Will be closed immediately after CI result." \ --base main) echo "$PR_URL" > /tmp/dryrun-baseline/pr-url.txt echo "Opened: $PR_URL" ``` Expected: PR URL returned. - [ ] **Step 5.3: Wait for CI `test` job to complete (expected: FAIL)** ```bash cd "/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss" BRANCH=$(cat /tmp/dryrun-baseline/smoke-branch.txt) sleep 15 RUN_ID=$(gh run list --branch "$BRANCH" --workflow release.yml --limit 1 --json databaseId --jq '.[0].databaseId') echo "Watching run $RUN_ID (expected to FAIL)" # Use --exit-status WITHOUT `set -e`; we expect non-zero set +e gh run watch "$RUN_ID" --exit-status --interval 15 EXIT=$? set -e echo "Exit code: $EXIT (non-zero is EXPECTED here)" ``` Expected: exit code != 0. If exit code IS 0, that means CI passed despite `assert False` → the test suite is not being run, or the file was excluded → record as **FAIL — CI gate broken**. - [ ] **Step 5.4: Verify PR mergeability check shows failure** ```bash PR_URL=$(cat /tmp/dryrun-baseline/pr-url.txt) PR_NUM=$(basename "$PR_URL") STATE=$(gh pr view "$PR_NUM" --json statusCheckRollup --jq '.statusCheckRollup[] | select(.name=="test") | .conclusion') echo "test job conclusion: $STATE" ``` Expected: `FAILURE`. If `SUCCESS`, the gate is broken. - [ ] **Step 5.5: Close PR and delete branch** ```bash cd "/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss" PR_URL=$(cat /tmp/dryrun-baseline/pr-url.txt) PR_NUM=$(basename "$PR_URL") gh pr close "$PR_NUM" --delete-branch --comment "dryrun complete — CI gate verified, closing without merge" # Also delete locally git checkout main BRANCH=$(cat /tmp/dryrun-baseline/smoke-branch.txt) git branch -D "$BRANCH" 2>/dev/null || true ``` Expected: PR closed, local branch gone. - [ ] **Step 5.6: Check whether `main` has required status checks configured** ```bash gh api repos/keboola/agnes-the-ai-analyst/branches/main/protection 2>/tmp/dryrun-protection-err.txt > /tmp/dryrun-protection.json RC=$? if [ $RC -ne 0 ]; then echo "No branch protection on main (or insufficient permissions to read it)" cat /tmp/dryrun-protection-err.txt PROTECTION_NOTE="NONE — branch is unprotected; broken PRs can be merged. Recommend adding 'test' as required status check." else REQUIRED=$(jq -r '.required_status_checks.contexts[]?' /tmp/dryrun-protection.json 2>/dev/null | tr '\n' ',') echo "Required checks: $REQUIRED" if echo "$REQUIRED" | grep -q "test"; then PROTECTION_NOTE="OK — 'test' is required." else PROTECTION_NOTE="PARTIAL — protection exists but 'test' is not required. Contexts: $REQUIRED" fi fi echo "$PROTECTION_NOTE" > /tmp/dryrun-baseline/protection-note.txt ``` Expected: note written. Does not abort — informational only. - [ ] **Step 5.7: Record Task 5 result** ```bash cat >> /tmp/dryrun-report.md < - Throwaway PR: $(cat /tmp/dryrun-baseline/pr-url.txt) (closed) - CI 'test' job result on broken code: - Branch protection on main: $(cat /tmp/dryrun-baseline/protection-note.txt) EOF ``` **Task 5 gate:** broken PR's CI status is FAILURE. Proceed. If `PROTECTION_NOTE` says NONE/PARTIAL, the final report must flag this as a **hackathon-blocking recommendation**. --- ## Task 6: Cleanup and Baseline Restoration **Purpose:** Leave the system in exactly the state recorded in Task 1. This is the most important task — a dirty dry-run poisons the hackathon. - [ ] **Step 6.1: Restore `agnes-dev` AGNES_TAG** ```bash ORIG_LINE=$(cat /tmp/dryrun-baseline/dev-env.txt) # ORIG_LINE looks like: AGNES_TAG=dev ORIG_VALUE=$(echo "$ORIG_LINE" | cut -d= -f2-) gcloud compute ssh agnes-dev --zone=europe-west1-b --quiet --command "\ sudo sed -i 's|^AGNES_TAG=.*|AGNES_TAG=$ORIG_VALUE|' /data/.env && \ sudo rm -f /data/.env.dryrun-bak && \ sudo grep -E '^AGNES_TAG=' /data/.env && \ sudo /usr/local/bin/agnes-auto-upgrade.sh 2>&1 | tail -20" ``` Expected: AGNES_TAG line matches original, auto-upgrade pulls back to the original tag. - [ ] **Step 6.2: Wait for dev VM to return to healthy state on original tag** ```bash for i in $(seq 1 30); do STATUS=$(curl -s --max-time 5 http://34.77.94.14:8000/api/health | jq -r '.status' 2>/dev/null || echo down) echo "[$i/30] status=$STATUS" [ "$STATUS" = "healthy" ] || [ "$STATUS" = "degraded" ] && break sleep 3 done ``` Expected: reaches healthy/degraded within 90s. - [ ] **Step 6.3: Verify running image matches baseline** ```bash RESTORED=$(gcloud compute ssh agnes-dev --zone=europe-west1-b --quiet --command \ "docker inspect \$(docker ps -qf name=app) --format '{{.Config.Image}}'") ORIG=$(cat /tmp/dryrun-baseline/dev-image.txt) echo "Restored: $RESTORED" echo "Original: $ORIG" [ "$RESTORED" = "$ORIG" ] && echo MATCH || echo "MISMATCH — investigate" ``` Expected: MATCH. If MISMATCH, the baseline-tag digest may have advanced (auto-upgrade pulled newer `:stable`/`:dev` floating image during the run) — that is acceptable as long as the `.Config.Image` *tag* matches. Record exact difference in report. - [ ] **Step 6.4: Delete throwaway branches in public repo** ```bash cd "/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss" STARTING=$(cat /tmp/dryrun-baseline/starting-branch.txt) git checkout "$STARTING" FEAT_BRANCH=$(cat /tmp/dryrun-baseline/branch-name.txt) SMOKE_BRANCH=$(cat /tmp/dryrun-baseline/smoke-branch.txt 2>/dev/null || echo "") # Local delete git branch -D "$FEAT_BRANCH" 2>/dev/null || true [ -n "$SMOKE_BRANCH" ] && git branch -D "$SMOKE_BRANCH" 2>/dev/null || true # Remote delete (smoke branch was already deleted via `gh pr close --delete-branch` in Step 5.5) git push origin --delete "$FEAT_BRANCH" 2>/dev/null || echo "(feature branch already gone)" ``` Expected: local branches gone, remote feature branch deleted. QUICKSTART.md commit on throwaway branch vanishes from origin. - [ ] **Step 6.5: Final health check on prod (must match baseline)** ```bash curl -sf --max-time 10 http://34.77.102.61:8000/api/health > /tmp/dryrun-baseline/prod-health-after.json BEFORE=$(jq -r '.status' /tmp/dryrun-baseline/prod-health.json) AFTER=$(jq -r '.status' /tmp/dryrun-baseline/prod-health-after.json) echo "Prod status before: $BEFORE / after: $AFTER" [ "$BEFORE" = "$AFTER" ] && echo UNCHANGED || echo DRIFT ``` Expected: UNCHANGED. (Note: prod was never touched, so this is sanity only.) - [ ] **Step 6.6: Record Task 6 result** ```bash cat >> /tmp/dryrun-report.md < - agnes-dev AGNES_TAG restored to: $(cat /tmp/dryrun-baseline/dev-env.txt) - agnes-dev health after restore: $(curl -s --max-time 5 http://34.77.94.14:8000/api/health | jq -r '.status') - agnes-dev image: matches baseline? - Throwaway branches deleted: feature, smoke - Prod status unchanged: EOF ``` **Task 6 gate:** dev VM back on its baseline tag, branches gone, prod untouched. --- ## Task 7: Generate Deliverables **Purpose:** Produce the artefacts the user needs tomorrow: a helper script for the hackathon team and a consolidated report. **Files:** - Create: `scripts/switch-dev-vm.sh` (new) - Create (already being built): `/tmp/dryrun-report.md` - [ ] **Step 7.1: Write `scripts/switch-dev-vm.sh`** Create file at `scripts/switch-dev-vm.sh`: ```bash #!/usr/bin/env bash # switch-dev-vm.sh — point the shared hackathon dev VM at the caller's branch image. # # Usage: # scripts/switch-dev-vm.sh # scripts/switch-dev-vm.sh hack-zs-metrics # # Prerequisite: your branch has been pushed and the release.yml workflow has completed, # producing ghcr.io/keboola/agnes-the-ai-analyst:dev-. # # The slug is derived from your branch name by stripping the leading "feature/" and # replacing non-alphanumeric chars with "-". For branch "feature/hack-zs-metrics" the slug # is "hack-zs-metrics". set -euo pipefail if [ $# -ne 1 ]; then echo "Usage: $0 " >&2 echo "Example: $0 hack-zs-metrics" >&2 exit 2 fi SLUG="$1" VM="agnes-dev" ZONE="europe-west1-b" TAG="dev-$SLUG" IMAGE="ghcr.io/keboola/agnes-the-ai-analyst:$TAG" echo "[1/4] Verifying $IMAGE exists on GHCR..." docker manifest inspect "$IMAGE" > /dev/null || { echo "ERROR: $IMAGE not found on GHCR. Did your release.yml run finish?" >&2 echo "Check: gh run list --branch feature/$SLUG --workflow release.yml" >&2 exit 1 } echo "[2/4] Updating AGNES_TAG on $VM to $TAG..." gcloud compute ssh "$VM" --zone="$ZONE" --quiet --command "\ sudo sed -i 's|^AGNES_TAG=.*|AGNES_TAG=$TAG|' /data/.env && \ sudo grep -E '^AGNES_TAG=' /data/.env" echo "[3/4] Triggering auto-upgrade..." gcloud compute ssh "$VM" --zone="$ZONE" --quiet --command \ "sudo /usr/local/bin/agnes-auto-upgrade.sh 2>&1 | tail -10" echo "[4/4] Waiting for app to become healthy..." for i in $(seq 1 30); do STATUS=$(curl -s --max-time 5 http://34.77.94.14:8000/api/health | python3 -c 'import sys,json; print(json.load(sys.stdin).get("status","down"))' 2>/dev/null || echo down) echo " [$i/30] status=$STATUS" if [ "$STATUS" = "healthy" ] || [ "$STATUS" = "degraded" ]; then echo "OK — agnes-dev now running $TAG. Open http://34.77.94.14:8000" exit 0 fi sleep 3 done echo "ERROR: agnes-dev did not become healthy in 90s. SSH in and check: docker compose logs" >&2 exit 1 ``` ```bash chmod +x scripts/switch-dev-vm.sh bash -n scripts/switch-dev-vm.sh # syntax check ``` Expected: syntax-check passes, file executable. - [ ] **Step 7.2: Commit the script on a fresh branch and open PR** ```bash cd "/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss" git checkout -b feature/hackathon-dryrun-deliverables git add scripts/switch-dev-vm.sh git commit -m "chore: add switch-dev-vm.sh helper for hackathon" git push -u origin HEAD gh pr create --title "chore: add switch-dev-vm.sh helper for hackathon" \ --body "Adds scripts/switch-dev-vm.sh. Produced by the 2026-04-21 hackathon dry-run. Reviewed by user before merge." \ --base main > /tmp/dryrun-baseline/deliverable-pr.txt cat /tmp/dryrun-baseline/deliverable-pr.txt ``` Expected: PR URL. **Do not merge** — leave for user review. - [ ] **Step 7.3: Finalise report with overall verdict** Determine overall verdict by inspecting each Task's PASS/FAIL line in `/tmp/dryrun-report.md`. Overall is PASS only if all tasks PASS (SKIPPED Task 4 is acceptable — note it). Append to report: ```bash cat >> /tmp/dryrun-report.md < ## Recommendations for the User Before Hackathon Starts 1. Configure required status check 'test' on main branch of keboola/agnes-the-ai-analyst. 2. Pin prod image_tag in agnes-infra-keboola/terraform/terraform.tfvars from "stable" to "stable-2026.04.XX" (current running version). Revert after hackathon. 3. Rotate admin password '1234' on prod (34.77.102.61:8000/login) and dev (34.77.94.14:8000/login). 4. Wire notification_channel_ids in tfvars so uptime alerts actually notify someone. 5. Share the hackathon 1-pager + switch-dev-vm.sh via the team Slack channel. 6. Review PR $(cat /tmp/dryrun-baseline/deliverable-pr.txt) and merge if switch-dev-vm.sh looks good. ## Artefacts - Full report: /tmp/dryrun-report.md (this file) - Baseline snapshots: /tmp/dryrun-baseline/*.{json,txt} - TF plan output: /tmp/dryrun-tfplan.txt (if Task 4 ran) - Deliverable PR: $(cat /tmp/dryrun-baseline/deliverable-pr.txt) EOF cat /tmp/dryrun-report.md ``` Expected: full report printed. - [ ] **Step 7.4: Print final summary to chat** Agent should output, in its final message to the user: - Overall verdict (one line) - Each task's result (one line each) - Any unresolved anomalies - Link to deliverable PR - Path to full report **Task 7 gate:** report complete, PR open, all artefacts listed. --- ## Abort / Rollback Procedures If any task fails mid-execution, the agent must still perform Task 6 cleanup before reporting failure. Specifically: - If Task 2 push succeeded but Task 3 failed → still run Task 6 Steps 6.1-6.4 to restore dev VM and delete the branch. - If Task 5 PR was opened but workflow didn't finish → close the PR with `gh pr close --delete-branch` and log it. - If Task 4 TF plan showed destroys → abort immediately, do NOT attempt apply, record in report, continue to Task 6. If Task 6 itself fails (dev VM won't come back healthy on original tag), the agent must: 1. Print the baseline values (from `/tmp/dryrun-baseline/dev-env.txt`, `/tmp/dryrun-baseline/dev-image.txt`) so the user can manually SSH and fix. 2. Attempt `gcloud compute ssh agnes-dev --zone=europe-west1-b --command "docker compose -f /opt/agnes/docker-compose.yml logs --tail 100"` and include output in the report. 3. Mark overall verdict as FAIL and stop. ## What a Successful Run Looks Like - Task 1 baseline: captured with prod+dev healthy/degraded - Task 2: GHCR manifest exists for `:dev-hack-dryrun-` - Task 3: agnes-dev briefly running the per-branch image, healthy within 90s - Task 4: `terraform plan` showed `1+ to add, 0 to destroy` (or SKIPPED) - Task 5: CI `test` job reported FAILURE on the broken PR, PR closed - Task 6: agnes-dev back on its baseline AGNES_TAG, healthy, branches gone - Task 7: `scripts/switch-dev-vm.sh` committed on PR for user review, full report in `/tmp/dryrun-report.md` - Final agent message: verdict + 6 bullet results + deliverable PR link Duration: ~45-75 minutes, bounded primarily by CI workflow runs (~3-5 min each, two runs) and TF init (~30s-2min cold).