CLAUDE.md rewritten (708 -> ~320 lines): four overlapping release sections collapsed to one, stale v1->v35 schema history dropped (it lives in CHANGELOG), marketplace endpoint internals and verbose process sections moved out or tightened. New focused docs: - docs/RELEASING.md - release process, deploy workflows, CI quirks (RELEASE_TEMPLATE.md folded in as an appendix) - docs/marketplace.md - marketplace ingestion + re-serving internals - docs/README.md - documentation index by audience, linked from README.md and CLAUDE.md Archived under docs/archive/: docs/superpowers/ (52 historical planning artifacts), HACKATHON.md, pd-ps-comments.md, security-audit-2026-04.md, future/NOTIFICATIONS.md. Removed the docs/auto-install.md stub. Fixed dangling links in connectors/jira/README.md and dev_docs/README.md, repointed code/doc references to archived paths.
848 lines
34 KiB
Markdown
848 lines
34 KiB
Markdown
# Hackathon E2E Dry-Run Plan
|
||
|
||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
|
||
|
||
**Goal:** Validate the full developer→dev-VM→merge→prod flow end-to-end the day before a multi-developer hackathon, so any broken link is found and fixed before participants arrive.
|
||
|
||
**Architecture:** This is an operational dry-run, not a code feature. The executing agent pushes a throwaway feature branch to the public repo, verifies that CI produces a per-branch Docker image tag on GHCR, switches the shared `agnes-dev` VM onto that tag via the existing auto-upgrade cron, verifies that the CI test gate blocks a deliberately-broken PR from reaching `:stable`, and produces a helper script + report. The plan is **strictly non-destructive for prod** — prod-pinning (point 6 of the original outline) is explicitly out of scope and left to the user.
|
||
|
||
**Tech Stack:** Bash / `gcloud` / `gh` / `git` / `docker` / `curl` / Python (`pytest`) / Terraform (plan only, no apply). No app code changes.
|
||
|
||
---
|
||
|
||
## Out of Scope (do NOT do)
|
||
|
||
- Any `terraform apply` against real infrastructure. TF `plan` is allowed; TF `apply` is forbidden.
|
||
- Pinning `prod_instance.image_tag` in `agnes-infra-keboola`. User will do this themselves after the dry-run succeeds.
|
||
- Rotating admin passwords, Keboola tokens, or JWT secrets.
|
||
- Modifying `main` branch of any repo. All changes happen on throwaway branches, which are deleted at the end.
|
||
- Creating new GCP resources (VMs, disks, IPs, secrets, SAs).
|
||
|
||
If any step would require doing one of the above, **STOP and ask the user**.
|
||
|
||
---
|
||
|
||
## Prerequisites
|
||
|
||
Before starting, the executing agent MUST verify all of the following. If any fails, abort and report which prerequisite is missing — do NOT try to fix it.
|
||
|
||
- [ ] **Working directory** is the `tmp_oss` checkout at `/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss`. Current branch can be anything; the plan will create a new branch.
|
||
|
||
- [ ] **`gh auth status`** shows authenticated, with `workflow` scope. Run:
|
||
|
||
```bash
|
||
gh auth status 2>&1 | grep -E "(Logged in|Token scopes)"
|
||
```
|
||
|
||
Expected: line containing `Logged in to github.com` and a line listing scopes that include `workflow`. If `workflow` scope is missing, abort with message: `Run: gh auth refresh -h github.com -s workflow`.
|
||
|
||
- [ ] **`gcloud` authenticated** to project `internal-prod`. Run:
|
||
|
||
```bash
|
||
gcloud config get-value project
|
||
gcloud auth list --filter=status:ACTIVE --format="value(account)"
|
||
```
|
||
|
||
Expected: project is `internal-prod`, at least one active account. If not, abort with message: `Run: gcloud config set project internal-prod && gcloud auth login`.
|
||
|
||
- [ ] **SSH to `agnes-dev` works** (OS Login). Run:
|
||
|
||
```bash
|
||
gcloud compute ssh agnes-dev --zone=europe-west1-b --command="echo ok" --quiet
|
||
```
|
||
|
||
Expected: output contains `ok`. First connection may take ~20s while OS Login provisions. If fails with permission error, abort with message: `User needs compute.osLogin role on agnes-dev VM`.
|
||
|
||
- [ ] **`docker` CLI available** locally (for `docker manifest inspect`). Run: `docker --version`. Expected: version output. If missing, abort.
|
||
|
||
- [ ] **Public GHCR pull works**. Run:
|
||
|
||
```bash
|
||
docker manifest inspect ghcr.io/keboola/agnes-the-ai-analyst:stable > /dev/null && echo ok
|
||
```
|
||
|
||
Expected: `ok`. If fails, abort — something is wrong with public image visibility.
|
||
|
||
- [ ] **Clone of `agnes-infra-keboola` exists or can be cloned** at `/tmp/agnes-infra-keboola`. Run:
|
||
|
||
```bash
|
||
if [ ! -d /tmp/agnes-infra-keboola ]; then
|
||
gh repo clone keboola/agnes-infra-keboola /tmp/agnes-infra-keboola
|
||
fi
|
||
cd /tmp/agnes-infra-keboola && git status --short
|
||
```
|
||
|
||
Expected: clone succeeds, `git status` is clean. If clone fails, skip Task 4 (TF plan verification) and note it in the final report.
|
||
|
||
**Gate:** All 7 prerequisite checks pass, OR the agent has clearly reported which ones failed and reduced scope accordingly. Only then proceed to Task 1.
|
||
|
||
---
|
||
|
||
## Task 1: Baseline Snapshot
|
||
|
||
**Purpose:** Record the current state of both VMs and the TF outputs so the agent can detect drift at the end and prove it left everything as it found it.
|
||
|
||
**Files:**
|
||
- Create: `/tmp/dryrun-baseline/prod-health.json`
|
||
- Create: `/tmp/dryrun-baseline/dev-health.json`
|
||
- Create: `/tmp/dryrun-baseline/prod-image.txt`
|
||
- Create: `/tmp/dryrun-baseline/dev-image.txt`
|
||
- Create: `/tmp/dryrun-baseline/dev-env.txt`
|
||
|
||
- [ ] **Step 1.1: Create baseline directory**
|
||
|
||
```bash
|
||
mkdir -p /tmp/dryrun-baseline
|
||
```
|
||
|
||
- [ ] **Step 1.2: Capture prod health**
|
||
|
||
```bash
|
||
curl -sf --max-time 10 http://<prod-vm-ip>:8000/api/health > /tmp/dryrun-baseline/prod-health.json
|
||
cat /tmp/dryrun-baseline/prod-health.json | python3 -m json.tool
|
||
```
|
||
|
||
Expected: JSON with `"status"` field equal to `"healthy"` or `"degraded"`. If `"unhealthy"` or curl times out, abort with message: `Prod is not in acceptable baseline state — investigate before dry-run`.
|
||
|
||
- [ ] **Step 1.3: Capture dev health**
|
||
|
||
```bash
|
||
curl -sf --max-time 10 http://<dev-vm-ip>:8000/api/health > /tmp/dryrun-baseline/dev-health.json
|
||
cat /tmp/dryrun-baseline/dev-health.json | python3 -m json.tool
|
||
```
|
||
|
||
Expected: JSON with `"status"` in `{healthy, degraded}`. Same abort condition as 1.2.
|
||
|
||
- [ ] **Step 1.4: Capture current image tags on both VMs**
|
||
|
||
```bash
|
||
gcloud compute ssh agnes-prod --zone=europe-west1-b --quiet --command \
|
||
"docker inspect \$(docker ps -qf name=app) --format '{{.Config.Image}}'" \
|
||
> /tmp/dryrun-baseline/prod-image.txt
|
||
gcloud compute ssh agnes-dev --zone=europe-west1-b --quiet --command \
|
||
"docker inspect \$(docker ps -qf name=app) --format '{{.Config.Image}}'" \
|
||
> /tmp/dryrun-baseline/dev-image.txt
|
||
cat /tmp/dryrun-baseline/prod-image.txt /tmp/dryrun-baseline/dev-image.txt
|
||
```
|
||
|
||
Expected: each file contains exactly one line like `ghcr.io/keboola/agnes-the-ai-analyst:stable` or `:stable-2026.04.XX`. Non-empty.
|
||
|
||
- [ ] **Step 1.5: Capture `agnes-dev` `.env` AGNES_TAG line**
|
||
|
||
```bash
|
||
gcloud compute ssh agnes-dev --zone=europe-west1-b --quiet --command \
|
||
"sudo grep -E '^AGNES_TAG=' /data/.env || echo 'AGNES_TAG_NOT_SET'" \
|
||
> /tmp/dryrun-baseline/dev-env.txt
|
||
cat /tmp/dryrun-baseline/dev-env.txt
|
||
```
|
||
|
||
Expected: output is `AGNES_TAG=dev` or similar. Record exact value for restoration in Task 6. If `AGNES_TAG_NOT_SET`, abort — the VM is in an unknown config state.
|
||
|
||
- [ ] **Step 1.6: Record baseline to report buffer**
|
||
|
||
Append to a running report at `/tmp/dryrun-report.md` (create if not exists):
|
||
|
||
```bash
|
||
cat > /tmp/dryrun-report.md <<EOF
|
||
# Hackathon Dry-Run Report
|
||
|
||
**Run at:** $(date -u +"%Y-%m-%dT%H:%M:%SZ")
|
||
|
||
## Baseline (Task 1)
|
||
|
||
- Prod health status: $(jq -r '.status' /tmp/dryrun-baseline/prod-health.json)
|
||
- Dev health status: $(jq -r '.status' /tmp/dryrun-baseline/dev-health.json)
|
||
- Prod image: $(cat /tmp/dryrun-baseline/prod-image.txt)
|
||
- Dev image: $(cat /tmp/dryrun-baseline/dev-image.txt)
|
||
- Dev AGNES_TAG: $(cat /tmp/dryrun-baseline/dev-env.txt)
|
||
|
||
EOF
|
||
cat /tmp/dryrun-report.md
|
||
```
|
||
|
||
Expected: report file exists, all fields populated (no empty values).
|
||
|
||
**Task 1 gate:** baseline directory has 5 non-empty files, report has 5 non-empty bullet lines. Proceed.
|
||
|
||
---
|
||
|
||
## Task 2: Verify Per-Branch GHCR Build
|
||
|
||
**Purpose:** Push a throwaway feature branch to the public repo, wait for the release workflow, and confirm that the per-branch `:dev-<slug>` tag appears on GHCR.
|
||
|
||
**Files:**
|
||
- Create (throwaway): branch `feature/hack-dryrun-<timestamp>` in `tmp_oss` + one trivial commit touching `docs/QUICKSTART.md`
|
||
|
||
**Branch naming:** the agent MUST use `feature/hack-dryrun-<epoch>` (e.g. `feature/hack-dryrun-1745254321`) so the slug is unique per run and cleanup is deterministic.
|
||
|
||
- [ ] **Step 2.1: Compute branch name and expected slug**
|
||
|
||
Per `.github/workflows/release.yml:92-98` logic: strip `feature/` prefix, sanitise `[^a-zA-Z0-9-]` to `-`, lowercase, cut 50 chars.
|
||
|
||
```bash
|
||
EPOCH=$(date +%s)
|
||
BRANCH="feature/hack-dryrun-${EPOCH}"
|
||
SLUG=$(echo "$BRANCH" | sed 's|^feature/||' | sed 's|[^a-zA-Z0-9-]|-|g' | tr '[:upper:]' '[:lower:]' | cut -c1-50)
|
||
echo "BRANCH=$BRANCH"
|
||
echo "SLUG=$SLUG"
|
||
echo "EXPECTED_TAG=ghcr.io/keboola/agnes-the-ai-analyst:dev-$SLUG"
|
||
# Persist for later steps
|
||
echo "$BRANCH" > /tmp/dryrun-baseline/branch-name.txt
|
||
echo "$SLUG" > /tmp/dryrun-baseline/slug.txt
|
||
```
|
||
|
||
Expected: BRANCH like `feature/hack-dryrun-1745254321`, SLUG like `hack-dryrun-1745254321`. Persisted.
|
||
|
||
- [ ] **Step 2.2: Create branch with trivial commit**
|
||
|
||
```bash
|
||
cd "/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss"
|
||
# Save current branch so we can return
|
||
git rev-parse --abbrev-ref HEAD > /tmp/dryrun-baseline/starting-branch.txt
|
||
BRANCH=$(cat /tmp/dryrun-baseline/branch-name.txt)
|
||
git checkout -b "$BRANCH"
|
||
echo "<!-- dryrun $(date -u +%FT%TZ) -->" >> docs/QUICKSTART.md
|
||
git add docs/QUICKSTART.md
|
||
git commit -m "dryrun: verify per-branch GHCR tag"
|
||
git push -u origin "$BRANCH"
|
||
```
|
||
|
||
Expected: branch created, one commit, push succeeds with upstream tracking. If push is rejected (e.g. protection), abort.
|
||
|
||
- [ ] **Step 2.3: Wait for release workflow to complete**
|
||
|
||
```bash
|
||
cd "/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss"
|
||
BRANCH=$(cat /tmp/dryrun-baseline/branch-name.txt)
|
||
# Get the most recent run id for this branch + workflow
|
||
sleep 10 # give GH a moment to register the run
|
||
RUN_ID=$(gh run list --branch "$BRANCH" --workflow release.yml --limit 1 --json databaseId --jq '.[0].databaseId')
|
||
echo "Watching run $RUN_ID"
|
||
gh run watch "$RUN_ID" --exit-status --interval 15
|
||
echo "Workflow exit: $?"
|
||
```
|
||
|
||
Expected: exit status 0 after ~3-5 min. If exit != 0, print the logs:
|
||
|
||
```bash
|
||
gh run view "$RUN_ID" --log-failed | tail -100
|
||
```
|
||
|
||
and abort with message: `Release workflow failed for throwaway branch — investigate before hackathon`.
|
||
|
||
- [ ] **Step 2.4: Verify per-branch tag exists on GHCR**
|
||
|
||
```bash
|
||
SLUG=$(cat /tmp/dryrun-baseline/slug.txt)
|
||
EXPECTED="ghcr.io/keboola/agnes-the-ai-analyst:dev-$SLUG"
|
||
docker manifest inspect "$EXPECTED" > /tmp/dryrun-baseline/ghcr-manifest.json
|
||
DIGEST=$(jq -r '.config.digest // .manifests[0].digest' /tmp/dryrun-baseline/ghcr-manifest.json)
|
||
echo "Tag exists: $EXPECTED"
|
||
echo "Digest: $DIGEST"
|
||
echo "$DIGEST" > /tmp/dryrun-baseline/expected-digest.txt
|
||
```
|
||
|
||
Expected: `docker manifest inspect` returns JSON (exit 0), a non-empty digest is extracted. If the tag is missing, abort with message: `release.yml did not produce :dev-<slug> tag — check build-and-push step logs`.
|
||
|
||
- [ ] **Step 2.5: Record Task 2 result**
|
||
|
||
```bash
|
||
SLUG=$(cat /tmp/dryrun-baseline/slug.txt)
|
||
cat >> /tmp/dryrun-report.md <<EOF
|
||
## Task 2: Per-Branch GHCR Build — PASS
|
||
|
||
- Branch: $(cat /tmp/dryrun-baseline/branch-name.txt)
|
||
- Slug: $SLUG
|
||
- Tag: ghcr.io/keboola/agnes-the-ai-analyst:dev-$SLUG
|
||
- Digest: $(cat /tmp/dryrun-baseline/expected-digest.txt)
|
||
|
||
EOF
|
||
```
|
||
|
||
**Task 2 gate:** `:dev-<slug>` manifest exists. Proceed.
|
||
|
||
---
|
||
|
||
## Task 3: Dev VM Switch Flow
|
||
|
||
**Purpose:** Simulate the hackathon developer path — have the shared `agnes-dev` VM pick up the per-branch image via the existing auto-upgrade cron, verify the new image is running, then (in Task 6) roll back.
|
||
|
||
**Files touched (reversibly):**
|
||
- `/data/.env` on `agnes-dev` VM — one-line `AGNES_TAG=` change (rollback is captured in baseline from Step 1.5)
|
||
|
||
- [ ] **Step 3.1: Switch `agnes-dev` `.env` AGNES_TAG to the per-branch tag**
|
||
|
||
```bash
|
||
SLUG=$(cat /tmp/dryrun-baseline/slug.txt)
|
||
NEW_TAG="dev-$SLUG"
|
||
gcloud compute ssh agnes-dev --zone=europe-west1-b --quiet --command "\
|
||
sudo cp /data/.env /data/.env.dryrun-bak && \
|
||
sudo sed -i 's|^AGNES_TAG=.*|AGNES_TAG=$NEW_TAG|' /data/.env && \
|
||
sudo grep -E '^AGNES_TAG=' /data/.env"
|
||
```
|
||
|
||
Expected: final line is `AGNES_TAG=dev-<slug>`. If sed didn't match (no `AGNES_TAG=` line existed), abort and manually investigate.
|
||
|
||
- [ ] **Step 3.2: Trigger auto-upgrade cron script immediately**
|
||
|
||
```bash
|
||
gcloud compute ssh agnes-dev --zone=europe-west1-b --quiet --command \
|
||
"sudo /usr/local/bin/agnes-auto-upgrade.sh 2>&1 | tail -30"
|
||
```
|
||
|
||
Expected: output shows `docker compose pull` + `docker compose up -d` activity. If the script doesn't exist or errors, abort with message: `auto-upgrade script missing or broken on agnes-dev`.
|
||
|
||
- [ ] **Step 3.3: Wait for app container to become healthy**
|
||
|
||
```bash
|
||
# Poll /api/health for up to 90s
|
||
for i in $(seq 1 30); do
|
||
STATUS=$(curl -s --max-time 5 http://<dev-vm-ip>:8000/api/health | jq -r '.status' 2>/dev/null || echo "down")
|
||
echo "[$i/30] status=$STATUS"
|
||
if [ "$STATUS" = "healthy" ] || [ "$STATUS" = "degraded" ]; then
|
||
break
|
||
fi
|
||
sleep 3
|
||
done
|
||
[ "$STATUS" = "healthy" ] || [ "$STATUS" = "degraded" ] || { echo "FAIL: dev never healthy"; exit 1; }
|
||
```
|
||
|
||
Expected: reaches `healthy`/`degraded` within 90s.
|
||
|
||
- [ ] **Step 3.4: Verify the running image is the per-branch one**
|
||
|
||
```bash
|
||
SLUG=$(cat /tmp/dryrun-baseline/slug.txt)
|
||
EXPECTED_DIGEST=$(cat /tmp/dryrun-baseline/expected-digest.txt)
|
||
RUNNING_IMAGE=$(gcloud compute ssh agnes-dev --zone=europe-west1-b --quiet --command \
|
||
"docker inspect \$(docker ps -qf name=app) --format '{{.Image}}'")
|
||
echo "Running image digest: $RUNNING_IMAGE"
|
||
# The running image line will be sha256:xxxxx. Compare to the manifest digest we recorded.
|
||
# They should match (or differ only by multi-arch manifest indirection — compare via docker inspect on remote)
|
||
gcloud compute ssh agnes-dev --zone=europe-west1-b --quiet --command \
|
||
"docker inspect \$(docker ps -qf name=app) --format '{{.Config.Image}}' && \
|
||
docker image inspect \$(docker ps -qf name=app --format '{{.Image}}' | head -1) --format '{{.RepoTags}}{{.RepoDigests}}'"
|
||
```
|
||
|
||
Expected: `RepoTags` or `RepoDigests` output includes either `:dev-$SLUG` or the digest from Step 2.4. If neither matches, the cron didn't pull the new tag — record as FAIL and continue (cleanup is still required).
|
||
|
||
- [ ] **Step 3.5: Record Task 3 result**
|
||
|
||
The agent must judge PASS/FAIL based on Step 3.4 output: PASS iff `RepoTags` or `RepoDigests` contained `:dev-$SLUG` or the digest captured in Step 2.4.
|
||
|
||
```bash
|
||
SLUG=$(cat /tmp/dryrun-baseline/slug.txt)
|
||
# Replace <RESULT> with PASS or FAIL based on the Step 3.4 output the agent observed.
|
||
# Replace <IMAGE_OUTPUT> with the RepoTags/RepoDigests line from Step 3.4.
|
||
# Replace <SECONDS> with the loop iteration count from Step 3.3 × 3.
|
||
cat >> /tmp/dryrun-report.md <<EOF
|
||
## Task 3: Dev VM Switch — <RESULT>
|
||
|
||
- Switched agnes-dev to AGNES_TAG=dev-$SLUG
|
||
- Health after switch: reached healthy/degraded within 90s
|
||
- Running image: <IMAGE_OUTPUT>
|
||
- Time from cron trigger to healthy: <SECONDS>s
|
||
|
||
EOF
|
||
```
|
||
|
||
**Task 3 gate:** health reached OK state; running image verified. Proceed even if image verification was inconclusive — rollback still required.
|
||
|
||
---
|
||
|
||
## Task 4: Terraform Plan Verification (Private Repo)
|
||
|
||
**Purpose:** Validate that adding a new entry to `dev_instances` produces a clean `terraform plan` (not apply) in `agnes-infra-keboola`. This proves the TF module accepts the variable shape the hackathon docs will recommend.
|
||
|
||
**Skip condition:** If prerequisites check found that `/tmp/agnes-infra-keboola` clone failed, skip this entire task and record `SKIPPED — repo unavailable` in the report.
|
||
|
||
**Files touched (throwaway branch only):**
|
||
- `/tmp/agnes-infra-keboola/terraform/terraform.tfvars` (throwaway edit)
|
||
|
||
- [ ] **Step 4.1: Create throwaway branch in private repo**
|
||
|
||
```bash
|
||
cd /tmp/agnes-infra-keboola
|
||
git checkout main
|
||
git pull
|
||
EPOCH=$(date +%s)
|
||
BRANCH="dryrun-tfplan-${EPOCH}"
|
||
echo "$BRANCH" > /tmp/dryrun-baseline/tf-branch.txt
|
||
git checkout -b "$BRANCH"
|
||
```
|
||
|
||
Expected: clean checkout of main, new branch created.
|
||
|
||
- [ ] **Step 4.2: Add throwaway dev_instance entry**
|
||
|
||
Read `terraform/terraform.tfvars` first to understand the current `dev_instances` shape. Then append a new entry.
|
||
|
||
The `dev_instances` variable schema (from `infra/modules/customer-instance/variables.tf:41-49`) is:
|
||
```hcl
|
||
list(object({
|
||
name = string
|
||
machine_type = optional(string, "e2-small")
|
||
image_tag = optional(string, "dev")
|
||
}))
|
||
```
|
||
|
||
Modify the `dev_instances` list to append:
|
||
```hcl
|
||
{ name = "agnes-hack-dryrun", image_tag = "dev-<slug-from-task-2>" }
|
||
```
|
||
|
||
The agent should detect the current tfvars format and insert accordingly. If the file does not already contain `dev_instances`, abort and report format-mismatch.
|
||
|
||
```bash
|
||
SLUG=$(cat /tmp/dryrun-baseline/slug.txt)
|
||
# Show current tfvars for context
|
||
cat /tmp/agnes-infra-keboola/terraform/terraform.tfvars | grep -A 20 "dev_instances"
|
||
# Agent must edit the file to add the new entry — use the Edit tool rather than sed to be safe.
|
||
```
|
||
|
||
After editing, show the diff:
|
||
```bash
|
||
cd /tmp/agnes-infra-keboola
|
||
git diff terraform/terraform.tfvars
|
||
```
|
||
|
||
Expected: diff adds exactly one new entry to `dev_instances` list with `name = "agnes-hack-dryrun"` and `image_tag = "dev-<slug>"`.
|
||
|
||
- [ ] **Step 4.3: Run `terraform plan` locally (no apply)**
|
||
|
||
```bash
|
||
cd /tmp/agnes-infra-keboola/terraform
|
||
export GOOGLE_APPLICATION_CREDENTIALS="$HOME/.agnes-keys/agnes-deploy-internal-prod-key.json"
|
||
[ -f "$GOOGLE_APPLICATION_CREDENTIALS" ] || { echo "SA key not found — skipping plan"; exit 2; }
|
||
terraform init -input=false -upgrade=false
|
||
terraform plan -input=false -no-color -out=/tmp/dryrun-tfplan.bin > /tmp/dryrun-tfplan.txt 2>&1
|
||
RC=$?
|
||
echo "terraform plan exit: $RC"
|
||
tail -40 /tmp/dryrun-tfplan.txt
|
||
```
|
||
|
||
Expected:
|
||
- exit 0 or 2 (2 = changes detected, which is what we want)
|
||
- output ends with `Plan: N to add, M to change, K to destroy.` where `N >= 1` (at least the new VM + disk + IP) and `K == 0` (we must NOT be destroying anything)
|
||
|
||
If `K > 0` or `terraform plan` errors, abort and DO NOT proceed to Step 4.4. Report the plan output verbatim in the final report.
|
||
|
||
- [ ] **Step 4.4: Discard throwaway branch (no push, no apply)**
|
||
|
||
```bash
|
||
cd /tmp/agnes-infra-keboola
|
||
git checkout main
|
||
BRANCH=$(cat /tmp/dryrun-baseline/tf-branch.txt)
|
||
git branch -D "$BRANCH"
|
||
# Branch was never pushed, so nothing to clean up remotely.
|
||
```
|
||
|
||
Expected: branch deleted locally, main is current, working tree clean.
|
||
|
||
- [ ] **Step 4.5: Record Task 4 result**
|
||
|
||
```bash
|
||
ADDS=$(grep -E "Plan:" /tmp/dryrun-tfplan.txt | head -1)
|
||
DESTROYS_OK=$(grep -E "Plan:.*0 to destroy" /tmp/dryrun-tfplan.txt && echo yes || echo no)
|
||
cat >> /tmp/dryrun-report.md <<EOF
|
||
## Task 4: TF Plan for New Dev VM — <PASS|SKIPPED|FAIL>
|
||
|
||
- Plan summary: $ADDS
|
||
- Zero destroys: $DESTROYS_OK
|
||
- Full plan output: see /tmp/dryrun-tfplan.txt
|
||
|
||
EOF
|
||
```
|
||
|
||
**Task 4 gate:** plan produced with 0 destroys and ≥1 add. Proceed.
|
||
|
||
---
|
||
|
||
## Task 5: Verify Smoke-Test Gate Blocks Broken PR
|
||
|
||
**Purpose:** Confirm that a pull request with a deliberately-failing test does NOT produce a passing CI — which is the safety net that keeps `:stable` from auto-promoting broken images to prod.
|
||
|
||
**Files touched (throwaway branch only):**
|
||
- `tests/test_dryrun_should_fail.py` (new file on throwaway branch)
|
||
|
||
**Important:** This task creates a PR (not a merge). The PR is closed without merging in Step 5.5.
|
||
|
||
- [ ] **Step 5.1: Create throwaway branch with failing test**
|
||
|
||
```bash
|
||
cd "/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss"
|
||
git checkout main
|
||
git pull
|
||
EPOCH=$(date +%s)
|
||
BRANCH="dryrun-break-smoke-${EPOCH}"
|
||
echo "$BRANCH" > /tmp/dryrun-baseline/smoke-branch.txt
|
||
git checkout -b "$BRANCH"
|
||
cat > tests/test_dryrun_should_fail.py <<'PYEOF'
|
||
def test_intentional_fail_for_dryrun():
|
||
"""Intentional failure to verify CI gate blocks broken PRs. Remove after dryrun."""
|
||
assert False, "dryrun: this test is supposed to fail"
|
||
PYEOF
|
||
git add tests/test_dryrun_should_fail.py
|
||
git commit -m "dryrun: intentional failing test (will be reverted)"
|
||
git push -u origin "$BRANCH"
|
||
```
|
||
|
||
Expected: push succeeds.
|
||
|
||
- [ ] **Step 5.2: Open PR**
|
||
|
||
```bash
|
||
cd "/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss"
|
||
PR_URL=$(gh pr create --title "dryrun: verify CI gate (DO NOT MERGE)" \
|
||
--body "Intentionally failing test to verify CI blocks bad merges. Will be closed immediately after CI result." \
|
||
--base main)
|
||
echo "$PR_URL" > /tmp/dryrun-baseline/pr-url.txt
|
||
echo "Opened: $PR_URL"
|
||
```
|
||
|
||
Expected: PR URL returned.
|
||
|
||
- [ ] **Step 5.3: Wait for CI `test` job to complete (expected: FAIL)**
|
||
|
||
```bash
|
||
cd "/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss"
|
||
BRANCH=$(cat /tmp/dryrun-baseline/smoke-branch.txt)
|
||
sleep 15
|
||
RUN_ID=$(gh run list --branch "$BRANCH" --workflow release.yml --limit 1 --json databaseId --jq '.[0].databaseId')
|
||
echo "Watching run $RUN_ID (expected to FAIL)"
|
||
# Use --exit-status WITHOUT `set -e`; we expect non-zero
|
||
set +e
|
||
gh run watch "$RUN_ID" --exit-status --interval 15
|
||
EXIT=$?
|
||
set -e
|
||
echo "Exit code: $EXIT (non-zero is EXPECTED here)"
|
||
```
|
||
|
||
Expected: exit code != 0. If exit code IS 0, that means CI passed despite `assert False` → the test suite is not being run, or the file was excluded → record as **FAIL — CI gate broken**.
|
||
|
||
- [ ] **Step 5.4: Verify PR mergeability check shows failure**
|
||
|
||
```bash
|
||
PR_URL=$(cat /tmp/dryrun-baseline/pr-url.txt)
|
||
PR_NUM=$(basename "$PR_URL")
|
||
STATE=$(gh pr view "$PR_NUM" --json statusCheckRollup --jq '.statusCheckRollup[] | select(.name=="test") | .conclusion')
|
||
echo "test job conclusion: $STATE"
|
||
```
|
||
|
||
Expected: `FAILURE`. If `SUCCESS`, the gate is broken.
|
||
|
||
- [ ] **Step 5.5: Close PR and delete branch**
|
||
|
||
```bash
|
||
cd "/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss"
|
||
PR_URL=$(cat /tmp/dryrun-baseline/pr-url.txt)
|
||
PR_NUM=$(basename "$PR_URL")
|
||
gh pr close "$PR_NUM" --delete-branch --comment "dryrun complete — CI gate verified, closing without merge"
|
||
# Also delete locally
|
||
git checkout main
|
||
BRANCH=$(cat /tmp/dryrun-baseline/smoke-branch.txt)
|
||
git branch -D "$BRANCH" 2>/dev/null || true
|
||
```
|
||
|
||
Expected: PR closed, local branch gone.
|
||
|
||
- [ ] **Step 5.6: Check whether `main` has required status checks configured**
|
||
|
||
```bash
|
||
gh api repos/keboola/agnes-the-ai-analyst/branches/main/protection 2>/tmp/dryrun-protection-err.txt > /tmp/dryrun-protection.json
|
||
RC=$?
|
||
if [ $RC -ne 0 ]; then
|
||
echo "No branch protection on main (or insufficient permissions to read it)"
|
||
cat /tmp/dryrun-protection-err.txt
|
||
PROTECTION_NOTE="NONE — branch is unprotected; broken PRs can be merged. Recommend adding 'test' as required status check."
|
||
else
|
||
REQUIRED=$(jq -r '.required_status_checks.contexts[]?' /tmp/dryrun-protection.json 2>/dev/null | tr '\n' ',')
|
||
echo "Required checks: $REQUIRED"
|
||
if echo "$REQUIRED" | grep -q "test"; then
|
||
PROTECTION_NOTE="OK — 'test' is required."
|
||
else
|
||
PROTECTION_NOTE="PARTIAL — protection exists but 'test' is not required. Contexts: $REQUIRED"
|
||
fi
|
||
fi
|
||
echo "$PROTECTION_NOTE" > /tmp/dryrun-baseline/protection-note.txt
|
||
```
|
||
|
||
Expected: note written. Does not abort — informational only.
|
||
|
||
- [ ] **Step 5.7: Record Task 5 result**
|
||
|
||
```bash
|
||
cat >> /tmp/dryrun-report.md <<EOF
|
||
## Task 5: CI Gate — <PASS|FAIL>
|
||
|
||
- Throwaway PR: $(cat /tmp/dryrun-baseline/pr-url.txt) (closed)
|
||
- CI 'test' job result on broken code: <FAILURE expected>
|
||
- Branch protection on main: $(cat /tmp/dryrun-baseline/protection-note.txt)
|
||
|
||
EOF
|
||
```
|
||
|
||
**Task 5 gate:** broken PR's CI status is FAILURE. Proceed. If `PROTECTION_NOTE` says NONE/PARTIAL, the final report must flag this as a **hackathon-blocking recommendation**.
|
||
|
||
---
|
||
|
||
## Task 6: Cleanup and Baseline Restoration
|
||
|
||
**Purpose:** Leave the system in exactly the state recorded in Task 1. This is the most important task — a dirty dry-run poisons the hackathon.
|
||
|
||
- [ ] **Step 6.1: Restore `agnes-dev` AGNES_TAG**
|
||
|
||
```bash
|
||
ORIG_LINE=$(cat /tmp/dryrun-baseline/dev-env.txt)
|
||
# ORIG_LINE looks like: AGNES_TAG=dev
|
||
ORIG_VALUE=$(echo "$ORIG_LINE" | cut -d= -f2-)
|
||
gcloud compute ssh agnes-dev --zone=europe-west1-b --quiet --command "\
|
||
sudo sed -i 's|^AGNES_TAG=.*|AGNES_TAG=$ORIG_VALUE|' /data/.env && \
|
||
sudo rm -f /data/.env.dryrun-bak && \
|
||
sudo grep -E '^AGNES_TAG=' /data/.env && \
|
||
sudo /usr/local/bin/agnes-auto-upgrade.sh 2>&1 | tail -20"
|
||
```
|
||
|
||
Expected: AGNES_TAG line matches original, auto-upgrade pulls back to the original tag.
|
||
|
||
- [ ] **Step 6.2: Wait for dev VM to return to healthy state on original tag**
|
||
|
||
```bash
|
||
for i in $(seq 1 30); do
|
||
STATUS=$(curl -s --max-time 5 http://<dev-vm-ip>:8000/api/health | jq -r '.status' 2>/dev/null || echo down)
|
||
echo "[$i/30] status=$STATUS"
|
||
[ "$STATUS" = "healthy" ] || [ "$STATUS" = "degraded" ] && break
|
||
sleep 3
|
||
done
|
||
```
|
||
|
||
Expected: reaches healthy/degraded within 90s.
|
||
|
||
- [ ] **Step 6.3: Verify running image matches baseline**
|
||
|
||
```bash
|
||
RESTORED=$(gcloud compute ssh agnes-dev --zone=europe-west1-b --quiet --command \
|
||
"docker inspect \$(docker ps -qf name=app) --format '{{.Config.Image}}'")
|
||
ORIG=$(cat /tmp/dryrun-baseline/dev-image.txt)
|
||
echo "Restored: $RESTORED"
|
||
echo "Original: $ORIG"
|
||
[ "$RESTORED" = "$ORIG" ] && echo MATCH || echo "MISMATCH — investigate"
|
||
```
|
||
|
||
Expected: MATCH. If MISMATCH, the baseline-tag digest may have advanced (auto-upgrade pulled newer `:stable`/`:dev` floating image during the run) — that is acceptable as long as the `.Config.Image` *tag* matches. Record exact difference in report.
|
||
|
||
- [ ] **Step 6.4: Delete throwaway branches in public repo**
|
||
|
||
```bash
|
||
cd "/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss"
|
||
STARTING=$(cat /tmp/dryrun-baseline/starting-branch.txt)
|
||
git checkout "$STARTING"
|
||
FEAT_BRANCH=$(cat /tmp/dryrun-baseline/branch-name.txt)
|
||
SMOKE_BRANCH=$(cat /tmp/dryrun-baseline/smoke-branch.txt 2>/dev/null || echo "")
|
||
# Local delete
|
||
git branch -D "$FEAT_BRANCH" 2>/dev/null || true
|
||
[ -n "$SMOKE_BRANCH" ] && git branch -D "$SMOKE_BRANCH" 2>/dev/null || true
|
||
# Remote delete (smoke branch was already deleted via `gh pr close --delete-branch` in Step 5.5)
|
||
git push origin --delete "$FEAT_BRANCH" 2>/dev/null || echo "(feature branch already gone)"
|
||
```
|
||
|
||
Expected: local branches gone, remote feature branch deleted. QUICKSTART.md commit on throwaway branch vanishes from origin.
|
||
|
||
- [ ] **Step 6.5: Final health check on prod (must match baseline)**
|
||
|
||
```bash
|
||
curl -sf --max-time 10 http://<prod-vm-ip>:8000/api/health > /tmp/dryrun-baseline/prod-health-after.json
|
||
BEFORE=$(jq -r '.status' /tmp/dryrun-baseline/prod-health.json)
|
||
AFTER=$(jq -r '.status' /tmp/dryrun-baseline/prod-health-after.json)
|
||
echo "Prod status before: $BEFORE / after: $AFTER"
|
||
[ "$BEFORE" = "$AFTER" ] && echo UNCHANGED || echo DRIFT
|
||
```
|
||
|
||
Expected: UNCHANGED. (Note: prod was never touched, so this is sanity only.)
|
||
|
||
- [ ] **Step 6.6: Record Task 6 result**
|
||
|
||
```bash
|
||
cat >> /tmp/dryrun-report.md <<EOF
|
||
## Task 6: Cleanup — <PASS|FAIL>
|
||
|
||
- agnes-dev AGNES_TAG restored to: $(cat /tmp/dryrun-baseline/dev-env.txt)
|
||
- agnes-dev health after restore: $(curl -s --max-time 5 http://<dev-vm-ip>:8000/api/health | jq -r '.status')
|
||
- agnes-dev image: matches baseline? <MATCH|MISMATCH — paste both>
|
||
- Throwaway branches deleted: feature, smoke
|
||
- Prod status unchanged: <UNCHANGED|DRIFT>
|
||
|
||
EOF
|
||
```
|
||
|
||
**Task 6 gate:** dev VM back on its baseline tag, branches gone, prod untouched.
|
||
|
||
---
|
||
|
||
## Task 7: Generate Deliverables
|
||
|
||
**Purpose:** Produce the artefacts the user needs tomorrow: a helper script for the hackathon team and a consolidated report.
|
||
|
||
**Files:**
|
||
- Create: `scripts/switch-dev-vm.sh` (new)
|
||
- Create (already being built): `/tmp/dryrun-report.md`
|
||
|
||
- [ ] **Step 7.1: Write `scripts/switch-dev-vm.sh`**
|
||
|
||
Create file at `scripts/switch-dev-vm.sh`:
|
||
|
||
```bash
|
||
#!/usr/bin/env bash
|
||
# switch-dev-vm.sh — point the shared hackathon dev VM at the caller's branch image.
|
||
#
|
||
# Usage:
|
||
# scripts/switch-dev-vm.sh <branch-slug>
|
||
# scripts/switch-dev-vm.sh hack-zs-metrics
|
||
#
|
||
# Prerequisite: your branch has been pushed and the release.yml workflow has completed,
|
||
# producing ghcr.io/keboola/agnes-the-ai-analyst:dev-<slug>.
|
||
#
|
||
# The slug is derived from your branch name by stripping the leading "feature/" and
|
||
# replacing non-alphanumeric chars with "-". For branch "feature/hack-zs-metrics" the slug
|
||
# is "hack-zs-metrics".
|
||
set -euo pipefail
|
||
|
||
if [ $# -ne 1 ]; then
|
||
echo "Usage: $0 <branch-slug>" >&2
|
||
echo "Example: $0 hack-zs-metrics" >&2
|
||
exit 2
|
||
fi
|
||
|
||
SLUG="$1"
|
||
VM="agnes-dev"
|
||
ZONE="europe-west1-b"
|
||
TAG="dev-$SLUG"
|
||
IMAGE="ghcr.io/keboola/agnes-the-ai-analyst:$TAG"
|
||
|
||
echo "[1/4] Verifying $IMAGE exists on GHCR..."
|
||
docker manifest inspect "$IMAGE" > /dev/null || {
|
||
echo "ERROR: $IMAGE not found on GHCR. Did your release.yml run finish?" >&2
|
||
echo "Check: gh run list --branch feature/$SLUG --workflow release.yml" >&2
|
||
exit 1
|
||
}
|
||
|
||
echo "[2/4] Updating AGNES_TAG on $VM to $TAG..."
|
||
gcloud compute ssh "$VM" --zone="$ZONE" --quiet --command "\
|
||
sudo sed -i 's|^AGNES_TAG=.*|AGNES_TAG=$TAG|' /data/.env && \
|
||
sudo grep -E '^AGNES_TAG=' /data/.env"
|
||
|
||
echo "[3/4] Triggering auto-upgrade..."
|
||
gcloud compute ssh "$VM" --zone="$ZONE" --quiet --command \
|
||
"sudo /usr/local/bin/agnes-auto-upgrade.sh 2>&1 | tail -10"
|
||
|
||
echo "[4/4] Waiting for app to become healthy..."
|
||
for i in $(seq 1 30); do
|
||
STATUS=$(curl -s --max-time 5 http://<dev-vm-ip>:8000/api/health | python3 -c 'import sys,json; print(json.load(sys.stdin).get("status","down"))' 2>/dev/null || echo down)
|
||
echo " [$i/30] status=$STATUS"
|
||
if [ "$STATUS" = "healthy" ] || [ "$STATUS" = "degraded" ]; then
|
||
echo "OK — agnes-dev now running $TAG. Open http://<dev-vm-ip>:8000"
|
||
exit 0
|
||
fi
|
||
sleep 3
|
||
done
|
||
echo "ERROR: agnes-dev did not become healthy in 90s. SSH in and check: docker compose logs" >&2
|
||
exit 1
|
||
```
|
||
|
||
```bash
|
||
chmod +x scripts/switch-dev-vm.sh
|
||
bash -n scripts/switch-dev-vm.sh # syntax check
|
||
```
|
||
|
||
Expected: syntax-check passes, file executable.
|
||
|
||
- [ ] **Step 7.2: Commit the script on a fresh branch and open PR**
|
||
|
||
```bash
|
||
cd "/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss"
|
||
git checkout -b feature/hackathon-dryrun-deliverables
|
||
git add scripts/switch-dev-vm.sh
|
||
git commit -m "chore: add switch-dev-vm.sh helper for hackathon"
|
||
git push -u origin HEAD
|
||
gh pr create --title "chore: add switch-dev-vm.sh helper for hackathon" \
|
||
--body "Adds scripts/switch-dev-vm.sh. Produced by the 2026-04-21 hackathon dry-run. Reviewed by user before merge." \
|
||
--base main > /tmp/dryrun-baseline/deliverable-pr.txt
|
||
cat /tmp/dryrun-baseline/deliverable-pr.txt
|
||
```
|
||
|
||
Expected: PR URL. **Do not merge** — leave for user review.
|
||
|
||
- [ ] **Step 7.3: Finalise report with overall verdict**
|
||
|
||
Determine overall verdict by inspecting each Task's PASS/FAIL line in `/tmp/dryrun-report.md`. Overall is PASS only if all tasks PASS (SKIPPED Task 4 is acceptable — note it).
|
||
|
||
Append to report:
|
||
|
||
```bash
|
||
cat >> /tmp/dryrun-report.md <<EOF
|
||
---
|
||
|
||
## Overall Verdict
|
||
|
||
<PASS | PASS WITH GAPS | FAIL>
|
||
|
||
## Recommendations for the User Before Hackathon Starts
|
||
|
||
1. <If protection-note said NONE/PARTIAL:> Configure required status check 'test' on main branch of keboola/agnes-the-ai-analyst.
|
||
2. Pin prod image_tag in agnes-infra-keboola/terraform/terraform.tfvars from "stable" to "stable-2026.04.XX" (current running version). Revert after hackathon.
|
||
3. Rotate admin password '1234' on prod (<prod-vm-ip>:8000/login) and dev (<dev-vm-ip>:8000/login).
|
||
4. Wire notification_channel_ids in tfvars so uptime alerts actually notify someone.
|
||
5. Share the hackathon 1-pager + switch-dev-vm.sh via the team Slack channel.
|
||
6. Review PR $(cat /tmp/dryrun-baseline/deliverable-pr.txt) and merge if switch-dev-vm.sh looks good.
|
||
|
||
## Artefacts
|
||
|
||
- Full report: /tmp/dryrun-report.md (this file)
|
||
- Baseline snapshots: /tmp/dryrun-baseline/*.{json,txt}
|
||
- TF plan output: /tmp/dryrun-tfplan.txt (if Task 4 ran)
|
||
- Deliverable PR: $(cat /tmp/dryrun-baseline/deliverable-pr.txt)
|
||
|
||
EOF
|
||
cat /tmp/dryrun-report.md
|
||
```
|
||
|
||
Expected: full report printed.
|
||
|
||
- [ ] **Step 7.4: Print final summary to chat**
|
||
|
||
Agent should output, in its final message to the user:
|
||
- Overall verdict (one line)
|
||
- Each task's result (one line each)
|
||
- Any unresolved anomalies
|
||
- Link to deliverable PR
|
||
- Path to full report
|
||
|
||
**Task 7 gate:** report complete, PR open, all artefacts listed.
|
||
|
||
---
|
||
|
||
## Abort / Rollback Procedures
|
||
|
||
If any task fails mid-execution, the agent must still perform Task 6 cleanup before reporting failure. Specifically:
|
||
|
||
- If Task 2 push succeeded but Task 3 failed → still run Task 6 Steps 6.1-6.4 to restore dev VM and delete the branch.
|
||
- If Task 5 PR was opened but workflow didn't finish → close the PR with `gh pr close --delete-branch` and log it.
|
||
- If Task 4 TF plan showed destroys → abort immediately, do NOT attempt apply, record in report, continue to Task 6.
|
||
|
||
If Task 6 itself fails (dev VM won't come back healthy on original tag), the agent must:
|
||
1. Print the baseline values (from `/tmp/dryrun-baseline/dev-env.txt`, `/tmp/dryrun-baseline/dev-image.txt`) so the user can manually SSH and fix.
|
||
2. Attempt `gcloud compute ssh agnes-dev --zone=europe-west1-b --command "docker compose -f /opt/agnes/docker-compose.yml logs --tail 100"` and include output in the report.
|
||
3. Mark overall verdict as FAIL and stop.
|
||
|
||
## What a Successful Run Looks Like
|
||
|
||
- Task 1 baseline: captured with prod+dev healthy/degraded
|
||
- Task 2: GHCR manifest exists for `:dev-hack-dryrun-<epoch>`
|
||
- Task 3: agnes-dev briefly running the per-branch image, healthy within 90s
|
||
- Task 4: `terraform plan` showed `1+ to add, 0 to destroy` (or SKIPPED)
|
||
- Task 5: CI `test` job reported FAILURE on the broken PR, PR closed
|
||
- Task 6: agnes-dev back on its baseline AGNES_TAG, healthy, branches gone
|
||
- Task 7: `scripts/switch-dev-vm.sh` committed on PR for user review, full report in `/tmp/dryrun-report.md`
|
||
- Final agent message: verdict + 6 bullet results + deliverable PR link
|
||
|
||
Duration: ~45-75 minutes, bounded primarily by CI workflow runs (~3-5 min each, two runs) and TF init (~30s-2min cold).
|