* chore(oss): isolate customer-specific deploy bits from scripts/grpn/ (#88) Vendor-neutralization step before public release. The directory mixed two concerns: (1) generic ops scripts referenced from mainline OSS infrastructure (TLS rotation, auto-upgrade cron) and (2) one operator's hackathon manual-deploy helper with hardcoded GCP project IDs, VM names, and admin emails. Splitting them per concern. Moved (still in OSS, just under a vendor-neutral name): - scripts/grpn/agnes-tls-rotate.sh → scripts/ops/agnes-tls-rotate.sh - scripts/grpn/agnes-auto-upgrade.sh → scripts/ops/agnes-auto-upgrade.sh Removed (belongs in private consumer infra repos, not upstream OSS): - scripts/grpn/Makefile (hardcoded prj-grp-foundryai-dev-7c37, foundryai-development VM name, e_zsrotyr@groupon.com bootstrap email) - scripts/grpn/README.md (GRPN hackathon deploy walkthrough) - docs/superpowers/plans/2026-04-22-grpn-deploy-learnings.md (org-specific deploy log) Cross-refs updated in README.md, CLAUDE.md, docs/DEPLOYMENT.md, docker-compose.yml. CHANGELOG entry flags BREAKING (ops) for any consumer infra repo that installs these scripts via path-based systemd timers. This is the first wave of #88 — the remaining leaks (test data with prj-grp-dataview-prod-1ff9, AIAgent.FoundryAI tags in OpenMetadata test fixtures, docstrings in connectors/openmetadata/enricher.py) will be a separate, smaller PR. Refs #88. * chore(oss): comprehensive vendor-neutralization (#88 wave 2 + review fixes) PR #94 review found that the original wave-1 grep was scoped wrong and many leaks survived. This commit closes wave 1 properly AND folds in all wave-2 anonymization in a single pass — easier to review than two PRs. Wave-1 review-fix corrections: - Caddyfile: scripts/grpn/agnes-tls-rotate.sh → scripts/ops/ (the original wave-1 grep filter excluded extensionless files like Caddyfile). - CHANGELOG bullet rewritten — original wording implied an in-repo migration for infra/modules/customer-instance/, which is wrong (the TF module embeds the script inline via heredoc, never sourced from scripts/grpn/). Now flags downstream consumer infra repos only. - infra/modules/customer-instance/variables.tf: Czech docstring with `grpn` example → English description with `acme, example` placeholders. Wave-2 anonymization: - Code docstrings (connectors/openmetadata/{client,transformer,enricher}.py, src/catalog_export.py, scripts/duckdb_manager.py): prj-grp-… → my-bq-project / prj-example-1234, AIAgent.FoundryAI → AIAgent.MyAgent, FoundryAIDataModel → AnalyticsDataModel. - Test fixtures (4 files): same set of replacements — 157 tests still pass. - .github/workflows/keboola-deploy.yml: "Groupon-side dev VMs" comment → generic "per-developer dev VMs". - docs/auth-groups.md + scripts/debug/probe_google_groups.py: kids-ai-data-analysis project name → acme-internal-prod placeholder. - 5 planning/spec docs under docs/superpowers/{plans,specs}/2026-04-21-*: hardcoded IPs (34.77.94.14, 34.77.102.61) → <dev-vm-ip>/<prod-vm-ip>; GRPN/Groupon → Acme/another-customer; prj-grp-… → prj-example-…. - scripts/switch-dev-vm.sh deleted — hackathon-era helper hardcoded to a specific shared dev VM. Per-developer dev VMs are the supported pattern. Final grep `groupon|grpn|foundryai|prj-grp|groupondev|34\.77\.(94|102)\.…|kids-ai-data` returns zero hits (excluding CHANGELOG.md historical entries). CHANGELOG entry expanded to document both waves under one bullet, with the BREAKING (ops) clarification about the TF module being unaffected. Refs review of #94, closes #88. * fix(oss): close remaining #94 review-2 findings (Czech, padak refs, CHANGELOG) Reviewer of PR #94 round 2 caught 4 remaining items the wave-2 pass missed: 1. infra/modules/customer-instance/variables.tf had Czech descriptions on 8 more variables. Previous review only flagged line 19; this round audited the rest. Translated lines 2, 28, 42-46 (heredoc), 60, 65, 71, 78, 84 to English. Same review concern: a Terraform module that is the customer-facing API surface in Czech is unfit for OSS distribution. 2. infra/modules/customer-instance/outputs.tf had Czech descriptions on four outputs. Same fix. 3. docs/padak-security.md referenced a private repo (padak/keboola_agent_cli#206) in two places. Replaced with generic 'tracked upstream in the auth-CLI repo' per CLAUDE.md vendor-agnostic rule (no cross-refs to private repos). 4. scripts/fetch-env-from-secrets.sh:41 had a Czech comment. Translated. 5. CHANGELOG cosmetic: bullet said 'AIAgent.FoundryAI -> AIAgent.MyAgent' but the actual code uses both MyAgent (in docstrings) and Example (in test fixtures). Reworded to mention both targets. Final grep across all shipping file types (.md, .py, .yml, .yaml, .sh, Makefile, .json, .tf, .tpl, Caddyfile, .toml) for groupon|grpn|foundryai| prj-grp|groupondev|34.77.94.14|34.77.102.61|kids-ai-data|padak/keboola_agent_cli returns ZERO hits (excluding CHANGELOG.md). Czech-diacritic grep across .tf/.toml/Caddyfile/Makefile/.yml returns ZERO hits. 157/157 OpenMetadata + DuckDB tests still pass. * fix(oss): close #94 round-3 leaks (env.template, instance.yaml.example, padak typo) Round-3 reviewer caught two MUST-FIX leaks the round-2 grep missed (grep was scoped to extensions that did not include .template / .example suffixes — the audit was right, the previous grep was not paranoid enough): 1. config/instance.yaml.example:114 — '(optional - Groupon-specific)' brand leak in a shipping config example. Replaced with '(optional)'. 2. config/.env.template:68 — stale path 'scripts/grpn/agnes-tls-rotate.sh' in operator-facing env-template comment. The script lives at scripts/ops/ now (commit 16a85cc); this comment had been pointing operators at a non-existent path. 3. docs/padak-security.md:188 — phrase duplication 'tracked in tracked upstream' from a sloppy substitution in round-2. Trivial wording fix. Final paranoid grep across .md/.py/.yml/.yaml/.sh/Makefile/.json/.tf/.tpl/ Caddyfile/.toml/.template/.example/.env* with the full token set (groupon|grpn|foundryai|prj-grp|groupondev|34\.77\.94\.14|34\.77\.102\.61| kids-ai-data|padak/keboola_agent_cli) returns ZERO hits, excluding CHANGELOG.md historical entries. * fix(oss): #94 round-4 — QUICKSTART.md + rename padak-security.md Devin Review caught two findings on the latest round-3 commit: 1. docs/QUICKSTART.md:67 still pointed users at the deleted scripts/switch-dev-vm.sh. A Quickstart user following step-by-step would hit a missing-file error at the final step. Replaced with the inline gcloud-ssh equivalent that the Removed bullet documents. 2. docs/padak-security.md filename retains the personal identifier 'padak'. The PR fixed the body content (replaced padak/keboola_agent_cli#206 references with generic wording) but missed the filename. Renamed to docs/security-audit-2026-04.md (date-anchored, vendor-neutral). Updated the historical CHANGELOG link to point at the new path with an inline note about the rename. * fix(oss): redact remaining hardcoded IPs from planning docs + remove default email Devin Review caught two more leaks: 1. scripts/fetch-env-from-secrets.sh line 16 had a hardcoded personal-email default (zdenek.srotyr@keboola.com). Replaced with ':?' bash error so SEED_ADMIN_EMAIL must be explicitly set — safer than carrying any specific identity. 2. Planning docs still had 35.195.96.98 and 34.62.223.189 (legacy prod/dev IPs) that the round-1 IP-replace pattern missed (it only targeted 34.77.x.x). Generic regex redaction across all five planning docs replaces every public IP with <redacted-ip>, preserving private/loopback/IAP ranges.
34 KiB
Hackathon E2E Dry-Run Plan
For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: Validate the full developer→dev-VM→merge→prod flow end-to-end the day before a multi-developer hackathon, so any broken link is found and fixed before participants arrive.
Architecture: This is an operational dry-run, not a code feature. The executing agent pushes a throwaway feature branch to the public repo, verifies that CI produces a per-branch Docker image tag on GHCR, switches the shared agnes-dev VM onto that tag via the existing auto-upgrade cron, verifies that the CI test gate blocks a deliberately-broken PR from reaching :stable, and produces a helper script + report. The plan is strictly non-destructive for prod — prod-pinning (point 6 of the original outline) is explicitly out of scope and left to the user.
Tech Stack: Bash / gcloud / gh / git / docker / curl / Python (pytest) / Terraform (plan only, no apply). No app code changes.
Out of Scope (do NOT do)
- Any
terraform applyagainst real infrastructure. TFplanis allowed; TFapplyis forbidden. - Pinning
prod_instance.image_taginagnes-infra-keboola. User will do this themselves after the dry-run succeeds. - Rotating admin passwords, Keboola tokens, or JWT secrets.
- Modifying
mainbranch of any repo. All changes happen on throwaway branches, which are deleted at the end. - Creating new GCP resources (VMs, disks, IPs, secrets, SAs).
If any step would require doing one of the above, STOP and ask the user.
Prerequisites
Before starting, the executing agent MUST verify all of the following. If any fails, abort and report which prerequisite is missing — do NOT try to fix it.
-
Working directory is the
tmp_osscheckout at/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss. Current branch can be anything; the plan will create a new branch. -
gh auth statusshows authenticated, withworkflowscope. Run:gh auth status 2>&1 | grep -E "(Logged in|Token scopes)"Expected: line containing
Logged in to github.comand a line listing scopes that includeworkflow. Ifworkflowscope is missing, abort with message:Run: gh auth refresh -h github.com -s workflow. -
gcloudauthenticated to projectinternal-prod. Run:gcloud config get-value project gcloud auth list --filter=status:ACTIVE --format="value(account)"Expected: project is
internal-prod, at least one active account. If not, abort with message:Run: gcloud config set project internal-prod && gcloud auth login. -
SSH to
agnes-devworks (OS Login). Run:gcloud compute ssh agnes-dev --zone=europe-west1-b --command="echo ok" --quietExpected: output contains
ok. First connection may take ~20s while OS Login provisions. If fails with permission error, abort with message:User needs compute.osLogin role on agnes-dev VM. -
dockerCLI available locally (fordocker manifest inspect). Run:docker --version. Expected: version output. If missing, abort. -
Public GHCR pull works. Run:
docker manifest inspect ghcr.io/keboola/agnes-the-ai-analyst:stable > /dev/null && echo okExpected:
ok. If fails, abort — something is wrong with public image visibility. -
Clone of
agnes-infra-keboolaexists or can be cloned at/tmp/agnes-infra-keboola. Run:if [ ! -d /tmp/agnes-infra-keboola ]; then gh repo clone keboola/agnes-infra-keboola /tmp/agnes-infra-keboola fi cd /tmp/agnes-infra-keboola && git status --shortExpected: clone succeeds,
git statusis clean. If clone fails, skip Task 4 (TF plan verification) and note it in the final report.
Gate: All 7 prerequisite checks pass, OR the agent has clearly reported which ones failed and reduced scope accordingly. Only then proceed to Task 1.
Task 1: Baseline Snapshot
Purpose: Record the current state of both VMs and the TF outputs so the agent can detect drift at the end and prove it left everything as it found it.
Files:
-
Create:
/tmp/dryrun-baseline/prod-health.json -
Create:
/tmp/dryrun-baseline/dev-health.json -
Create:
/tmp/dryrun-baseline/prod-image.txt -
Create:
/tmp/dryrun-baseline/dev-image.txt -
Create:
/tmp/dryrun-baseline/dev-env.txt -
Step 1.1: Create baseline directory
mkdir -p /tmp/dryrun-baseline -
Step 1.2: Capture prod health
curl -sf --max-time 10 http://<prod-vm-ip>:8000/api/health > /tmp/dryrun-baseline/prod-health.json cat /tmp/dryrun-baseline/prod-health.json | python3 -m json.toolExpected: JSON with
"status"field equal to"healthy"or"degraded". If"unhealthy"or curl times out, abort with message:Prod is not in acceptable baseline state — investigate before dry-run. -
Step 1.3: Capture dev health
curl -sf --max-time 10 http://<dev-vm-ip>:8000/api/health > /tmp/dryrun-baseline/dev-health.json cat /tmp/dryrun-baseline/dev-health.json | python3 -m json.toolExpected: JSON with
"status"in{healthy, degraded}. Same abort condition as 1.2. -
Step 1.4: Capture current image tags on both VMs
gcloud compute ssh agnes-prod --zone=europe-west1-b --quiet --command \ "docker inspect \$(docker ps -qf name=app) --format '{{.Config.Image}}'" \ > /tmp/dryrun-baseline/prod-image.txt gcloud compute ssh agnes-dev --zone=europe-west1-b --quiet --command \ "docker inspect \$(docker ps -qf name=app) --format '{{.Config.Image}}'" \ > /tmp/dryrun-baseline/dev-image.txt cat /tmp/dryrun-baseline/prod-image.txt /tmp/dryrun-baseline/dev-image.txtExpected: each file contains exactly one line like
ghcr.io/keboola/agnes-the-ai-analyst:stableor:stable-2026.04.XX. Non-empty. -
Step 1.5: Capture
agnes-dev.envAGNES_TAG linegcloud compute ssh agnes-dev --zone=europe-west1-b --quiet --command \ "sudo grep -E '^AGNES_TAG=' /data/.env || echo 'AGNES_TAG_NOT_SET'" \ > /tmp/dryrun-baseline/dev-env.txt cat /tmp/dryrun-baseline/dev-env.txtExpected: output is
AGNES_TAG=devor similar. Record exact value for restoration in Task 6. IfAGNES_TAG_NOT_SET, abort — the VM is in an unknown config state. -
Step 1.6: Record baseline to report buffer
Append to a running report at
/tmp/dryrun-report.md(create if not exists):cat > /tmp/dryrun-report.md <<EOF # Hackathon Dry-Run Report **Run at:** $(date -u +"%Y-%m-%dT%H:%M:%SZ") ## Baseline (Task 1) - Prod health status: $(jq -r '.status' /tmp/dryrun-baseline/prod-health.json) - Dev health status: $(jq -r '.status' /tmp/dryrun-baseline/dev-health.json) - Prod image: $(cat /tmp/dryrun-baseline/prod-image.txt) - Dev image: $(cat /tmp/dryrun-baseline/dev-image.txt) - Dev AGNES_TAG: $(cat /tmp/dryrun-baseline/dev-env.txt) EOF cat /tmp/dryrun-report.mdExpected: report file exists, all fields populated (no empty values).
Task 1 gate: baseline directory has 5 non-empty files, report has 5 non-empty bullet lines. Proceed.
Task 2: Verify Per-Branch GHCR Build
Purpose: Push a throwaway feature branch to the public repo, wait for the release workflow, and confirm that the per-branch :dev-<slug> tag appears on GHCR.
Files:
- Create (throwaway): branch
feature/hack-dryrun-<timestamp>intmp_oss+ one trivial commit touchingdocs/QUICKSTART.md
Branch naming: the agent MUST use feature/hack-dryrun-<epoch> (e.g. feature/hack-dryrun-1745254321) so the slug is unique per run and cleanup is deterministic.
-
Step 2.1: Compute branch name and expected slug
Per
.github/workflows/release.yml:92-98logic: stripfeature/prefix, sanitise[^a-zA-Z0-9-]to-, lowercase, cut 50 chars.EPOCH=$(date +%s) BRANCH="feature/hack-dryrun-${EPOCH}" SLUG=$(echo "$BRANCH" | sed 's|^feature/||' | sed 's|[^a-zA-Z0-9-]|-|g' | tr '[:upper:]' '[:lower:]' | cut -c1-50) echo "BRANCH=$BRANCH" echo "SLUG=$SLUG" echo "EXPECTED_TAG=ghcr.io/keboola/agnes-the-ai-analyst:dev-$SLUG" # Persist for later steps echo "$BRANCH" > /tmp/dryrun-baseline/branch-name.txt echo "$SLUG" > /tmp/dryrun-baseline/slug.txtExpected: BRANCH like
feature/hack-dryrun-1745254321, SLUG likehack-dryrun-1745254321. Persisted. -
Step 2.2: Create branch with trivial commit
cd "/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss" # Save current branch so we can return git rev-parse --abbrev-ref HEAD > /tmp/dryrun-baseline/starting-branch.txt BRANCH=$(cat /tmp/dryrun-baseline/branch-name.txt) git checkout -b "$BRANCH" echo "<!-- dryrun $(date -u +%FT%TZ) -->" >> docs/QUICKSTART.md git add docs/QUICKSTART.md git commit -m "dryrun: verify per-branch GHCR tag" git push -u origin "$BRANCH"Expected: branch created, one commit, push succeeds with upstream tracking. If push is rejected (e.g. protection), abort.
-
Step 2.3: Wait for release workflow to complete
cd "/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss" BRANCH=$(cat /tmp/dryrun-baseline/branch-name.txt) # Get the most recent run id for this branch + workflow sleep 10 # give GH a moment to register the run RUN_ID=$(gh run list --branch "$BRANCH" --workflow release.yml --limit 1 --json databaseId --jq '.[0].databaseId') echo "Watching run $RUN_ID" gh run watch "$RUN_ID" --exit-status --interval 15 echo "Workflow exit: $?"Expected: exit status 0 after ~3-5 min. If exit != 0, print the logs:
gh run view "$RUN_ID" --log-failed | tail -100and abort with message:
Release workflow failed for throwaway branch — investigate before hackathon. -
Step 2.4: Verify per-branch tag exists on GHCR
SLUG=$(cat /tmp/dryrun-baseline/slug.txt) EXPECTED="ghcr.io/keboola/agnes-the-ai-analyst:dev-$SLUG" docker manifest inspect "$EXPECTED" > /tmp/dryrun-baseline/ghcr-manifest.json DIGEST=$(jq -r '.config.digest // .manifests[0].digest' /tmp/dryrun-baseline/ghcr-manifest.json) echo "Tag exists: $EXPECTED" echo "Digest: $DIGEST" echo "$DIGEST" > /tmp/dryrun-baseline/expected-digest.txtExpected:
docker manifest inspectreturns JSON (exit 0), a non-empty digest is extracted. If the tag is missing, abort with message:release.yml did not produce :dev-<slug> tag — check build-and-push step logs. -
Step 2.5: Record Task 2 result
SLUG=$(cat /tmp/dryrun-baseline/slug.txt) cat >> /tmp/dryrun-report.md <<EOF ## Task 2: Per-Branch GHCR Build — PASS - Branch: $(cat /tmp/dryrun-baseline/branch-name.txt) - Slug: $SLUG - Tag: ghcr.io/keboola/agnes-the-ai-analyst:dev-$SLUG - Digest: $(cat /tmp/dryrun-baseline/expected-digest.txt) EOF
Task 2 gate: :dev-<slug> manifest exists. Proceed.
Task 3: Dev VM Switch Flow
Purpose: Simulate the hackathon developer path — have the shared agnes-dev VM pick up the per-branch image via the existing auto-upgrade cron, verify the new image is running, then (in Task 6) roll back.
Files touched (reversibly):
-
/data/.envonagnes-devVM — one-lineAGNES_TAG=change (rollback is captured in baseline from Step 1.5) -
Step 3.1: Switch
agnes-dev.envAGNES_TAG to the per-branch tagSLUG=$(cat /tmp/dryrun-baseline/slug.txt) NEW_TAG="dev-$SLUG" gcloud compute ssh agnes-dev --zone=europe-west1-b --quiet --command "\ sudo cp /data/.env /data/.env.dryrun-bak && \ sudo sed -i 's|^AGNES_TAG=.*|AGNES_TAG=$NEW_TAG|' /data/.env && \ sudo grep -E '^AGNES_TAG=' /data/.env"Expected: final line is
AGNES_TAG=dev-<slug>. If sed didn't match (noAGNES_TAG=line existed), abort and manually investigate. -
Step 3.2: Trigger auto-upgrade cron script immediately
gcloud compute ssh agnes-dev --zone=europe-west1-b --quiet --command \ "sudo /usr/local/bin/agnes-auto-upgrade.sh 2>&1 | tail -30"Expected: output shows
docker compose pull+docker compose up -dactivity. If the script doesn't exist or errors, abort with message:auto-upgrade script missing or broken on agnes-dev. -
Step 3.3: Wait for app container to become healthy
# Poll /api/health for up to 90s for i in $(seq 1 30); do STATUS=$(curl -s --max-time 5 http://<dev-vm-ip>:8000/api/health | jq -r '.status' 2>/dev/null || echo "down") echo "[$i/30] status=$STATUS" if [ "$STATUS" = "healthy" ] || [ "$STATUS" = "degraded" ]; then break fi sleep 3 done [ "$STATUS" = "healthy" ] || [ "$STATUS" = "degraded" ] || { echo "FAIL: dev never healthy"; exit 1; }Expected: reaches
healthy/degradedwithin 90s. -
Step 3.4: Verify the running image is the per-branch one
SLUG=$(cat /tmp/dryrun-baseline/slug.txt) EXPECTED_DIGEST=$(cat /tmp/dryrun-baseline/expected-digest.txt) RUNNING_IMAGE=$(gcloud compute ssh agnes-dev --zone=europe-west1-b --quiet --command \ "docker inspect \$(docker ps -qf name=app) --format '{{.Image}}'") echo "Running image digest: $RUNNING_IMAGE" # The running image line will be sha256:xxxxx. Compare to the manifest digest we recorded. # They should match (or differ only by multi-arch manifest indirection — compare via docker inspect on remote) gcloud compute ssh agnes-dev --zone=europe-west1-b --quiet --command \ "docker inspect \$(docker ps -qf name=app) --format '{{.Config.Image}}' && \ docker image inspect \$(docker ps -qf name=app --format '{{.Image}}' | head -1) --format '{{.RepoTags}}{{.RepoDigests}}'"Expected:
RepoTagsorRepoDigestsoutput includes either:dev-$SLUGor the digest from Step 2.4. If neither matches, the cron didn't pull the new tag — record as FAIL and continue (cleanup is still required). -
Step 3.5: Record Task 3 result
The agent must judge PASS/FAIL based on Step 3.4 output: PASS iff
RepoTagsorRepoDigestscontained:dev-$SLUGor the digest captured in Step 2.4.SLUG=$(cat /tmp/dryrun-baseline/slug.txt) # Replace <RESULT> with PASS or FAIL based on the Step 3.4 output the agent observed. # Replace <IMAGE_OUTPUT> with the RepoTags/RepoDigests line from Step 3.4. # Replace <SECONDS> with the loop iteration count from Step 3.3 × 3. cat >> /tmp/dryrun-report.md <<EOF ## Task 3: Dev VM Switch — <RESULT> - Switched agnes-dev to AGNES_TAG=dev-$SLUG - Health after switch: reached healthy/degraded within 90s - Running image: <IMAGE_OUTPUT> - Time from cron trigger to healthy: <SECONDS>s EOF
Task 3 gate: health reached OK state; running image verified. Proceed even if image verification was inconclusive — rollback still required.
Task 4: Terraform Plan Verification (Private Repo)
Purpose: Validate that adding a new entry to dev_instances produces a clean terraform plan (not apply) in agnes-infra-keboola. This proves the TF module accepts the variable shape the hackathon docs will recommend.
Skip condition: If prerequisites check found that /tmp/agnes-infra-keboola clone failed, skip this entire task and record SKIPPED — repo unavailable in the report.
Files touched (throwaway branch only):
-
/tmp/agnes-infra-keboola/terraform/terraform.tfvars(throwaway edit) -
Step 4.1: Create throwaway branch in private repo
cd /tmp/agnes-infra-keboola git checkout main git pull EPOCH=$(date +%s) BRANCH="dryrun-tfplan-${EPOCH}" echo "$BRANCH" > /tmp/dryrun-baseline/tf-branch.txt git checkout -b "$BRANCH"Expected: clean checkout of main, new branch created.
-
Step 4.2: Add throwaway dev_instance entry
Read
terraform/terraform.tfvarsfirst to understand the currentdev_instancesshape. Then append a new entry.The
dev_instancesvariable schema (frominfra/modules/customer-instance/variables.tf:41-49) is:list(object({ name = string machine_type = optional(string, "e2-small") image_tag = optional(string, "dev") }))Modify the
dev_instanceslist to append:{ name = "agnes-hack-dryrun", image_tag = "dev-<slug-from-task-2>" }The agent should detect the current tfvars format and insert accordingly. If the file does not already contain
dev_instances, abort and report format-mismatch.SLUG=$(cat /tmp/dryrun-baseline/slug.txt) # Show current tfvars for context cat /tmp/agnes-infra-keboola/terraform/terraform.tfvars | grep -A 20 "dev_instances" # Agent must edit the file to add the new entry — use the Edit tool rather than sed to be safe.After editing, show the diff:
cd /tmp/agnes-infra-keboola git diff terraform/terraform.tfvarsExpected: diff adds exactly one new entry to
dev_instanceslist withname = "agnes-hack-dryrun"andimage_tag = "dev-<slug>". -
Step 4.3: Run
terraform planlocally (no apply)cd /tmp/agnes-infra-keboola/terraform export GOOGLE_APPLICATION_CREDENTIALS="$HOME/.agnes-keys/agnes-deploy-internal-prod-key.json" [ -f "$GOOGLE_APPLICATION_CREDENTIALS" ] || { echo "SA key not found — skipping plan"; exit 2; } terraform init -input=false -upgrade=false terraform plan -input=false -no-color -out=/tmp/dryrun-tfplan.bin > /tmp/dryrun-tfplan.txt 2>&1 RC=$? echo "terraform plan exit: $RC" tail -40 /tmp/dryrun-tfplan.txtExpected:
- exit 0 or 2 (2 = changes detected, which is what we want)
- output ends with
Plan: N to add, M to change, K to destroy.whereN >= 1(at least the new VM + disk + IP) andK == 0(we must NOT be destroying anything)
If
K > 0orterraform planerrors, abort and DO NOT proceed to Step 4.4. Report the plan output verbatim in the final report. -
Step 4.4: Discard throwaway branch (no push, no apply)
cd /tmp/agnes-infra-keboola git checkout main BRANCH=$(cat /tmp/dryrun-baseline/tf-branch.txt) git branch -D "$BRANCH" # Branch was never pushed, so nothing to clean up remotely.Expected: branch deleted locally, main is current, working tree clean.
-
Step 4.5: Record Task 4 result
ADDS=$(grep -E "Plan:" /tmp/dryrun-tfplan.txt | head -1) DESTROYS_OK=$(grep -E "Plan:.*0 to destroy" /tmp/dryrun-tfplan.txt && echo yes || echo no) cat >> /tmp/dryrun-report.md <<EOF ## Task 4: TF Plan for New Dev VM — <PASS|SKIPPED|FAIL> - Plan summary: $ADDS - Zero destroys: $DESTROYS_OK - Full plan output: see /tmp/dryrun-tfplan.txt EOF
Task 4 gate: plan produced with 0 destroys and ≥1 add. Proceed.
Task 5: Verify Smoke-Test Gate Blocks Broken PR
Purpose: Confirm that a pull request with a deliberately-failing test does NOT produce a passing CI — which is the safety net that keeps :stable from auto-promoting broken images to prod.
Files touched (throwaway branch only):
tests/test_dryrun_should_fail.py(new file on throwaway branch)
Important: This task creates a PR (not a merge). The PR is closed without merging in Step 5.5.
-
Step 5.1: Create throwaway branch with failing test
cd "/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss" git checkout main git pull EPOCH=$(date +%s) BRANCH="dryrun-break-smoke-${EPOCH}" echo "$BRANCH" > /tmp/dryrun-baseline/smoke-branch.txt git checkout -b "$BRANCH" cat > tests/test_dryrun_should_fail.py <<'PYEOF' def test_intentional_fail_for_dryrun(): """Intentional failure to verify CI gate blocks broken PRs. Remove after dryrun.""" assert False, "dryrun: this test is supposed to fail" PYEOF git add tests/test_dryrun_should_fail.py git commit -m "dryrun: intentional failing test (will be reverted)" git push -u origin "$BRANCH"Expected: push succeeds.
-
Step 5.2: Open PR
cd "/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss" PR_URL=$(gh pr create --title "dryrun: verify CI gate (DO NOT MERGE)" \ --body "Intentionally failing test to verify CI blocks bad merges. Will be closed immediately after CI result." \ --base main) echo "$PR_URL" > /tmp/dryrun-baseline/pr-url.txt echo "Opened: $PR_URL"Expected: PR URL returned.
-
Step 5.3: Wait for CI
testjob to complete (expected: FAIL)cd "/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss" BRANCH=$(cat /tmp/dryrun-baseline/smoke-branch.txt) sleep 15 RUN_ID=$(gh run list --branch "$BRANCH" --workflow release.yml --limit 1 --json databaseId --jq '.[0].databaseId') echo "Watching run $RUN_ID (expected to FAIL)" # Use --exit-status WITHOUT `set -e`; we expect non-zero set +e gh run watch "$RUN_ID" --exit-status --interval 15 EXIT=$? set -e echo "Exit code: $EXIT (non-zero is EXPECTED here)"Expected: exit code != 0. If exit code IS 0, that means CI passed despite
assert False→ the test suite is not being run, or the file was excluded → record as FAIL — CI gate broken. -
Step 5.4: Verify PR mergeability check shows failure
PR_URL=$(cat /tmp/dryrun-baseline/pr-url.txt) PR_NUM=$(basename "$PR_URL") STATE=$(gh pr view "$PR_NUM" --json statusCheckRollup --jq '.statusCheckRollup[] | select(.name=="test") | .conclusion') echo "test job conclusion: $STATE"Expected:
FAILURE. IfSUCCESS, the gate is broken. -
Step 5.5: Close PR and delete branch
cd "/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss" PR_URL=$(cat /tmp/dryrun-baseline/pr-url.txt) PR_NUM=$(basename "$PR_URL") gh pr close "$PR_NUM" --delete-branch --comment "dryrun complete — CI gate verified, closing without merge" # Also delete locally git checkout main BRANCH=$(cat /tmp/dryrun-baseline/smoke-branch.txt) git branch -D "$BRANCH" 2>/dev/null || trueExpected: PR closed, local branch gone.
-
Step 5.6: Check whether
mainhas required status checks configuredgh api repos/keboola/agnes-the-ai-analyst/branches/main/protection 2>/tmp/dryrun-protection-err.txt > /tmp/dryrun-protection.json RC=$? if [ $RC -ne 0 ]; then echo "No branch protection on main (or insufficient permissions to read it)" cat /tmp/dryrun-protection-err.txt PROTECTION_NOTE="NONE — branch is unprotected; broken PRs can be merged. Recommend adding 'test' as required status check." else REQUIRED=$(jq -r '.required_status_checks.contexts[]?' /tmp/dryrun-protection.json 2>/dev/null | tr '\n' ',') echo "Required checks: $REQUIRED" if echo "$REQUIRED" | grep -q "test"; then PROTECTION_NOTE="OK — 'test' is required." else PROTECTION_NOTE="PARTIAL — protection exists but 'test' is not required. Contexts: $REQUIRED" fi fi echo "$PROTECTION_NOTE" > /tmp/dryrun-baseline/protection-note.txtExpected: note written. Does not abort — informational only.
-
Step 5.7: Record Task 5 result
cat >> /tmp/dryrun-report.md <<EOF ## Task 5: CI Gate — <PASS|FAIL> - Throwaway PR: $(cat /tmp/dryrun-baseline/pr-url.txt) (closed) - CI 'test' job result on broken code: <FAILURE expected> - Branch protection on main: $(cat /tmp/dryrun-baseline/protection-note.txt) EOF
Task 5 gate: broken PR's CI status is FAILURE. Proceed. If PROTECTION_NOTE says NONE/PARTIAL, the final report must flag this as a hackathon-blocking recommendation.
Task 6: Cleanup and Baseline Restoration
Purpose: Leave the system in exactly the state recorded in Task 1. This is the most important task — a dirty dry-run poisons the hackathon.
-
Step 6.1: Restore
agnes-devAGNES_TAGORIG_LINE=$(cat /tmp/dryrun-baseline/dev-env.txt) # ORIG_LINE looks like: AGNES_TAG=dev ORIG_VALUE=$(echo "$ORIG_LINE" | cut -d= -f2-) gcloud compute ssh agnes-dev --zone=europe-west1-b --quiet --command "\ sudo sed -i 's|^AGNES_TAG=.*|AGNES_TAG=$ORIG_VALUE|' /data/.env && \ sudo rm -f /data/.env.dryrun-bak && \ sudo grep -E '^AGNES_TAG=' /data/.env && \ sudo /usr/local/bin/agnes-auto-upgrade.sh 2>&1 | tail -20"Expected: AGNES_TAG line matches original, auto-upgrade pulls back to the original tag.
-
Step 6.2: Wait for dev VM to return to healthy state on original tag
for i in $(seq 1 30); do STATUS=$(curl -s --max-time 5 http://<dev-vm-ip>:8000/api/health | jq -r '.status' 2>/dev/null || echo down) echo "[$i/30] status=$STATUS" [ "$STATUS" = "healthy" ] || [ "$STATUS" = "degraded" ] && break sleep 3 doneExpected: reaches healthy/degraded within 90s.
-
Step 6.3: Verify running image matches baseline
RESTORED=$(gcloud compute ssh agnes-dev --zone=europe-west1-b --quiet --command \ "docker inspect \$(docker ps -qf name=app) --format '{{.Config.Image}}'") ORIG=$(cat /tmp/dryrun-baseline/dev-image.txt) echo "Restored: $RESTORED" echo "Original: $ORIG" [ "$RESTORED" = "$ORIG" ] && echo MATCH || echo "MISMATCH — investigate"Expected: MATCH. If MISMATCH, the baseline-tag digest may have advanced (auto-upgrade pulled newer
:stable/:devfloating image during the run) — that is acceptable as long as the.Config.Imagetag matches. Record exact difference in report. -
Step 6.4: Delete throwaway branches in public repo
cd "/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss" STARTING=$(cat /tmp/dryrun-baseline/starting-branch.txt) git checkout "$STARTING" FEAT_BRANCH=$(cat /tmp/dryrun-baseline/branch-name.txt) SMOKE_BRANCH=$(cat /tmp/dryrun-baseline/smoke-branch.txt 2>/dev/null || echo "") # Local delete git branch -D "$FEAT_BRANCH" 2>/dev/null || true [ -n "$SMOKE_BRANCH" ] && git branch -D "$SMOKE_BRANCH" 2>/dev/null || true # Remote delete (smoke branch was already deleted via `gh pr close --delete-branch` in Step 5.5) git push origin --delete "$FEAT_BRANCH" 2>/dev/null || echo "(feature branch already gone)"Expected: local branches gone, remote feature branch deleted. QUICKSTART.md commit on throwaway branch vanishes from origin.
-
Step 6.5: Final health check on prod (must match baseline)
curl -sf --max-time 10 http://<prod-vm-ip>:8000/api/health > /tmp/dryrun-baseline/prod-health-after.json BEFORE=$(jq -r '.status' /tmp/dryrun-baseline/prod-health.json) AFTER=$(jq -r '.status' /tmp/dryrun-baseline/prod-health-after.json) echo "Prod status before: $BEFORE / after: $AFTER" [ "$BEFORE" = "$AFTER" ] && echo UNCHANGED || echo DRIFTExpected: UNCHANGED. (Note: prod was never touched, so this is sanity only.)
-
Step 6.6: Record Task 6 result
cat >> /tmp/dryrun-report.md <<EOF ## Task 6: Cleanup — <PASS|FAIL> - agnes-dev AGNES_TAG restored to: $(cat /tmp/dryrun-baseline/dev-env.txt) - agnes-dev health after restore: $(curl -s --max-time 5 http://<dev-vm-ip>:8000/api/health | jq -r '.status') - agnes-dev image: matches baseline? <MATCH|MISMATCH — paste both> - Throwaway branches deleted: feature, smoke - Prod status unchanged: <UNCHANGED|DRIFT> EOF
Task 6 gate: dev VM back on its baseline tag, branches gone, prod untouched.
Task 7: Generate Deliverables
Purpose: Produce the artefacts the user needs tomorrow: a helper script for the hackathon team and a consolidated report.
Files:
-
Create:
scripts/switch-dev-vm.sh(new) -
Create (already being built):
/tmp/dryrun-report.md -
Step 7.1: Write
scripts/switch-dev-vm.shCreate file at
scripts/switch-dev-vm.sh:#!/usr/bin/env bash # switch-dev-vm.sh — point the shared hackathon dev VM at the caller's branch image. # # Usage: # scripts/switch-dev-vm.sh <branch-slug> # scripts/switch-dev-vm.sh hack-zs-metrics # # Prerequisite: your branch has been pushed and the release.yml workflow has completed, # producing ghcr.io/keboola/agnes-the-ai-analyst:dev-<slug>. # # The slug is derived from your branch name by stripping the leading "feature/" and # replacing non-alphanumeric chars with "-". For branch "feature/hack-zs-metrics" the slug # is "hack-zs-metrics". set -euo pipefail if [ $# -ne 1 ]; then echo "Usage: $0 <branch-slug>" >&2 echo "Example: $0 hack-zs-metrics" >&2 exit 2 fi SLUG="$1" VM="agnes-dev" ZONE="europe-west1-b" TAG="dev-$SLUG" IMAGE="ghcr.io/keboola/agnes-the-ai-analyst:$TAG" echo "[1/4] Verifying $IMAGE exists on GHCR..." docker manifest inspect "$IMAGE" > /dev/null || { echo "ERROR: $IMAGE not found on GHCR. Did your release.yml run finish?" >&2 echo "Check: gh run list --branch feature/$SLUG --workflow release.yml" >&2 exit 1 } echo "[2/4] Updating AGNES_TAG on $VM to $TAG..." gcloud compute ssh "$VM" --zone="$ZONE" --quiet --command "\ sudo sed -i 's|^AGNES_TAG=.*|AGNES_TAG=$TAG|' /data/.env && \ sudo grep -E '^AGNES_TAG=' /data/.env" echo "[3/4] Triggering auto-upgrade..." gcloud compute ssh "$VM" --zone="$ZONE" --quiet --command \ "sudo /usr/local/bin/agnes-auto-upgrade.sh 2>&1 | tail -10" echo "[4/4] Waiting for app to become healthy..." for i in $(seq 1 30); do STATUS=$(curl -s --max-time 5 http://<dev-vm-ip>:8000/api/health | python3 -c 'import sys,json; print(json.load(sys.stdin).get("status","down"))' 2>/dev/null || echo down) echo " [$i/30] status=$STATUS" if [ "$STATUS" = "healthy" ] || [ "$STATUS" = "degraded" ]; then echo "OK — agnes-dev now running $TAG. Open http://<dev-vm-ip>:8000" exit 0 fi sleep 3 done echo "ERROR: agnes-dev did not become healthy in 90s. SSH in and check: docker compose logs" >&2 exit 1chmod +x scripts/switch-dev-vm.sh bash -n scripts/switch-dev-vm.sh # syntax checkExpected: syntax-check passes, file executable.
-
Step 7.2: Commit the script on a fresh branch and open PR
cd "/Users/zdeneksrotyr/Library/Mobile Documents/com~apple~CloudDocs/Sources/VsCode/component_factory/tmp_oss" git checkout -b feature/hackathon-dryrun-deliverables git add scripts/switch-dev-vm.sh git commit -m "chore: add switch-dev-vm.sh helper for hackathon" git push -u origin HEAD gh pr create --title "chore: add switch-dev-vm.sh helper for hackathon" \ --body "Adds scripts/switch-dev-vm.sh. Produced by the 2026-04-21 hackathon dry-run. Reviewed by user before merge." \ --base main > /tmp/dryrun-baseline/deliverable-pr.txt cat /tmp/dryrun-baseline/deliverable-pr.txtExpected: PR URL. Do not merge — leave for user review.
-
Step 7.3: Finalise report with overall verdict
Determine overall verdict by inspecting each Task's PASS/FAIL line in
/tmp/dryrun-report.md. Overall is PASS only if all tasks PASS (SKIPPED Task 4 is acceptable — note it).Append to report:
cat >> /tmp/dryrun-report.md <<EOF --- ## Overall Verdict <PASS | PASS WITH GAPS | FAIL> ## Recommendations for the User Before Hackathon Starts 1. <If protection-note said NONE/PARTIAL:> Configure required status check 'test' on main branch of keboola/agnes-the-ai-analyst. 2. Pin prod image_tag in agnes-infra-keboola/terraform/terraform.tfvars from "stable" to "stable-2026.04.XX" (current running version). Revert after hackathon. 3. Rotate admin password '1234' on prod (<prod-vm-ip>:8000/login) and dev (<dev-vm-ip>:8000/login). 4. Wire notification_channel_ids in tfvars so uptime alerts actually notify someone. 5. Share the hackathon 1-pager + switch-dev-vm.sh via the team Slack channel. 6. Review PR $(cat /tmp/dryrun-baseline/deliverable-pr.txt) and merge if switch-dev-vm.sh looks good. ## Artefacts - Full report: /tmp/dryrun-report.md (this file) - Baseline snapshots: /tmp/dryrun-baseline/*.{json,txt} - TF plan output: /tmp/dryrun-tfplan.txt (if Task 4 ran) - Deliverable PR: $(cat /tmp/dryrun-baseline/deliverable-pr.txt) EOF cat /tmp/dryrun-report.mdExpected: full report printed.
-
Step 7.4: Print final summary to chat
Agent should output, in its final message to the user:
- Overall verdict (one line)
- Each task's result (one line each)
- Any unresolved anomalies
- Link to deliverable PR
- Path to full report
Task 7 gate: report complete, PR open, all artefacts listed.
Abort / Rollback Procedures
If any task fails mid-execution, the agent must still perform Task 6 cleanup before reporting failure. Specifically:
- If Task 2 push succeeded but Task 3 failed → still run Task 6 Steps 6.1-6.4 to restore dev VM and delete the branch.
- If Task 5 PR was opened but workflow didn't finish → close the PR with
gh pr close --delete-branchand log it. - If Task 4 TF plan showed destroys → abort immediately, do NOT attempt apply, record in report, continue to Task 6.
If Task 6 itself fails (dev VM won't come back healthy on original tag), the agent must:
- Print the baseline values (from
/tmp/dryrun-baseline/dev-env.txt,/tmp/dryrun-baseline/dev-image.txt) so the user can manually SSH and fix. - Attempt
gcloud compute ssh agnes-dev --zone=europe-west1-b --command "docker compose -f /opt/agnes/docker-compose.yml logs --tail 100"and include output in the report. - Mark overall verdict as FAIL and stop.
What a Successful Run Looks Like
- Task 1 baseline: captured with prod+dev healthy/degraded
- Task 2: GHCR manifest exists for
:dev-hack-dryrun-<epoch> - Task 3: agnes-dev briefly running the per-branch image, healthy within 90s
- Task 4:
terraform planshowed1+ to add, 0 to destroy(or SKIPPED) - Task 5: CI
testjob reported FAILURE on the broken PR, PR closed - Task 6: agnes-dev back on its baseline AGNES_TAG, healthy, branches gone
- Task 7:
scripts/switch-dev-vm.shcommitted on PR for user review, full report in/tmp/dryrun-report.md - Final agent message: verdict + 6 bullet results + deliverable PR link
Duration: ~45-75 minutes, bounded primarily by CI workflow runs (~3-5 min each, two runs) and TF init (~30s-2min cold).