ci: consolidate release pipeline (salvageable subset of #139 ) (#314 )

* ci: add actionlint workflow lint, drop superseded deploy.yml stub

* ci: extract rollback into reusable rollback.yml, wire into release smoke-test

* ci: add weekly prune-dev-tags workflow for legacy CalVer tag/image cleanup

* release: 0.54.17 — CI/release workflow consolidation

* fix(ci): warn when rollback.yml receives a non-stable failed_image_tag

* fix(ci): rollback.yml + prune-dev-tags.sh review findings

rollback.yml:
- Pass workflow_dispatch inputs (failed_image_tag, target_image_tag)
  through env: instead of textual ${{ }} splicing into bash run blocks
  — prevents an actor with workflow_dispatch privilege from injecting
  shell via quote/backtick payloads.
- Guard against TARGET == FAILED when only one stable-* tag exists
  (fresh repo, or aggressive pruning at month boundary). Fail loudly
  rather than re-push the broken image as :stable.
- Add commit SHA to the rollback tracking-issue body — github.sha is
  inherited across workflow_call, so on-call no longer has to navigate
  rollback run → caller-workflow breadcrumb → failing commit.

prune-dev-tags.sh:
- Replace 'printf … | head -20' preview pipeline with array slice
  ('"${TO_PRUNE[@]:0:20}"'). Under set -o pipefail, head closing
  the pipe early SIGPIPEs printf (exit 141) and aborts the script
  before any deletion runs — exactly the multi-month-backlog scenario
  the script targets.
- Refactor GHCR-pass: fetch versions JSON once before the loop, then
  build a tag→version-id map up-front. Closes two problems:
    1. O(N × pages) GHCR API calls collapse to one paginated listing
       — months of accumulated CalVer tags no longer risk tripping
       abuse detection.
    2. The new jq filter excludes any version that ALSO carries a
       floating alias (:stable, :dev, *-latest). GHCR DELETE-version
       drops the entire manifest, so pruning a CalVer tag that shares
       a manifest with :stable (e.g. after a rollback re-tag) would
       have vaporized :stable. Now it's skipped with a log line.

lint-workflows.yml:
- Add an explicit shellcheck step. actionlint only walks
  .github/workflows/ and the shell embedded in their run: blocks, so
  freestanding scripts/ops/*.sh (which are in the workflow's path
  filter) were never actually validated despite triggering CI.

* fix(ci): shellcheck --severity=warning to skip pre-existing info findings

The new shellcheck step caught info-level findings (SC1091, SC2015) in
agnes-auto-upgrade.sh / agnes-tls-rotate.sh — pre-existing, not regressed
by this PR. Constrain shellcheck to warning+ severity (real bugs) so info
and style findings don't block CI; mirrors the actionlint step's
continue-on-error initial-rollout posture.

* fix(ci): second-pass review findings — concurrency, walk-back, failure propagation

rollback.yml:
- Add own concurrency block (group: rollback-<repo>-<failed_tag>,
  cancel-in-progress: false). The caller release.yml uses
  cancel-in-progress: true to avoid duplicate CalVer claims, but a
  second push to main mid-rollback would otherwise kill the workflow
  between the :stable recovery push and the :deprecated-* audit push,
  leaving :stable stuck on the broken image. A reusable workflow's own
  concurrency overrides the inherited one.
- Walk back through stable-* tags newest-first, skipping any whose
  :deprecated-<stripped> GHCR alias already exists (carries the mark of
  a prior failed rollback). The previous 'second-most-recent' heuristic
  could re-point :stable at a known-broken image on cascading failures.
- Reorder re-tag step: push :stable recovery FIRST, then the
  :deprecated-* audit tag. Defense in depth — even if the concurrency
  block somehow misfires, the worst case is missing audit metadata
  rather than production stuck on the broken image.
- Move GHCR login before resolve step so 'docker manifest inspect' can
  probe for :deprecated-* aliases during walk-back.
- Document the top-level permissions block's dual semantics
  (workflow_dispatch grants directly; workflow_call acts as a cap
  intersected with the caller's job-level permissions).

release.yml:
- Rewrite the 'issues: write' comment. Old wording ('default for jobs')
  was factually wrong — GITHUB_TOKEN's default for issues is never write
  — and read as 'this line just documents a default', so a future
  cleanup PR could delete it. The line is load-bearing: workflow_call
  permissions are bounded by the caller's GITHUB_TOKEN scope, and
  removing it would silently 403 rollback.yml's gh issue create step.

prune-dev-tags.sh:
- Drop the '|| echo "[]"' fallback on the GHCR versions fetch. The
  fallback turned every API failure (403 missing scope, 429 rate limit,
  transient 5xx) into a silent no-op with exit 0 — operators saw a
  green run while every TAG fell through to the same 'no eligible
  version' skip message used for legitimate manifest-collision skips.
- Reorder: fetch GHCR versions BEFORE any git-tag deletion. Git-tag
  delete is irrecoverable (next run rebuilds TO_PRUNE from 'git tag
  -l', so an orphan GHCR image is never enumerated again). Fetching
  first means an API failure aborts cleanly with no state change.
- Track PRUNE_FAILED flag. 'git push --delete' fallback is no longer
  unconditional — local 'git tag -d' is gated on successful remote
  push, so a refused remote delete (tag-protection rule, missing
  contents:write) leaves the local tag in place for retry. The flag
  propagates to a final 'exit 1' so the cron run turns red on any
  push or DELETE failure.

lint-workflows.yml:
- shellcheck step now uses 'find scripts/ops -type f -name *.sh' to
  match the workflow's recursive 'scripts/ops/**.sh' path filter. The
  previous bare 'scripts/ops/*.sh' glob only matched top-level files;
  a future script under a subdirectory would have triggered the
  workflow but never been linted.

* docs(releasing): document rollback.yml, prune-dev-tags.yml, lint-workflows.yml

Reflects the new operational workflows landing in this release:
- Auto-rollback paragraph in release.yml description (smoke-test job +
  rollback-on-smoke-fail → rollback.yml)
- rollback.yml subsection — workflow_call + workflow_dispatch entry
  points, walk-back target resolution, immutability + concurrency
  guarantees, manual operator gh workflow run examples
- prune-dev-tags.yml subsection — weekly cron, KEEP_MONTHS retention
  semantics, floating-alias safety, dry_run preview, failure-propagation
  exit-non-zero behavior
- lint-workflows.yml CI quirk — actionlint (continue-on-error) +
  shellcheck (--severity=warning blocking) advisory checks

CLAUDE.md non-negotiable rules unchanged — still high-level and
correct (changelog discipline + release-cut belongs to the PR + run the
full test suite).

2026-05-15 14:06:59 +02:00

16 KiB

Raw Blame History

Releasing & deploying

The full release process for Agnes. CLAUDE.md carries the short version; this doc is the operational reference. Read it linearly the first few times — once internalized, the order matters less, but the non-obvious gotchas never go away.

Changelog discipline — non-negotiable

Every PR that adds, removes, or changes user-visible behavior MUST update CHANGELOG.md in the same PR. No exceptions, no follow-ups, no "I'll do it after merge". User-visible = anything an operator, end-user, or downstream integrator can observe: CLI flags / output / exit codes, REST endpoints / payloads / status codes, web UI, instance.yaml schema, env vars, extract.duckdb contract, Docker / compose / Caddyfile knobs, default behaviors, breaking changes, security fixes.

How:

Add a bullet under the topmost ## [Unreleased] heading (create one if missing — it sits above the latest released version).
Group by ### Added / ### Changed / ### Fixed / ### Removed / ### Internal (Keep-a-Changelog sections).
Mark breaking changes with **BREAKING** at the start of the bullet — operators grep for that string before bumping the pin.
Reference the relevant doc/runbook if one exists (e.g. see docs/auth-groups.md), don't restate it.
Internal-only changes (refactors, test additions, dependency bumps without behavior change) go under ### Internal — still log them, just keep them terse.

Reviewers should bounce PRs that touch user-visible behavior without a changelog update — same way they'd bounce a PR with no test changes for new logic.

Release-cut belongs to the PR — non-negotiable

The version bump + CHANGELOG rename + new empty [Unreleased] are the LAST commit on the PR that earned the version. Never a standalone follow-up PR.

When a PR lands the only [Unreleased] content (or is the last in a queue of in-flight feature PRs), the release-cut MUST ship as part of the same merge. Standalone release-cut PRs add review-overhead PRs to history with no behavior change of their own and pollute git log with bookkeeping commits separated from the work that earned them.

Mandatory checklist before approving / enabling auto-merge on ANY PR:

Stop. Will this PR land alone in [Unreleased] (no other in-flight PRs queued behind it)?
If yes, the release-cut is REQUIRED in the same PR before merge. BEFORE pushing the final commit:
- Bump pyproject.toml to X.Y.Z
- Rename ## [Unreleased] → ## [X.Y.Z] — YYYY-MM-DD, add a new empty ## [Unreleased] on top
- Either squash these into the consolidation commit OR add as a separate release: X.Y.Z commit on the same branch
THEN push, approve, enable auto-merge.
After auto-merge fires: tag vX.Y.Z against the merge commit + create a GitHub Release. Done — one PR, one merge, one release.

Failure mode to avoid: enabling auto-merge on the feature PR thinking "I'll add the release-cut after." Auto-merge fires faster than the second commit lands. The window closes; the only fix is a standalone release-cut PR — exactly what this rule prohibits.

Acceptable standalone release-cut (rare): only when [Unreleased] accumulated bullets from MULTIPLE already-merged PRs AND no further behavior-change PR is queued — i.e. the cut is the only outstanding work and there's no PR to attach it to.

Release workflow — concrete recipe

Happy path (8 steps)

# 1. Branch from a fresh checkout. iCloud Drive worktrees randomly hang
#    on git operations — use a fresh shallow clone in /tmp instead.
cd /tmp && git clone --depth 50 --branch main \
  https://github.com/keboola/agnes-the-ai-analyst.git agnes-<topic>
cd agnes-<topic> && git checkout -b zs/<branch-name>

# 2. Make the change + tests. Run the AREA pytest while iterating
#    (e.g. `pytest tests/test_X.py -p no:xdist -q`).

# 3. Add a CHANGELOG bullet under [Unreleased].
#    Group: Added | Changed | Fixed | Removed | Internal
#    Mark BREAKING with **BREAKING** prefix.

# 4. Commit the change(s). Multiple logical commits OK; release-cut
#    will be a SEPARATE last commit (next step). DO NOT bundle the
#    release-cut into the same commit as the change — it pollutes
#    the SHA that auto-close keywords reference and makes revert
#    targeted at the change-only difficult.

# 5. Run the full pytest suite locally:
#    `pytest tests/ -p no:xdist -q` (or `-n auto` if xdist works).
#    Pre-existing fails (e.g. test_readers_in_pre_init_dir under
#    subprocess timeout) are OK to ignore; verify by reverting your
#    diff and reproducing on bare main.

# 6. Release-cut commit (LAST commit on the PR per the rule above):
#    - Bump pyproject.toml: version = "X.Y.Z"
#    - Rename `## [Unreleased]` → `## [X.Y.Z] — YYYY-MM-DD`
#    - Add a fresh empty `## [Unreleased]` line above
#    Commit message: `release: X.Y.Z — <one-line summary>`

# 7. Push branch + open PR + enable auto-merge SQUASH:
#    git push -u origin HEAD
#    gh pr create --repo keboola/agnes-the-ai-analyst \
#      --head <branch> --title "<...>" --body "<...>"
#    gh pr merge <N> --repo keboola/agnes-the-ai-analyst \
#      --squash --auto --delete-branch

# 8. After auto-merge fires (poll or `Monitor`):
#    git fetch origin --tags
#    git tag vX.Y.Z <merge-sha>
#    git push origin vX.Y.Z
#    gh release create vX.Y.Z --repo keboola/agnes-the-ai-analyst \
#      --title "vX.Y.Z — <...>" --notes "<copy-paste from CHANGELOG>"

Picking the next version

pyproject.toml's current version is the next-release target (post-cut from the previous release). Pre-1.0 we patch-bump for everything that doesn't break operator-facing APIs:

instance.yaml schema additions, new env vars, new endpoints → patch (e.g. 0.54.3 → 0.54.4)
New CLI subcommands, BREAKING removals, schema migrations → still patch within the current 0.5x cycle (no minor bumps cut today)
The CHANGELOG **BREAKING** marker is what operators grep for; the version number is secondary

Always check git tag -l "v0.X*" before naming — if v0.54.0 is already tagged, the next one is v0.54.1, even if pyproject.toml still says 0.54.0 from a stale post-cut commit (we've shipped that race before).

Authoring expectations on the PR

Self-PRs (you're both author and reviewer): GitHub forbids self-approve. If branch protection requires N approving reviews (we don't today — required_approving_review_count = 0), you need someone else to approve. With our current 0-review setup, self-PRs can still merge automatically once required CI passes.
Other people's PRs you're taking over: dismiss any prior CHANGES_REQUESTED reviews (yours or someone else's) before auto-merge can fire. gh pr review <N> --approve --body "..." after pushing your fixes.
Devin Review: not a required check today; runs in parallel and posts a comment. Don't wait on it for merge unless the human reviewer explicitly asks.

CI quirks you WILL hit

gh pr checks glosses CANCELLED as fail. When you force-push (rebase, amend), GitHub auto-cancels the in-flight Release workflow run on the older SHA. Those cancelled jobs show up as "fail" in the PR's check summary and tab forever, even after newer runs succeed. Look at the conclusion column, not just the count. Rule of thumb: if the same check name appears with both pass and fail rows, the fail row is from an older auto-cancelled SHA. Verify with gh api repos/keboola/agnes-the-ai-analyst/commits/<sha>/check-runs — the raw API distinguishes cancelled from failure truthfully.
Branch protection's "strict" mode caches cancelled test as blocking even after newer test runs succeed. Symptom: mergeable_state: blocked despite all required checks green on the latest SHA. Fix: re-run the cancelled Release workflow run (gh run rerun <run-id>); once its test job lands as success, the block clears. We've hit this on PRs #273, #281, #285, #286.
Required checks (per branch protection): test + docker-build only. Other workflows (cli-wheel-clean-install, build-and-push, Release-pipeline, Devin Review) are advisory — green/red doesn't gate merge.
enforce_admins: true in branch protection means --admin flag on gh pr merge does NOT bypass. Don't try; just fix the underlying block.
lint-workflows.yml is advisory. Triggered on changes to .github/workflows/** or scripts/ops/**.sh. Runs actionlint on workflow YAMLs + shellcheck --severity=warning on freestanding ops scripts. The actionlint step has continue-on-error: true initially (pre-existing inventory has info-level findings); flip to fail-fast once the repo is actionlint-clean. The shellcheck step IS blocking at warning+ severity — info/style findings ride through, real bugs break CI.

Recovery when something derails

Force-pushed and lost auto-merge? GitHub usually preserves auto-merge across force-pushes for the same PR; if it cleared, just re-run gh pr merge <N> --squash --auto --delete-branch.
Release-cut commit forgot to land? That's the failure mode the "Release-cut belongs to the PR" rule prevents. If it happens anyway: open a follow-on PR with ONLY the release-cut commit, ship it, and write up why in your post-mortem comment.
Wrong version number tagged? git tag -d vX.Y.Z && git push --delete origin vX.Y.Z then re-tag against the right SHA. Update the GitHub Release if you already created it.

Deploy workflows

Two separate release.yml-style workflows produce GHCR images. Pick the one that matches what you're shipping.

`release.yml` — auto-build on every push

Runs on every push to every branch.

Push to main → :stable, :stable-YYYY.MM.N (CalVer).
Push to non-main <prefix>/<branch> → :dev, :dev-YYYY.MM.N, :dev-<branch-slug>, and (when prefix isn't a Git Flow convention) :dev-<prefix>-latest alias.

VMs that pin to a floating tag (:dev, :dev-<prefix>-latest) auto-upgrade within ~5 min via the cron in agnes-auto-upgrade.sh. Convenient for per-developer dev VMs; footgun for shared dev VMs (last pusher wins, regardless of who).

Auto-rollback on smoke failure. On main pushes, after :stable is published, the smoke-test job pulls the just-built image and runs scripts/ops/post-deploy-smoke-test.sh inside a docker-compose stack. If that job fails, the rollback-on-smoke-fail job calls the reusable rollback.yml workflow (see below) which re-points :stable to the previous known-good build, marks the failed image as :deprecated-*, and opens a tracking issue labeled bug.

`rollback.yml` — reusable + manual rollback

Two entry points:

workflow_call from release.yml's rollback-on-smoke-fail job (auto-rollback path above).
workflow_dispatch for manual operator rollback when something breaks post-deploy that the auto smoke-test missed.

Manual rollback — flip :stable back to a previous good build:

gh workflow run rollback.yml \
  --repo keboola/agnes-the-ai-analyst \
  -f failed_image_tag=stable-YYYY.MM.N

By default target_image_tag resolves by walking back through stable-* git tags newest-first and picking the first that does NOT already carry a :deprecated-<stripped> GHCR alias (i.e. wasn't previously auto-rolled- back). That prevents cascading failures from re-pointing :stable at a known-broken image. To force a specific target:

gh workflow run rollback.yml \
  --repo keboola/agnes-the-ai-analyst \
  -f failed_image_tag=stable-2026.05.531 \
  -f target_image_tag=stable-2026.04.474

Notes:

The workflow does NOT delete the failed git tag (CalVer immutability is preserved) — only the GHCR :stable alias is re-pointed and the failed image gains a :deprecated-* audit alias.
Re-tag order is :stable recovery first, then :deprecated-* audit, so a mid-step interruption leaves production healthy with at-worst missing audit metadata.
Concurrency: cancel-in-progress: false (overrides the caller workflow's cancellation policy) so a subsequent push to main won't kill a rollback mid-flight.

`keboola-deploy.yml` — tag-triggered, explicit deploy only

Runs only on git tags matching keboola-deploy-*. Publishes:

:keboola-deploy-<git-tag-suffix> — immutable, tied to the exact commit
:keboola-deploy-latest — floating alias the consumer pins to

Operator workflow:

git checkout <commit-or-branch>
git tag keboola-deploy-<descriptive-name>
git push origin keboola-deploy-<descriptive-name>
# → workflow builds + publishes both tags
# → VM cron picks up :keboola-deploy-latest within ~5 min
# → manual cron trigger (skip the wait): sudo /usr/local/bin/agnes-auto-upgrade.sh on the VM

Use this when the consumer (e.g. a customer dev VM) needs deploy-when-I-decide semantics — no surprise rollouts from upstream branch pushes by other contributors. The infra repo pins image_tag = "keboola-deploy-latest" on the relevant VM.

`prune-dev-tags.yml` — weekly CalVer + GHCR housekeeping

Cron 0 4 * * 0 (Sundays 04:00 UTC) + workflow_dispatch. Prunes legacy CalVer git tags (dev-YYYY.MM.N, stable-YYYY.MM.N) and the matching GHCR image versions older than KEEP_MONTHS (default 1 → keep current

previous month). Floating aliases (:stable, :dev, *-latest) are never matched: they are git-tagless, and the GHCR pass explicitly skips any version that shares a manifest with a floating alias to avoid collateral deletion of :stable after a rollback re-tag.

Manual preview (no deletions, lists what would be pruned):

gh workflow run prune-dev-tags.yml \
  --repo keboola/agnes-the-ai-analyst \
  -f dry_run=true

Force a wider window (one-off aggressive cleanup):

gh workflow run prune-dev-tags.yml \
  --repo keboola/agnes-the-ai-analyst \
  -f keep_months=3

Scheduled (cron) runs always prune for real; dry_run is honored only on manual dispatch. The script tracks per-tag remote-push / GHCR-DELETE failures and exits non-zero on any failure, so a refused remote push (tag- protection rule, missing scope) or a GHCR API error turns the cron run red instead of silently swallowing it. Local git tag -d is gated on successful remote push, so a refused delete leaves the local tag in place for retry on the next run.

Module versioning

The customer-instance Terraform module under infra/modules/customer-instance/ is published as infra-vMAJOR.MINOR.PATCH git tags (separate from app CalVer tags). Bump on any module-API change; downstream infra repos pin to the tag in their source = "github.com/keboola/agnes-the-ai-analyst//infra/modules/customer-instance?ref=infra-v1.X.Y".

After merging a module change to main:

git tag infra-vX.Y.Z origin/main
git push origin infra-vX.Y.Z

Replacing a VM after a startup-script change

Module sets lifecycle { ignore_changes = [metadata_startup_script] } on google_compute_instance.vm so normal terraform apply doesn't churn running VMs. To propagate a startup-script update, trigger the consumer's apply workflow manually with the VM resource address — typical workflow_dispatch input is recreate_targets='module.agnes.google_compute_instance.vm["<vm-name>"]'.

Appendix: CHANGELOG entry skeleton

Copy this when adding to ## [Unreleased] in CHANGELOG.md. Drop the sections you don't need; keep the Keep-a-Changelog order.

### Added
- New feature description.

### Changed
- Change description. **BREAKING** prefix + migration steps if operator-facing.

### Fixed
- Bug fix description.

### Removed
- **BREAKING** removed feature — what replaces it.

### Internal
- Refactors, test additions, dependency bumps with no behavior change.

At release-cut time ## [Unreleased] is renamed to ## [X.Y.Z] — YYYY-MM-DD and a fresh empty ## [Unreleased] is added on top. CI publishes the matching stable-YYYY.MM.N image tag for the merge commit (see Deploy workflows above).

16 KiB Raw Blame History