* ci: add actionlint workflow lint, drop superseded deploy.yml stub
* ci: extract rollback into reusable rollback.yml, wire into release smoke-test
* ci: add weekly prune-dev-tags workflow for legacy CalVer tag/image cleanup
* release: 0.54.17 — CI/release workflow consolidation
* fix(ci): warn when rollback.yml receives a non-stable failed_image_tag
* fix(ci): rollback.yml + prune-dev-tags.sh review findings
rollback.yml:
- Pass workflow_dispatch inputs (failed_image_tag, target_image_tag)
through env: instead of textual ${{ }} splicing into bash run blocks
— prevents an actor with workflow_dispatch privilege from injecting
shell via quote/backtick payloads.
- Guard against TARGET == FAILED when only one stable-* tag exists
(fresh repo, or aggressive pruning at month boundary). Fail loudly
rather than re-push the broken image as :stable.
- Add commit SHA to the rollback tracking-issue body — github.sha is
inherited across workflow_call, so on-call no longer has to navigate
rollback run → caller-workflow breadcrumb → failing commit.
prune-dev-tags.sh:
- Replace 'printf … | head -20' preview pipeline with array slice
('"${TO_PRUNE[@]:0:20}"'). Under set -o pipefail, head closing
the pipe early SIGPIPEs printf (exit 141) and aborts the script
before any deletion runs — exactly the multi-month-backlog scenario
the script targets.
- Refactor GHCR-pass: fetch versions JSON once before the loop, then
build a tag→version-id map up-front. Closes two problems:
1. O(N × pages) GHCR API calls collapse to one paginated listing
— months of accumulated CalVer tags no longer risk tripping
abuse detection.
2. The new jq filter excludes any version that ALSO carries a
floating alias (:stable, :dev, *-latest). GHCR DELETE-version
drops the entire manifest, so pruning a CalVer tag that shares
a manifest with :stable (e.g. after a rollback re-tag) would
have vaporized :stable. Now it's skipped with a log line.
lint-workflows.yml:
- Add an explicit shellcheck step. actionlint only walks
.github/workflows/ and the shell embedded in their run: blocks, so
freestanding scripts/ops/*.sh (which are in the workflow's path
filter) were never actually validated despite triggering CI.
* fix(ci): shellcheck --severity=warning to skip pre-existing info findings
The new shellcheck step caught info-level findings (SC1091, SC2015) in
agnes-auto-upgrade.sh / agnes-tls-rotate.sh — pre-existing, not regressed
by this PR. Constrain shellcheck to warning+ severity (real bugs) so info
and style findings don't block CI; mirrors the actionlint step's
continue-on-error initial-rollout posture.
* fix(ci): second-pass review findings — concurrency, walk-back, failure propagation
rollback.yml:
- Add own concurrency block (group: rollback-<repo>-<failed_tag>,
cancel-in-progress: false). The caller release.yml uses
cancel-in-progress: true to avoid duplicate CalVer claims, but a
second push to main mid-rollback would otherwise kill the workflow
between the :stable recovery push and the :deprecated-* audit push,
leaving :stable stuck on the broken image. A reusable workflow's own
concurrency overrides the inherited one.
- Walk back through stable-* tags newest-first, skipping any whose
:deprecated-<stripped> GHCR alias already exists (carries the mark of
a prior failed rollback). The previous 'second-most-recent' heuristic
could re-point :stable at a known-broken image on cascading failures.
- Reorder re-tag step: push :stable recovery FIRST, then the
:deprecated-* audit tag. Defense in depth — even if the concurrency
block somehow misfires, the worst case is missing audit metadata
rather than production stuck on the broken image.
- Move GHCR login before resolve step so 'docker manifest inspect' can
probe for :deprecated-* aliases during walk-back.
- Document the top-level permissions block's dual semantics
(workflow_dispatch grants directly; workflow_call acts as a cap
intersected with the caller's job-level permissions).
release.yml:
- Rewrite the 'issues: write' comment. Old wording ('default for jobs')
was factually wrong — GITHUB_TOKEN's default for issues is never write
— and read as 'this line just documents a default', so a future
cleanup PR could delete it. The line is load-bearing: workflow_call
permissions are bounded by the caller's GITHUB_TOKEN scope, and
removing it would silently 403 rollback.yml's gh issue create step.
prune-dev-tags.sh:
- Drop the '|| echo "[]"' fallback on the GHCR versions fetch. The
fallback turned every API failure (403 missing scope, 429 rate limit,
transient 5xx) into a silent no-op with exit 0 — operators saw a
green run while every TAG fell through to the same 'no eligible
version' skip message used for legitimate manifest-collision skips.
- Reorder: fetch GHCR versions BEFORE any git-tag deletion. Git-tag
delete is irrecoverable (next run rebuilds TO_PRUNE from 'git tag
-l', so an orphan GHCR image is never enumerated again). Fetching
first means an API failure aborts cleanly with no state change.
- Track PRUNE_FAILED flag. 'git push --delete' fallback is no longer
unconditional — local 'git tag -d' is gated on successful remote
push, so a refused remote delete (tag-protection rule, missing
contents:write) leaves the local tag in place for retry. The flag
propagates to a final 'exit 1' so the cron run turns red on any
push or DELETE failure.
lint-workflows.yml:
- shellcheck step now uses 'find scripts/ops -type f -name *.sh' to
match the workflow's recursive 'scripts/ops/**.sh' path filter. The
previous bare 'scripts/ops/*.sh' glob only matched top-level files;
a future script under a subdirectory would have triggered the
workflow but never been linted.
* docs(releasing): document rollback.yml, prune-dev-tags.yml, lint-workflows.yml
Reflects the new operational workflows landing in this release:
- Auto-rollback paragraph in release.yml description (smoke-test job +
rollback-on-smoke-fail → rollback.yml)
- rollback.yml subsection — workflow_call + workflow_dispatch entry
points, walk-back target resolution, immutability + concurrency
guarantees, manual operator gh workflow run examples
- prune-dev-tags.yml subsection — weekly cron, KEEP_MONTHS retention
semantics, floating-alias safety, dry_run preview, failure-propagation
exit-non-zero behavior
- lint-workflows.yml CI quirk — actionlint (continue-on-error) +
shellcheck (--severity=warning blocking) advisory checks
CLAUDE.md non-negotiable rules unchanged — still high-level and
correct (changelog discipline + release-cut belongs to the PR + run the
full test suite).
179 lines
7.5 KiB
YAML
179 lines
7.5 KiB
YAML
name: Rollback :stable
|
|
|
|
# Re-tag :stable to a previous known-good build, deprecate the failing
|
|
# image, and open a tracking issue. Callable from release.yml on
|
|
# smoke-test failure (workflow_call) or manually by an operator
|
|
# (workflow_dispatch) when something breaks post-deploy.
|
|
|
|
on:
|
|
workflow_call:
|
|
inputs:
|
|
failed_image_tag:
|
|
description: 'The image_tag that failed (e.g. stable-2026.05.531)'
|
|
type: string
|
|
required: true
|
|
target_image_tag:
|
|
description: 'Override the rollback target. Defaults to the second-most-recent stable-* tag.'
|
|
type: string
|
|
required: false
|
|
workflow_dispatch:
|
|
inputs:
|
|
failed_image_tag:
|
|
description: 'The image_tag that failed (e.g. stable-2026.05.531)'
|
|
type: string
|
|
required: true
|
|
target_image_tag:
|
|
description: 'Rollback target. Defaults to the second-most-recent stable-* tag.'
|
|
type: string
|
|
required: false
|
|
|
|
# NOTE: This top-level block has dual semantics:
|
|
# - On `workflow_dispatch` (manual operator trigger): governs the
|
|
# GITHUB_TOKEN scope directly.
|
|
# - On `workflow_call` from release.yml: the caller's job-level
|
|
# `permissions:` (rollback-on-smoke-fail) governs, intersected with
|
|
# this block as a cap. Tightening this block lowers the cap on both
|
|
# entry points; tightening the caller affects only the workflow_call
|
|
# path. Keep both in sync if you adjust either side.
|
|
permissions:
|
|
contents: read
|
|
packages: write
|
|
issues: write
|
|
|
|
# Override the caller's `cancel-in-progress: true` concurrency policy
|
|
# (release.yml groups by ref and cancels older runs to avoid duplicate
|
|
# CalVer claims). A rollback mid-flight must NOT be cancelled — the
|
|
# re-tag step has multiple `docker push`es; a cancellation between them
|
|
# would leave :stable on the broken image. A reusable workflow's own
|
|
# concurrency block overrides the inherited one.
|
|
concurrency:
|
|
group: rollback-${{ github.repository }}-${{ inputs.failed_image_tag }}
|
|
cancel-in-progress: false
|
|
|
|
jobs:
|
|
rollback:
|
|
runs-on: ubuntu-latest
|
|
steps:
|
|
- uses: actions/checkout@v6
|
|
with:
|
|
fetch-depth: 0
|
|
fetch-tags: true
|
|
|
|
# GHCR login moved BEFORE target resolution so the resolve step can
|
|
# use `docker manifest inspect` to skip known-broken candidates
|
|
# (versions that already carry a `:deprecated-*` alias from a prior
|
|
# rollback).
|
|
- name: Log in to GHCR
|
|
uses: docker/login-action@v4
|
|
with:
|
|
registry: ghcr.io
|
|
username: ${{ github.actor }}
|
|
password: ${{ secrets.GITHUB_TOKEN }}
|
|
|
|
- name: Resolve target image
|
|
id: target
|
|
# Inputs are passed via env to keep them out of the shell-script
|
|
# source — `${{ ... }}` is textual substitution, so an attacker with
|
|
# workflow_dispatch privilege could otherwise close a quote and
|
|
# inject commands. Env-var expansion does not re-parse for command
|
|
# substitution, so it's safe.
|
|
env:
|
|
TARGET_INPUT: ${{ inputs.target_image_tag }}
|
|
FAILED: ${{ inputs.failed_image_tag }}
|
|
REPO_SLUG: ${{ github.repository }}
|
|
run: |
|
|
REPO="ghcr.io/${REPO_SLUG}"
|
|
if [ -n "$TARGET_INPUT" ]; then
|
|
TARGET="$TARGET_INPUT"
|
|
else
|
|
# Walk back through stable-* tags newest-first; skip any whose
|
|
# `:deprecated-<stripped>` GHCR alias exists, because that
|
|
# marks a previously-failed release. The naive "second-most-
|
|
# recent" heuristic re-points :stable at known-broken images on
|
|
# cascading failures (rollback only pushes a deprecated alias,
|
|
# it does NOT delete the failed git tag — that would break
|
|
# CalVer immutability — so the failed tag stays in sort order
|
|
# on subsequent rollbacks).
|
|
TARGET=""
|
|
while IFS= read -r CANDIDATE; do
|
|
[ -z "$CANDIDATE" ] && continue
|
|
[ "$CANDIDATE" = "$FAILED" ] && continue
|
|
STRIPPED="${CANDIDATE#stable-}"
|
|
if docker manifest inspect "$REPO:deprecated-${STRIPPED}" > /dev/null 2>&1; then
|
|
echo " skipping $CANDIDATE (carries :deprecated-${STRIPPED} from a prior rollback)"
|
|
continue
|
|
fi
|
|
TARGET="$CANDIDATE"
|
|
break
|
|
done < <(git tag -l "stable-*" --sort=-version:refname)
|
|
if [ -z "$TARGET" ]; then
|
|
echo "::error::No known-good previous stable-* tag found — supply target_image_tag explicitly"
|
|
exit 1
|
|
fi
|
|
fi
|
|
# Defense in depth: even with the walk-back, refuse if the
|
|
# resolved target somehow matches FAILED (e.g. operator override
|
|
# via target_image_tag pointing at the failed build).
|
|
if [ "$TARGET" = "$FAILED" ]; then
|
|
echo "::error::Rollback target equals failed tag ($TARGET) — refusing to re-push broken image"
|
|
exit 1
|
|
fi
|
|
echo "target=$TARGET" >> "$GITHUB_OUTPUT"
|
|
echo "Rollback target: $TARGET"
|
|
|
|
- name: Re-tag :stable to target + mark failed image deprecated
|
|
env:
|
|
FAILED: ${{ inputs.failed_image_tag }}
|
|
TARGET: ${{ steps.target.outputs.target }}
|
|
run: |
|
|
REPO="ghcr.io/${{ github.repository }}"
|
|
if [[ "$FAILED" != stable-* ]]; then
|
|
echo "::warning::failed_image_tag '$FAILED' is not a stable-* tag — this workflow rolls back the :stable channel; the deprecated-* tag name may be non-standard."
|
|
fi
|
|
# Strip the channel prefix for a backward-compatible deprecated tag name
|
|
DEPRECATED="deprecated-${FAILED#stable-}"
|
|
|
|
# Order matters: push :stable recovery FIRST, then the
|
|
# :deprecated-* audit tag. If something interrupts mid-step
|
|
# (concurrency block above SHOULD prevent it, but defense in
|
|
# depth), the worst case is missing audit metadata — production
|
|
# is already healthy. The reverse order risked :stable stuck on
|
|
# the broken image.
|
|
docker pull "$REPO:$TARGET"
|
|
docker tag "$REPO:$TARGET" "$REPO:stable"
|
|
docker push "$REPO:stable"
|
|
|
|
docker pull "$REPO:$FAILED"
|
|
docker tag "$REPO:$FAILED" "$REPO:$DEPRECATED"
|
|
docker push "$REPO:$DEPRECATED"
|
|
|
|
- name: Open tracking issue
|
|
env:
|
|
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
|
|
FAILED: ${{ inputs.failed_image_tag }}
|
|
TARGET: ${{ steps.target.outputs.target }}
|
|
REPO_SLUG: ${{ github.repository }}
|
|
EVENT: ${{ github.event_name }}
|
|
SERVER_URL: ${{ github.server_url }}
|
|
RUN_ID: ${{ github.run_id }}
|
|
SHA: ${{ github.sha }}
|
|
run: |
|
|
# Same channel-prefix strip as the re-tag step, so the issue body
|
|
# shows the deprecated tag name that was actually pushed.
|
|
DEPRECATED="deprecated-${FAILED#stable-}"
|
|
gh issue create \
|
|
--title "Rollback: :stable reverted from $FAILED to $TARGET" \
|
|
--body "$(cat <<EOF
|
|
## Rollback report
|
|
|
|
- Failed image: \`ghcr.io/${REPO_SLUG}:${FAILED}\`
|
|
- Commit: \`${SHA}\`
|
|
- Deprecated tag: \`ghcr.io/${REPO_SLUG}:${DEPRECATED}\`
|
|
- Rolled back to: \`ghcr.io/${REPO_SLUG}:${TARGET}\`
|
|
- Triggered by: ${EVENT}
|
|
- Run: ${SERVER_URL}/${REPO_SLUG}/actions/runs/${RUN_ID}
|
|
|
|
Investigate before re-deploying.
|
|
EOF
|
|
)" \
|
|
--label "bug" || echo "::warning::Failed to open rollback tracking issue — check gh auth / labels"
|