* ci: add actionlint workflow lint, drop superseded deploy.yml stub
* ci: extract rollback into reusable rollback.yml, wire into release smoke-test
* ci: add weekly prune-dev-tags workflow for legacy CalVer tag/image cleanup
* release: 0.54.17 — CI/release workflow consolidation
* fix(ci): warn when rollback.yml receives a non-stable failed_image_tag
* fix(ci): rollback.yml + prune-dev-tags.sh review findings
rollback.yml:
- Pass workflow_dispatch inputs (failed_image_tag, target_image_tag)
through env: instead of textual ${{ }} splicing into bash run blocks
— prevents an actor with workflow_dispatch privilege from injecting
shell via quote/backtick payloads.
- Guard against TARGET == FAILED when only one stable-* tag exists
(fresh repo, or aggressive pruning at month boundary). Fail loudly
rather than re-push the broken image as :stable.
- Add commit SHA to the rollback tracking-issue body — github.sha is
inherited across workflow_call, so on-call no longer has to navigate
rollback run → caller-workflow breadcrumb → failing commit.
prune-dev-tags.sh:
- Replace 'printf … | head -20' preview pipeline with array slice
('"${TO_PRUNE[@]:0:20}"'). Under set -o pipefail, head closing
the pipe early SIGPIPEs printf (exit 141) and aborts the script
before any deletion runs — exactly the multi-month-backlog scenario
the script targets.
- Refactor GHCR-pass: fetch versions JSON once before the loop, then
build a tag→version-id map up-front. Closes two problems:
1. O(N × pages) GHCR API calls collapse to one paginated listing
— months of accumulated CalVer tags no longer risk tripping
abuse detection.
2. The new jq filter excludes any version that ALSO carries a
floating alias (:stable, :dev, *-latest). GHCR DELETE-version
drops the entire manifest, so pruning a CalVer tag that shares
a manifest with :stable (e.g. after a rollback re-tag) would
have vaporized :stable. Now it's skipped with a log line.
lint-workflows.yml:
- Add an explicit shellcheck step. actionlint only walks
.github/workflows/ and the shell embedded in their run: blocks, so
freestanding scripts/ops/*.sh (which are in the workflow's path
filter) were never actually validated despite triggering CI.
* fix(ci): shellcheck --severity=warning to skip pre-existing info findings
The new shellcheck step caught info-level findings (SC1091, SC2015) in
agnes-auto-upgrade.sh / agnes-tls-rotate.sh — pre-existing, not regressed
by this PR. Constrain shellcheck to warning+ severity (real bugs) so info
and style findings don't block CI; mirrors the actionlint step's
continue-on-error initial-rollout posture.
* fix(ci): second-pass review findings — concurrency, walk-back, failure propagation
rollback.yml:
- Add own concurrency block (group: rollback-<repo>-<failed_tag>,
cancel-in-progress: false). The caller release.yml uses
cancel-in-progress: true to avoid duplicate CalVer claims, but a
second push to main mid-rollback would otherwise kill the workflow
between the :stable recovery push and the :deprecated-* audit push,
leaving :stable stuck on the broken image. A reusable workflow's own
concurrency overrides the inherited one.
- Walk back through stable-* tags newest-first, skipping any whose
:deprecated-<stripped> GHCR alias already exists (carries the mark of
a prior failed rollback). The previous 'second-most-recent' heuristic
could re-point :stable at a known-broken image on cascading failures.
- Reorder re-tag step: push :stable recovery FIRST, then the
:deprecated-* audit tag. Defense in depth — even if the concurrency
block somehow misfires, the worst case is missing audit metadata
rather than production stuck on the broken image.
- Move GHCR login before resolve step so 'docker manifest inspect' can
probe for :deprecated-* aliases during walk-back.
- Document the top-level permissions block's dual semantics
(workflow_dispatch grants directly; workflow_call acts as a cap
intersected with the caller's job-level permissions).
release.yml:
- Rewrite the 'issues: write' comment. Old wording ('default for jobs')
was factually wrong — GITHUB_TOKEN's default for issues is never write
— and read as 'this line just documents a default', so a future
cleanup PR could delete it. The line is load-bearing: workflow_call
permissions are bounded by the caller's GITHUB_TOKEN scope, and
removing it would silently 403 rollback.yml's gh issue create step.
prune-dev-tags.sh:
- Drop the '|| echo "[]"' fallback on the GHCR versions fetch. The
fallback turned every API failure (403 missing scope, 429 rate limit,
transient 5xx) into a silent no-op with exit 0 — operators saw a
green run while every TAG fell through to the same 'no eligible
version' skip message used for legitimate manifest-collision skips.
- Reorder: fetch GHCR versions BEFORE any git-tag deletion. Git-tag
delete is irrecoverable (next run rebuilds TO_PRUNE from 'git tag
-l', so an orphan GHCR image is never enumerated again). Fetching
first means an API failure aborts cleanly with no state change.
- Track PRUNE_FAILED flag. 'git push --delete' fallback is no longer
unconditional — local 'git tag -d' is gated on successful remote
push, so a refused remote delete (tag-protection rule, missing
contents:write) leaves the local tag in place for retry. The flag
propagates to a final 'exit 1' so the cron run turns red on any
push or DELETE failure.
lint-workflows.yml:
- shellcheck step now uses 'find scripts/ops -type f -name *.sh' to
match the workflow's recursive 'scripts/ops/**.sh' path filter. The
previous bare 'scripts/ops/*.sh' glob only matched top-level files;
a future script under a subdirectory would have triggered the
workflow but never been linted.
* docs(releasing): document rollback.yml, prune-dev-tags.yml, lint-workflows.yml
Reflects the new operational workflows landing in this release:
- Auto-rollback paragraph in release.yml description (smoke-test job +
rollback-on-smoke-fail → rollback.yml)
- rollback.yml subsection — workflow_call + workflow_dispatch entry
points, walk-back target resolution, immutability + concurrency
guarantees, manual operator gh workflow run examples
- prune-dev-tags.yml subsection — weekly cron, KEEP_MONTHS retention
semantics, floating-alias safety, dry_run preview, failure-propagation
exit-non-zero behavior
- lint-workflows.yml CI quirk — actionlint (continue-on-error) +
shellcheck (--severity=warning blocking) advisory checks
CLAUDE.md non-negotiable rules unchanged — still high-level and
correct (changelog discipline + release-cut belongs to the PR + run the
full test suite).
This commit is contained in:
parent
7907b8082e
commit
9f5adbce37
10 changed files with 593 additions and 79 deletions
27
.github/workflows/deploy.yml
vendored
27
.github/workflows/deploy.yml
vendored
|
|
@ -1,27 +0,0 @@
|
||||||
# SUPERSEDED by release.yml — CalVer tagging with stable/dev channels.
|
|
||||||
# Kept for manual trigger only. Automated builds use release.yml.
|
|
||||||
name: Build & Push (legacy)
|
|
||||||
|
|
||||||
on:
|
|
||||||
workflow_dispatch: {}
|
|
||||||
|
|
||||||
jobs:
|
|
||||||
test:
|
|
||||||
runs-on: ubuntu-latest
|
|
||||||
steps:
|
|
||||||
- uses: actions/checkout@v6
|
|
||||||
|
|
||||||
- uses: actions/setup-python@v6
|
|
||||||
with:
|
|
||||||
python-version: "3.13"
|
|
||||||
|
|
||||||
- name: Install uv
|
|
||||||
uses: astral-sh/setup-uv@v7
|
|
||||||
|
|
||||||
- name: Install dependencies
|
|
||||||
run: uv pip install --system ".[dev,server]"
|
|
||||||
|
|
||||||
- name: Run tests
|
|
||||||
run: pytest tests/ -v --tb=short
|
|
||||||
env:
|
|
||||||
TESTING: "1"
|
|
||||||
67
.github/workflows/lint-workflows.yml
vendored
Normal file
67
.github/workflows/lint-workflows.yml
vendored
Normal file
|
|
@ -0,0 +1,67 @@
|
||||||
|
name: Lint workflows
|
||||||
|
|
||||||
|
# Catches GitHub Actions / shellcheck issues in workflow YAMLs before they
|
||||||
|
# break a real release. Runs on push/PR that touches anything under
|
||||||
|
# .github/workflows/ and on manual workflow_dispatch. Keeps non-blocking
|
||||||
|
# (warnings only) initially — flip to fail-fast when the existing inventory
|
||||||
|
# is clean.
|
||||||
|
|
||||||
|
on:
|
||||||
|
push:
|
||||||
|
branches: [main]
|
||||||
|
paths:
|
||||||
|
- ".github/workflows/**"
|
||||||
|
- "scripts/ops/**.sh"
|
||||||
|
pull_request:
|
||||||
|
paths:
|
||||||
|
- ".github/workflows/**"
|
||||||
|
- "scripts/ops/**.sh"
|
||||||
|
workflow_dispatch:
|
||||||
|
|
||||||
|
permissions:
|
||||||
|
contents: read
|
||||||
|
|
||||||
|
jobs:
|
||||||
|
actionlint:
|
||||||
|
runs-on: ubuntu-latest
|
||||||
|
steps:
|
||||||
|
- uses: actions/checkout@v6
|
||||||
|
|
||||||
|
- name: Run actionlint
|
||||||
|
run: |
|
||||||
|
# Pin to a specific actionlint version for reproducibility.
|
||||||
|
# Updates: bump the version string + verify rules in CHANGELOG.
|
||||||
|
ACTIONLINT_VERSION="1.7.7"
|
||||||
|
curl -sSL \
|
||||||
|
"https://github.com/rhysd/actionlint/releases/download/v${ACTIONLINT_VERSION}/actionlint_${ACTIONLINT_VERSION}_linux_amd64.tar.gz" \
|
||||||
|
| tar xz actionlint
|
||||||
|
./actionlint -color
|
||||||
|
# Continue-on-error initially: surface findings without blocking
|
||||||
|
# while the existing workflow inventory is being cleaned up. Flip
|
||||||
|
# to false (default) once the repo is actionlint-clean.
|
||||||
|
continue-on-error: true
|
||||||
|
|
||||||
|
- name: Run shellcheck on ops scripts
|
||||||
|
# actionlint above only walks `.github/workflows/**` + the shell
|
||||||
|
# snippets embedded inside their `run:` blocks; freestanding
|
||||||
|
# `scripts/ops/**/*.sh` files (which are also in this workflow's
|
||||||
|
# path filter via the `**.sh` glob) need their own pass.
|
||||||
|
# shellcheck is pre-installed on ubuntu-latest runners.
|
||||||
|
#
|
||||||
|
# `find` matches the recursive `**.sh` path filter above. A bare
|
||||||
|
# `scripts/ops/*.sh` glob would silently skip future scripts under
|
||||||
|
# subdirectories — the workflow would trigger on them (filter
|
||||||
|
# matches) but never lint them.
|
||||||
|
#
|
||||||
|
# `--severity=warning` blocks only on warning+ findings (actual
|
||||||
|
# bugs); info/style level passes silently. This lets the existing
|
||||||
|
# inventory's info-level findings (e.g. SC1091, SC2015 in
|
||||||
|
# agnes-auto-upgrade.sh / agnes-tls-rotate.sh) ride through while
|
||||||
|
# still catching real regressions in new scripts.
|
||||||
|
run: |
|
||||||
|
mapfile -t SCRIPTS < <(find scripts/ops -type f -name '*.sh' 2>/dev/null)
|
||||||
|
if [ "${#SCRIPTS[@]}" -gt 0 ]; then
|
||||||
|
shellcheck --severity=warning "${SCRIPTS[@]}"
|
||||||
|
else
|
||||||
|
echo "No scripts/ops/**/*.sh found — nothing to check."
|
||||||
|
fi
|
||||||
44
.github/workflows/prune-dev-tags.yml
vendored
Normal file
44
.github/workflows/prune-dev-tags.yml
vendored
Normal file
|
|
@ -0,0 +1,44 @@
|
||||||
|
name: Prune dev tags
|
||||||
|
|
||||||
|
# Weekly housekeeping: prune legacy CalVer git tags + GHCR images
|
||||||
|
# (dev-YYYY.MM.N / stable-YYYY.MM.N) on a KEEP_MONTHS retention window
|
||||||
|
# (current + previous month by default). Manual trigger supports a
|
||||||
|
# dry-run and a KEEP_MONTHS override. Floating aliases (:stable, :dev,
|
||||||
|
# *-latest) are git-tagless and never matched, so they are never pruned.
|
||||||
|
# Scheduled runs always prune for real; use workflow_dispatch with
|
||||||
|
# dry_run=true to preview.
|
||||||
|
|
||||||
|
on:
|
||||||
|
schedule:
|
||||||
|
- cron: '0 4 * * 0' # Sundays 04:00 UTC
|
||||||
|
workflow_dispatch:
|
||||||
|
inputs:
|
||||||
|
dry_run:
|
||||||
|
description: 'Dry-run only — list tags that would be pruned, do not delete'
|
||||||
|
type: boolean
|
||||||
|
default: true
|
||||||
|
keep_months:
|
||||||
|
description: 'Keep current month + this many previous months (e.g. 1 = 2 months total)'
|
||||||
|
type: string
|
||||||
|
default: '1'
|
||||||
|
|
||||||
|
permissions:
|
||||||
|
contents: write # delete git tags
|
||||||
|
packages: write # delete GHCR image versions
|
||||||
|
|
||||||
|
jobs:
|
||||||
|
prune:
|
||||||
|
runs-on: ubuntu-latest
|
||||||
|
steps:
|
||||||
|
- uses: actions/checkout@v6
|
||||||
|
with:
|
||||||
|
fetch-depth: 0
|
||||||
|
fetch-tags: true
|
||||||
|
|
||||||
|
- name: Run prune
|
||||||
|
env:
|
||||||
|
GITHUB_REPOSITORY: ${{ github.repository }}
|
||||||
|
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
|
||||||
|
KEEP_MONTHS: ${{ inputs.keep_months || '1' }}
|
||||||
|
PRUNE_DRY_RUN: ${{ inputs.dry_run && '1' || '0' }}
|
||||||
|
run: bash scripts/ops/prune-dev-tags.sh
|
||||||
71
.github/workflows/release.yml
vendored
71
.github/workflows/release.yml
vendored
|
|
@ -24,10 +24,14 @@ on:
|
||||||
permissions:
|
permissions:
|
||||||
contents: write
|
contents: write
|
||||||
packages: write
|
packages: write
|
||||||
# `issues: write` lets the smoke-test job's rollback step open a
|
# issues: write — explicitly granted at workflow scope so the
|
||||||
# GitHub issue alerting operators when an auto-rollback fires. Without
|
# rollback-on-smoke-fail job (which calls rollback.yml via workflow_call)
|
||||||
# this, the `gh issue create` call hits 403 and the `|| echo` fallback
|
# can open a tracking issue when an auto-rollback fires. Reusable-
|
||||||
# silently swallows it — operators see :stable revert with no alert.
|
# workflow permissions are bounded by the caller's GITHUB_TOKEN scope,
|
||||||
|
# so removing this line would silently 403 rollback.yml's gh issue
|
||||||
|
# create step (the || echo fallback would swallow the failure, leaving
|
||||||
|
# :stable reverted with no operator alert). Keep in sync with the
|
||||||
|
# rollback-on-smoke-fail job-level permissions below.
|
||||||
issues: write
|
issues: write
|
||||||
|
|
||||||
# When a developer pushes a brand-new branch with code changes, GitHub fires
|
# When a developer pushes a brand-new branch with code changes, GitHub fires
|
||||||
|
|
@ -208,12 +212,10 @@ jobs:
|
||||||
fetch-depth: 0
|
fetch-depth: 0
|
||||||
fetch-tags: true
|
fetch-tags: true
|
||||||
|
|
||||||
# Required for the rollback step's `docker push` to GHCR. The
|
# Required so `Start Agnes from built image` can pull the just-built
|
||||||
# `build-and-push` job logs in for itself; this job needs its own
|
# private GHCR image. The `build-and-push` job logs in for itself;
|
||||||
# login since GitHub Actions tokens are scoped per-job. Without it,
|
# this job needs its own login since GitHub Actions tokens are scoped
|
||||||
# the rollback hits "unauthenticated: User cannot be authenticated
|
# per-job.
|
||||||
# with the token provided" and silently leaves :stable pointing at
|
|
||||||
# the broken image (real incident: PR #137 / 4ec5ff44).
|
|
||||||
- name: Log in to GHCR
|
- name: Log in to GHCR
|
||||||
uses: docker/login-action@v4
|
uses: docker/login-action@v4
|
||||||
with:
|
with:
|
||||||
|
|
@ -234,44 +236,6 @@ jobs:
|
||||||
- name: Run smoke tests
|
- name: Run smoke tests
|
||||||
run: bash scripts/smoke-test.sh http://localhost:8000
|
run: bash scripts/smoke-test.sh http://localhost:8000
|
||||||
|
|
||||||
- name: Automatic rollback on failure
|
|
||||||
if: failure()
|
|
||||||
env:
|
|
||||||
# Required for the `gh issue create` call below — without GH_TOKEN
|
|
||||||
# the gh CLI fails the auth check and the issue creation falls
|
|
||||||
# through the `|| echo` fallback, so an operator never sees the
|
|
||||||
# rollback alert.
|
|
||||||
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
|
|
||||||
run: |
|
|
||||||
IMAGE_TAG="${{ needs.build-and-push.outputs.image_tag }}"
|
|
||||||
VERSION="${{ needs.build-and-push.outputs.version }}"
|
|
||||||
DEPRECATED_TAG="deprecated-${VERSION}"
|
|
||||||
REPO="ghcr.io/${{ github.repository }}"
|
|
||||||
|
|
||||||
echo "Smoke test failed — initiating rollback"
|
|
||||||
|
|
||||||
# Tag the current (failed) image as :deprecated-YYYY.MM.N
|
|
||||||
docker pull "${REPO}:${IMAGE_TAG}"
|
|
||||||
docker tag "${REPO}:${IMAGE_TAG}" "${REPO}:${DEPRECATED_TAG}"
|
|
||||||
docker push "${REPO}:${DEPRECATED_TAG}"
|
|
||||||
echo "Tagged failed image as ${REPO}:${DEPRECATED_TAG}"
|
|
||||||
|
|
||||||
# Revert :stable to the previous known-good image
|
|
||||||
PREV_TAG=$(git tag -l "stable-*" --sort=-version:refname | head -2 | tail -1)
|
|
||||||
if [ -n "$PREV_TAG" ]; then
|
|
||||||
docker pull "${REPO}:${PREV_TAG}"
|
|
||||||
docker tag "${REPO}:${PREV_TAG}" "${REPO}:stable"
|
|
||||||
docker push "${REPO}:stable"
|
|
||||||
echo "Reverted :stable to ${PREV_TAG}"
|
|
||||||
else
|
|
||||||
echo "WARNING: No previous stable tag found — cannot revert :stable automatically"
|
|
||||||
fi
|
|
||||||
|
|
||||||
# Create a GitHub issue alerting about the failure
|
|
||||||
ISSUE_TITLE="Smoke test failure — rollback to ${PREV_TAG:-unknown}"
|
|
||||||
ISSUE_BODY="## Automatic Rollback Report\n\nThe smoke test for image \`${IMAGE_TAG}\` failed.\n\n- **Failed image**: \`${REPO}:${IMAGE_TAG}\`\n- **Deprecated tag**: \`${REPO}:${DEPRECATED_TAG}\`\n- **Rolled back to**: \`${PREV_TAG:-N/A}\`\n- **Commit**: \`${{ github.sha }}\`\n- **Run**: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}\n\nPlease investigate and fix before re-deploying."
|
|
||||||
gh issue create --title "$ISSUE_TITLE" --body "$(echo -e "$ISSUE_BODY")" --label "bug" || echo "Failed to create GitHub issue (gh CLI may not be available)"
|
|
||||||
|
|
||||||
- name: Collect logs on failure
|
- name: Collect logs on failure
|
||||||
if: failure()
|
if: failure()
|
||||||
run: docker compose -f docker-compose.yml -f docker-compose.prod.yml -f docker-compose.ci.yml logs > smoke-test-logs.txt
|
run: docker compose -f docker-compose.yml -f docker-compose.prod.yml -f docker-compose.ci.yml logs > smoke-test-logs.txt
|
||||||
|
|
@ -287,6 +251,17 @@ jobs:
|
||||||
if: always()
|
if: always()
|
||||||
run: docker compose -f docker-compose.yml -f docker-compose.prod.yml -f docker-compose.ci.yml down -v
|
run: docker compose -f docker-compose.yml -f docker-compose.prod.yml -f docker-compose.ci.yml down -v
|
||||||
|
|
||||||
|
rollback-on-smoke-fail:
|
||||||
|
needs: [build-and-push, smoke-test]
|
||||||
|
if: failure() && needs.smoke-test.result == 'failure'
|
||||||
|
uses: ./.github/workflows/rollback.yml
|
||||||
|
with:
|
||||||
|
failed_image_tag: ${{ needs.build-and-push.outputs.image_tag }}
|
||||||
|
permissions:
|
||||||
|
contents: read
|
||||||
|
packages: write
|
||||||
|
issues: write
|
||||||
|
|
||||||
# Reproduces the deploy shape that broke agnes-development on 2026-04-29:
|
# Reproduces the deploy shape that broke agnes-development on 2026-04-29:
|
||||||
# the production stack uses docker-compose.host-mount.yml to bind-mount /data
|
# the production stack uses docker-compose.host-mount.yml to bind-mount /data
|
||||||
# from the host PD instead of using a Docker named volume. Docker initializes
|
# from the host PD instead of using a Docker named volume. Docker initializes
|
||||||
|
|
|
||||||
179
.github/workflows/rollback.yml
vendored
Normal file
179
.github/workflows/rollback.yml
vendored
Normal file
|
|
@ -0,0 +1,179 @@
|
||||||
|
name: Rollback :stable
|
||||||
|
|
||||||
|
# Re-tag :stable to a previous known-good build, deprecate the failing
|
||||||
|
# image, and open a tracking issue. Callable from release.yml on
|
||||||
|
# smoke-test failure (workflow_call) or manually by an operator
|
||||||
|
# (workflow_dispatch) when something breaks post-deploy.
|
||||||
|
|
||||||
|
on:
|
||||||
|
workflow_call:
|
||||||
|
inputs:
|
||||||
|
failed_image_tag:
|
||||||
|
description: 'The image_tag that failed (e.g. stable-2026.05.531)'
|
||||||
|
type: string
|
||||||
|
required: true
|
||||||
|
target_image_tag:
|
||||||
|
description: 'Override the rollback target. Defaults to the second-most-recent stable-* tag.'
|
||||||
|
type: string
|
||||||
|
required: false
|
||||||
|
workflow_dispatch:
|
||||||
|
inputs:
|
||||||
|
failed_image_tag:
|
||||||
|
description: 'The image_tag that failed (e.g. stable-2026.05.531)'
|
||||||
|
type: string
|
||||||
|
required: true
|
||||||
|
target_image_tag:
|
||||||
|
description: 'Rollback target. Defaults to the second-most-recent stable-* tag.'
|
||||||
|
type: string
|
||||||
|
required: false
|
||||||
|
|
||||||
|
# NOTE: This top-level block has dual semantics:
|
||||||
|
# - On `workflow_dispatch` (manual operator trigger): governs the
|
||||||
|
# GITHUB_TOKEN scope directly.
|
||||||
|
# - On `workflow_call` from release.yml: the caller's job-level
|
||||||
|
# `permissions:` (rollback-on-smoke-fail) governs, intersected with
|
||||||
|
# this block as a cap. Tightening this block lowers the cap on both
|
||||||
|
# entry points; tightening the caller affects only the workflow_call
|
||||||
|
# path. Keep both in sync if you adjust either side.
|
||||||
|
permissions:
|
||||||
|
contents: read
|
||||||
|
packages: write
|
||||||
|
issues: write
|
||||||
|
|
||||||
|
# Override the caller's `cancel-in-progress: true` concurrency policy
|
||||||
|
# (release.yml groups by ref and cancels older runs to avoid duplicate
|
||||||
|
# CalVer claims). A rollback mid-flight must NOT be cancelled — the
|
||||||
|
# re-tag step has multiple `docker push`es; a cancellation between them
|
||||||
|
# would leave :stable on the broken image. A reusable workflow's own
|
||||||
|
# concurrency block overrides the inherited one.
|
||||||
|
concurrency:
|
||||||
|
group: rollback-${{ github.repository }}-${{ inputs.failed_image_tag }}
|
||||||
|
cancel-in-progress: false
|
||||||
|
|
||||||
|
jobs:
|
||||||
|
rollback:
|
||||||
|
runs-on: ubuntu-latest
|
||||||
|
steps:
|
||||||
|
- uses: actions/checkout@v6
|
||||||
|
with:
|
||||||
|
fetch-depth: 0
|
||||||
|
fetch-tags: true
|
||||||
|
|
||||||
|
# GHCR login moved BEFORE target resolution so the resolve step can
|
||||||
|
# use `docker manifest inspect` to skip known-broken candidates
|
||||||
|
# (versions that already carry a `:deprecated-*` alias from a prior
|
||||||
|
# rollback).
|
||||||
|
- name: Log in to GHCR
|
||||||
|
uses: docker/login-action@v4
|
||||||
|
with:
|
||||||
|
registry: ghcr.io
|
||||||
|
username: ${{ github.actor }}
|
||||||
|
password: ${{ secrets.GITHUB_TOKEN }}
|
||||||
|
|
||||||
|
- name: Resolve target image
|
||||||
|
id: target
|
||||||
|
# Inputs are passed via env to keep them out of the shell-script
|
||||||
|
# source — `${{ ... }}` is textual substitution, so an attacker with
|
||||||
|
# workflow_dispatch privilege could otherwise close a quote and
|
||||||
|
# inject commands. Env-var expansion does not re-parse for command
|
||||||
|
# substitution, so it's safe.
|
||||||
|
env:
|
||||||
|
TARGET_INPUT: ${{ inputs.target_image_tag }}
|
||||||
|
FAILED: ${{ inputs.failed_image_tag }}
|
||||||
|
REPO_SLUG: ${{ github.repository }}
|
||||||
|
run: |
|
||||||
|
REPO="ghcr.io/${REPO_SLUG}"
|
||||||
|
if [ -n "$TARGET_INPUT" ]; then
|
||||||
|
TARGET="$TARGET_INPUT"
|
||||||
|
else
|
||||||
|
# Walk back through stable-* tags newest-first; skip any whose
|
||||||
|
# `:deprecated-<stripped>` GHCR alias exists, because that
|
||||||
|
# marks a previously-failed release. The naive "second-most-
|
||||||
|
# recent" heuristic re-points :stable at known-broken images on
|
||||||
|
# cascading failures (rollback only pushes a deprecated alias,
|
||||||
|
# it does NOT delete the failed git tag — that would break
|
||||||
|
# CalVer immutability — so the failed tag stays in sort order
|
||||||
|
# on subsequent rollbacks).
|
||||||
|
TARGET=""
|
||||||
|
while IFS= read -r CANDIDATE; do
|
||||||
|
[ -z "$CANDIDATE" ] && continue
|
||||||
|
[ "$CANDIDATE" = "$FAILED" ] && continue
|
||||||
|
STRIPPED="${CANDIDATE#stable-}"
|
||||||
|
if docker manifest inspect "$REPO:deprecated-${STRIPPED}" > /dev/null 2>&1; then
|
||||||
|
echo " skipping $CANDIDATE (carries :deprecated-${STRIPPED} from a prior rollback)"
|
||||||
|
continue
|
||||||
|
fi
|
||||||
|
TARGET="$CANDIDATE"
|
||||||
|
break
|
||||||
|
done < <(git tag -l "stable-*" --sort=-version:refname)
|
||||||
|
if [ -z "$TARGET" ]; then
|
||||||
|
echo "::error::No known-good previous stable-* tag found — supply target_image_tag explicitly"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
# Defense in depth: even with the walk-back, refuse if the
|
||||||
|
# resolved target somehow matches FAILED (e.g. operator override
|
||||||
|
# via target_image_tag pointing at the failed build).
|
||||||
|
if [ "$TARGET" = "$FAILED" ]; then
|
||||||
|
echo "::error::Rollback target equals failed tag ($TARGET) — refusing to re-push broken image"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
echo "target=$TARGET" >> "$GITHUB_OUTPUT"
|
||||||
|
echo "Rollback target: $TARGET"
|
||||||
|
|
||||||
|
- name: Re-tag :stable to target + mark failed image deprecated
|
||||||
|
env:
|
||||||
|
FAILED: ${{ inputs.failed_image_tag }}
|
||||||
|
TARGET: ${{ steps.target.outputs.target }}
|
||||||
|
run: |
|
||||||
|
REPO="ghcr.io/${{ github.repository }}"
|
||||||
|
if [[ "$FAILED" != stable-* ]]; then
|
||||||
|
echo "::warning::failed_image_tag '$FAILED' is not a stable-* tag — this workflow rolls back the :stable channel; the deprecated-* tag name may be non-standard."
|
||||||
|
fi
|
||||||
|
# Strip the channel prefix for a backward-compatible deprecated tag name
|
||||||
|
DEPRECATED="deprecated-${FAILED#stable-}"
|
||||||
|
|
||||||
|
# Order matters: push :stable recovery FIRST, then the
|
||||||
|
# :deprecated-* audit tag. If something interrupts mid-step
|
||||||
|
# (concurrency block above SHOULD prevent it, but defense in
|
||||||
|
# depth), the worst case is missing audit metadata — production
|
||||||
|
# is already healthy. The reverse order risked :stable stuck on
|
||||||
|
# the broken image.
|
||||||
|
docker pull "$REPO:$TARGET"
|
||||||
|
docker tag "$REPO:$TARGET" "$REPO:stable"
|
||||||
|
docker push "$REPO:stable"
|
||||||
|
|
||||||
|
docker pull "$REPO:$FAILED"
|
||||||
|
docker tag "$REPO:$FAILED" "$REPO:$DEPRECATED"
|
||||||
|
docker push "$REPO:$DEPRECATED"
|
||||||
|
|
||||||
|
- name: Open tracking issue
|
||||||
|
env:
|
||||||
|
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
|
||||||
|
FAILED: ${{ inputs.failed_image_tag }}
|
||||||
|
TARGET: ${{ steps.target.outputs.target }}
|
||||||
|
REPO_SLUG: ${{ github.repository }}
|
||||||
|
EVENT: ${{ github.event_name }}
|
||||||
|
SERVER_URL: ${{ github.server_url }}
|
||||||
|
RUN_ID: ${{ github.run_id }}
|
||||||
|
SHA: ${{ github.sha }}
|
||||||
|
run: |
|
||||||
|
# Same channel-prefix strip as the re-tag step, so the issue body
|
||||||
|
# shows the deprecated tag name that was actually pushed.
|
||||||
|
DEPRECATED="deprecated-${FAILED#stable-}"
|
||||||
|
gh issue create \
|
||||||
|
--title "Rollback: :stable reverted from $FAILED to $TARGET" \
|
||||||
|
--body "$(cat <<EOF
|
||||||
|
## Rollback report
|
||||||
|
|
||||||
|
- Failed image: \`ghcr.io/${REPO_SLUG}:${FAILED}\`
|
||||||
|
- Commit: \`${SHA}\`
|
||||||
|
- Deprecated tag: \`ghcr.io/${REPO_SLUG}:${DEPRECATED}\`
|
||||||
|
- Rolled back to: \`ghcr.io/${REPO_SLUG}:${TARGET}\`
|
||||||
|
- Triggered by: ${EVENT}
|
||||||
|
- Run: ${SERVER_URL}/${REPO_SLUG}/actions/runs/${RUN_ID}
|
||||||
|
|
||||||
|
Investigate before re-deploying.
|
||||||
|
EOF
|
||||||
|
)" \
|
||||||
|
--label "bug" || echo "::warning::Failed to open rollback tracking issue — check gh auth / labels"
|
||||||
3
.gitignore
vendored
3
.gitignore
vendored
|
|
@ -119,9 +119,6 @@ config/data_description.md
|
||||||
# Instance-specific data description (generated per-instance)
|
# Instance-specific data description (generated per-instance)
|
||||||
docs/data_description.md
|
docs/data_description.md
|
||||||
|
|
||||||
# Actual deploy workflow (created from .example, may contain secrets in comments)
|
|
||||||
.github/workflows/deploy.yml
|
|
||||||
|
|
||||||
# Project-specific: Data directory
|
# Project-specific: Data directory
|
||||||
# Downloaded source data - never commit
|
# Downloaded source data - never commit
|
||||||
data/
|
data/
|
||||||
|
|
|
||||||
15
CHANGELOG.md
15
CHANGELOG.md
|
|
@ -10,6 +10,8 @@ CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every C
|
||||||
|
|
||||||
## [Unreleased]
|
## [Unreleased]
|
||||||
|
|
||||||
|
## [0.54.17] — 2026-05-15
|
||||||
|
|
||||||
### Changed
|
### Changed
|
||||||
- `agnes refresh-marketplace --check` (the SessionStart-hook detector
|
- `agnes refresh-marketplace --check` (the SessionStart-hook detector
|
||||||
that fires on every Claude Code session start in every workspace)
|
that fires on every Claude Code session start in every workspace)
|
||||||
|
|
@ -34,6 +36,19 @@ CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every C
|
||||||
|
|
||||||
### Internal
|
### Internal
|
||||||
- CI test suite sharded for speed. The `test` job in `.github/workflows/ci.yml` is now a `test-shard` matrix — 4 parallel jobs via `pytest-split`, balanced by a committed `.test_durations` file — aggregated into a single `test` status check so branch protection needs no change. The duplicate full-suite `test` job in `release.yml` is removed (it re-ran the same ~10 min suite a second time on every push to main/feature branches); `release.yml` is now image-build only, with the advisory ruff/mypy steps moved to a lean `lint` job in `ci.yml`. Net: ~10 min → ~3 min wall-clock per push, and the suite runs once instead of twice. Adds `pytest-split` to the `dev` extra.
|
- CI test suite sharded for speed. The `test` job in `.github/workflows/ci.yml` is now a `test-shard` matrix — 4 parallel jobs via `pytest-split`, balanced by a committed `.test_durations` file — aggregated into a single `test` status check so branch protection needs no change. The duplicate full-suite `test` job in `release.yml` is removed (it re-ran the same ~10 min suite a second time on every push to main/feature branches); `release.yml` is now image-build only, with the advisory ruff/mypy steps moved to a lean `lint` job in `ci.yml`. Net: ~10 min → ~3 min wall-clock per push, and the suite runs once instead of twice. Adds `pytest-split` to the `dev` extra.
|
||||||
|
- CI/release workflow polish (the still-salvageable subset of the
|
||||||
|
abandoned PR #139, after #311 obsoleted the test-job refactor):
|
||||||
|
`rollback.yml` extracts the `release.yml` smoke-test rollback into a
|
||||||
|
reusable + manually dispatchable workflow, with a warning guard on
|
||||||
|
non-`stable-*` `workflow_dispatch` inputs. `prune-dev-tags.yml` adds
|
||||||
|
weekly housekeeping (Sundays 04:00 UTC) of legacy CalVer git tags +
|
||||||
|
GHCR images outside a `KEEP_MONTHS` retention window; floating
|
||||||
|
aliases are git-tagless and never matched. `lint-workflows.yml` runs
|
||||||
|
`actionlint` on `.github/workflows/**` + `scripts/ops/**.sh` changes
|
||||||
|
(non-blocking initially). The superseded `deploy.yml` stub is removed.
|
||||||
|
Excludes #139's rejected pieces (Release Drafter, setuptools_scm,
|
||||||
|
run-number tag scheme, main-only release triggers, deletion of
|
||||||
|
`cli-wheel-clean-install`).
|
||||||
|
|
||||||
## [0.54.16] — 2026-05-14
|
## [0.54.16] — 2026-05-14
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -169,6 +169,14 @@ from a stale post-cut commit (we've shipped that race before).
|
||||||
`Release`-pipeline, Devin Review) are advisory — green/red doesn't gate merge.
|
`Release`-pipeline, Devin Review) are advisory — green/red doesn't gate merge.
|
||||||
- **`enforce_admins: true`** in branch protection means `--admin` flag on
|
- **`enforce_admins: true`** in branch protection means `--admin` flag on
|
||||||
`gh pr merge` does NOT bypass. Don't try; just fix the underlying block.
|
`gh pr merge` does NOT bypass. Don't try; just fix the underlying block.
|
||||||
|
- **`lint-workflows.yml` is advisory.** Triggered on changes to
|
||||||
|
`.github/workflows/**` or `scripts/ops/**.sh`. Runs `actionlint` on
|
||||||
|
workflow YAMLs + `shellcheck --severity=warning` on freestanding ops
|
||||||
|
scripts. The `actionlint` step has `continue-on-error: true` initially
|
||||||
|
(pre-existing inventory has info-level findings); flip to fail-fast
|
||||||
|
once the repo is actionlint-clean. The `shellcheck` step IS blocking at
|
||||||
|
warning+ severity — info/style findings ride through, real bugs break
|
||||||
|
CI.
|
||||||
|
|
||||||
### Recovery when something derails
|
### Recovery when something derails
|
||||||
|
|
||||||
|
|
@ -201,6 +209,54 @@ within ~5 min via the cron in `agnes-auto-upgrade.sh`. Convenient for
|
||||||
per-developer dev VMs; **footgun for shared dev VMs** (last pusher wins,
|
per-developer dev VMs; **footgun for shared dev VMs** (last pusher wins,
|
||||||
regardless of who).
|
regardless of who).
|
||||||
|
|
||||||
|
**Auto-rollback on smoke failure.** On `main` pushes, after `:stable` is
|
||||||
|
published, the `smoke-test` job pulls the just-built image and runs
|
||||||
|
`scripts/ops/post-deploy-smoke-test.sh` inside a docker-compose stack. If
|
||||||
|
that job fails, the `rollback-on-smoke-fail` job calls the reusable
|
||||||
|
`rollback.yml` workflow (see below) which re-points `:stable` to the
|
||||||
|
previous known-good build, marks the failed image as `:deprecated-*`,
|
||||||
|
and opens a tracking issue labeled `bug`.
|
||||||
|
|
||||||
|
### `rollback.yml` — reusable + manual rollback
|
||||||
|
|
||||||
|
Two entry points:
|
||||||
|
- **`workflow_call`** from `release.yml`'s `rollback-on-smoke-fail` job
|
||||||
|
(auto-rollback path above).
|
||||||
|
- **`workflow_dispatch`** for manual operator rollback when something
|
||||||
|
breaks post-deploy that the auto smoke-test missed.
|
||||||
|
|
||||||
|
**Manual rollback** — flip `:stable` back to a previous good build:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
gh workflow run rollback.yml \
|
||||||
|
--repo keboola/agnes-the-ai-analyst \
|
||||||
|
-f failed_image_tag=stable-YYYY.MM.N
|
||||||
|
```
|
||||||
|
|
||||||
|
By default `target_image_tag` resolves by walking back through `stable-*`
|
||||||
|
git tags newest-first and picking the first that does NOT already carry a
|
||||||
|
`:deprecated-<stripped>` GHCR alias (i.e. wasn't previously auto-rolled-
|
||||||
|
back). That prevents cascading failures from re-pointing `:stable` at a
|
||||||
|
known-broken image. To force a specific target:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
gh workflow run rollback.yml \
|
||||||
|
--repo keboola/agnes-the-ai-analyst \
|
||||||
|
-f failed_image_tag=stable-2026.05.531 \
|
||||||
|
-f target_image_tag=stable-2026.04.474
|
||||||
|
```
|
||||||
|
|
||||||
|
Notes:
|
||||||
|
- The workflow does NOT delete the failed git tag (CalVer immutability is
|
||||||
|
preserved) — only the GHCR `:stable` alias is re-pointed and the failed
|
||||||
|
image gains a `:deprecated-*` audit alias.
|
||||||
|
- Re-tag order is `:stable` recovery first, then `:deprecated-*` audit, so
|
||||||
|
a mid-step interruption leaves production healthy with at-worst missing
|
||||||
|
audit metadata.
|
||||||
|
- Concurrency: `cancel-in-progress: false` (overrides the caller workflow's
|
||||||
|
cancellation policy) so a subsequent push to `main` won't kill a
|
||||||
|
rollback mid-flight.
|
||||||
|
|
||||||
### `keboola-deploy.yml` — tag-triggered, explicit deploy only
|
### `keboola-deploy.yml` — tag-triggered, explicit deploy only
|
||||||
|
|
||||||
Runs **only** on git tags matching `keboola-deploy-*`. Publishes:
|
Runs **only** on git tags matching `keboola-deploy-*`. Publishes:
|
||||||
|
|
@ -222,6 +278,40 @@ Use this when the consumer (e.g. a customer dev VM) needs
|
||||||
pushes by other contributors. The infra repo pins
|
pushes by other contributors. The infra repo pins
|
||||||
`image_tag = "keboola-deploy-latest"` on the relevant VM.
|
`image_tag = "keboola-deploy-latest"` on the relevant VM.
|
||||||
|
|
||||||
|
### `prune-dev-tags.yml` — weekly CalVer + GHCR housekeeping
|
||||||
|
|
||||||
|
Cron `0 4 * * 0` (Sundays 04:00 UTC) + `workflow_dispatch`. Prunes legacy
|
||||||
|
CalVer git tags (`dev-YYYY.MM.N`, `stable-YYYY.MM.N`) and the matching
|
||||||
|
GHCR image versions older than `KEEP_MONTHS` (default `1` → keep current
|
||||||
|
+ previous month). Floating aliases (`:stable`, `:dev`, `*-latest`) are
|
||||||
|
never matched: they are git-tagless, and the GHCR pass explicitly skips
|
||||||
|
any version that shares a manifest with a floating alias to avoid
|
||||||
|
collateral deletion of `:stable` after a rollback re-tag.
|
||||||
|
|
||||||
|
**Manual preview** (no deletions, lists what would be pruned):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
gh workflow run prune-dev-tags.yml \
|
||||||
|
--repo keboola/agnes-the-ai-analyst \
|
||||||
|
-f dry_run=true
|
||||||
|
```
|
||||||
|
|
||||||
|
**Force a wider window** (one-off aggressive cleanup):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
gh workflow run prune-dev-tags.yml \
|
||||||
|
--repo keboola/agnes-the-ai-analyst \
|
||||||
|
-f keep_months=3
|
||||||
|
```
|
||||||
|
|
||||||
|
Scheduled (cron) runs always prune for real; `dry_run` is honored only on
|
||||||
|
manual dispatch. The script tracks per-tag remote-push / GHCR-DELETE
|
||||||
|
failures and exits non-zero on any failure, so a refused remote push (tag-
|
||||||
|
protection rule, missing scope) or a GHCR API error turns the cron run
|
||||||
|
red instead of silently swallowing it. Local `git tag -d` is gated on
|
||||||
|
successful remote push, so a refused delete leaves the local tag in place
|
||||||
|
for retry on the next run.
|
||||||
|
|
||||||
### Module versioning
|
### Module versioning
|
||||||
|
|
||||||
The customer-instance Terraform module under `infra/modules/customer-instance/`
|
The customer-instance Terraform module under `infra/modules/customer-instance/`
|
||||||
|
|
|
||||||
|
|
@ -1,6 +1,6 @@
|
||||||
[project]
|
[project]
|
||||||
name = "agnes-the-ai-analyst"
|
name = "agnes-the-ai-analyst"
|
||||||
version = "0.54.16"
|
version = "0.54.17"
|
||||||
description = "Agnes — AI Data Analyst platform for AI analytical systems"
|
description = "Agnes — AI Data Analyst platform for AI analytical systems"
|
||||||
requires-python = ">=3.11,<3.14"
|
requires-python = ">=3.11,<3.14"
|
||||||
license = "MIT"
|
license = "MIT"
|
||||||
|
|
|
||||||
174
scripts/ops/prune-dev-tags.sh
Executable file
174
scripts/ops/prune-dev-tags.sh
Executable file
|
|
@ -0,0 +1,174 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
# Prune legacy CalVer dev/stable image identity from git + GHCR:
|
||||||
|
#
|
||||||
|
# Git tags + GHCR image versions of the form
|
||||||
|
# dev-YYYY.MM.N e.g. dev-2026.04.475
|
||||||
|
# stable-YYYY.MM.N e.g. stable-2026.04.474
|
||||||
|
# accumulate one per CI build. Retention: KEEP_MONTHS (default 1) keeps
|
||||||
|
# the current month + the previous KEEP_MONTHS months; older tags +
|
||||||
|
# images are pruned.
|
||||||
|
#
|
||||||
|
# Dry-run via PRUNE_DRY_RUN=1 (or workflow input) — lists what would be
|
||||||
|
# pruned without acting.
|
||||||
|
#
|
||||||
|
# Idempotent: re-running with no eligible tags exits 0.
|
||||||
|
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
KEEP_MONTHS="${KEEP_MONTHS:-1}"
|
||||||
|
[[ "$KEEP_MONTHS" =~ ^[0-9]+$ ]] || { echo "KEEP_MONTHS must be a non-negative integer (got: '$KEEP_MONTHS')"; exit 1; }
|
||||||
|
DRY_RUN="${PRUNE_DRY_RUN:-0}"
|
||||||
|
REPO="${GITHUB_REPOSITORY:?GITHUB_REPOSITORY env var must be set (e.g. keboola/agnes-the-ai-analyst)}"
|
||||||
|
|
||||||
|
cd "$(git rev-parse --show-toplevel)"
|
||||||
|
|
||||||
|
# Compute the set of YYYY.MM strings to KEEP — walk back KEEP_MONTHS+1
|
||||||
|
# months from today.
|
||||||
|
TODAY_YEAR=$(date +%Y)
|
||||||
|
TODAY_MONTH=$(date +%m)
|
||||||
|
TODAY_MONTH_NUM=$((10#$TODAY_MONTH)) # strip leading zero for arithmetic
|
||||||
|
|
||||||
|
KEEP_YYYY_MM=()
|
||||||
|
for i in $(seq 0 "$KEEP_MONTHS"); do
|
||||||
|
Y=$TODAY_YEAR
|
||||||
|
M=$((TODAY_MONTH_NUM - i))
|
||||||
|
while [ "$M" -lt 1 ]; do
|
||||||
|
M=$((M + 12))
|
||||||
|
Y=$((Y - 1))
|
||||||
|
done
|
||||||
|
KEEP_YYYY_MM+=("$(printf '%04d.%02d' "$Y" "$M")")
|
||||||
|
done
|
||||||
|
|
||||||
|
echo "Retention window (YYYY.MM): ${KEEP_YYYY_MM[*]}"
|
||||||
|
|
||||||
|
# Collect candidate tags — strictly `dev-YYYY.MM.N` / `stable-YYYY.MM.N`.
|
||||||
|
LEGACY_TAGS=$(git tag -l 'dev-*' 'stable-*' \
|
||||||
|
| grep -E '^(dev|stable)-[0-9]{4}\.[0-9]{2}\.[0-9]+$' \
|
||||||
|
|| true)
|
||||||
|
|
||||||
|
# Filter: keep tags whose YYYY.MM is in the keep window; prune the rest.
|
||||||
|
TO_PRUNE=()
|
||||||
|
if [ -n "$LEGACY_TAGS" ]; then
|
||||||
|
while IFS= read -r TAG; do
|
||||||
|
[ -z "$TAG" ] && continue
|
||||||
|
TAG_YM=$(echo "$TAG" | sed -E 's/^(dev|stable)-([0-9]{4}\.[0-9]{2})\.[0-9]+$/\2/')
|
||||||
|
KEEP=0
|
||||||
|
for KEEP_YM in "${KEEP_YYYY_MM[@]}"; do
|
||||||
|
if [ "$TAG_YM" = "$KEEP_YM" ]; then KEEP=1; break; fi
|
||||||
|
done
|
||||||
|
if [ "$KEEP" = "0" ]; then
|
||||||
|
TO_PRUNE+=("$TAG")
|
||||||
|
fi
|
||||||
|
done <<< "$LEGACY_TAGS"
|
||||||
|
fi
|
||||||
|
|
||||||
|
SECTION1_HAS_WORK=0
|
||||||
|
|
||||||
|
if [ -z "$LEGACY_TAGS" ]; then
|
||||||
|
echo "No legacy CalVer tags found — nothing to prune."
|
||||||
|
elif [ "${#TO_PRUNE[@]}" -eq 0 ]; then
|
||||||
|
echo "All legacy tags are within retention window — nothing to prune."
|
||||||
|
else
|
||||||
|
SECTION1_HAS_WORK=1
|
||||||
|
echo "Will prune ${#TO_PRUNE[@]} tags older than the retention window:"
|
||||||
|
# Array slice instead of `printf … | head` — under `set -o pipefail`,
|
||||||
|
# head closing the pipe early can SIGPIPE printf (exit 141) and abort
|
||||||
|
# the script before any deletion runs. The slice avoids the pipeline.
|
||||||
|
printf ' %s\n' "${TO_PRUNE[@]:0:20}"
|
||||||
|
[ "${#TO_PRUNE[@]}" -gt 20 ] && echo " ... (and $((${#TO_PRUNE[@]} - 20)) more)"
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ "$SECTION1_HAS_WORK" = "1" ] && [ "$DRY_RUN" = "1" ]; then
|
||||||
|
echo "(dry-run — no deletions)"
|
||||||
|
SECTION1_HAS_WORK=0
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Track failures so the workflow run turns red even when individual
|
||||||
|
# operations were swallowed by `|| ...` fallbacks. Stdout warnings alone
|
||||||
|
# are invisible on a green run, so a hard exit-1 at the end is the only
|
||||||
|
# reliable signal to operators.
|
||||||
|
PRUNE_FAILED=0
|
||||||
|
|
||||||
|
if [ "$SECTION1_HAS_WORK" = "1" ]; then
|
||||||
|
# Fetch GHCR versions BEFORE any git-tag deletion — if the API call
|
||||||
|
# fails (403 missing scope, 429 rate limit, transient 5xx), we abort
|
||||||
|
# cleanly with no state change. Doing the irrecoverable git-tag delete
|
||||||
|
# first risked orphan GHCR images: the next run rebuilds TO_PRUNE from
|
||||||
|
# `git tag -l`, so without the local git tag the orphan image is never
|
||||||
|
# enumerated again.
|
||||||
|
TAG_TO_ID=""
|
||||||
|
if [ -n "${GH_TOKEN:-}" ]; then
|
||||||
|
ORG=$(echo "$REPO" | cut -d/ -f1)
|
||||||
|
PKG_NAME=$(echo "$REPO" | cut -d/ -f2)
|
||||||
|
echo "Fetching GHCR image versions for $ORG/$PKG_NAME ..."
|
||||||
|
|
||||||
|
# One paginated fetch up-front, then per-tag lookups against the
|
||||||
|
# cached result. Avoids O(N × pages) API calls on a multi-month
|
||||||
|
# backlog (legacy CalVer tag counts run ~500/month per channel).
|
||||||
|
# No `|| echo "[]"` fallback — let `set -e` propagate API failure
|
||||||
|
# rather than silently turning every TAG into a no-op skip.
|
||||||
|
VERSIONS_JSON=$(gh api \
|
||||||
|
"/orgs/${ORG}/packages/container/${PKG_NAME}/versions" \
|
||||||
|
--paginate)
|
||||||
|
|
||||||
|
# CRITICAL: GHCR's DELETE-version drops the entire manifest, taking
|
||||||
|
# EVERY tag on it (including `:stable`, `:dev`, `dev-<user>-latest`).
|
||||||
|
# After a rollback re-tag, the previous-known-good version carries
|
||||||
|
# both `:stable` and its CalVer tag — pruning that CalVer tag would
|
||||||
|
# vaporize `:stable`. So skip any version that also carries a
|
||||||
|
# floating alias. The jq filter applies that exclusion up-front.
|
||||||
|
TAG_TO_ID=$(echo "$VERSIONS_JSON" | jq -r '
|
||||||
|
.[]
|
||||||
|
| select(
|
||||||
|
(.metadata.container.tags | index("stable") // false | not) and
|
||||||
|
(.metadata.container.tags | index("dev") // false | not) and
|
||||||
|
((.metadata.container.tags | map(endswith("-latest")) | any) | not)
|
||||||
|
)
|
||||||
|
| . as $v
|
||||||
|
| .metadata.container.tags[] as $t
|
||||||
|
| "\($t)\t\($v.id)"
|
||||||
|
')
|
||||||
|
else
|
||||||
|
echo "GH_TOKEN not set — GHCR image deletion will be skipped (git tags will still be pruned below)."
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Delete git tags. Local delete is gated on successful remote push —
|
||||||
|
# if the remote refuses (protected tag, missing contents:write,
|
||||||
|
# transient failure), leaving the local tag in place means the next
|
||||||
|
# run retries the same TAG cleanly. checkout@v6 re-fetches tags so a
|
||||||
|
# successful local-only delete would just come back anyway.
|
||||||
|
for TAG in "${TO_PRUNE[@]}"; do
|
||||||
|
echo " deleting tag: $TAG"
|
||||||
|
if git push origin --delete "$TAG"; then
|
||||||
|
git tag -d "$TAG" 2>/dev/null || true
|
||||||
|
else
|
||||||
|
echo " (remote push failed — leaving local tag in place for retry; check tag-protection rules or contents:write scope)"
|
||||||
|
PRUNE_FAILED=1
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
# Delete GHCR image versions using the up-front fetch.
|
||||||
|
if [ -n "${GH_TOKEN:-}" ]; then
|
||||||
|
echo "Deleting matching GHCR image versions ..."
|
||||||
|
for TAG in "${TO_PRUNE[@]}"; do
|
||||||
|
VERSION_ID=$(echo "$TAG_TO_ID" | awk -v t="$TAG" '$1==t {print $2; exit}')
|
||||||
|
if [ -n "$VERSION_ID" ]; then
|
||||||
|
echo " deleting GHCR image $TAG (version $VERSION_ID)"
|
||||||
|
if ! gh api -X DELETE \
|
||||||
|
"/orgs/${ORG}/packages/container/${PKG_NAME}/versions/${VERSION_ID}"; then
|
||||||
|
echo " (DELETE failed — check packages:write scope, rate limits, or version already gone)"
|
||||||
|
PRUNE_FAILED=1
|
||||||
|
fi
|
||||||
|
else
|
||||||
|
echo " skipping GHCR image $TAG — no eligible version (already gone, or shares a manifest with :stable/:dev/*-latest)"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ "$PRUNE_FAILED" = "1" ]; then
|
||||||
|
echo "::error::One or more prune operations failed — see warnings above"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "Prune complete."
|
||||||
Loading…
Reference in a new issue