From 9f5adbce37950f5fb81366e26a23049109ec139b Mon Sep 17 00:00:00 2001 From: ZdenekSrotyr <139972147+ZdenekSrotyr@users.noreply.github.com> Date: Fri, 15 May 2026 14:06:59 +0200 Subject: [PATCH] ci: consolidate release pipeline (salvageable subset of #139) (#314) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * ci: add actionlint workflow lint, drop superseded deploy.yml stub * ci: extract rollback into reusable rollback.yml, wire into release smoke-test * ci: add weekly prune-dev-tags workflow for legacy CalVer tag/image cleanup * release: 0.54.17 — CI/release workflow consolidation * fix(ci): warn when rollback.yml receives a non-stable failed_image_tag * fix(ci): rollback.yml + prune-dev-tags.sh review findings rollback.yml: - Pass workflow_dispatch inputs (failed_image_tag, target_image_tag) through env: instead of textual ${{ }} splicing into bash run blocks — prevents an actor with workflow_dispatch privilege from injecting shell via quote/backtick payloads. - Guard against TARGET == FAILED when only one stable-* tag exists (fresh repo, or aggressive pruning at month boundary). Fail loudly rather than re-push the broken image as :stable. - Add commit SHA to the rollback tracking-issue body — github.sha is inherited across workflow_call, so on-call no longer has to navigate rollback run → caller-workflow breadcrumb → failing commit. prune-dev-tags.sh: - Replace 'printf … | head -20' preview pipeline with array slice ('"${TO_PRUNE[@]:0:20}"'). Under set -o pipefail, head closing the pipe early SIGPIPEs printf (exit 141) and aborts the script before any deletion runs — exactly the multi-month-backlog scenario the script targets. - Refactor GHCR-pass: fetch versions JSON once before the loop, then build a tag→version-id map up-front. Closes two problems: 1. O(N × pages) GHCR API calls collapse to one paginated listing — months of accumulated CalVer tags no longer risk tripping abuse detection. 2. The new jq filter excludes any version that ALSO carries a floating alias (:stable, :dev, *-latest). GHCR DELETE-version drops the entire manifest, so pruning a CalVer tag that shares a manifest with :stable (e.g. after a rollback re-tag) would have vaporized :stable. Now it's skipped with a log line. lint-workflows.yml: - Add an explicit shellcheck step. actionlint only walks .github/workflows/ and the shell embedded in their run: blocks, so freestanding scripts/ops/*.sh (which are in the workflow's path filter) were never actually validated despite triggering CI. * fix(ci): shellcheck --severity=warning to skip pre-existing info findings The new shellcheck step caught info-level findings (SC1091, SC2015) in agnes-auto-upgrade.sh / agnes-tls-rotate.sh — pre-existing, not regressed by this PR. Constrain shellcheck to warning+ severity (real bugs) so info and style findings don't block CI; mirrors the actionlint step's continue-on-error initial-rollout posture. * fix(ci): second-pass review findings — concurrency, walk-back, failure propagation rollback.yml: - Add own concurrency block (group: rollback--, cancel-in-progress: false). The caller release.yml uses cancel-in-progress: true to avoid duplicate CalVer claims, but a second push to main mid-rollback would otherwise kill the workflow between the :stable recovery push and the :deprecated-* audit push, leaving :stable stuck on the broken image. A reusable workflow's own concurrency overrides the inherited one. - Walk back through stable-* tags newest-first, skipping any whose :deprecated- GHCR alias already exists (carries the mark of a prior failed rollback). The previous 'second-most-recent' heuristic could re-point :stable at a known-broken image on cascading failures. - Reorder re-tag step: push :stable recovery FIRST, then the :deprecated-* audit tag. Defense in depth — even if the concurrency block somehow misfires, the worst case is missing audit metadata rather than production stuck on the broken image. - Move GHCR login before resolve step so 'docker manifest inspect' can probe for :deprecated-* aliases during walk-back. - Document the top-level permissions block's dual semantics (workflow_dispatch grants directly; workflow_call acts as a cap intersected with the caller's job-level permissions). release.yml: - Rewrite the 'issues: write' comment. Old wording ('default for jobs') was factually wrong — GITHUB_TOKEN's default for issues is never write — and read as 'this line just documents a default', so a future cleanup PR could delete it. The line is load-bearing: workflow_call permissions are bounded by the caller's GITHUB_TOKEN scope, and removing it would silently 403 rollback.yml's gh issue create step. prune-dev-tags.sh: - Drop the '|| echo "[]"' fallback on the GHCR versions fetch. The fallback turned every API failure (403 missing scope, 429 rate limit, transient 5xx) into a silent no-op with exit 0 — operators saw a green run while every TAG fell through to the same 'no eligible version' skip message used for legitimate manifest-collision skips. - Reorder: fetch GHCR versions BEFORE any git-tag deletion. Git-tag delete is irrecoverable (next run rebuilds TO_PRUNE from 'git tag -l', so an orphan GHCR image is never enumerated again). Fetching first means an API failure aborts cleanly with no state change. - Track PRUNE_FAILED flag. 'git push --delete' fallback is no longer unconditional — local 'git tag -d' is gated on successful remote push, so a refused remote delete (tag-protection rule, missing contents:write) leaves the local tag in place for retry. The flag propagates to a final 'exit 1' so the cron run turns red on any push or DELETE failure. lint-workflows.yml: - shellcheck step now uses 'find scripts/ops -type f -name *.sh' to match the workflow's recursive 'scripts/ops/**.sh' path filter. The previous bare 'scripts/ops/*.sh' glob only matched top-level files; a future script under a subdirectory would have triggered the workflow but never been linted. * docs(releasing): document rollback.yml, prune-dev-tags.yml, lint-workflows.yml Reflects the new operational workflows landing in this release: - Auto-rollback paragraph in release.yml description (smoke-test job + rollback-on-smoke-fail → rollback.yml) - rollback.yml subsection — workflow_call + workflow_dispatch entry points, walk-back target resolution, immutability + concurrency guarantees, manual operator gh workflow run examples - prune-dev-tags.yml subsection — weekly cron, KEEP_MONTHS retention semantics, floating-alias safety, dry_run preview, failure-propagation exit-non-zero behavior - lint-workflows.yml CI quirk — actionlint (continue-on-error) + shellcheck (--severity=warning blocking) advisory checks CLAUDE.md non-negotiable rules unchanged — still high-level and correct (changelog discipline + release-cut belongs to the PR + run the full test suite). --- .github/workflows/deploy.yml | 27 ---- .github/workflows/lint-workflows.yml | 67 ++++++++++ .github/workflows/prune-dev-tags.yml | 44 +++++++ .github/workflows/release.yml | 71 ++++------- .github/workflows/rollback.yml | 179 +++++++++++++++++++++++++++ .gitignore | 3 - CHANGELOG.md | 15 +++ docs/RELEASING.md | 90 ++++++++++++++ pyproject.toml | 2 +- scripts/ops/prune-dev-tags.sh | 174 ++++++++++++++++++++++++++ 10 files changed, 593 insertions(+), 79 deletions(-) delete mode 100644 .github/workflows/deploy.yml create mode 100644 .github/workflows/lint-workflows.yml create mode 100644 .github/workflows/prune-dev-tags.yml create mode 100644 .github/workflows/rollback.yml create mode 100755 scripts/ops/prune-dev-tags.sh diff --git a/.github/workflows/deploy.yml b/.github/workflows/deploy.yml deleted file mode 100644 index ed1828c..0000000 --- a/.github/workflows/deploy.yml +++ /dev/null @@ -1,27 +0,0 @@ -# SUPERSEDED by release.yml — CalVer tagging with stable/dev channels. -# Kept for manual trigger only. Automated builds use release.yml. -name: Build & Push (legacy) - -on: - workflow_dispatch: {} - -jobs: - test: - runs-on: ubuntu-latest - steps: - - uses: actions/checkout@v6 - - - uses: actions/setup-python@v6 - with: - python-version: "3.13" - - - name: Install uv - uses: astral-sh/setup-uv@v7 - - - name: Install dependencies - run: uv pip install --system ".[dev,server]" - - - name: Run tests - run: pytest tests/ -v --tb=short - env: - TESTING: "1" diff --git a/.github/workflows/lint-workflows.yml b/.github/workflows/lint-workflows.yml new file mode 100644 index 0000000..c823bbd --- /dev/null +++ b/.github/workflows/lint-workflows.yml @@ -0,0 +1,67 @@ +name: Lint workflows + +# Catches GitHub Actions / shellcheck issues in workflow YAMLs before they +# break a real release. Runs on push/PR that touches anything under +# .github/workflows/ and on manual workflow_dispatch. Keeps non-blocking +# (warnings only) initially — flip to fail-fast when the existing inventory +# is clean. + +on: + push: + branches: [main] + paths: + - ".github/workflows/**" + - "scripts/ops/**.sh" + pull_request: + paths: + - ".github/workflows/**" + - "scripts/ops/**.sh" + workflow_dispatch: + +permissions: + contents: read + +jobs: + actionlint: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v6 + + - name: Run actionlint + run: | + # Pin to a specific actionlint version for reproducibility. + # Updates: bump the version string + verify rules in CHANGELOG. + ACTIONLINT_VERSION="1.7.7" + curl -sSL \ + "https://github.com/rhysd/actionlint/releases/download/v${ACTIONLINT_VERSION}/actionlint_${ACTIONLINT_VERSION}_linux_amd64.tar.gz" \ + | tar xz actionlint + ./actionlint -color + # Continue-on-error initially: surface findings without blocking + # while the existing workflow inventory is being cleaned up. Flip + # to false (default) once the repo is actionlint-clean. + continue-on-error: true + + - name: Run shellcheck on ops scripts + # actionlint above only walks `.github/workflows/**` + the shell + # snippets embedded inside their `run:` blocks; freestanding + # `scripts/ops/**/*.sh` files (which are also in this workflow's + # path filter via the `**.sh` glob) need their own pass. + # shellcheck is pre-installed on ubuntu-latest runners. + # + # `find` matches the recursive `**.sh` path filter above. A bare + # `scripts/ops/*.sh` glob would silently skip future scripts under + # subdirectories — the workflow would trigger on them (filter + # matches) but never lint them. + # + # `--severity=warning` blocks only on warning+ findings (actual + # bugs); info/style level passes silently. This lets the existing + # inventory's info-level findings (e.g. SC1091, SC2015 in + # agnes-auto-upgrade.sh / agnes-tls-rotate.sh) ride through while + # still catching real regressions in new scripts. + run: | + mapfile -t SCRIPTS < <(find scripts/ops -type f -name '*.sh' 2>/dev/null) + if [ "${#SCRIPTS[@]}" -gt 0 ]; then + shellcheck --severity=warning "${SCRIPTS[@]}" + else + echo "No scripts/ops/**/*.sh found — nothing to check." + fi diff --git a/.github/workflows/prune-dev-tags.yml b/.github/workflows/prune-dev-tags.yml new file mode 100644 index 0000000..24cccf9 --- /dev/null +++ b/.github/workflows/prune-dev-tags.yml @@ -0,0 +1,44 @@ +name: Prune dev tags + +# Weekly housekeeping: prune legacy CalVer git tags + GHCR images +# (dev-YYYY.MM.N / stable-YYYY.MM.N) on a KEEP_MONTHS retention window +# (current + previous month by default). Manual trigger supports a +# dry-run and a KEEP_MONTHS override. Floating aliases (:stable, :dev, +# *-latest) are git-tagless and never matched, so they are never pruned. +# Scheduled runs always prune for real; use workflow_dispatch with +# dry_run=true to preview. + +on: + schedule: + - cron: '0 4 * * 0' # Sundays 04:00 UTC + workflow_dispatch: + inputs: + dry_run: + description: 'Dry-run only — list tags that would be pruned, do not delete' + type: boolean + default: true + keep_months: + description: 'Keep current month + this many previous months (e.g. 1 = 2 months total)' + type: string + default: '1' + +permissions: + contents: write # delete git tags + packages: write # delete GHCR image versions + +jobs: + prune: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v6 + with: + fetch-depth: 0 + fetch-tags: true + + - name: Run prune + env: + GITHUB_REPOSITORY: ${{ github.repository }} + GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} + KEEP_MONTHS: ${{ inputs.keep_months || '1' }} + PRUNE_DRY_RUN: ${{ inputs.dry_run && '1' || '0' }} + run: bash scripts/ops/prune-dev-tags.sh diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml index 72b744c..79fe0b0 100644 --- a/.github/workflows/release.yml +++ b/.github/workflows/release.yml @@ -24,10 +24,14 @@ on: permissions: contents: write packages: write - # `issues: write` lets the smoke-test job's rollback step open a - # GitHub issue alerting operators when an auto-rollback fires. Without - # this, the `gh issue create` call hits 403 and the `|| echo` fallback - # silently swallows it — operators see :stable revert with no alert. + # issues: write — explicitly granted at workflow scope so the + # rollback-on-smoke-fail job (which calls rollback.yml via workflow_call) + # can open a tracking issue when an auto-rollback fires. Reusable- + # workflow permissions are bounded by the caller's GITHUB_TOKEN scope, + # so removing this line would silently 403 rollback.yml's gh issue + # create step (the || echo fallback would swallow the failure, leaving + # :stable reverted with no operator alert). Keep in sync with the + # rollback-on-smoke-fail job-level permissions below. issues: write # When a developer pushes a brand-new branch with code changes, GitHub fires @@ -208,12 +212,10 @@ jobs: fetch-depth: 0 fetch-tags: true - # Required for the rollback step's `docker push` to GHCR. The - # `build-and-push` job logs in for itself; this job needs its own - # login since GitHub Actions tokens are scoped per-job. Without it, - # the rollback hits "unauthenticated: User cannot be authenticated - # with the token provided" and silently leaves :stable pointing at - # the broken image (real incident: PR #137 / 4ec5ff44). + # Required so `Start Agnes from built image` can pull the just-built + # private GHCR image. The `build-and-push` job logs in for itself; + # this job needs its own login since GitHub Actions tokens are scoped + # per-job. - name: Log in to GHCR uses: docker/login-action@v4 with: @@ -234,44 +236,6 @@ jobs: - name: Run smoke tests run: bash scripts/smoke-test.sh http://localhost:8000 - - name: Automatic rollback on failure - if: failure() - env: - # Required for the `gh issue create` call below — without GH_TOKEN - # the gh CLI fails the auth check and the issue creation falls - # through the `|| echo` fallback, so an operator never sees the - # rollback alert. - GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} - run: | - IMAGE_TAG="${{ needs.build-and-push.outputs.image_tag }}" - VERSION="${{ needs.build-and-push.outputs.version }}" - DEPRECATED_TAG="deprecated-${VERSION}" - REPO="ghcr.io/${{ github.repository }}" - - echo "Smoke test failed — initiating rollback" - - # Tag the current (failed) image as :deprecated-YYYY.MM.N - docker pull "${REPO}:${IMAGE_TAG}" - docker tag "${REPO}:${IMAGE_TAG}" "${REPO}:${DEPRECATED_TAG}" - docker push "${REPO}:${DEPRECATED_TAG}" - echo "Tagged failed image as ${REPO}:${DEPRECATED_TAG}" - - # Revert :stable to the previous known-good image - PREV_TAG=$(git tag -l "stable-*" --sort=-version:refname | head -2 | tail -1) - if [ -n "$PREV_TAG" ]; then - docker pull "${REPO}:${PREV_TAG}" - docker tag "${REPO}:${PREV_TAG}" "${REPO}:stable" - docker push "${REPO}:stable" - echo "Reverted :stable to ${PREV_TAG}" - else - echo "WARNING: No previous stable tag found — cannot revert :stable automatically" - fi - - # Create a GitHub issue alerting about the failure - ISSUE_TITLE="Smoke test failure — rollback to ${PREV_TAG:-unknown}" - ISSUE_BODY="## Automatic Rollback Report\n\nThe smoke test for image \`${IMAGE_TAG}\` failed.\n\n- **Failed image**: \`${REPO}:${IMAGE_TAG}\`\n- **Deprecated tag**: \`${REPO}:${DEPRECATED_TAG}\`\n- **Rolled back to**: \`${PREV_TAG:-N/A}\`\n- **Commit**: \`${{ github.sha }}\`\n- **Run**: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}\n\nPlease investigate and fix before re-deploying." - gh issue create --title "$ISSUE_TITLE" --body "$(echo -e "$ISSUE_BODY")" --label "bug" || echo "Failed to create GitHub issue (gh CLI may not be available)" - - name: Collect logs on failure if: failure() run: docker compose -f docker-compose.yml -f docker-compose.prod.yml -f docker-compose.ci.yml logs > smoke-test-logs.txt @@ -287,6 +251,17 @@ jobs: if: always() run: docker compose -f docker-compose.yml -f docker-compose.prod.yml -f docker-compose.ci.yml down -v + rollback-on-smoke-fail: + needs: [build-and-push, smoke-test] + if: failure() && needs.smoke-test.result == 'failure' + uses: ./.github/workflows/rollback.yml + with: + failed_image_tag: ${{ needs.build-and-push.outputs.image_tag }} + permissions: + contents: read + packages: write + issues: write + # Reproduces the deploy shape that broke agnes-development on 2026-04-29: # the production stack uses docker-compose.host-mount.yml to bind-mount /data # from the host PD instead of using a Docker named volume. Docker initializes diff --git a/.github/workflows/rollback.yml b/.github/workflows/rollback.yml new file mode 100644 index 0000000..a47414a --- /dev/null +++ b/.github/workflows/rollback.yml @@ -0,0 +1,179 @@ +name: Rollback :stable + +# Re-tag :stable to a previous known-good build, deprecate the failing +# image, and open a tracking issue. Callable from release.yml on +# smoke-test failure (workflow_call) or manually by an operator +# (workflow_dispatch) when something breaks post-deploy. + +on: + workflow_call: + inputs: + failed_image_tag: + description: 'The image_tag that failed (e.g. stable-2026.05.531)' + type: string + required: true + target_image_tag: + description: 'Override the rollback target. Defaults to the second-most-recent stable-* tag.' + type: string + required: false + workflow_dispatch: + inputs: + failed_image_tag: + description: 'The image_tag that failed (e.g. stable-2026.05.531)' + type: string + required: true + target_image_tag: + description: 'Rollback target. Defaults to the second-most-recent stable-* tag.' + type: string + required: false + +# NOTE: This top-level block has dual semantics: +# - On `workflow_dispatch` (manual operator trigger): governs the +# GITHUB_TOKEN scope directly. +# - On `workflow_call` from release.yml: the caller's job-level +# `permissions:` (rollback-on-smoke-fail) governs, intersected with +# this block as a cap. Tightening this block lowers the cap on both +# entry points; tightening the caller affects only the workflow_call +# path. Keep both in sync if you adjust either side. +permissions: + contents: read + packages: write + issues: write + +# Override the caller's `cancel-in-progress: true` concurrency policy +# (release.yml groups by ref and cancels older runs to avoid duplicate +# CalVer claims). A rollback mid-flight must NOT be cancelled — the +# re-tag step has multiple `docker push`es; a cancellation between them +# would leave :stable on the broken image. A reusable workflow's own +# concurrency block overrides the inherited one. +concurrency: + group: rollback-${{ github.repository }}-${{ inputs.failed_image_tag }} + cancel-in-progress: false + +jobs: + rollback: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v6 + with: + fetch-depth: 0 + fetch-tags: true + + # GHCR login moved BEFORE target resolution so the resolve step can + # use `docker manifest inspect` to skip known-broken candidates + # (versions that already carry a `:deprecated-*` alias from a prior + # rollback). + - name: Log in to GHCR + uses: docker/login-action@v4 + with: + registry: ghcr.io + username: ${{ github.actor }} + password: ${{ secrets.GITHUB_TOKEN }} + + - name: Resolve target image + id: target + # Inputs are passed via env to keep them out of the shell-script + # source — `${{ ... }}` is textual substitution, so an attacker with + # workflow_dispatch privilege could otherwise close a quote and + # inject commands. Env-var expansion does not re-parse for command + # substitution, so it's safe. + env: + TARGET_INPUT: ${{ inputs.target_image_tag }} + FAILED: ${{ inputs.failed_image_tag }} + REPO_SLUG: ${{ github.repository }} + run: | + REPO="ghcr.io/${REPO_SLUG}" + if [ -n "$TARGET_INPUT" ]; then + TARGET="$TARGET_INPUT" + else + # Walk back through stable-* tags newest-first; skip any whose + # `:deprecated-` GHCR alias exists, because that + # marks a previously-failed release. The naive "second-most- + # recent" heuristic re-points :stable at known-broken images on + # cascading failures (rollback only pushes a deprecated alias, + # it does NOT delete the failed git tag — that would break + # CalVer immutability — so the failed tag stays in sort order + # on subsequent rollbacks). + TARGET="" + while IFS= read -r CANDIDATE; do + [ -z "$CANDIDATE" ] && continue + [ "$CANDIDATE" = "$FAILED" ] && continue + STRIPPED="${CANDIDATE#stable-}" + if docker manifest inspect "$REPO:deprecated-${STRIPPED}" > /dev/null 2>&1; then + echo " skipping $CANDIDATE (carries :deprecated-${STRIPPED} from a prior rollback)" + continue + fi + TARGET="$CANDIDATE" + break + done < <(git tag -l "stable-*" --sort=-version:refname) + if [ -z "$TARGET" ]; then + echo "::error::No known-good previous stable-* tag found — supply target_image_tag explicitly" + exit 1 + fi + fi + # Defense in depth: even with the walk-back, refuse if the + # resolved target somehow matches FAILED (e.g. operator override + # via target_image_tag pointing at the failed build). + if [ "$TARGET" = "$FAILED" ]; then + echo "::error::Rollback target equals failed tag ($TARGET) — refusing to re-push broken image" + exit 1 + fi + echo "target=$TARGET" >> "$GITHUB_OUTPUT" + echo "Rollback target: $TARGET" + + - name: Re-tag :stable to target + mark failed image deprecated + env: + FAILED: ${{ inputs.failed_image_tag }} + TARGET: ${{ steps.target.outputs.target }} + run: | + REPO="ghcr.io/${{ github.repository }}" + if [[ "$FAILED" != stable-* ]]; then + echo "::warning::failed_image_tag '$FAILED' is not a stable-* tag — this workflow rolls back the :stable channel; the deprecated-* tag name may be non-standard." + fi + # Strip the channel prefix for a backward-compatible deprecated tag name + DEPRECATED="deprecated-${FAILED#stable-}" + + # Order matters: push :stable recovery FIRST, then the + # :deprecated-* audit tag. If something interrupts mid-step + # (concurrency block above SHOULD prevent it, but defense in + # depth), the worst case is missing audit metadata — production + # is already healthy. The reverse order risked :stable stuck on + # the broken image. + docker pull "$REPO:$TARGET" + docker tag "$REPO:$TARGET" "$REPO:stable" + docker push "$REPO:stable" + + docker pull "$REPO:$FAILED" + docker tag "$REPO:$FAILED" "$REPO:$DEPRECATED" + docker push "$REPO:$DEPRECATED" + + - name: Open tracking issue + env: + GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} + FAILED: ${{ inputs.failed_image_tag }} + TARGET: ${{ steps.target.outputs.target }} + REPO_SLUG: ${{ github.repository }} + EVENT: ${{ github.event_name }} + SERVER_URL: ${{ github.server_url }} + RUN_ID: ${{ github.run_id }} + SHA: ${{ github.sha }} + run: | + # Same channel-prefix strip as the re-tag step, so the issue body + # shows the deprecated tag name that was actually pushed. + DEPRECATED="deprecated-${FAILED#stable-}" + gh issue create \ + --title "Rollback: :stable reverted from $FAILED to $TARGET" \ + --body "$(cat <` GHCR alias (i.e. wasn't previously auto-rolled- +back). That prevents cascading failures from re-pointing `:stable` at a +known-broken image. To force a specific target: + +```bash +gh workflow run rollback.yml \ + --repo keboola/agnes-the-ai-analyst \ + -f failed_image_tag=stable-2026.05.531 \ + -f target_image_tag=stable-2026.04.474 +``` + +Notes: +- The workflow does NOT delete the failed git tag (CalVer immutability is + preserved) — only the GHCR `:stable` alias is re-pointed and the failed + image gains a `:deprecated-*` audit alias. +- Re-tag order is `:stable` recovery first, then `:deprecated-*` audit, so + a mid-step interruption leaves production healthy with at-worst missing + audit metadata. +- Concurrency: `cancel-in-progress: false` (overrides the caller workflow's + cancellation policy) so a subsequent push to `main` won't kill a + rollback mid-flight. + ### `keboola-deploy.yml` — tag-triggered, explicit deploy only Runs **only** on git tags matching `keboola-deploy-*`. Publishes: @@ -222,6 +278,40 @@ Use this when the consumer (e.g. a customer dev VM) needs pushes by other contributors. The infra repo pins `image_tag = "keboola-deploy-latest"` on the relevant VM. +### `prune-dev-tags.yml` — weekly CalVer + GHCR housekeeping + +Cron `0 4 * * 0` (Sundays 04:00 UTC) + `workflow_dispatch`. Prunes legacy +CalVer git tags (`dev-YYYY.MM.N`, `stable-YYYY.MM.N`) and the matching +GHCR image versions older than `KEEP_MONTHS` (default `1` → keep current ++ previous month). Floating aliases (`:stable`, `:dev`, `*-latest`) are +never matched: they are git-tagless, and the GHCR pass explicitly skips +any version that shares a manifest with a floating alias to avoid +collateral deletion of `:stable` after a rollback re-tag. + +**Manual preview** (no deletions, lists what would be pruned): + +```bash +gh workflow run prune-dev-tags.yml \ + --repo keboola/agnes-the-ai-analyst \ + -f dry_run=true +``` + +**Force a wider window** (one-off aggressive cleanup): + +```bash +gh workflow run prune-dev-tags.yml \ + --repo keboola/agnes-the-ai-analyst \ + -f keep_months=3 +``` + +Scheduled (cron) runs always prune for real; `dry_run` is honored only on +manual dispatch. The script tracks per-tag remote-push / GHCR-DELETE +failures and exits non-zero on any failure, so a refused remote push (tag- +protection rule, missing scope) or a GHCR API error turns the cron run +red instead of silently swallowing it. Local `git tag -d` is gated on +successful remote push, so a refused delete leaves the local tag in place +for retry on the next run. + ### Module versioning The customer-instance Terraform module under `infra/modules/customer-instance/` diff --git a/pyproject.toml b/pyproject.toml index 7de12ce..452289a 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -1,6 +1,6 @@ [project] name = "agnes-the-ai-analyst" -version = "0.54.16" +version = "0.54.17" description = "Agnes — AI Data Analyst platform for AI analytical systems" requires-python = ">=3.11,<3.14" license = "MIT" diff --git a/scripts/ops/prune-dev-tags.sh b/scripts/ops/prune-dev-tags.sh new file mode 100755 index 0000000..a44245a --- /dev/null +++ b/scripts/ops/prune-dev-tags.sh @@ -0,0 +1,174 @@ +#!/usr/bin/env bash +# Prune legacy CalVer dev/stable image identity from git + GHCR: +# +# Git tags + GHCR image versions of the form +# dev-YYYY.MM.N e.g. dev-2026.04.475 +# stable-YYYY.MM.N e.g. stable-2026.04.474 +# accumulate one per CI build. Retention: KEEP_MONTHS (default 1) keeps +# the current month + the previous KEEP_MONTHS months; older tags + +# images are pruned. +# +# Dry-run via PRUNE_DRY_RUN=1 (or workflow input) — lists what would be +# pruned without acting. +# +# Idempotent: re-running with no eligible tags exits 0. + +set -euo pipefail + +KEEP_MONTHS="${KEEP_MONTHS:-1}" +[[ "$KEEP_MONTHS" =~ ^[0-9]+$ ]] || { echo "KEEP_MONTHS must be a non-negative integer (got: '$KEEP_MONTHS')"; exit 1; } +DRY_RUN="${PRUNE_DRY_RUN:-0}" +REPO="${GITHUB_REPOSITORY:?GITHUB_REPOSITORY env var must be set (e.g. keboola/agnes-the-ai-analyst)}" + +cd "$(git rev-parse --show-toplevel)" + +# Compute the set of YYYY.MM strings to KEEP — walk back KEEP_MONTHS+1 +# months from today. +TODAY_YEAR=$(date +%Y) +TODAY_MONTH=$(date +%m) +TODAY_MONTH_NUM=$((10#$TODAY_MONTH)) # strip leading zero for arithmetic + +KEEP_YYYY_MM=() +for i in $(seq 0 "$KEEP_MONTHS"); do + Y=$TODAY_YEAR + M=$((TODAY_MONTH_NUM - i)) + while [ "$M" -lt 1 ]; do + M=$((M + 12)) + Y=$((Y - 1)) + done + KEEP_YYYY_MM+=("$(printf '%04d.%02d' "$Y" "$M")") +done + +echo "Retention window (YYYY.MM): ${KEEP_YYYY_MM[*]}" + +# Collect candidate tags — strictly `dev-YYYY.MM.N` / `stable-YYYY.MM.N`. +LEGACY_TAGS=$(git tag -l 'dev-*' 'stable-*' \ + | grep -E '^(dev|stable)-[0-9]{4}\.[0-9]{2}\.[0-9]+$' \ + || true) + +# Filter: keep tags whose YYYY.MM is in the keep window; prune the rest. +TO_PRUNE=() +if [ -n "$LEGACY_TAGS" ]; then + while IFS= read -r TAG; do + [ -z "$TAG" ] && continue + TAG_YM=$(echo "$TAG" | sed -E 's/^(dev|stable)-([0-9]{4}\.[0-9]{2})\.[0-9]+$/\2/') + KEEP=0 + for KEEP_YM in "${KEEP_YYYY_MM[@]}"; do + if [ "$TAG_YM" = "$KEEP_YM" ]; then KEEP=1; break; fi + done + if [ "$KEEP" = "0" ]; then + TO_PRUNE+=("$TAG") + fi + done <<< "$LEGACY_TAGS" +fi + +SECTION1_HAS_WORK=0 + +if [ -z "$LEGACY_TAGS" ]; then + echo "No legacy CalVer tags found — nothing to prune." +elif [ "${#TO_PRUNE[@]}" -eq 0 ]; then + echo "All legacy tags are within retention window — nothing to prune." +else + SECTION1_HAS_WORK=1 + echo "Will prune ${#TO_PRUNE[@]} tags older than the retention window:" + # Array slice instead of `printf … | head` — under `set -o pipefail`, + # head closing the pipe early can SIGPIPE printf (exit 141) and abort + # the script before any deletion runs. The slice avoids the pipeline. + printf ' %s\n' "${TO_PRUNE[@]:0:20}" + [ "${#TO_PRUNE[@]}" -gt 20 ] && echo " ... (and $((${#TO_PRUNE[@]} - 20)) more)" +fi + +if [ "$SECTION1_HAS_WORK" = "1" ] && [ "$DRY_RUN" = "1" ]; then + echo "(dry-run — no deletions)" + SECTION1_HAS_WORK=0 +fi + +# Track failures so the workflow run turns red even when individual +# operations were swallowed by `|| ...` fallbacks. Stdout warnings alone +# are invisible on a green run, so a hard exit-1 at the end is the only +# reliable signal to operators. +PRUNE_FAILED=0 + +if [ "$SECTION1_HAS_WORK" = "1" ]; then + # Fetch GHCR versions BEFORE any git-tag deletion — if the API call + # fails (403 missing scope, 429 rate limit, transient 5xx), we abort + # cleanly with no state change. Doing the irrecoverable git-tag delete + # first risked orphan GHCR images: the next run rebuilds TO_PRUNE from + # `git tag -l`, so without the local git tag the orphan image is never + # enumerated again. + TAG_TO_ID="" + if [ -n "${GH_TOKEN:-}" ]; then + ORG=$(echo "$REPO" | cut -d/ -f1) + PKG_NAME=$(echo "$REPO" | cut -d/ -f2) + echo "Fetching GHCR image versions for $ORG/$PKG_NAME ..." + + # One paginated fetch up-front, then per-tag lookups against the + # cached result. Avoids O(N × pages) API calls on a multi-month + # backlog (legacy CalVer tag counts run ~500/month per channel). + # No `|| echo "[]"` fallback — let `set -e` propagate API failure + # rather than silently turning every TAG into a no-op skip. + VERSIONS_JSON=$(gh api \ + "/orgs/${ORG}/packages/container/${PKG_NAME}/versions" \ + --paginate) + + # CRITICAL: GHCR's DELETE-version drops the entire manifest, taking + # EVERY tag on it (including `:stable`, `:dev`, `dev--latest`). + # After a rollback re-tag, the previous-known-good version carries + # both `:stable` and its CalVer tag — pruning that CalVer tag would + # vaporize `:stable`. So skip any version that also carries a + # floating alias. The jq filter applies that exclusion up-front. + TAG_TO_ID=$(echo "$VERSIONS_JSON" | jq -r ' + .[] + | select( + (.metadata.container.tags | index("stable") // false | not) and + (.metadata.container.tags | index("dev") // false | not) and + ((.metadata.container.tags | map(endswith("-latest")) | any) | not) + ) + | . as $v + | .metadata.container.tags[] as $t + | "\($t)\t\($v.id)" + ') + else + echo "GH_TOKEN not set — GHCR image deletion will be skipped (git tags will still be pruned below)." + fi + + # Delete git tags. Local delete is gated on successful remote push — + # if the remote refuses (protected tag, missing contents:write, + # transient failure), leaving the local tag in place means the next + # run retries the same TAG cleanly. checkout@v6 re-fetches tags so a + # successful local-only delete would just come back anyway. + for TAG in "${TO_PRUNE[@]}"; do + echo " deleting tag: $TAG" + if git push origin --delete "$TAG"; then + git tag -d "$TAG" 2>/dev/null || true + else + echo " (remote push failed — leaving local tag in place for retry; check tag-protection rules or contents:write scope)" + PRUNE_FAILED=1 + fi + done + + # Delete GHCR image versions using the up-front fetch. + if [ -n "${GH_TOKEN:-}" ]; then + echo "Deleting matching GHCR image versions ..." + for TAG in "${TO_PRUNE[@]}"; do + VERSION_ID=$(echo "$TAG_TO_ID" | awk -v t="$TAG" '$1==t {print $2; exit}') + if [ -n "$VERSION_ID" ]; then + echo " deleting GHCR image $TAG (version $VERSION_ID)" + if ! gh api -X DELETE \ + "/orgs/${ORG}/packages/container/${PKG_NAME}/versions/${VERSION_ID}"; then + echo " (DELETE failed — check packages:write scope, rate limits, or version already gone)" + PRUNE_FAILED=1 + fi + else + echo " skipping GHCR image $TAG — no eligible version (already gone, or shares a manifest with :stable/:dev/*-latest)" + fi + done + fi +fi + +if [ "$PRUNE_FAILED" = "1" ]; then + echo "::error::One or more prune operations failed — see warnings above" + exit 1 +fi + +echo "Prune complete."