ci: consolidate release pipeline (salvageable subset of #139) (#314)

* ci: add actionlint workflow lint, drop superseded deploy.yml stub

* ci: extract rollback into reusable rollback.yml, wire into release smoke-test

* ci: add weekly prune-dev-tags workflow for legacy CalVer tag/image cleanup

* release: 0.54.17 — CI/release workflow consolidation

* fix(ci): warn when rollback.yml receives a non-stable failed_image_tag

* fix(ci): rollback.yml + prune-dev-tags.sh review findings

rollback.yml:
- Pass workflow_dispatch inputs (failed_image_tag, target_image_tag)
  through env: instead of textual ${{ }} splicing into bash run blocks
  — prevents an actor with workflow_dispatch privilege from injecting
  shell via quote/backtick payloads.
- Guard against TARGET == FAILED when only one stable-* tag exists
  (fresh repo, or aggressive pruning at month boundary). Fail loudly
  rather than re-push the broken image as :stable.
- Add commit SHA to the rollback tracking-issue body — github.sha is
  inherited across workflow_call, so on-call no longer has to navigate
  rollback run → caller-workflow breadcrumb → failing commit.

prune-dev-tags.sh:
- Replace 'printf … | head -20' preview pipeline with array slice
  ('"${TO_PRUNE[@]:0:20}"'). Under set -o pipefail, head closing
  the pipe early SIGPIPEs printf (exit 141) and aborts the script
  before any deletion runs — exactly the multi-month-backlog scenario
  the script targets.
- Refactor GHCR-pass: fetch versions JSON once before the loop, then
  build a tag→version-id map up-front. Closes two problems:
    1. O(N × pages) GHCR API calls collapse to one paginated listing
       — months of accumulated CalVer tags no longer risk tripping
       abuse detection.
    2. The new jq filter excludes any version that ALSO carries a
       floating alias (:stable, :dev, *-latest). GHCR DELETE-version
       drops the entire manifest, so pruning a CalVer tag that shares
       a manifest with :stable (e.g. after a rollback re-tag) would
       have vaporized :stable. Now it's skipped with a log line.

lint-workflows.yml:
- Add an explicit shellcheck step. actionlint only walks
  .github/workflows/ and the shell embedded in their run: blocks, so
  freestanding scripts/ops/*.sh (which are in the workflow's path
  filter) were never actually validated despite triggering CI.

* fix(ci): shellcheck --severity=warning to skip pre-existing info findings

The new shellcheck step caught info-level findings (SC1091, SC2015) in
agnes-auto-upgrade.sh / agnes-tls-rotate.sh — pre-existing, not regressed
by this PR. Constrain shellcheck to warning+ severity (real bugs) so info
and style findings don't block CI; mirrors the actionlint step's
continue-on-error initial-rollout posture.

* fix(ci): second-pass review findings — concurrency, walk-back, failure propagation

rollback.yml:
- Add own concurrency block (group: rollback-<repo>-<failed_tag>,
  cancel-in-progress: false). The caller release.yml uses
  cancel-in-progress: true to avoid duplicate CalVer claims, but a
  second push to main mid-rollback would otherwise kill the workflow
  between the :stable recovery push and the :deprecated-* audit push,
  leaving :stable stuck on the broken image. A reusable workflow's own
  concurrency overrides the inherited one.
- Walk back through stable-* tags newest-first, skipping any whose
  :deprecated-<stripped> GHCR alias already exists (carries the mark of
  a prior failed rollback). The previous 'second-most-recent' heuristic
  could re-point :stable at a known-broken image on cascading failures.
- Reorder re-tag step: push :stable recovery FIRST, then the
  :deprecated-* audit tag. Defense in depth — even if the concurrency
  block somehow misfires, the worst case is missing audit metadata
  rather than production stuck on the broken image.
- Move GHCR login before resolve step so 'docker manifest inspect' can
  probe for :deprecated-* aliases during walk-back.
- Document the top-level permissions block's dual semantics
  (workflow_dispatch grants directly; workflow_call acts as a cap
  intersected with the caller's job-level permissions).

release.yml:
- Rewrite the 'issues: write' comment. Old wording ('default for jobs')
  was factually wrong — GITHUB_TOKEN's default for issues is never write
  — and read as 'this line just documents a default', so a future
  cleanup PR could delete it. The line is load-bearing: workflow_call
  permissions are bounded by the caller's GITHUB_TOKEN scope, and
  removing it would silently 403 rollback.yml's gh issue create step.

prune-dev-tags.sh:
- Drop the '|| echo "[]"' fallback on the GHCR versions fetch. The
  fallback turned every API failure (403 missing scope, 429 rate limit,
  transient 5xx) into a silent no-op with exit 0 — operators saw a
  green run while every TAG fell through to the same 'no eligible
  version' skip message used for legitimate manifest-collision skips.
- Reorder: fetch GHCR versions BEFORE any git-tag deletion. Git-tag
  delete is irrecoverable (next run rebuilds TO_PRUNE from 'git tag
  -l', so an orphan GHCR image is never enumerated again). Fetching
  first means an API failure aborts cleanly with no state change.
- Track PRUNE_FAILED flag. 'git push --delete' fallback is no longer
  unconditional — local 'git tag -d' is gated on successful remote
  push, so a refused remote delete (tag-protection rule, missing
  contents:write) leaves the local tag in place for retry. The flag
  propagates to a final 'exit 1' so the cron run turns red on any
  push or DELETE failure.

lint-workflows.yml:
- shellcheck step now uses 'find scripts/ops -type f -name *.sh' to
  match the workflow's recursive 'scripts/ops/**.sh' path filter. The
  previous bare 'scripts/ops/*.sh' glob only matched top-level files;
  a future script under a subdirectory would have triggered the
  workflow but never been linted.

* docs(releasing): document rollback.yml, prune-dev-tags.yml, lint-workflows.yml

Reflects the new operational workflows landing in this release:
- Auto-rollback paragraph in release.yml description (smoke-test job +
  rollback-on-smoke-fail → rollback.yml)
- rollback.yml subsection — workflow_call + workflow_dispatch entry
  points, walk-back target resolution, immutability + concurrency
  guarantees, manual operator gh workflow run examples
- prune-dev-tags.yml subsection — weekly cron, KEEP_MONTHS retention
  semantics, floating-alias safety, dry_run preview, failure-propagation
  exit-non-zero behavior
- lint-workflows.yml CI quirk — actionlint (continue-on-error) +
  shellcheck (--severity=warning blocking) advisory checks

CLAUDE.md non-negotiable rules unchanged — still high-level and
correct (changelog discipline + release-cut belongs to the PR + run the
full test suite).
This commit is contained in:
ZdenekSrotyr 2026-05-15 14:06:59 +02:00 committed by GitHub
parent 7907b8082e
commit 9f5adbce37
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
10 changed files with 593 additions and 79 deletions

View file

@ -1,27 +0,0 @@
# SUPERSEDED by release.yml — CalVer tagging with stable/dev channels.
# Kept for manual trigger only. Automated builds use release.yml.
name: Build & Push (legacy)
on:
workflow_dispatch: {}
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6
- uses: actions/setup-python@v6
with:
python-version: "3.13"
- name: Install uv
uses: astral-sh/setup-uv@v7
- name: Install dependencies
run: uv pip install --system ".[dev,server]"
- name: Run tests
run: pytest tests/ -v --tb=short
env:
TESTING: "1"

67
.github/workflows/lint-workflows.yml vendored Normal file
View file

@ -0,0 +1,67 @@
name: Lint workflows
# Catches GitHub Actions / shellcheck issues in workflow YAMLs before they
# break a real release. Runs on push/PR that touches anything under
# .github/workflows/ and on manual workflow_dispatch. Keeps non-blocking
# (warnings only) initially — flip to fail-fast when the existing inventory
# is clean.
on:
push:
branches: [main]
paths:
- ".github/workflows/**"
- "scripts/ops/**.sh"
pull_request:
paths:
- ".github/workflows/**"
- "scripts/ops/**.sh"
workflow_dispatch:
permissions:
contents: read
jobs:
actionlint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6
- name: Run actionlint
run: |
# Pin to a specific actionlint version for reproducibility.
# Updates: bump the version string + verify rules in CHANGELOG.
ACTIONLINT_VERSION="1.7.7"
curl -sSL \
"https://github.com/rhysd/actionlint/releases/download/v${ACTIONLINT_VERSION}/actionlint_${ACTIONLINT_VERSION}_linux_amd64.tar.gz" \
| tar xz actionlint
./actionlint -color
# Continue-on-error initially: surface findings without blocking
# while the existing workflow inventory is being cleaned up. Flip
# to false (default) once the repo is actionlint-clean.
continue-on-error: true
- name: Run shellcheck on ops scripts
# actionlint above only walks `.github/workflows/**` + the shell
# snippets embedded inside their `run:` blocks; freestanding
# `scripts/ops/**/*.sh` files (which are also in this workflow's
# path filter via the `**.sh` glob) need their own pass.
# shellcheck is pre-installed on ubuntu-latest runners.
#
# `find` matches the recursive `**.sh` path filter above. A bare
# `scripts/ops/*.sh` glob would silently skip future scripts under
# subdirectories — the workflow would trigger on them (filter
# matches) but never lint them.
#
# `--severity=warning` blocks only on warning+ findings (actual
# bugs); info/style level passes silently. This lets the existing
# inventory's info-level findings (e.g. SC1091, SC2015 in
# agnes-auto-upgrade.sh / agnes-tls-rotate.sh) ride through while
# still catching real regressions in new scripts.
run: |
mapfile -t SCRIPTS < <(find scripts/ops -type f -name '*.sh' 2>/dev/null)
if [ "${#SCRIPTS[@]}" -gt 0 ]; then
shellcheck --severity=warning "${SCRIPTS[@]}"
else
echo "No scripts/ops/**/*.sh found — nothing to check."
fi

44
.github/workflows/prune-dev-tags.yml vendored Normal file
View file

@ -0,0 +1,44 @@
name: Prune dev tags
# Weekly housekeeping: prune legacy CalVer git tags + GHCR images
# (dev-YYYY.MM.N / stable-YYYY.MM.N) on a KEEP_MONTHS retention window
# (current + previous month by default). Manual trigger supports a
# dry-run and a KEEP_MONTHS override. Floating aliases (:stable, :dev,
# *-latest) are git-tagless and never matched, so they are never pruned.
# Scheduled runs always prune for real; use workflow_dispatch with
# dry_run=true to preview.
on:
schedule:
- cron: '0 4 * * 0' # Sundays 04:00 UTC
workflow_dispatch:
inputs:
dry_run:
description: 'Dry-run only — list tags that would be pruned, do not delete'
type: boolean
default: true
keep_months:
description: 'Keep current month + this many previous months (e.g. 1 = 2 months total)'
type: string
default: '1'
permissions:
contents: write # delete git tags
packages: write # delete GHCR image versions
jobs:
prune:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6
with:
fetch-depth: 0
fetch-tags: true
- name: Run prune
env:
GITHUB_REPOSITORY: ${{ github.repository }}
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
KEEP_MONTHS: ${{ inputs.keep_months || '1' }}
PRUNE_DRY_RUN: ${{ inputs.dry_run && '1' || '0' }}
run: bash scripts/ops/prune-dev-tags.sh

View file

@ -24,10 +24,14 @@ on:
permissions:
contents: write
packages: write
# `issues: write` lets the smoke-test job's rollback step open a
# GitHub issue alerting operators when an auto-rollback fires. Without
# this, the `gh issue create` call hits 403 and the `|| echo` fallback
# silently swallows it — operators see :stable revert with no alert.
# issues: write — explicitly granted at workflow scope so the
# rollback-on-smoke-fail job (which calls rollback.yml via workflow_call)
# can open a tracking issue when an auto-rollback fires. Reusable-
# workflow permissions are bounded by the caller's GITHUB_TOKEN scope,
# so removing this line would silently 403 rollback.yml's gh issue
# create step (the || echo fallback would swallow the failure, leaving
# :stable reverted with no operator alert). Keep in sync with the
# rollback-on-smoke-fail job-level permissions below.
issues: write
# When a developer pushes a brand-new branch with code changes, GitHub fires
@ -208,12 +212,10 @@ jobs:
fetch-depth: 0
fetch-tags: true
# Required for the rollback step's `docker push` to GHCR. The
# `build-and-push` job logs in for itself; this job needs its own
# login since GitHub Actions tokens are scoped per-job. Without it,
# the rollback hits "unauthenticated: User cannot be authenticated
# with the token provided" and silently leaves :stable pointing at
# the broken image (real incident: PR #137 / 4ec5ff44).
# Required so `Start Agnes from built image` can pull the just-built
# private GHCR image. The `build-and-push` job logs in for itself;
# this job needs its own login since GitHub Actions tokens are scoped
# per-job.
- name: Log in to GHCR
uses: docker/login-action@v4
with:
@ -234,44 +236,6 @@ jobs:
- name: Run smoke tests
run: bash scripts/smoke-test.sh http://localhost:8000
- name: Automatic rollback on failure
if: failure()
env:
# Required for the `gh issue create` call below — without GH_TOKEN
# the gh CLI fails the auth check and the issue creation falls
# through the `|| echo` fallback, so an operator never sees the
# rollback alert.
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
IMAGE_TAG="${{ needs.build-and-push.outputs.image_tag }}"
VERSION="${{ needs.build-and-push.outputs.version }}"
DEPRECATED_TAG="deprecated-${VERSION}"
REPO="ghcr.io/${{ github.repository }}"
echo "Smoke test failed — initiating rollback"
# Tag the current (failed) image as :deprecated-YYYY.MM.N
docker pull "${REPO}:${IMAGE_TAG}"
docker tag "${REPO}:${IMAGE_TAG}" "${REPO}:${DEPRECATED_TAG}"
docker push "${REPO}:${DEPRECATED_TAG}"
echo "Tagged failed image as ${REPO}:${DEPRECATED_TAG}"
# Revert :stable to the previous known-good image
PREV_TAG=$(git tag -l "stable-*" --sort=-version:refname | head -2 | tail -1)
if [ -n "$PREV_TAG" ]; then
docker pull "${REPO}:${PREV_TAG}"
docker tag "${REPO}:${PREV_TAG}" "${REPO}:stable"
docker push "${REPO}:stable"
echo "Reverted :stable to ${PREV_TAG}"
else
echo "WARNING: No previous stable tag found — cannot revert :stable automatically"
fi
# Create a GitHub issue alerting about the failure
ISSUE_TITLE="Smoke test failure — rollback to ${PREV_TAG:-unknown}"
ISSUE_BODY="## Automatic Rollback Report\n\nThe smoke test for image \`${IMAGE_TAG}\` failed.\n\n- **Failed image**: \`${REPO}:${IMAGE_TAG}\`\n- **Deprecated tag**: \`${REPO}:${DEPRECATED_TAG}\`\n- **Rolled back to**: \`${PREV_TAG:-N/A}\`\n- **Commit**: \`${{ github.sha }}\`\n- **Run**: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}\n\nPlease investigate and fix before re-deploying."
gh issue create --title "$ISSUE_TITLE" --body "$(echo -e "$ISSUE_BODY")" --label "bug" || echo "Failed to create GitHub issue (gh CLI may not be available)"
- name: Collect logs on failure
if: failure()
run: docker compose -f docker-compose.yml -f docker-compose.prod.yml -f docker-compose.ci.yml logs > smoke-test-logs.txt
@ -287,6 +251,17 @@ jobs:
if: always()
run: docker compose -f docker-compose.yml -f docker-compose.prod.yml -f docker-compose.ci.yml down -v
rollback-on-smoke-fail:
needs: [build-and-push, smoke-test]
if: failure() && needs.smoke-test.result == 'failure'
uses: ./.github/workflows/rollback.yml
with:
failed_image_tag: ${{ needs.build-and-push.outputs.image_tag }}
permissions:
contents: read
packages: write
issues: write
# Reproduces the deploy shape that broke agnes-development on 2026-04-29:
# the production stack uses docker-compose.host-mount.yml to bind-mount /data
# from the host PD instead of using a Docker named volume. Docker initializes

179
.github/workflows/rollback.yml vendored Normal file
View file

@ -0,0 +1,179 @@
name: Rollback :stable
# Re-tag :stable to a previous known-good build, deprecate the failing
# image, and open a tracking issue. Callable from release.yml on
# smoke-test failure (workflow_call) or manually by an operator
# (workflow_dispatch) when something breaks post-deploy.
on:
workflow_call:
inputs:
failed_image_tag:
description: 'The image_tag that failed (e.g. stable-2026.05.531)'
type: string
required: true
target_image_tag:
description: 'Override the rollback target. Defaults to the second-most-recent stable-* tag.'
type: string
required: false
workflow_dispatch:
inputs:
failed_image_tag:
description: 'The image_tag that failed (e.g. stable-2026.05.531)'
type: string
required: true
target_image_tag:
description: 'Rollback target. Defaults to the second-most-recent stable-* tag.'
type: string
required: false
# NOTE: This top-level block has dual semantics:
# - On `workflow_dispatch` (manual operator trigger): governs the
# GITHUB_TOKEN scope directly.
# - On `workflow_call` from release.yml: the caller's job-level
# `permissions:` (rollback-on-smoke-fail) governs, intersected with
# this block as a cap. Tightening this block lowers the cap on both
# entry points; tightening the caller affects only the workflow_call
# path. Keep both in sync if you adjust either side.
permissions:
contents: read
packages: write
issues: write
# Override the caller's `cancel-in-progress: true` concurrency policy
# (release.yml groups by ref and cancels older runs to avoid duplicate
# CalVer claims). A rollback mid-flight must NOT be cancelled — the
# re-tag step has multiple `docker push`es; a cancellation between them
# would leave :stable on the broken image. A reusable workflow's own
# concurrency block overrides the inherited one.
concurrency:
group: rollback-${{ github.repository }}-${{ inputs.failed_image_tag }}
cancel-in-progress: false
jobs:
rollback:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6
with:
fetch-depth: 0
fetch-tags: true
# GHCR login moved BEFORE target resolution so the resolve step can
# use `docker manifest inspect` to skip known-broken candidates
# (versions that already carry a `:deprecated-*` alias from a prior
# rollback).
- name: Log in to GHCR
uses: docker/login-action@v4
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Resolve target image
id: target
# Inputs are passed via env to keep them out of the shell-script
# source — `${{ ... }}` is textual substitution, so an attacker with
# workflow_dispatch privilege could otherwise close a quote and
# inject commands. Env-var expansion does not re-parse for command
# substitution, so it's safe.
env:
TARGET_INPUT: ${{ inputs.target_image_tag }}
FAILED: ${{ inputs.failed_image_tag }}
REPO_SLUG: ${{ github.repository }}
run: |
REPO="ghcr.io/${REPO_SLUG}"
if [ -n "$TARGET_INPUT" ]; then
TARGET="$TARGET_INPUT"
else
# Walk back through stable-* tags newest-first; skip any whose
# `:deprecated-<stripped>` GHCR alias exists, because that
# marks a previously-failed release. The naive "second-most-
# recent" heuristic re-points :stable at known-broken images on
# cascading failures (rollback only pushes a deprecated alias,
# it does NOT delete the failed git tag — that would break
# CalVer immutability — so the failed tag stays in sort order
# on subsequent rollbacks).
TARGET=""
while IFS= read -r CANDIDATE; do
[ -z "$CANDIDATE" ] && continue
[ "$CANDIDATE" = "$FAILED" ] && continue
STRIPPED="${CANDIDATE#stable-}"
if docker manifest inspect "$REPO:deprecated-${STRIPPED}" > /dev/null 2>&1; then
echo " skipping $CANDIDATE (carries :deprecated-${STRIPPED} from a prior rollback)"
continue
fi
TARGET="$CANDIDATE"
break
done < <(git tag -l "stable-*" --sort=-version:refname)
if [ -z "$TARGET" ]; then
echo "::error::No known-good previous stable-* tag found — supply target_image_tag explicitly"
exit 1
fi
fi
# Defense in depth: even with the walk-back, refuse if the
# resolved target somehow matches FAILED (e.g. operator override
# via target_image_tag pointing at the failed build).
if [ "$TARGET" = "$FAILED" ]; then
echo "::error::Rollback target equals failed tag ($TARGET) — refusing to re-push broken image"
exit 1
fi
echo "target=$TARGET" >> "$GITHUB_OUTPUT"
echo "Rollback target: $TARGET"
- name: Re-tag :stable to target + mark failed image deprecated
env:
FAILED: ${{ inputs.failed_image_tag }}
TARGET: ${{ steps.target.outputs.target }}
run: |
REPO="ghcr.io/${{ github.repository }}"
if [[ "$FAILED" != stable-* ]]; then
echo "::warning::failed_image_tag '$FAILED' is not a stable-* tag — this workflow rolls back the :stable channel; the deprecated-* tag name may be non-standard."
fi
# Strip the channel prefix for a backward-compatible deprecated tag name
DEPRECATED="deprecated-${FAILED#stable-}"
# Order matters: push :stable recovery FIRST, then the
# :deprecated-* audit tag. If something interrupts mid-step
# (concurrency block above SHOULD prevent it, but defense in
# depth), the worst case is missing audit metadata — production
# is already healthy. The reverse order risked :stable stuck on
# the broken image.
docker pull "$REPO:$TARGET"
docker tag "$REPO:$TARGET" "$REPO:stable"
docker push "$REPO:stable"
docker pull "$REPO:$FAILED"
docker tag "$REPO:$FAILED" "$REPO:$DEPRECATED"
docker push "$REPO:$DEPRECATED"
- name: Open tracking issue
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
FAILED: ${{ inputs.failed_image_tag }}
TARGET: ${{ steps.target.outputs.target }}
REPO_SLUG: ${{ github.repository }}
EVENT: ${{ github.event_name }}
SERVER_URL: ${{ github.server_url }}
RUN_ID: ${{ github.run_id }}
SHA: ${{ github.sha }}
run: |
# Same channel-prefix strip as the re-tag step, so the issue body
# shows the deprecated tag name that was actually pushed.
DEPRECATED="deprecated-${FAILED#stable-}"
gh issue create \
--title "Rollback: :stable reverted from $FAILED to $TARGET" \
--body "$(cat <<EOF
## Rollback report
- Failed image: \`ghcr.io/${REPO_SLUG}:${FAILED}\`
- Commit: \`${SHA}\`
- Deprecated tag: \`ghcr.io/${REPO_SLUG}:${DEPRECATED}\`
- Rolled back to: \`ghcr.io/${REPO_SLUG}:${TARGET}\`
- Triggered by: ${EVENT}
- Run: ${SERVER_URL}/${REPO_SLUG}/actions/runs/${RUN_ID}
Investigate before re-deploying.
EOF
)" \
--label "bug" || echo "::warning::Failed to open rollback tracking issue — check gh auth / labels"

3
.gitignore vendored
View file

@ -119,9 +119,6 @@ config/data_description.md
# Instance-specific data description (generated per-instance)
docs/data_description.md
# Actual deploy workflow (created from .example, may contain secrets in comments)
.github/workflows/deploy.yml
# Project-specific: Data directory
# Downloaded source data - never commit
data/

View file

@ -10,6 +10,8 @@ CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every C
## [Unreleased]
## [0.54.17] — 2026-05-15
### Changed
- `agnes refresh-marketplace --check` (the SessionStart-hook detector
that fires on every Claude Code session start in every workspace)
@ -34,6 +36,19 @@ CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every C
### Internal
- CI test suite sharded for speed. The `test` job in `.github/workflows/ci.yml` is now a `test-shard` matrix — 4 parallel jobs via `pytest-split`, balanced by a committed `.test_durations` file — aggregated into a single `test` status check so branch protection needs no change. The duplicate full-suite `test` job in `release.yml` is removed (it re-ran the same ~10 min suite a second time on every push to main/feature branches); `release.yml` is now image-build only, with the advisory ruff/mypy steps moved to a lean `lint` job in `ci.yml`. Net: ~10 min → ~3 min wall-clock per push, and the suite runs once instead of twice. Adds `pytest-split` to the `dev` extra.
- CI/release workflow polish (the still-salvageable subset of the
abandoned PR #139, after #311 obsoleted the test-job refactor):
`rollback.yml` extracts the `release.yml` smoke-test rollback into a
reusable + manually dispatchable workflow, with a warning guard on
non-`stable-*` `workflow_dispatch` inputs. `prune-dev-tags.yml` adds
weekly housekeeping (Sundays 04:00 UTC) of legacy CalVer git tags +
GHCR images outside a `KEEP_MONTHS` retention window; floating
aliases are git-tagless and never matched. `lint-workflows.yml` runs
`actionlint` on `.github/workflows/**` + `scripts/ops/**.sh` changes
(non-blocking initially). The superseded `deploy.yml` stub is removed.
Excludes #139's rejected pieces (Release Drafter, setuptools_scm,
run-number tag scheme, main-only release triggers, deletion of
`cli-wheel-clean-install`).
## [0.54.16] — 2026-05-14

View file

@ -169,6 +169,14 @@ from a stale post-cut commit (we've shipped that race before).
`Release`-pipeline, Devin Review) are advisory — green/red doesn't gate merge.
- **`enforce_admins: true`** in branch protection means `--admin` flag on
`gh pr merge` does NOT bypass. Don't try; just fix the underlying block.
- **`lint-workflows.yml` is advisory.** Triggered on changes to
`.github/workflows/**` or `scripts/ops/**.sh`. Runs `actionlint` on
workflow YAMLs + `shellcheck --severity=warning` on freestanding ops
scripts. The `actionlint` step has `continue-on-error: true` initially
(pre-existing inventory has info-level findings); flip to fail-fast
once the repo is actionlint-clean. The `shellcheck` step IS blocking at
warning+ severity — info/style findings ride through, real bugs break
CI.
### Recovery when something derails
@ -201,6 +209,54 @@ within ~5 min via the cron in `agnes-auto-upgrade.sh`. Convenient for
per-developer dev VMs; **footgun for shared dev VMs** (last pusher wins,
regardless of who).
**Auto-rollback on smoke failure.** On `main` pushes, after `:stable` is
published, the `smoke-test` job pulls the just-built image and runs
`scripts/ops/post-deploy-smoke-test.sh` inside a docker-compose stack. If
that job fails, the `rollback-on-smoke-fail` job calls the reusable
`rollback.yml` workflow (see below) which re-points `:stable` to the
previous known-good build, marks the failed image as `:deprecated-*`,
and opens a tracking issue labeled `bug`.
### `rollback.yml` — reusable + manual rollback
Two entry points:
- **`workflow_call`** from `release.yml`'s `rollback-on-smoke-fail` job
(auto-rollback path above).
- **`workflow_dispatch`** for manual operator rollback when something
breaks post-deploy that the auto smoke-test missed.
**Manual rollback** — flip `:stable` back to a previous good build:
```bash
gh workflow run rollback.yml \
--repo keboola/agnes-the-ai-analyst \
-f failed_image_tag=stable-YYYY.MM.N
```
By default `target_image_tag` resolves by walking back through `stable-*`
git tags newest-first and picking the first that does NOT already carry a
`:deprecated-<stripped>` GHCR alias (i.e. wasn't previously auto-rolled-
back). That prevents cascading failures from re-pointing `:stable` at a
known-broken image. To force a specific target:
```bash
gh workflow run rollback.yml \
--repo keboola/agnes-the-ai-analyst \
-f failed_image_tag=stable-2026.05.531 \
-f target_image_tag=stable-2026.04.474
```
Notes:
- The workflow does NOT delete the failed git tag (CalVer immutability is
preserved) — only the GHCR `:stable` alias is re-pointed and the failed
image gains a `:deprecated-*` audit alias.
- Re-tag order is `:stable` recovery first, then `:deprecated-*` audit, so
a mid-step interruption leaves production healthy with at-worst missing
audit metadata.
- Concurrency: `cancel-in-progress: false` (overrides the caller workflow's
cancellation policy) so a subsequent push to `main` won't kill a
rollback mid-flight.
### `keboola-deploy.yml` — tag-triggered, explicit deploy only
Runs **only** on git tags matching `keboola-deploy-*`. Publishes:
@ -222,6 +278,40 @@ Use this when the consumer (e.g. a customer dev VM) needs
pushes by other contributors. The infra repo pins
`image_tag = "keboola-deploy-latest"` on the relevant VM.
### `prune-dev-tags.yml` — weekly CalVer + GHCR housekeeping
Cron `0 4 * * 0` (Sundays 04:00 UTC) + `workflow_dispatch`. Prunes legacy
CalVer git tags (`dev-YYYY.MM.N`, `stable-YYYY.MM.N`) and the matching
GHCR image versions older than `KEEP_MONTHS` (default `1` → keep current
+ previous month). Floating aliases (`:stable`, `:dev`, `*-latest`) are
never matched: they are git-tagless, and the GHCR pass explicitly skips
any version that shares a manifest with a floating alias to avoid
collateral deletion of `:stable` after a rollback re-tag.
**Manual preview** (no deletions, lists what would be pruned):
```bash
gh workflow run prune-dev-tags.yml \
--repo keboola/agnes-the-ai-analyst \
-f dry_run=true
```
**Force a wider window** (one-off aggressive cleanup):
```bash
gh workflow run prune-dev-tags.yml \
--repo keboola/agnes-the-ai-analyst \
-f keep_months=3
```
Scheduled (cron) runs always prune for real; `dry_run` is honored only on
manual dispatch. The script tracks per-tag remote-push / GHCR-DELETE
failures and exits non-zero on any failure, so a refused remote push (tag-
protection rule, missing scope) or a GHCR API error turns the cron run
red instead of silently swallowing it. Local `git tag -d` is gated on
successful remote push, so a refused delete leaves the local tag in place
for retry on the next run.
### Module versioning
The customer-instance Terraform module under `infra/modules/customer-instance/`

View file

@ -1,6 +1,6 @@
[project]
name = "agnes-the-ai-analyst"
version = "0.54.16"
version = "0.54.17"
description = "Agnes — AI Data Analyst platform for AI analytical systems"
requires-python = ">=3.11,<3.14"
license = "MIT"

174
scripts/ops/prune-dev-tags.sh Executable file
View file

@ -0,0 +1,174 @@
#!/usr/bin/env bash
# Prune legacy CalVer dev/stable image identity from git + GHCR:
#
# Git tags + GHCR image versions of the form
# dev-YYYY.MM.N e.g. dev-2026.04.475
# stable-YYYY.MM.N e.g. stable-2026.04.474
# accumulate one per CI build. Retention: KEEP_MONTHS (default 1) keeps
# the current month + the previous KEEP_MONTHS months; older tags +
# images are pruned.
#
# Dry-run via PRUNE_DRY_RUN=1 (or workflow input) — lists what would be
# pruned without acting.
#
# Idempotent: re-running with no eligible tags exits 0.
set -euo pipefail
KEEP_MONTHS="${KEEP_MONTHS:-1}"
[[ "$KEEP_MONTHS" =~ ^[0-9]+$ ]] || { echo "KEEP_MONTHS must be a non-negative integer (got: '$KEEP_MONTHS')"; exit 1; }
DRY_RUN="${PRUNE_DRY_RUN:-0}"
REPO="${GITHUB_REPOSITORY:?GITHUB_REPOSITORY env var must be set (e.g. keboola/agnes-the-ai-analyst)}"
cd "$(git rev-parse --show-toplevel)"
# Compute the set of YYYY.MM strings to KEEP — walk back KEEP_MONTHS+1
# months from today.
TODAY_YEAR=$(date +%Y)
TODAY_MONTH=$(date +%m)
TODAY_MONTH_NUM=$((10#$TODAY_MONTH)) # strip leading zero for arithmetic
KEEP_YYYY_MM=()
for i in $(seq 0 "$KEEP_MONTHS"); do
Y=$TODAY_YEAR
M=$((TODAY_MONTH_NUM - i))
while [ "$M" -lt 1 ]; do
M=$((M + 12))
Y=$((Y - 1))
done
KEEP_YYYY_MM+=("$(printf '%04d.%02d' "$Y" "$M")")
done
echo "Retention window (YYYY.MM): ${KEEP_YYYY_MM[*]}"
# Collect candidate tags — strictly `dev-YYYY.MM.N` / `stable-YYYY.MM.N`.
LEGACY_TAGS=$(git tag -l 'dev-*' 'stable-*' \
| grep -E '^(dev|stable)-[0-9]{4}\.[0-9]{2}\.[0-9]+$' \
|| true)
# Filter: keep tags whose YYYY.MM is in the keep window; prune the rest.
TO_PRUNE=()
if [ -n "$LEGACY_TAGS" ]; then
while IFS= read -r TAG; do
[ -z "$TAG" ] && continue
TAG_YM=$(echo "$TAG" | sed -E 's/^(dev|stable)-([0-9]{4}\.[0-9]{2})\.[0-9]+$/\2/')
KEEP=0
for KEEP_YM in "${KEEP_YYYY_MM[@]}"; do
if [ "$TAG_YM" = "$KEEP_YM" ]; then KEEP=1; break; fi
done
if [ "$KEEP" = "0" ]; then
TO_PRUNE+=("$TAG")
fi
done <<< "$LEGACY_TAGS"
fi
SECTION1_HAS_WORK=0
if [ -z "$LEGACY_TAGS" ]; then
echo "No legacy CalVer tags found — nothing to prune."
elif [ "${#TO_PRUNE[@]}" -eq 0 ]; then
echo "All legacy tags are within retention window — nothing to prune."
else
SECTION1_HAS_WORK=1
echo "Will prune ${#TO_PRUNE[@]} tags older than the retention window:"
# Array slice instead of `printf … | head` — under `set -o pipefail`,
# head closing the pipe early can SIGPIPE printf (exit 141) and abort
# the script before any deletion runs. The slice avoids the pipeline.
printf ' %s\n' "${TO_PRUNE[@]:0:20}"
[ "${#TO_PRUNE[@]}" -gt 20 ] && echo " ... (and $((${#TO_PRUNE[@]} - 20)) more)"
fi
if [ "$SECTION1_HAS_WORK" = "1" ] && [ "$DRY_RUN" = "1" ]; then
echo "(dry-run — no deletions)"
SECTION1_HAS_WORK=0
fi
# Track failures so the workflow run turns red even when individual
# operations were swallowed by `|| ...` fallbacks. Stdout warnings alone
# are invisible on a green run, so a hard exit-1 at the end is the only
# reliable signal to operators.
PRUNE_FAILED=0
if [ "$SECTION1_HAS_WORK" = "1" ]; then
# Fetch GHCR versions BEFORE any git-tag deletion — if the API call
# fails (403 missing scope, 429 rate limit, transient 5xx), we abort
# cleanly with no state change. Doing the irrecoverable git-tag delete
# first risked orphan GHCR images: the next run rebuilds TO_PRUNE from
# `git tag -l`, so without the local git tag the orphan image is never
# enumerated again.
TAG_TO_ID=""
if [ -n "${GH_TOKEN:-}" ]; then
ORG=$(echo "$REPO" | cut -d/ -f1)
PKG_NAME=$(echo "$REPO" | cut -d/ -f2)
echo "Fetching GHCR image versions for $ORG/$PKG_NAME ..."
# One paginated fetch up-front, then per-tag lookups against the
# cached result. Avoids O(N × pages) API calls on a multi-month
# backlog (legacy CalVer tag counts run ~500/month per channel).
# No `|| echo "[]"` fallback — let `set -e` propagate API failure
# rather than silently turning every TAG into a no-op skip.
VERSIONS_JSON=$(gh api \
"/orgs/${ORG}/packages/container/${PKG_NAME}/versions" \
--paginate)
# CRITICAL: GHCR's DELETE-version drops the entire manifest, taking
# EVERY tag on it (including `:stable`, `:dev`, `dev-<user>-latest`).
# After a rollback re-tag, the previous-known-good version carries
# both `:stable` and its CalVer tag — pruning that CalVer tag would
# vaporize `:stable`. So skip any version that also carries a
# floating alias. The jq filter applies that exclusion up-front.
TAG_TO_ID=$(echo "$VERSIONS_JSON" | jq -r '
.[]
| select(
(.metadata.container.tags | index("stable") // false | not) and
(.metadata.container.tags | index("dev") // false | not) and
((.metadata.container.tags | map(endswith("-latest")) | any) | not)
)
| . as $v
| .metadata.container.tags[] as $t
| "\($t)\t\($v.id)"
')
else
echo "GH_TOKEN not set — GHCR image deletion will be skipped (git tags will still be pruned below)."
fi
# Delete git tags. Local delete is gated on successful remote push —
# if the remote refuses (protected tag, missing contents:write,
# transient failure), leaving the local tag in place means the next
# run retries the same TAG cleanly. checkout@v6 re-fetches tags so a
# successful local-only delete would just come back anyway.
for TAG in "${TO_PRUNE[@]}"; do
echo " deleting tag: $TAG"
if git push origin --delete "$TAG"; then
git tag -d "$TAG" 2>/dev/null || true
else
echo " (remote push failed — leaving local tag in place for retry; check tag-protection rules or contents:write scope)"
PRUNE_FAILED=1
fi
done
# Delete GHCR image versions using the up-front fetch.
if [ -n "${GH_TOKEN:-}" ]; then
echo "Deleting matching GHCR image versions ..."
for TAG in "${TO_PRUNE[@]}"; do
VERSION_ID=$(echo "$TAG_TO_ID" | awk -v t="$TAG" '$1==t {print $2; exit}')
if [ -n "$VERSION_ID" ]; then
echo " deleting GHCR image $TAG (version $VERSION_ID)"
if ! gh api -X DELETE \
"/orgs/${ORG}/packages/container/${PKG_NAME}/versions/${VERSION_ID}"; then
echo " (DELETE failed — check packages:write scope, rate limits, or version already gone)"
PRUNE_FAILED=1
fi
else
echo " skipping GHCR image $TAG — no eligible version (already gone, or shares a manifest with :stable/:dev/*-latest)"
fi
done
fi
fi
if [ "$PRUNE_FAILED" = "1" ]; then
echo "::error::One or more prune operations failed — see warnings above"
exit 1
fi
echo "Prune complete."