* ci: add actionlint workflow lint, drop superseded deploy.yml stub
* ci: extract rollback into reusable rollback.yml, wire into release smoke-test
* ci: add weekly prune-dev-tags workflow for legacy CalVer tag/image cleanup
* release: 0.54.17 — CI/release workflow consolidation
* fix(ci): warn when rollback.yml receives a non-stable failed_image_tag
* fix(ci): rollback.yml + prune-dev-tags.sh review findings
rollback.yml:
- Pass workflow_dispatch inputs (failed_image_tag, target_image_tag)
through env: instead of textual ${{ }} splicing into bash run blocks
— prevents an actor with workflow_dispatch privilege from injecting
shell via quote/backtick payloads.
- Guard against TARGET == FAILED when only one stable-* tag exists
(fresh repo, or aggressive pruning at month boundary). Fail loudly
rather than re-push the broken image as :stable.
- Add commit SHA to the rollback tracking-issue body — github.sha is
inherited across workflow_call, so on-call no longer has to navigate
rollback run → caller-workflow breadcrumb → failing commit.
prune-dev-tags.sh:
- Replace 'printf … | head -20' preview pipeline with array slice
('"${TO_PRUNE[@]:0:20}"'). Under set -o pipefail, head closing
the pipe early SIGPIPEs printf (exit 141) and aborts the script
before any deletion runs — exactly the multi-month-backlog scenario
the script targets.
- Refactor GHCR-pass: fetch versions JSON once before the loop, then
build a tag→version-id map up-front. Closes two problems:
1. O(N × pages) GHCR API calls collapse to one paginated listing
— months of accumulated CalVer tags no longer risk tripping
abuse detection.
2. The new jq filter excludes any version that ALSO carries a
floating alias (:stable, :dev, *-latest). GHCR DELETE-version
drops the entire manifest, so pruning a CalVer tag that shares
a manifest with :stable (e.g. after a rollback re-tag) would
have vaporized :stable. Now it's skipped with a log line.
lint-workflows.yml:
- Add an explicit shellcheck step. actionlint only walks
.github/workflows/ and the shell embedded in their run: blocks, so
freestanding scripts/ops/*.sh (which are in the workflow's path
filter) were never actually validated despite triggering CI.
* fix(ci): shellcheck --severity=warning to skip pre-existing info findings
The new shellcheck step caught info-level findings (SC1091, SC2015) in
agnes-auto-upgrade.sh / agnes-tls-rotate.sh — pre-existing, not regressed
by this PR. Constrain shellcheck to warning+ severity (real bugs) so info
and style findings don't block CI; mirrors the actionlint step's
continue-on-error initial-rollout posture.
* fix(ci): second-pass review findings — concurrency, walk-back, failure propagation
rollback.yml:
- Add own concurrency block (group: rollback-<repo>-<failed_tag>,
cancel-in-progress: false). The caller release.yml uses
cancel-in-progress: true to avoid duplicate CalVer claims, but a
second push to main mid-rollback would otherwise kill the workflow
between the :stable recovery push and the :deprecated-* audit push,
leaving :stable stuck on the broken image. A reusable workflow's own
concurrency overrides the inherited one.
- Walk back through stable-* tags newest-first, skipping any whose
:deprecated-<stripped> GHCR alias already exists (carries the mark of
a prior failed rollback). The previous 'second-most-recent' heuristic
could re-point :stable at a known-broken image on cascading failures.
- Reorder re-tag step: push :stable recovery FIRST, then the
:deprecated-* audit tag. Defense in depth — even if the concurrency
block somehow misfires, the worst case is missing audit metadata
rather than production stuck on the broken image.
- Move GHCR login before resolve step so 'docker manifest inspect' can
probe for :deprecated-* aliases during walk-back.
- Document the top-level permissions block's dual semantics
(workflow_dispatch grants directly; workflow_call acts as a cap
intersected with the caller's job-level permissions).
release.yml:
- Rewrite the 'issues: write' comment. Old wording ('default for jobs')
was factually wrong — GITHUB_TOKEN's default for issues is never write
— and read as 'this line just documents a default', so a future
cleanup PR could delete it. The line is load-bearing: workflow_call
permissions are bounded by the caller's GITHUB_TOKEN scope, and
removing it would silently 403 rollback.yml's gh issue create step.
prune-dev-tags.sh:
- Drop the '|| echo "[]"' fallback on the GHCR versions fetch. The
fallback turned every API failure (403 missing scope, 429 rate limit,
transient 5xx) into a silent no-op with exit 0 — operators saw a
green run while every TAG fell through to the same 'no eligible
version' skip message used for legitimate manifest-collision skips.
- Reorder: fetch GHCR versions BEFORE any git-tag deletion. Git-tag
delete is irrecoverable (next run rebuilds TO_PRUNE from 'git tag
-l', so an orphan GHCR image is never enumerated again). Fetching
first means an API failure aborts cleanly with no state change.
- Track PRUNE_FAILED flag. 'git push --delete' fallback is no longer
unconditional — local 'git tag -d' is gated on successful remote
push, so a refused remote delete (tag-protection rule, missing
contents:write) leaves the local tag in place for retry. The flag
propagates to a final 'exit 1' so the cron run turns red on any
push or DELETE failure.
lint-workflows.yml:
- shellcheck step now uses 'find scripts/ops -type f -name *.sh' to
match the workflow's recursive 'scripts/ops/**.sh' path filter. The
previous bare 'scripts/ops/*.sh' glob only matched top-level files;
a future script under a subdirectory would have triggered the
workflow but never been linted.
* docs(releasing): document rollback.yml, prune-dev-tags.yml, lint-workflows.yml
Reflects the new operational workflows landing in this release:
- Auto-rollback paragraph in release.yml description (smoke-test job +
rollback-on-smoke-fail → rollback.yml)
- rollback.yml subsection — workflow_call + workflow_dispatch entry
points, walk-back target resolution, immutability + concurrency
guarantees, manual operator gh workflow run examples
- prune-dev-tags.yml subsection — weekly cron, KEEP_MONTHS retention
semantics, floating-alias safety, dry_run preview, failure-propagation
exit-non-zero behavior
- lint-workflows.yml CI quirk — actionlint (continue-on-error) +
shellcheck (--severity=warning blocking) advisory checks
CLAUDE.md non-negotiable rules unchanged — still high-level and
correct (changelog discipline + release-cut belongs to the PR + run the
full test suite).
16 KiB
Releasing & deploying
The full release process for Agnes. CLAUDE.md carries the short version; this doc is the operational reference. Read it linearly the first few times — once internalized, the order matters less, but the non-obvious gotchas never go away.
Changelog discipline — non-negotiable
Every PR that adds, removes, or changes user-visible behavior MUST update
CHANGELOG.md in the same PR. No exceptions, no follow-ups, no "I'll do it
after merge". User-visible = anything an operator, end-user, or downstream
integrator can observe: CLI flags / output / exit codes, REST endpoints /
payloads / status codes, web UI, instance.yaml schema, env vars,
extract.duckdb contract, Docker / compose / Caddyfile knobs, default
behaviors, breaking changes, security fixes.
How:
- Add a bullet under the topmost
## [Unreleased]heading (create one if missing — it sits above the latest released version). - Group by
### Added/### Changed/### Fixed/### Removed/### Internal(Keep-a-Changelog sections). - Mark breaking changes with
**BREAKING**at the start of the bullet — operators grep for that string before bumping the pin. - Reference the relevant doc/runbook if one exists (e.g.
see docs/auth-groups.md), don't restate it. - Internal-only changes (refactors, test additions, dependency bumps without
behavior change) go under
### Internal— still log them, just keep them terse.
Reviewers should bounce PRs that touch user-visible behavior without a changelog update — same way they'd bounce a PR with no test changes for new logic.
Release-cut belongs to the PR — non-negotiable
The version bump + CHANGELOG rename + new empty [Unreleased] are the LAST
commit on the PR that earned the version. Never a standalone follow-up PR.
When a PR lands the only [Unreleased] content (or is the last in a queue of
in-flight feature PRs), the release-cut MUST ship as part of the same merge.
Standalone release-cut PRs add review-overhead PRs to history with no behavior
change of their own and pollute git log with bookkeeping commits separated
from the work that earned them.
Mandatory checklist before approving / enabling auto-merge on ANY PR:
- Stop. Will this PR land alone in
[Unreleased](no other in-flight PRs queued behind it)? - If yes, the release-cut is REQUIRED in the same PR before merge. BEFORE
pushing the final commit:
- Bump
pyproject.tomltoX.Y.Z - Rename
## [Unreleased]→## [X.Y.Z] — YYYY-MM-DD, add a new empty## [Unreleased]on top - Either squash these into the consolidation commit OR add as a separate
release: X.Y.Zcommit on the same branch
- Bump
- THEN push, approve, enable auto-merge.
- After auto-merge fires: tag
vX.Y.Zagainst the merge commit + create a GitHub Release. Done — one PR, one merge, one release.
Failure mode to avoid: enabling auto-merge on the feature PR thinking "I'll add the release-cut after." Auto-merge fires faster than the second commit lands. The window closes; the only fix is a standalone release-cut PR — exactly what this rule prohibits.
Acceptable standalone release-cut (rare): only when [Unreleased]
accumulated bullets from MULTIPLE already-merged PRs AND no further
behavior-change PR is queued — i.e. the cut is the only outstanding work and
there's no PR to attach it to.
Release workflow — concrete recipe
Happy path (8 steps)
# 1. Branch from a fresh checkout. iCloud Drive worktrees randomly hang
# on git operations — use a fresh shallow clone in /tmp instead.
cd /tmp && git clone --depth 50 --branch main \
https://github.com/keboola/agnes-the-ai-analyst.git agnes-<topic>
cd agnes-<topic> && git checkout -b zs/<branch-name>
# 2. Make the change + tests. Run the AREA pytest while iterating
# (e.g. `pytest tests/test_X.py -p no:xdist -q`).
# 3. Add a CHANGELOG bullet under [Unreleased].
# Group: Added | Changed | Fixed | Removed | Internal
# Mark BREAKING with **BREAKING** prefix.
# 4. Commit the change(s). Multiple logical commits OK; release-cut
# will be a SEPARATE last commit (next step). DO NOT bundle the
# release-cut into the same commit as the change — it pollutes
# the SHA that auto-close keywords reference and makes revert
# targeted at the change-only difficult.
# 5. Run the full pytest suite locally:
# `pytest tests/ -p no:xdist -q` (or `-n auto` if xdist works).
# Pre-existing fails (e.g. test_readers_in_pre_init_dir under
# subprocess timeout) are OK to ignore; verify by reverting your
# diff and reproducing on bare main.
# 6. Release-cut commit (LAST commit on the PR per the rule above):
# - Bump pyproject.toml: version = "X.Y.Z"
# - Rename `## [Unreleased]` → `## [X.Y.Z] — YYYY-MM-DD`
# - Add a fresh empty `## [Unreleased]` line above
# Commit message: `release: X.Y.Z — <one-line summary>`
# 7. Push branch + open PR + enable auto-merge SQUASH:
# git push -u origin HEAD
# gh pr create --repo keboola/agnes-the-ai-analyst \
# --head <branch> --title "<...>" --body "<...>"
# gh pr merge <N> --repo keboola/agnes-the-ai-analyst \
# --squash --auto --delete-branch
# 8. After auto-merge fires (poll or `Monitor`):
# git fetch origin --tags
# git tag vX.Y.Z <merge-sha>
# git push origin vX.Y.Z
# gh release create vX.Y.Z --repo keboola/agnes-the-ai-analyst \
# --title "vX.Y.Z — <...>" --notes "<copy-paste from CHANGELOG>"
Picking the next version
pyproject.toml's current version is the next-release target (post-cut
from the previous release). Pre-1.0 we patch-bump for everything that doesn't
break operator-facing APIs:
instance.yamlschema additions, new env vars, new endpoints → patch (e.g. 0.54.3 → 0.54.4)- New CLI subcommands, BREAKING removals, schema migrations → still patch within the current 0.5x cycle (no minor bumps cut today)
- The CHANGELOG
**BREAKING**marker is what operators grep for; the version number is secondary
Always check git tag -l "v0.X*" before naming — if v0.54.0 is already
tagged, the next one is v0.54.1, even if pyproject.toml still says 0.54.0
from a stale post-cut commit (we've shipped that race before).
Authoring expectations on the PR
- Self-PRs (you're both author and reviewer): GitHub forbids self-approve.
If branch protection requires N approving reviews (we don't today —
required_approving_review_count = 0), you need someone else to approve. With our current 0-review setup, self-PRs can still merge automatically once required CI passes. - Other people's PRs you're taking over: dismiss any prior
CHANGES_REQUESTED reviews (yours or someone else's) before auto-merge can
fire.
gh pr review <N> --approve --body "..."after pushing your fixes. - Devin Review: not a required check today; runs in parallel and posts a comment. Don't wait on it for merge unless the human reviewer explicitly asks.
CI quirks you WILL hit
gh pr checksglosses CANCELLED asfail. When you force-push (rebase, amend), GitHub auto-cancels the in-flightReleaseworkflow run on the older SHA. Those cancelled jobs show up as "fail" in the PR's check summary and tab forever, even after newer runs succeed. Look at the conclusion column, not just the count. Rule of thumb: if the same check name appears with bothpassandfailrows, thefailrow is from an older auto-cancelled SHA. Verify withgh api repos/keboola/agnes-the-ai-analyst/commits/<sha>/check-runs— the raw API distinguishescancelledfromfailuretruthfully.- Branch protection's "strict" mode caches cancelled
testas blocking even after newertestruns succeed. Symptom:mergeable_state: blockeddespite all required checks green on the latest SHA. Fix: re-run the cancelledReleaseworkflow run (gh run rerun <run-id>); once itstestjob lands as success, the block clears. We've hit this on PRs #273, #281, #285, #286. - Required checks (per branch protection):
test+docker-buildonly. Other workflows (cli-wheel-clean-install,build-and-push,Release-pipeline, Devin Review) are advisory — green/red doesn't gate merge. enforce_admins: truein branch protection means--adminflag ongh pr mergedoes NOT bypass. Don't try; just fix the underlying block.lint-workflows.ymlis advisory. Triggered on changes to.github/workflows/**orscripts/ops/**.sh. Runsactionlinton workflow YAMLs +shellcheck --severity=warningon freestanding ops scripts. Theactionlintstep hascontinue-on-error: trueinitially (pre-existing inventory has info-level findings); flip to fail-fast once the repo is actionlint-clean. Theshellcheckstep IS blocking at warning+ severity — info/style findings ride through, real bugs break CI.
Recovery when something derails
- Force-pushed and lost auto-merge? GitHub usually preserves auto-merge
across force-pushes for the same PR; if it cleared, just re-run
gh pr merge <N> --squash --auto --delete-branch. - Release-cut commit forgot to land? That's the failure mode the "Release-cut belongs to the PR" rule prevents. If it happens anyway: open a follow-on PR with ONLY the release-cut commit, ship it, and write up why in your post-mortem comment.
- Wrong version number tagged?
git tag -d vX.Y.Z && git push --delete origin vX.Y.Zthen re-tag against the right SHA. Update the GitHub Release if you already created it.
Deploy workflows
Two separate release.yml-style workflows produce GHCR images. Pick the one that matches what you're shipping.
release.yml — auto-build on every push
Runs on every push to every branch.
- Push to
main→:stable,:stable-YYYY.MM.N(CalVer). - Push to non-main
<prefix>/<branch>→:dev,:dev-YYYY.MM.N,:dev-<branch-slug>, and (when prefix isn't a Git Flow convention):dev-<prefix>-latestalias.
VMs that pin to a floating tag (:dev, :dev-<prefix>-latest) auto-upgrade
within ~5 min via the cron in agnes-auto-upgrade.sh. Convenient for
per-developer dev VMs; footgun for shared dev VMs (last pusher wins,
regardless of who).
Auto-rollback on smoke failure. On main pushes, after :stable is
published, the smoke-test job pulls the just-built image and runs
scripts/ops/post-deploy-smoke-test.sh inside a docker-compose stack. If
that job fails, the rollback-on-smoke-fail job calls the reusable
rollback.yml workflow (see below) which re-points :stable to the
previous known-good build, marks the failed image as :deprecated-*,
and opens a tracking issue labeled bug.
rollback.yml — reusable + manual rollback
Two entry points:
workflow_callfromrelease.yml'srollback-on-smoke-failjob (auto-rollback path above).workflow_dispatchfor manual operator rollback when something breaks post-deploy that the auto smoke-test missed.
Manual rollback — flip :stable back to a previous good build:
gh workflow run rollback.yml \
--repo keboola/agnes-the-ai-analyst \
-f failed_image_tag=stable-YYYY.MM.N
By default target_image_tag resolves by walking back through stable-*
git tags newest-first and picking the first that does NOT already carry a
:deprecated-<stripped> GHCR alias (i.e. wasn't previously auto-rolled-
back). That prevents cascading failures from re-pointing :stable at a
known-broken image. To force a specific target:
gh workflow run rollback.yml \
--repo keboola/agnes-the-ai-analyst \
-f failed_image_tag=stable-2026.05.531 \
-f target_image_tag=stable-2026.04.474
Notes:
- The workflow does NOT delete the failed git tag (CalVer immutability is
preserved) — only the GHCR
:stablealias is re-pointed and the failed image gains a:deprecated-*audit alias. - Re-tag order is
:stablerecovery first, then:deprecated-*audit, so a mid-step interruption leaves production healthy with at-worst missing audit metadata. - Concurrency:
cancel-in-progress: false(overrides the caller workflow's cancellation policy) so a subsequent push tomainwon't kill a rollback mid-flight.
keboola-deploy.yml — tag-triggered, explicit deploy only
Runs only on git tags matching keboola-deploy-*. Publishes:
:keboola-deploy-<git-tag-suffix>— immutable, tied to the exact commit:keboola-deploy-latest— floating alias the consumer pins to
Operator workflow:
git checkout <commit-or-branch>
git tag keboola-deploy-<descriptive-name>
git push origin keboola-deploy-<descriptive-name>
# → workflow builds + publishes both tags
# → VM cron picks up :keboola-deploy-latest within ~5 min
# → manual cron trigger (skip the wait): sudo /usr/local/bin/agnes-auto-upgrade.sh on the VM
Use this when the consumer (e.g. a customer dev VM) needs
deploy-when-I-decide semantics — no surprise rollouts from upstream branch
pushes by other contributors. The infra repo pins
image_tag = "keboola-deploy-latest" on the relevant VM.
prune-dev-tags.yml — weekly CalVer + GHCR housekeeping
Cron 0 4 * * 0 (Sundays 04:00 UTC) + workflow_dispatch. Prunes legacy
CalVer git tags (dev-YYYY.MM.N, stable-YYYY.MM.N) and the matching
GHCR image versions older than KEEP_MONTHS (default 1 → keep current
- previous month). Floating aliases (
:stable,:dev,*-latest) are never matched: they are git-tagless, and the GHCR pass explicitly skips any version that shares a manifest with a floating alias to avoid collateral deletion of:stableafter a rollback re-tag.
Manual preview (no deletions, lists what would be pruned):
gh workflow run prune-dev-tags.yml \
--repo keboola/agnes-the-ai-analyst \
-f dry_run=true
Force a wider window (one-off aggressive cleanup):
gh workflow run prune-dev-tags.yml \
--repo keboola/agnes-the-ai-analyst \
-f keep_months=3
Scheduled (cron) runs always prune for real; dry_run is honored only on
manual dispatch. The script tracks per-tag remote-push / GHCR-DELETE
failures and exits non-zero on any failure, so a refused remote push (tag-
protection rule, missing scope) or a GHCR API error turns the cron run
red instead of silently swallowing it. Local git tag -d is gated on
successful remote push, so a refused delete leaves the local tag in place
for retry on the next run.
Module versioning
The customer-instance Terraform module under infra/modules/customer-instance/
is published as infra-vMAJOR.MINOR.PATCH git tags (separate from app CalVer
tags). Bump on any module-API change; downstream infra repos pin to the tag in
their source = "github.com/keboola/agnes-the-ai-analyst//infra/modules/customer-instance?ref=infra-v1.X.Y".
After merging a module change to main:
git tag infra-vX.Y.Z origin/main
git push origin infra-vX.Y.Z
Replacing a VM after a startup-script change
Module sets lifecycle { ignore_changes = [metadata_startup_script] } on
google_compute_instance.vm so normal terraform apply doesn't churn running
VMs. To propagate a startup-script update, trigger the consumer's apply workflow
manually with the VM resource address — typical workflow_dispatch input is
recreate_targets='module.agnes.google_compute_instance.vm["<vm-name>"]'.
Appendix: CHANGELOG entry skeleton
Copy this when adding to ## [Unreleased] in CHANGELOG.md. Drop the sections
you don't need; keep the Keep-a-Changelog order.
### Added
- New feature description.
### Changed
- Change description. **BREAKING** prefix + migration steps if operator-facing.
### Fixed
- Bug fix description.
### Removed
- **BREAKING** removed feature — what replaces it.
### Internal
- Refactors, test additions, dependency bumps with no behavior change.
At release-cut time ## [Unreleased] is renamed to ## [X.Y.Z] — YYYY-MM-DD
and a fresh empty ## [Unreleased] is added on top. CI publishes the matching
stable-YYYY.MM.N image tag for the merge commit (see Deploy workflows above).