agnes-the-ai-analyst/docs/RELEASING.md
ZdenekSrotyr 9f5adbce37
ci: consolidate release pipeline (salvageable subset of #139) (#314)
* ci: add actionlint workflow lint, drop superseded deploy.yml stub

* ci: extract rollback into reusable rollback.yml, wire into release smoke-test

* ci: add weekly prune-dev-tags workflow for legacy CalVer tag/image cleanup

* release: 0.54.17 — CI/release workflow consolidation

* fix(ci): warn when rollback.yml receives a non-stable failed_image_tag

* fix(ci): rollback.yml + prune-dev-tags.sh review findings

rollback.yml:
- Pass workflow_dispatch inputs (failed_image_tag, target_image_tag)
  through env: instead of textual ${{ }} splicing into bash run blocks
  — prevents an actor with workflow_dispatch privilege from injecting
  shell via quote/backtick payloads.
- Guard against TARGET == FAILED when only one stable-* tag exists
  (fresh repo, or aggressive pruning at month boundary). Fail loudly
  rather than re-push the broken image as :stable.
- Add commit SHA to the rollback tracking-issue body — github.sha is
  inherited across workflow_call, so on-call no longer has to navigate
  rollback run → caller-workflow breadcrumb → failing commit.

prune-dev-tags.sh:
- Replace 'printf … | head -20' preview pipeline with array slice
  ('"${TO_PRUNE[@]:0:20}"'). Under set -o pipefail, head closing
  the pipe early SIGPIPEs printf (exit 141) and aborts the script
  before any deletion runs — exactly the multi-month-backlog scenario
  the script targets.
- Refactor GHCR-pass: fetch versions JSON once before the loop, then
  build a tag→version-id map up-front. Closes two problems:
    1. O(N × pages) GHCR API calls collapse to one paginated listing
       — months of accumulated CalVer tags no longer risk tripping
       abuse detection.
    2. The new jq filter excludes any version that ALSO carries a
       floating alias (:stable, :dev, *-latest). GHCR DELETE-version
       drops the entire manifest, so pruning a CalVer tag that shares
       a manifest with :stable (e.g. after a rollback re-tag) would
       have vaporized :stable. Now it's skipped with a log line.

lint-workflows.yml:
- Add an explicit shellcheck step. actionlint only walks
  .github/workflows/ and the shell embedded in their run: blocks, so
  freestanding scripts/ops/*.sh (which are in the workflow's path
  filter) were never actually validated despite triggering CI.

* fix(ci): shellcheck --severity=warning to skip pre-existing info findings

The new shellcheck step caught info-level findings (SC1091, SC2015) in
agnes-auto-upgrade.sh / agnes-tls-rotate.sh — pre-existing, not regressed
by this PR. Constrain shellcheck to warning+ severity (real bugs) so info
and style findings don't block CI; mirrors the actionlint step's
continue-on-error initial-rollout posture.

* fix(ci): second-pass review findings — concurrency, walk-back, failure propagation

rollback.yml:
- Add own concurrency block (group: rollback-<repo>-<failed_tag>,
  cancel-in-progress: false). The caller release.yml uses
  cancel-in-progress: true to avoid duplicate CalVer claims, but a
  second push to main mid-rollback would otherwise kill the workflow
  between the :stable recovery push and the :deprecated-* audit push,
  leaving :stable stuck on the broken image. A reusable workflow's own
  concurrency overrides the inherited one.
- Walk back through stable-* tags newest-first, skipping any whose
  :deprecated-<stripped> GHCR alias already exists (carries the mark of
  a prior failed rollback). The previous 'second-most-recent' heuristic
  could re-point :stable at a known-broken image on cascading failures.
- Reorder re-tag step: push :stable recovery FIRST, then the
  :deprecated-* audit tag. Defense in depth — even if the concurrency
  block somehow misfires, the worst case is missing audit metadata
  rather than production stuck on the broken image.
- Move GHCR login before resolve step so 'docker manifest inspect' can
  probe for :deprecated-* aliases during walk-back.
- Document the top-level permissions block's dual semantics
  (workflow_dispatch grants directly; workflow_call acts as a cap
  intersected with the caller's job-level permissions).

release.yml:
- Rewrite the 'issues: write' comment. Old wording ('default for jobs')
  was factually wrong — GITHUB_TOKEN's default for issues is never write
  — and read as 'this line just documents a default', so a future
  cleanup PR could delete it. The line is load-bearing: workflow_call
  permissions are bounded by the caller's GITHUB_TOKEN scope, and
  removing it would silently 403 rollback.yml's gh issue create step.

prune-dev-tags.sh:
- Drop the '|| echo "[]"' fallback on the GHCR versions fetch. The
  fallback turned every API failure (403 missing scope, 429 rate limit,
  transient 5xx) into a silent no-op with exit 0 — operators saw a
  green run while every TAG fell through to the same 'no eligible
  version' skip message used for legitimate manifest-collision skips.
- Reorder: fetch GHCR versions BEFORE any git-tag deletion. Git-tag
  delete is irrecoverable (next run rebuilds TO_PRUNE from 'git tag
  -l', so an orphan GHCR image is never enumerated again). Fetching
  first means an API failure aborts cleanly with no state change.
- Track PRUNE_FAILED flag. 'git push --delete' fallback is no longer
  unconditional — local 'git tag -d' is gated on successful remote
  push, so a refused remote delete (tag-protection rule, missing
  contents:write) leaves the local tag in place for retry. The flag
  propagates to a final 'exit 1' so the cron run turns red on any
  push or DELETE failure.

lint-workflows.yml:
- shellcheck step now uses 'find scripts/ops -type f -name *.sh' to
  match the workflow's recursive 'scripts/ops/**.sh' path filter. The
  previous bare 'scripts/ops/*.sh' glob only matched top-level files;
  a future script under a subdirectory would have triggered the
  workflow but never been linted.

* docs(releasing): document rollback.yml, prune-dev-tags.yml, lint-workflows.yml

Reflects the new operational workflows landing in this release:
- Auto-rollback paragraph in release.yml description (smoke-test job +
  rollback-on-smoke-fail → rollback.yml)
- rollback.yml subsection — workflow_call + workflow_dispatch entry
  points, walk-back target resolution, immutability + concurrency
  guarantees, manual operator gh workflow run examples
- prune-dev-tags.yml subsection — weekly cron, KEEP_MONTHS retention
  semantics, floating-alias safety, dry_run preview, failure-propagation
  exit-non-zero behavior
- lint-workflows.yml CI quirk — actionlint (continue-on-error) +
  shellcheck (--severity=warning blocking) advisory checks

CLAUDE.md non-negotiable rules unchanged — still high-level and
correct (changelog discipline + release-cut belongs to the PR + run the
full test suite).
2026-05-15 14:06:59 +02:00

360 lines
16 KiB
Markdown

# Releasing & deploying
The full release process for Agnes. CLAUDE.md carries the short version; this
doc is the operational reference. Read it linearly the first few times — once
internalized, the order matters less, but the **non-obvious gotchas never go
away**.
## Changelog discipline — non-negotiable
**Every PR that adds, removes, or changes user-visible behavior MUST update
`CHANGELOG.md` in the same PR.** No exceptions, no follow-ups, no "I'll do it
after merge". User-visible = anything an operator, end-user, or downstream
integrator can observe: CLI flags / output / exit codes, REST endpoints /
payloads / status codes, web UI, `instance.yaml` schema, env vars,
`extract.duckdb` contract, Docker / compose / Caddyfile knobs, default
behaviors, breaking changes, security fixes.
**How:**
- Add a bullet under the topmost `## [Unreleased]` heading (create one if
missing — it sits above the latest released version).
- Group by `### Added` / `### Changed` / `### Fixed` / `### Removed` /
`### Internal` (Keep-a-Changelog sections).
- Mark breaking changes with `**BREAKING**` at the start of the bullet —
operators grep for that string before bumping the pin.
- Reference the relevant doc/runbook if one exists (e.g.
`see docs/auth-groups.md`), don't restate it.
- Internal-only changes (refactors, test additions, dependency bumps without
behavior change) go under `### Internal` — still log them, just keep them
terse.
Reviewers should bounce PRs that touch user-visible behavior without a
changelog update — same way they'd bounce a PR with no test changes for new
logic.
## Release-cut belongs to the PR — non-negotiable
**The version bump + CHANGELOG rename + new empty `[Unreleased]` are the LAST
commit on the PR that earned the version. Never a standalone follow-up PR.**
When a PR lands the only `[Unreleased]` content (or is the last in a queue of
in-flight feature PRs), the release-cut MUST ship as part of the same merge.
Standalone release-cut PRs add review-overhead PRs to history with no behavior
change of their own and pollute `git log` with bookkeeping commits separated
from the work that earned them.
**Mandatory checklist before approving / enabling auto-merge on ANY PR:**
1. **Stop.** Will this PR land alone in `[Unreleased]` (no other in-flight PRs
queued behind it)?
2. **If yes**, the release-cut is REQUIRED in the same PR before merge. BEFORE
pushing the final commit:
- Bump `pyproject.toml` to `X.Y.Z`
- Rename `## [Unreleased]``## [X.Y.Z] — YYYY-MM-DD`, add a new empty
`## [Unreleased]` on top
- Either squash these into the consolidation commit OR add as a separate
`release: X.Y.Z` commit on the same branch
3. **THEN** push, approve, enable auto-merge.
4. After auto-merge fires: tag `vX.Y.Z` against the merge commit + create a
GitHub Release. Done — one PR, one merge, one release.
**Failure mode to avoid:** enabling auto-merge on the feature PR thinking "I'll
add the release-cut after." Auto-merge fires faster than the second commit
lands. The window closes; the only fix is a standalone release-cut PR — exactly
what this rule prohibits.
**Acceptable standalone release-cut** (rare): only when `[Unreleased]`
accumulated bullets from MULTIPLE already-merged PRs AND no further
behavior-change PR is queued — i.e. the cut is the only outstanding work and
there's no PR to attach it to.
## Release workflow — concrete recipe
### Happy path (8 steps)
```bash
# 1. Branch from a fresh checkout. iCloud Drive worktrees randomly hang
# on git operations — use a fresh shallow clone in /tmp instead.
cd /tmp && git clone --depth 50 --branch main \
https://github.com/keboola/agnes-the-ai-analyst.git agnes-<topic>
cd agnes-<topic> && git checkout -b zs/<branch-name>
# 2. Make the change + tests. Run the AREA pytest while iterating
# (e.g. `pytest tests/test_X.py -p no:xdist -q`).
# 3. Add a CHANGELOG bullet under [Unreleased].
# Group: Added | Changed | Fixed | Removed | Internal
# Mark BREAKING with **BREAKING** prefix.
# 4. Commit the change(s). Multiple logical commits OK; release-cut
# will be a SEPARATE last commit (next step). DO NOT bundle the
# release-cut into the same commit as the change — it pollutes
# the SHA that auto-close keywords reference and makes revert
# targeted at the change-only difficult.
# 5. Run the full pytest suite locally:
# `pytest tests/ -p no:xdist -q` (or `-n auto` if xdist works).
# Pre-existing fails (e.g. test_readers_in_pre_init_dir under
# subprocess timeout) are OK to ignore; verify by reverting your
# diff and reproducing on bare main.
# 6. Release-cut commit (LAST commit on the PR per the rule above):
# - Bump pyproject.toml: version = "X.Y.Z"
# - Rename `## [Unreleased]` → `## [X.Y.Z] — YYYY-MM-DD`
# - Add a fresh empty `## [Unreleased]` line above
# Commit message: `release: X.Y.Z — <one-line summary>`
# 7. Push branch + open PR + enable auto-merge SQUASH:
# git push -u origin HEAD
# gh pr create --repo keboola/agnes-the-ai-analyst \
# --head <branch> --title "<...>" --body "<...>"
# gh pr merge <N> --repo keboola/agnes-the-ai-analyst \
# --squash --auto --delete-branch
# 8. After auto-merge fires (poll or `Monitor`):
# git fetch origin --tags
# git tag vX.Y.Z <merge-sha>
# git push origin vX.Y.Z
# gh release create vX.Y.Z --repo keboola/agnes-the-ai-analyst \
# --title "vX.Y.Z — <...>" --notes "<copy-paste from CHANGELOG>"
```
### Picking the next version
`pyproject.toml`'s current `version` is the **next-release target** (post-cut
from the previous release). Pre-1.0 we patch-bump for everything that doesn't
break operator-facing APIs:
- `instance.yaml` schema additions, new env vars, new endpoints → patch (e.g.
0.54.3 → 0.54.4)
- New CLI subcommands, BREAKING removals, schema migrations → still patch within
the current 0.5x cycle (no minor bumps cut today)
- The CHANGELOG `**BREAKING**` marker is what operators grep for; the version
number is secondary
Always check `git tag -l "v0.X*"` before naming — if `v0.54.0` is already
tagged, the next one is `v0.54.1`, even if `pyproject.toml` still says `0.54.0`
from a stale post-cut commit (we've shipped that race before).
### Authoring expectations on the PR
- **Self-PRs** (you're both author and reviewer): GitHub forbids self-approve.
If branch protection requires N approving reviews (we don't today —
`required_approving_review_count = 0`), you need someone else to approve. With
our current 0-review setup, self-PRs can still merge automatically once
required CI passes.
- **Other people's PRs you're taking over**: dismiss any prior
CHANGES_REQUESTED reviews (yours or someone else's) before auto-merge can
fire. `gh pr review <N> --approve --body "..."` after pushing your fixes.
- **Devin Review**: not a required check today; runs in parallel and posts a
comment. Don't wait on it for merge unless the human reviewer explicitly asks.
### CI quirks you WILL hit
- **`gh pr checks` glosses CANCELLED as `fail`.** When you force-push (rebase,
amend), GitHub auto-cancels the in-flight `Release` workflow run on the older
SHA. Those cancelled jobs show up as "fail" in the PR's check summary and tab
forever, even after newer runs succeed. **Look at the conclusion column, not
just the count.** Rule of thumb: if the same check name appears with both
`pass` and `fail` rows, the `fail` row is from an older auto-cancelled SHA.
Verify with `gh api repos/keboola/agnes-the-ai-analyst/commits/<sha>/check-runs`
— the raw API distinguishes `cancelled` from `failure` truthfully.
- **Branch protection's "strict" mode caches cancelled `test` as blocking** even
after newer `test` runs succeed. Symptom: `mergeable_state: blocked` despite
all required checks green on the latest SHA. Fix: re-run the cancelled
`Release` workflow run (`gh run rerun <run-id>`); once its `test` job lands as
success, the block clears. We've hit this on PRs #273, #281, #285, #286.
- **Required checks** (per branch protection): `test` + `docker-build` only.
Other workflows (`cli-wheel-clean-install`, `build-and-push`,
`Release`-pipeline, Devin Review) are advisory — green/red doesn't gate merge.
- **`enforce_admins: true`** in branch protection means `--admin` flag on
`gh pr merge` does NOT bypass. Don't try; just fix the underlying block.
- **`lint-workflows.yml` is advisory.** Triggered on changes to
`.github/workflows/**` or `scripts/ops/**.sh`. Runs `actionlint` on
workflow YAMLs + `shellcheck --severity=warning` on freestanding ops
scripts. The `actionlint` step has `continue-on-error: true` initially
(pre-existing inventory has info-level findings); flip to fail-fast
once the repo is actionlint-clean. The `shellcheck` step IS blocking at
warning+ severity — info/style findings ride through, real bugs break
CI.
### Recovery when something derails
- **Force-pushed and lost auto-merge?** GitHub *usually* preserves auto-merge
across force-pushes for the same PR; if it cleared, just re-run
`gh pr merge <N> --squash --auto --delete-branch`.
- **Release-cut commit forgot to land?** That's the failure mode the
"Release-cut belongs to the PR" rule prevents. If it happens anyway: open a
follow-on PR with ONLY the release-cut commit, ship it, and write up why in
your post-mortem comment.
- **Wrong version number tagged?** `git tag -d vX.Y.Z && git push --delete
origin vX.Y.Z` then re-tag against the right SHA. Update the GitHub Release if
you already created it.
## Deploy workflows
Two separate release.yml-style workflows produce GHCR images. Pick the one that
matches what you're shipping.
### `release.yml` — auto-build on every push
Runs on **every** push to **every** branch.
- Push to `main``:stable`, `:stable-YYYY.MM.N` (CalVer).
- Push to non-main `<prefix>/<branch>``:dev`, `:dev-YYYY.MM.N`,
`:dev-<branch-slug>`, and (when prefix isn't a Git Flow convention)
`:dev-<prefix>-latest` alias.
VMs that pin to a floating tag (`:dev`, `:dev-<prefix>-latest`) auto-upgrade
within ~5 min via the cron in `agnes-auto-upgrade.sh`. Convenient for
per-developer dev VMs; **footgun for shared dev VMs** (last pusher wins,
regardless of who).
**Auto-rollback on smoke failure.** On `main` pushes, after `:stable` is
published, the `smoke-test` job pulls the just-built image and runs
`scripts/ops/post-deploy-smoke-test.sh` inside a docker-compose stack. If
that job fails, the `rollback-on-smoke-fail` job calls the reusable
`rollback.yml` workflow (see below) which re-points `:stable` to the
previous known-good build, marks the failed image as `:deprecated-*`,
and opens a tracking issue labeled `bug`.
### `rollback.yml` — reusable + manual rollback
Two entry points:
- **`workflow_call`** from `release.yml`'s `rollback-on-smoke-fail` job
(auto-rollback path above).
- **`workflow_dispatch`** for manual operator rollback when something
breaks post-deploy that the auto smoke-test missed.
**Manual rollback** — flip `:stable` back to a previous good build:
```bash
gh workflow run rollback.yml \
--repo keboola/agnes-the-ai-analyst \
-f failed_image_tag=stable-YYYY.MM.N
```
By default `target_image_tag` resolves by walking back through `stable-*`
git tags newest-first and picking the first that does NOT already carry a
`:deprecated-<stripped>` GHCR alias (i.e. wasn't previously auto-rolled-
back). That prevents cascading failures from re-pointing `:stable` at a
known-broken image. To force a specific target:
```bash
gh workflow run rollback.yml \
--repo keboola/agnes-the-ai-analyst \
-f failed_image_tag=stable-2026.05.531 \
-f target_image_tag=stable-2026.04.474
```
Notes:
- The workflow does NOT delete the failed git tag (CalVer immutability is
preserved) — only the GHCR `:stable` alias is re-pointed and the failed
image gains a `:deprecated-*` audit alias.
- Re-tag order is `:stable` recovery first, then `:deprecated-*` audit, so
a mid-step interruption leaves production healthy with at-worst missing
audit metadata.
- Concurrency: `cancel-in-progress: false` (overrides the caller workflow's
cancellation policy) so a subsequent push to `main` won't kill a
rollback mid-flight.
### `keboola-deploy.yml` — tag-triggered, explicit deploy only
Runs **only** on git tags matching `keboola-deploy-*`. Publishes:
- `:keboola-deploy-<git-tag-suffix>` — immutable, tied to the exact commit
- `:keboola-deploy-latest` — floating alias the consumer pins to
**Operator workflow:**
```bash
git checkout <commit-or-branch>
git tag keboola-deploy-<descriptive-name>
git push origin keboola-deploy-<descriptive-name>
# → workflow builds + publishes both tags
# → VM cron picks up :keboola-deploy-latest within ~5 min
# → manual cron trigger (skip the wait): sudo /usr/local/bin/agnes-auto-upgrade.sh on the VM
```
Use this when the consumer (e.g. a customer dev VM) needs
**deploy-when-I-decide** semantics — no surprise rollouts from upstream branch
pushes by other contributors. The infra repo pins
`image_tag = "keboola-deploy-latest"` on the relevant VM.
### `prune-dev-tags.yml` — weekly CalVer + GHCR housekeeping
Cron `0 4 * * 0` (Sundays 04:00 UTC) + `workflow_dispatch`. Prunes legacy
CalVer git tags (`dev-YYYY.MM.N`, `stable-YYYY.MM.N`) and the matching
GHCR image versions older than `KEEP_MONTHS` (default `1` → keep current
+ previous month). Floating aliases (`:stable`, `:dev`, `*-latest`) are
never matched: they are git-tagless, and the GHCR pass explicitly skips
any version that shares a manifest with a floating alias to avoid
collateral deletion of `:stable` after a rollback re-tag.
**Manual preview** (no deletions, lists what would be pruned):
```bash
gh workflow run prune-dev-tags.yml \
--repo keboola/agnes-the-ai-analyst \
-f dry_run=true
```
**Force a wider window** (one-off aggressive cleanup):
```bash
gh workflow run prune-dev-tags.yml \
--repo keboola/agnes-the-ai-analyst \
-f keep_months=3
```
Scheduled (cron) runs always prune for real; `dry_run` is honored only on
manual dispatch. The script tracks per-tag remote-push / GHCR-DELETE
failures and exits non-zero on any failure, so a refused remote push (tag-
protection rule, missing scope) or a GHCR API error turns the cron run
red instead of silently swallowing it. Local `git tag -d` is gated on
successful remote push, so a refused delete leaves the local tag in place
for retry on the next run.
### Module versioning
The customer-instance Terraform module under `infra/modules/customer-instance/`
is published as `infra-vMAJOR.MINOR.PATCH` git tags (separate from app CalVer
tags). Bump on any module-API change; downstream infra repos pin to the tag in
their `source = "github.com/keboola/agnes-the-ai-analyst//infra/modules/customer-instance?ref=infra-v1.X.Y"`.
After merging a module change to `main`:
```bash
git tag infra-vX.Y.Z origin/main
git push origin infra-vX.Y.Z
```
### Replacing a VM after a startup-script change
Module sets `lifecycle { ignore_changes = [metadata_startup_script] }` on
`google_compute_instance.vm` so normal `terraform apply` doesn't churn running
VMs. To propagate a startup-script update, trigger the consumer's apply workflow
manually with the VM resource address — typical workflow_dispatch input is
`recreate_targets='module.agnes.google_compute_instance.vm["<vm-name>"]'`.
## Appendix: CHANGELOG entry skeleton
Copy this when adding to `## [Unreleased]` in `CHANGELOG.md`. Drop the sections
you don't need; keep the Keep-a-Changelog order.
```markdown
### Added
- New feature description.
### Changed
- Change description. **BREAKING** prefix + migration steps if operator-facing.
### Fixed
- Bug fix description.
### Removed
- **BREAKING** removed feature — what replaces it.
### Internal
- Refactors, test additions, dependency bumps with no behavior change.
```
At release-cut time `## [Unreleased]` is renamed to `## [X.Y.Z] — YYYY-MM-DD`
and a fresh empty `## [Unreleased]` is added on top. CI publishes the matching
`stable-YYYY.MM.N` image tag for the merge commit (see Deploy workflows above).