agnes-the-ai-analyst/docs/RELEASING.md
ZdenekSrotyr a48524509a
docs: consolidate and de-clutter the documentation tree (#306)
CLAUDE.md rewritten (708 -> ~320 lines): four overlapping release
sections collapsed to one, stale v1->v35 schema history dropped (it
lives in CHANGELOG), marketplace endpoint internals and verbose
process sections moved out or tightened.

New focused docs:
- docs/RELEASING.md - release process, deploy workflows, CI quirks
  (RELEASE_TEMPLATE.md folded in as an appendix)
- docs/marketplace.md - marketplace ingestion + re-serving internals
- docs/README.md - documentation index by audience, linked from
  README.md and CLAUDE.md

Archived under docs/archive/: docs/superpowers/ (52 historical
planning artifacts), HACKATHON.md, pd-ps-comments.md,
security-audit-2026-04.md, future/NOTIFICATIONS.md.

Removed the docs/auto-install.md stub. Fixed dangling links in
connectors/jira/README.md and dev_docs/README.md, repointed
code/doc references to archived paths.
2026-05-14 18:54:22 +00:00

12 KiB

Releasing & deploying

The full release process for Agnes. CLAUDE.md carries the short version; this doc is the operational reference. Read it linearly the first few times — once internalized, the order matters less, but the non-obvious gotchas never go away.

Changelog discipline — non-negotiable

Every PR that adds, removes, or changes user-visible behavior MUST update CHANGELOG.md in the same PR. No exceptions, no follow-ups, no "I'll do it after merge". User-visible = anything an operator, end-user, or downstream integrator can observe: CLI flags / output / exit codes, REST endpoints / payloads / status codes, web UI, instance.yaml schema, env vars, extract.duckdb contract, Docker / compose / Caddyfile knobs, default behaviors, breaking changes, security fixes.

How:

  • Add a bullet under the topmost ## [Unreleased] heading (create one if missing — it sits above the latest released version).
  • Group by ### Added / ### Changed / ### Fixed / ### Removed / ### Internal (Keep-a-Changelog sections).
  • Mark breaking changes with **BREAKING** at the start of the bullet — operators grep for that string before bumping the pin.
  • Reference the relevant doc/runbook if one exists (e.g. see docs/auth-groups.md), don't restate it.
  • Internal-only changes (refactors, test additions, dependency bumps without behavior change) go under ### Internal — still log them, just keep them terse.

Reviewers should bounce PRs that touch user-visible behavior without a changelog update — same way they'd bounce a PR with no test changes for new logic.

Release-cut belongs to the PR — non-negotiable

The version bump + CHANGELOG rename + new empty [Unreleased] are the LAST commit on the PR that earned the version. Never a standalone follow-up PR.

When a PR lands the only [Unreleased] content (or is the last in a queue of in-flight feature PRs), the release-cut MUST ship as part of the same merge. Standalone release-cut PRs add review-overhead PRs to history with no behavior change of their own and pollute git log with bookkeeping commits separated from the work that earned them.

Mandatory checklist before approving / enabling auto-merge on ANY PR:

  1. Stop. Will this PR land alone in [Unreleased] (no other in-flight PRs queued behind it)?
  2. If yes, the release-cut is REQUIRED in the same PR before merge. BEFORE pushing the final commit:
    • Bump pyproject.toml to X.Y.Z
    • Rename ## [Unreleased]## [X.Y.Z] — YYYY-MM-DD, add a new empty ## [Unreleased] on top
    • Either squash these into the consolidation commit OR add as a separate release: X.Y.Z commit on the same branch
  3. THEN push, approve, enable auto-merge.
  4. After auto-merge fires: tag vX.Y.Z against the merge commit + create a GitHub Release. Done — one PR, one merge, one release.

Failure mode to avoid: enabling auto-merge on the feature PR thinking "I'll add the release-cut after." Auto-merge fires faster than the second commit lands. The window closes; the only fix is a standalone release-cut PR — exactly what this rule prohibits.

Acceptable standalone release-cut (rare): only when [Unreleased] accumulated bullets from MULTIPLE already-merged PRs AND no further behavior-change PR is queued — i.e. the cut is the only outstanding work and there's no PR to attach it to.

Release workflow — concrete recipe

Happy path (8 steps)

# 1. Branch from a fresh checkout. iCloud Drive worktrees randomly hang
#    on git operations — use a fresh shallow clone in /tmp instead.
cd /tmp && git clone --depth 50 --branch main \
  https://github.com/keboola/agnes-the-ai-analyst.git agnes-<topic>
cd agnes-<topic> && git checkout -b zs/<branch-name>

# 2. Make the change + tests. Run the AREA pytest while iterating
#    (e.g. `pytest tests/test_X.py -p no:xdist -q`).

# 3. Add a CHANGELOG bullet under [Unreleased].
#    Group: Added | Changed | Fixed | Removed | Internal
#    Mark BREAKING with **BREAKING** prefix.

# 4. Commit the change(s). Multiple logical commits OK; release-cut
#    will be a SEPARATE last commit (next step). DO NOT bundle the
#    release-cut into the same commit as the change — it pollutes
#    the SHA that auto-close keywords reference and makes revert
#    targeted at the change-only difficult.

# 5. Run the full pytest suite locally:
#    `pytest tests/ -p no:xdist -q` (or `-n auto` if xdist works).
#    Pre-existing fails (e.g. test_readers_in_pre_init_dir under
#    subprocess timeout) are OK to ignore; verify by reverting your
#    diff and reproducing on bare main.

# 6. Release-cut commit (LAST commit on the PR per the rule above):
#    - Bump pyproject.toml: version = "X.Y.Z"
#    - Rename `## [Unreleased]` → `## [X.Y.Z] — YYYY-MM-DD`
#    - Add a fresh empty `## [Unreleased]` line above
#    Commit message: `release: X.Y.Z — <one-line summary>`

# 7. Push branch + open PR + enable auto-merge SQUASH:
#    git push -u origin HEAD
#    gh pr create --repo keboola/agnes-the-ai-analyst \
#      --head <branch> --title "<...>" --body "<...>"
#    gh pr merge <N> --repo keboola/agnes-the-ai-analyst \
#      --squash --auto --delete-branch

# 8. After auto-merge fires (poll or `Monitor`):
#    git fetch origin --tags
#    git tag vX.Y.Z <merge-sha>
#    git push origin vX.Y.Z
#    gh release create vX.Y.Z --repo keboola/agnes-the-ai-analyst \
#      --title "vX.Y.Z — <...>" --notes "<copy-paste from CHANGELOG>"

Picking the next version

pyproject.toml's current version is the next-release target (post-cut from the previous release). Pre-1.0 we patch-bump for everything that doesn't break operator-facing APIs:

  • instance.yaml schema additions, new env vars, new endpoints → patch (e.g. 0.54.3 → 0.54.4)
  • New CLI subcommands, BREAKING removals, schema migrations → still patch within the current 0.5x cycle (no minor bumps cut today)
  • The CHANGELOG **BREAKING** marker is what operators grep for; the version number is secondary

Always check git tag -l "v0.X*" before naming — if v0.54.0 is already tagged, the next one is v0.54.1, even if pyproject.toml still says 0.54.0 from a stale post-cut commit (we've shipped that race before).

Authoring expectations on the PR

  • Self-PRs (you're both author and reviewer): GitHub forbids self-approve. If branch protection requires N approving reviews (we don't today — required_approving_review_count = 0), you need someone else to approve. With our current 0-review setup, self-PRs can still merge automatically once required CI passes.
  • Other people's PRs you're taking over: dismiss any prior CHANGES_REQUESTED reviews (yours or someone else's) before auto-merge can fire. gh pr review <N> --approve --body "..." after pushing your fixes.
  • Devin Review: not a required check today; runs in parallel and posts a comment. Don't wait on it for merge unless the human reviewer explicitly asks.

CI quirks you WILL hit

  • gh pr checks glosses CANCELLED as fail. When you force-push (rebase, amend), GitHub auto-cancels the in-flight Release workflow run on the older SHA. Those cancelled jobs show up as "fail" in the PR's check summary and tab forever, even after newer runs succeed. Look at the conclusion column, not just the count. Rule of thumb: if the same check name appears with both pass and fail rows, the fail row is from an older auto-cancelled SHA. Verify with gh api repos/keboola/agnes-the-ai-analyst/commits/<sha>/check-runs — the raw API distinguishes cancelled from failure truthfully.
  • Branch protection's "strict" mode caches cancelled test as blocking even after newer test runs succeed. Symptom: mergeable_state: blocked despite all required checks green on the latest SHA. Fix: re-run the cancelled Release workflow run (gh run rerun <run-id>); once its test job lands as success, the block clears. We've hit this on PRs #273, #281, #285, #286.
  • Required checks (per branch protection): test + docker-build only. Other workflows (cli-wheel-clean-install, build-and-push, Release-pipeline, Devin Review) are advisory — green/red doesn't gate merge.
  • enforce_admins: true in branch protection means --admin flag on gh pr merge does NOT bypass. Don't try; just fix the underlying block.

Recovery when something derails

  • Force-pushed and lost auto-merge? GitHub usually preserves auto-merge across force-pushes for the same PR; if it cleared, just re-run gh pr merge <N> --squash --auto --delete-branch.
  • Release-cut commit forgot to land? That's the failure mode the "Release-cut belongs to the PR" rule prevents. If it happens anyway: open a follow-on PR with ONLY the release-cut commit, ship it, and write up why in your post-mortem comment.
  • Wrong version number tagged? git tag -d vX.Y.Z && git push --delete origin vX.Y.Z then re-tag against the right SHA. Update the GitHub Release if you already created it.

Deploy workflows

Two separate release.yml-style workflows produce GHCR images. Pick the one that matches what you're shipping.

release.yml — auto-build on every push

Runs on every push to every branch.

  • Push to main:stable, :stable-YYYY.MM.N (CalVer).
  • Push to non-main <prefix>/<branch>:dev, :dev-YYYY.MM.N, :dev-<branch-slug>, and (when prefix isn't a Git Flow convention) :dev-<prefix>-latest alias.

VMs that pin to a floating tag (:dev, :dev-<prefix>-latest) auto-upgrade within ~5 min via the cron in agnes-auto-upgrade.sh. Convenient for per-developer dev VMs; footgun for shared dev VMs (last pusher wins, regardless of who).

keboola-deploy.yml — tag-triggered, explicit deploy only

Runs only on git tags matching keboola-deploy-*. Publishes:

  • :keboola-deploy-<git-tag-suffix> — immutable, tied to the exact commit
  • :keboola-deploy-latest — floating alias the consumer pins to

Operator workflow:

git checkout <commit-or-branch>
git tag keboola-deploy-<descriptive-name>
git push origin keboola-deploy-<descriptive-name>
# → workflow builds + publishes both tags
# → VM cron picks up :keboola-deploy-latest within ~5 min
# → manual cron trigger (skip the wait): sudo /usr/local/bin/agnes-auto-upgrade.sh on the VM

Use this when the consumer (e.g. a customer dev VM) needs deploy-when-I-decide semantics — no surprise rollouts from upstream branch pushes by other contributors. The infra repo pins image_tag = "keboola-deploy-latest" on the relevant VM.

Module versioning

The customer-instance Terraform module under infra/modules/customer-instance/ is published as infra-vMAJOR.MINOR.PATCH git tags (separate from app CalVer tags). Bump on any module-API change; downstream infra repos pin to the tag in their source = "github.com/keboola/agnes-the-ai-analyst//infra/modules/customer-instance?ref=infra-v1.X.Y".

After merging a module change to main:

git tag infra-vX.Y.Z origin/main
git push origin infra-vX.Y.Z

Replacing a VM after a startup-script change

Module sets lifecycle { ignore_changes = [metadata_startup_script] } on google_compute_instance.vm so normal terraform apply doesn't churn running VMs. To propagate a startup-script update, trigger the consumer's apply workflow manually with the VM resource address — typical workflow_dispatch input is recreate_targets='module.agnes.google_compute_instance.vm["<vm-name>"]'.

Appendix: CHANGELOG entry skeleton

Copy this when adding to ## [Unreleased] in CHANGELOG.md. Drop the sections you don't need; keep the Keep-a-Changelog order.

### Added
- New feature description.

### Changed
- Change description. **BREAKING** prefix + migration steps if operator-facing.

### Fixed
- Bug fix description.

### Removed
- **BREAKING** removed feature — what replaces it.

### Internal
- Refactors, test additions, dependency bumps with no behavior change.

At release-cut time ## [Unreleased] is renamed to ## [X.Y.Z] — YYYY-MM-DD and a fresh empty ## [Unreleased] is added on top. CI publishes the matching stable-YYYY.MM.N image tag for the merge commit (see Deploy workflows above).