agnes-the-ai-analyst

AI-Cognitive-Leap/agnes-the-ai-analyst

Fork 0

Commit graph

Author	SHA1	Message	Date
ZdenekSrotyr	9f5adbce37	ci: consolidate release pipeline (salvageable subset of #139 ) (#314 ) * ci: add actionlint workflow lint, drop superseded deploy.yml stub * ci: extract rollback into reusable rollback.yml, wire into release smoke-test * ci: add weekly prune-dev-tags workflow for legacy CalVer tag/image cleanup * release: 0.54.17 — CI/release workflow consolidation * fix(ci): warn when rollback.yml receives a non-stable failed_image_tag * fix(ci): rollback.yml + prune-dev-tags.sh review findings rollback.yml: - Pass workflow_dispatch inputs (failed_image_tag, target_image_tag) through env: instead of textual ${{ }} splicing into bash run blocks — prevents an actor with workflow_dispatch privilege from injecting shell via quote/backtick payloads. - Guard against TARGET == FAILED when only one stable-* tag exists (fresh repo, or aggressive pruning at month boundary). Fail loudly rather than re-push the broken image as :stable. - Add commit SHA to the rollback tracking-issue body — github.sha is inherited across workflow_call, so on-call no longer has to navigate rollback run → caller-workflow breadcrumb → failing commit. prune-dev-tags.sh: - Replace 'printf … \| head -20' preview pipeline with array slice ('"${TO_PRUNE[@]:0:20}"'). Under set -o pipefail, head closing the pipe early SIGPIPEs printf (exit 141) and aborts the script before any deletion runs — exactly the multi-month-backlog scenario the script targets. - Refactor GHCR-pass: fetch versions JSON once before the loop, then build a tag→version-id map up-front. Closes two problems: 1. O(N × pages) GHCR API calls collapse to one paginated listing — months of accumulated CalVer tags no longer risk tripping abuse detection. 2. The new jq filter excludes any version that ALSO carries a floating alias (:stable, :dev, -latest). GHCR DELETE-version drops the entire manifest, so pruning a CalVer tag that shares a manifest with :stable (e.g. after a rollback re-tag) would have vaporized :stable. Now it's skipped with a log line. lint-workflows.yml: - Add an explicit shellcheck step. actionlint only walks .github/workflows/ and the shell embedded in their run: blocks, so freestanding scripts/ops/.sh (which are in the workflow's path filter) were never actually validated despite triggering CI. * fix(ci): shellcheck --severity=warning to skip pre-existing info findings The new shellcheck step caught info-level findings (SC1091, SC2015) in agnes-auto-upgrade.sh / agnes-tls-rotate.sh — pre-existing, not regressed by this PR. Constrain shellcheck to warning+ severity (real bugs) so info and style findings don't block CI; mirrors the actionlint step's continue-on-error initial-rollout posture. * fix(ci): second-pass review findings — concurrency, walk-back, failure propagation rollback.yml: - Add own concurrency block (group: rollback-<repo>-<failed_tag>, cancel-in-progress: false). The caller release.yml uses cancel-in-progress: true to avoid duplicate CalVer claims, but a second push to main mid-rollback would otherwise kill the workflow between the :stable recovery push and the :deprecated-* audit push, leaving :stable stuck on the broken image. A reusable workflow's own concurrency overrides the inherited one. - Walk back through stable-* tags newest-first, skipping any whose :deprecated-<stripped> GHCR alias already exists (carries the mark of a prior failed rollback). The previous 'second-most-recent' heuristic could re-point :stable at a known-broken image on cascading failures. - Reorder re-tag step: push :stable recovery FIRST, then the :deprecated-* audit tag. Defense in depth — even if the concurrency block somehow misfires, the worst case is missing audit metadata rather than production stuck on the broken image. - Move GHCR login before resolve step so 'docker manifest inspect' can probe for :deprecated-* aliases during walk-back. - Document the top-level permissions block's dual semantics (workflow_dispatch grants directly; workflow_call acts as a cap intersected with the caller's job-level permissions). release.yml: - Rewrite the 'issues: write' comment. Old wording ('default for jobs') was factually wrong — GITHUB_TOKEN's default for issues is never write — and read as 'this line just documents a default', so a future cleanup PR could delete it. The line is load-bearing: workflow_call permissions are bounded by the caller's GITHUB_TOKEN scope, and removing it would silently 403 rollback.yml's gh issue create step. prune-dev-tags.sh: - Drop the '\|\| echo "[]"' fallback on the GHCR versions fetch. The fallback turned every API failure (403 missing scope, 429 rate limit, transient 5xx) into a silent no-op with exit 0 — operators saw a green run while every TAG fell through to the same 'no eligible version' skip message used for legitimate manifest-collision skips. - Reorder: fetch GHCR versions BEFORE any git-tag deletion. Git-tag delete is irrecoverable (next run rebuilds TO_PRUNE from 'git tag -l', so an orphan GHCR image is never enumerated again). Fetching first means an API failure aborts cleanly with no state change. - Track PRUNE_FAILED flag. 'git push --delete' fallback is no longer unconditional — local 'git tag -d' is gated on successful remote push, so a refused remote delete (tag-protection rule, missing contents:write) leaves the local tag in place for retry. The flag propagates to a final 'exit 1' so the cron run turns red on any push or DELETE failure. lint-workflows.yml: - shellcheck step now uses 'find scripts/ops -type f -name .sh' to match the workflow's recursive 'scripts/ops/.sh' path filter. The previous bare 'scripts/ops/.sh' glob only matched top-level files; a future script under a subdirectory would have triggered the workflow but never been linted. * docs(releasing): document rollback.yml, prune-dev-tags.yml, lint-workflows.yml Reflects the new operational workflows landing in this release: - Auto-rollback paragraph in release.yml description (smoke-test job + rollback-on-smoke-fail → rollback.yml) - rollback.yml subsection — workflow_call + workflow_dispatch entry points, walk-back target resolution, immutability + concurrency guarantees, manual operator gh workflow run examples - prune-dev-tags.yml subsection — weekly cron, KEEP_MONTHS retention semantics, floating-alias safety, dry_run preview, failure-propagation exit-non-zero behavior - lint-workflows.yml CI quirk — actionlint (continue-on-error) + shellcheck (--severity=warning blocking) advisory checks CLAUDE.md non-negotiable rules unchanged — still high-level and correct (changelog discipline + release-cut belongs to the PR + run the full test suite).	2026-05-15 14:06:59 +02:00
ZdenekSrotyr	a48524509a	docs: consolidate and de-clutter the documentation tree (#306 ) CLAUDE.md rewritten (708 -> ~320 lines): four overlapping release sections collapsed to one, stale v1->v35 schema history dropped (it lives in CHANGELOG), marketplace endpoint internals and verbose process sections moved out or tightened. New focused docs: - docs/RELEASING.md - release process, deploy workflows, CI quirks (RELEASE_TEMPLATE.md folded in as an appendix) - docs/marketplace.md - marketplace ingestion + re-serving internals - docs/README.md - documentation index by audience, linked from README.md and CLAUDE.md Archived under docs/archive/: docs/superpowers/ (52 historical planning artifacts), HACKATHON.md, pd-ps-comments.md, security-audit-2026-04.md, future/NOTIFICATIONS.md. Removed the docs/auto-install.md stub. Fixed dangling links in connectors/jira/README.md and dev_docs/README.md, repointed code/doc references to archived paths.	2026-05-14 18:54:22 +00:00

Author

SHA1

Message

Date

ZdenekSrotyr

9f5adbce37

ci: consolidate release pipeline (salvageable subset of #139 ) (#314 )

* ci: add actionlint workflow lint, drop superseded deploy.yml stub

* ci: extract rollback into reusable rollback.yml, wire into release smoke-test

* ci: add weekly prune-dev-tags workflow for legacy CalVer tag/image cleanup

* release: 0.54.17 — CI/release workflow consolidation

* fix(ci): warn when rollback.yml receives a non-stable failed_image_tag

* fix(ci): rollback.yml + prune-dev-tags.sh review findings

rollback.yml:
- Pass workflow_dispatch inputs (failed_image_tag, target_image_tag)
  through env: instead of textual ${{ }} splicing into bash run blocks
  — prevents an actor with workflow_dispatch privilege from injecting
  shell via quote/backtick payloads.
- Guard against TARGET == FAILED when only one stable-* tag exists
  (fresh repo, or aggressive pruning at month boundary). Fail loudly
  rather than re-push the broken image as :stable.
- Add commit SHA to the rollback tracking-issue body — github.sha is
  inherited across workflow_call, so on-call no longer has to navigate
  rollback run → caller-workflow breadcrumb → failing commit.

prune-dev-tags.sh:
- Replace 'printf … | head -20' preview pipeline with array slice
  ('"${TO_PRUNE[@]:0:20}"'). Under set -o pipefail, head closing
  the pipe early SIGPIPEs printf (exit 141) and aborts the script
  before any deletion runs — exactly the multi-month-backlog scenario
  the script targets.
- Refactor GHCR-pass: fetch versions JSON once before the loop, then
  build a tag→version-id map up-front. Closes two problems:
    1. O(N × pages) GHCR API calls collapse to one paginated listing
       — months of accumulated CalVer tags no longer risk tripping
       abuse detection.
    2. The new jq filter excludes any version that ALSO carries a
       floating alias (:stable, :dev, *-latest). GHCR DELETE-version
       drops the entire manifest, so pruning a CalVer tag that shares
       a manifest with :stable (e.g. after a rollback re-tag) would
       have vaporized :stable. Now it's skipped with a log line.

lint-workflows.yml:
- Add an explicit shellcheck step. actionlint only walks
  .github/workflows/ and the shell embedded in their run: blocks, so
  freestanding scripts/ops/*.sh (which are in the workflow's path
  filter) were never actually validated despite triggering CI.

* fix(ci): shellcheck --severity=warning to skip pre-existing info findings

The new shellcheck step caught info-level findings (SC1091, SC2015) in
agnes-auto-upgrade.sh / agnes-tls-rotate.sh — pre-existing, not regressed
by this PR. Constrain shellcheck to warning+ severity (real bugs) so info
and style findings don't block CI; mirrors the actionlint step's
continue-on-error initial-rollout posture.

* fix(ci): second-pass review findings — concurrency, walk-back, failure propagation

rollback.yml:
- Add own concurrency block (group: rollback-<repo>-<failed_tag>,
  cancel-in-progress: false). The caller release.yml uses
  cancel-in-progress: true to avoid duplicate CalVer claims, but a
  second push to main mid-rollback would otherwise kill the workflow
  between the :stable recovery push and the :deprecated-* audit push,
  leaving :stable stuck on the broken image. A reusable workflow's own
  concurrency overrides the inherited one.
- Walk back through stable-* tags newest-first, skipping any whose
  :deprecated-<stripped> GHCR alias already exists (carries the mark of
  a prior failed rollback). The previous 'second-most-recent' heuristic
  could re-point :stable at a known-broken image on cascading failures.
- Reorder re-tag step: push :stable recovery FIRST, then the
  :deprecated-* audit tag. Defense in depth — even if the concurrency
  block somehow misfires, the worst case is missing audit metadata
  rather than production stuck on the broken image.
- Move GHCR login before resolve step so 'docker manifest inspect' can
  probe for :deprecated-* aliases during walk-back.
- Document the top-level permissions block's dual semantics
  (workflow_dispatch grants directly; workflow_call acts as a cap
  intersected with the caller's job-level permissions).

release.yml:
- Rewrite the 'issues: write' comment. Old wording ('default for jobs')
  was factually wrong — GITHUB_TOKEN's default for issues is never write
  — and read as 'this line just documents a default', so a future
  cleanup PR could delete it. The line is load-bearing: workflow_call
  permissions are bounded by the caller's GITHUB_TOKEN scope, and
  removing it would silently 403 rollback.yml's gh issue create step.

prune-dev-tags.sh:
- Drop the '|| echo "[]"' fallback on the GHCR versions fetch. The
  fallback turned every API failure (403 missing scope, 429 rate limit,
  transient 5xx) into a silent no-op with exit 0 — operators saw a
  green run while every TAG fell through to the same 'no eligible
  version' skip message used for legitimate manifest-collision skips.
- Reorder: fetch GHCR versions BEFORE any git-tag deletion. Git-tag
  delete is irrecoverable (next run rebuilds TO_PRUNE from 'git tag
  -l', so an orphan GHCR image is never enumerated again). Fetching
  first means an API failure aborts cleanly with no state change.
- Track PRUNE_FAILED flag. 'git push --delete' fallback is no longer
  unconditional — local 'git tag -d' is gated on successful remote
  push, so a refused remote delete (tag-protection rule, missing
  contents:write) leaves the local tag in place for retry. The flag
  propagates to a final 'exit 1' so the cron run turns red on any
  push or DELETE failure.

lint-workflows.yml:
- shellcheck step now uses 'find scripts/ops -type f -name *.sh' to
  match the workflow's recursive 'scripts/ops/**.sh' path filter. The
  previous bare 'scripts/ops/*.sh' glob only matched top-level files;
  a future script under a subdirectory would have triggered the
  workflow but never been linted.

* docs(releasing): document rollback.yml, prune-dev-tags.yml, lint-workflows.yml

Reflects the new operational workflows landing in this release:
- Auto-rollback paragraph in release.yml description (smoke-test job +
  rollback-on-smoke-fail → rollback.yml)
- rollback.yml subsection — workflow_call + workflow_dispatch entry
  points, walk-back target resolution, immutability + concurrency
  guarantees, manual operator gh workflow run examples
- prune-dev-tags.yml subsection — weekly cron, KEEP_MONTHS retention
  semantics, floating-alias safety, dry_run preview, failure-propagation
  exit-non-zero behavior
- lint-workflows.yml CI quirk — actionlint (continue-on-error) +
  shellcheck (--severity=warning blocking) advisory checks

CLAUDE.md non-negotiable rules unchanged — still high-level and
correct (changelog discipline + release-cut belongs to the PR + run the
full test suite).

2026-05-15 14:06:59 +02:00

ZdenekSrotyr

a48524509a

docs: consolidate and de-clutter the documentation tree (#306 )

CLAUDE.md rewritten (708 -> ~320 lines): four overlapping release
sections collapsed to one, stale v1->v35 schema history dropped (it
lives in CHANGELOG), marketplace endpoint internals and verbose
process sections moved out or tightened.

New focused docs:
- docs/RELEASING.md - release process, deploy workflows, CI quirks
  (RELEASE_TEMPLATE.md folded in as an appendix)
- docs/marketplace.md - marketplace ingestion + re-serving internals
- docs/README.md - documentation index by audience, linked from
  README.md and CLAUDE.md

Archived under docs/archive/: docs/superpowers/ (52 historical
planning artifacts), HACKATHON.md, pd-ps-comments.md,
security-audit-2026-04.md, future/NOTIFICATIONS.md.

Removed the docs/auto-install.md stub. Fixed dangling links in
connectors/jira/README.md and dev_docs/README.md, repointed
code/doc references to archived paths.

2026-05-14 18:54:22 +00:00

2 commits