Commit graph

21 commits

Author SHA1 Message Date
ZdenekSrotyr
9f5adbce37
ci: consolidate release pipeline (salvageable subset of #139) (#314)
* ci: add actionlint workflow lint, drop superseded deploy.yml stub

* ci: extract rollback into reusable rollback.yml, wire into release smoke-test

* ci: add weekly prune-dev-tags workflow for legacy CalVer tag/image cleanup

* release: 0.54.17 — CI/release workflow consolidation

* fix(ci): warn when rollback.yml receives a non-stable failed_image_tag

* fix(ci): rollback.yml + prune-dev-tags.sh review findings

rollback.yml:
- Pass workflow_dispatch inputs (failed_image_tag, target_image_tag)
  through env: instead of textual ${{ }} splicing into bash run blocks
  — prevents an actor with workflow_dispatch privilege from injecting
  shell via quote/backtick payloads.
- Guard against TARGET == FAILED when only one stable-* tag exists
  (fresh repo, or aggressive pruning at month boundary). Fail loudly
  rather than re-push the broken image as :stable.
- Add commit SHA to the rollback tracking-issue body — github.sha is
  inherited across workflow_call, so on-call no longer has to navigate
  rollback run → caller-workflow breadcrumb → failing commit.

prune-dev-tags.sh:
- Replace 'printf … | head -20' preview pipeline with array slice
  ('"${TO_PRUNE[@]:0:20}"'). Under set -o pipefail, head closing
  the pipe early SIGPIPEs printf (exit 141) and aborts the script
  before any deletion runs — exactly the multi-month-backlog scenario
  the script targets.
- Refactor GHCR-pass: fetch versions JSON once before the loop, then
  build a tag→version-id map up-front. Closes two problems:
    1. O(N × pages) GHCR API calls collapse to one paginated listing
       — months of accumulated CalVer tags no longer risk tripping
       abuse detection.
    2. The new jq filter excludes any version that ALSO carries a
       floating alias (:stable, :dev, *-latest). GHCR DELETE-version
       drops the entire manifest, so pruning a CalVer tag that shares
       a manifest with :stable (e.g. after a rollback re-tag) would
       have vaporized :stable. Now it's skipped with a log line.

lint-workflows.yml:
- Add an explicit shellcheck step. actionlint only walks
  .github/workflows/ and the shell embedded in their run: blocks, so
  freestanding scripts/ops/*.sh (which are in the workflow's path
  filter) were never actually validated despite triggering CI.

* fix(ci): shellcheck --severity=warning to skip pre-existing info findings

The new shellcheck step caught info-level findings (SC1091, SC2015) in
agnes-auto-upgrade.sh / agnes-tls-rotate.sh — pre-existing, not regressed
by this PR. Constrain shellcheck to warning+ severity (real bugs) so info
and style findings don't block CI; mirrors the actionlint step's
continue-on-error initial-rollout posture.

* fix(ci): second-pass review findings — concurrency, walk-back, failure propagation

rollback.yml:
- Add own concurrency block (group: rollback-<repo>-<failed_tag>,
  cancel-in-progress: false). The caller release.yml uses
  cancel-in-progress: true to avoid duplicate CalVer claims, but a
  second push to main mid-rollback would otherwise kill the workflow
  between the :stable recovery push and the :deprecated-* audit push,
  leaving :stable stuck on the broken image. A reusable workflow's own
  concurrency overrides the inherited one.
- Walk back through stable-* tags newest-first, skipping any whose
  :deprecated-<stripped> GHCR alias already exists (carries the mark of
  a prior failed rollback). The previous 'second-most-recent' heuristic
  could re-point :stable at a known-broken image on cascading failures.
- Reorder re-tag step: push :stable recovery FIRST, then the
  :deprecated-* audit tag. Defense in depth — even if the concurrency
  block somehow misfires, the worst case is missing audit metadata
  rather than production stuck on the broken image.
- Move GHCR login before resolve step so 'docker manifest inspect' can
  probe for :deprecated-* aliases during walk-back.
- Document the top-level permissions block's dual semantics
  (workflow_dispatch grants directly; workflow_call acts as a cap
  intersected with the caller's job-level permissions).

release.yml:
- Rewrite the 'issues: write' comment. Old wording ('default for jobs')
  was factually wrong — GITHUB_TOKEN's default for issues is never write
  — and read as 'this line just documents a default', so a future
  cleanup PR could delete it. The line is load-bearing: workflow_call
  permissions are bounded by the caller's GITHUB_TOKEN scope, and
  removing it would silently 403 rollback.yml's gh issue create step.

prune-dev-tags.sh:
- Drop the '|| echo "[]"' fallback on the GHCR versions fetch. The
  fallback turned every API failure (403 missing scope, 429 rate limit,
  transient 5xx) into a silent no-op with exit 0 — operators saw a
  green run while every TAG fell through to the same 'no eligible
  version' skip message used for legitimate manifest-collision skips.
- Reorder: fetch GHCR versions BEFORE any git-tag deletion. Git-tag
  delete is irrecoverable (next run rebuilds TO_PRUNE from 'git tag
  -l', so an orphan GHCR image is never enumerated again). Fetching
  first means an API failure aborts cleanly with no state change.
- Track PRUNE_FAILED flag. 'git push --delete' fallback is no longer
  unconditional — local 'git tag -d' is gated on successful remote
  push, so a refused remote delete (tag-protection rule, missing
  contents:write) leaves the local tag in place for retry. The flag
  propagates to a final 'exit 1' so the cron run turns red on any
  push or DELETE failure.

lint-workflows.yml:
- shellcheck step now uses 'find scripts/ops -type f -name *.sh' to
  match the workflow's recursive 'scripts/ops/**.sh' path filter. The
  previous bare 'scripts/ops/*.sh' glob only matched top-level files;
  a future script under a subdirectory would have triggered the
  workflow but never been linted.

* docs(releasing): document rollback.yml, prune-dev-tags.yml, lint-workflows.yml

Reflects the new operational workflows landing in this release:
- Auto-rollback paragraph in release.yml description (smoke-test job +
  rollback-on-smoke-fail → rollback.yml)
- rollback.yml subsection — workflow_call + workflow_dispatch entry
  points, walk-back target resolution, immutability + concurrency
  guarantees, manual operator gh workflow run examples
- prune-dev-tags.yml subsection — weekly cron, KEEP_MONTHS retention
  semantics, floating-alias safety, dry_run preview, failure-propagation
  exit-non-zero behavior
- lint-workflows.yml CI quirk — actionlint (continue-on-error) +
  shellcheck (--severity=warning blocking) advisory checks

CLAUDE.md non-negotiable rules unchanged — still high-level and
correct (changelog discipline + release-cut belongs to the PR + run the
full test suite).
2026-05-15 14:06:59 +02:00
ZdenekSrotyr
a1c7849b3e
ci: shard test suite + drop duplicate test run (#311)
The `test` job in ci.yml becomes a 4-way `test-shard` matrix (pytest-split,
balanced by a committed .test_durations), aggregated into a single `test`
status check so branch protection is unchanged.

release.yml's duplicate full-suite `test` job is removed — it re-ran the
same ~10 min suite a second time on every push to main/feature branches.
release.yml is now image-build only; the advisory ruff/mypy steps move to
a lean `lint` job in ci.yml.

Net: ~10 min -> ~3 min wall-clock per push, and the suite runs once
instead of twice.
2026-05-14 20:18:21 +00:00
ZdenekSrotyr
103669dafd
fix(cli-install): move kbcstorage to [server] extra so wheel installs cleanly (P0 onboarding hotfix → 0.53.4) (#272)
* fix(cli-install): move kbcstorage to [server] extra so wheel installs cleanly

The 0.53.3 wheel served at /cli/wheel/ is unsatisfiable on a clean machine:
analyst runs `uv tool install <wheel-url>` per the published /setup
instructions and the resolver immediately fails with

    Because kbcstorage<=0.9.5 depends on urllib3<2.0.0 and
    agnes-the-ai-analyst==0.53.3 depends on kbcstorage>=0.9.0 and
    urllib3>=2.7.0, we can conclude that agnes-the-ai-analyst==0.53.3
    cannot be used.

The `[tool.uv] override-dependencies = ["urllib3>=2.7.0"]` in pyproject.toml
masked the conflict in workspace contexts (Dockerfile + dev install) but
does NOT propagate to the wheel — wheel METADATA is plain PEP 621
Requires-Dist, and a fresh resolver context (uv tool install <wheel-url>)
never sees the override. Every existing test passed because the dev venv
already has kbcstorage 0.9.5 + urllib3 2.7.0 coexisting under workspace
overrides; the break only surfaces on the next analyst's first install.

Fix: kbcstorage moved out of [project] dependencies into
[project.optional-dependencies].server, since it is server-side only
(connectors/keboola/client.py is the sole import site, called from admin
endpoints, server connectors, and integration tests — never from the CLI
install path). Server install picks it up via Dockerfile's
`uv pip install --system --no-cache ".[server]"`. CI installs `.[dev,server]`
so workspace tests still cover the kbcstorage path. Analyst CLI wheel
METADATA now lists `kbcstorage>=0.9.0; extra == 'server'` (gated) and
`uv tool install <wheel>` resolves cleanly.

Verified end-to-end:
- Built wheel locally; inspected METADATA — kbcstorage line is now `; extra == 'server'`.
- `docker run --rm python:3.13-slim` + `uv tool install <wheel>`: agnes 0.53.4 installs, `agnes --version` works, `agnes catalog --help` renders, kbcstorage absent from CLI venv, urllib3 = 2.7.0.
- Same container with `.[server]` install path: kbcstorage present, urllib3 = 2.7.0 (override applies in workspace context).
- Full pytest suite green locally (4157 passed, 25 skipped).

* release: 0.53.4 — analyst CLI install hotfix (urllib3/kbcstorage resolver conflict)

Patch bump shipping the [server] extra split + new clean-install CI lane.
No DB migration; no API change; no operator-facing config change.
Operator side (Dockerfile path) auto-picks `.[server]` so the production
image gains kbcstorage transparently. Analyst onboarding (uv tool install
<wheel>) starts working again.
2026-05-12 17:09:44 +00:00
minasarustamyan
9de679c714
System plugins (schema v39) + marketplace UX polish + drop legacy pages (#241)
* System plugin tier with mark/unmark fanout (schema v39)

Adds a mandatory plugin tier so admins can pin a small set of curated
plugins into every user's stack from day one. Marking a plugin via the
new toggle on /admin/marketplaces materializes resource_grants for every
group and user_plugin_optouts subscriptions for every user, so the
existing resolver pulls the plugin into every served set without a new
filter layer. Hooks on user-create (Google OAuth, magic-link, admin
POST, scheduler) and group-create propagate the same materialization to
new principals. UI locks: /admin/access disables the checkbox with a
SYSTEM pill; /marketplace cards swap the "In stack" green pill for an
amber "Required" badge with shield icon; the plugin detail install
button reads "Required by your org"; /my-ai-stack toggle is disabled.
Bypass paths return 409 (DELETE /api/admin/grants for system grants,
PUT /api/my-stack/curated/.../{enabled:false}, DELETE
/api/marketplace/curated/.../install). Unmark only flips the flag —
materialized rows persist so admins curate cleanup at their leisure
through the now-unlocked /admin/access checkboxes.

* Marketplace UX polish + drop legacy /store and /my-ai-stack pages

Two-part cleanup post-v39:

(1) Page deletion. /store and /my-ai-stack were already replaced by
/marketplace?tab=flea and /marketplace?tab=my respectively, but the
standalone routes lingered. Hard delete in dev mode — no redirects,
stale bookmarks 404. The /store/new upload wizard, the flea
detail/edit pages, the admin queue, and all /api/store/* +
/api/my-stack endpoints (CLI consumers) stay. Internal hardcoded
hrefs in the upload wizard's Cancel button and the advanced-setup
page repointed to the marketplace tabs.

(2) Detail-page install button rework. The single button that morphed
between "+ Add to my stack" and "✓ In your stack" did not
communicate uninstall affordance. The installed state now renders an
inline white status label *before* a separate red-bordered
"✕ Remove from stack" button on the same row, both at identical
height to avoid layout shift. System plugins keep their locked amber
"✓ Required by your org" pill (no Remove button — API refuses 409).
The post-action hint panel now fires on remove too with the title
flipped to "✓ Removed from your stack" — Claude Code needs the same
/update-agnes-plugins refresh either way.

Also: /admin/marketplaces Details modal "Mark as system" toggle
redesigned. The button was near-invisible (matched neutral row
metadata). It's now a balanced amber-toned chip with shield icon
and a structured confirm modal replacing the native confirm() dialog
that summarizes fanout consequences before commit.

* Move stack-hint inside hero with glass-on-gradient styling

The post-action hint card ("✓ Added to your stack" with the
/update-agnes-plugins recipe) used to live below the hero in
panel-what (gray card on white page body). Clicking add/remove
inserted/removed it between the hero and content, shifting the
panels below — a noticeable scroll jump.

The hint is now anchored inside the hero's top-right corner alongside
the install/remove buttons, both as flex children of an absolutely
positioned .actions container. The card uses a translucent
white-on-glass treatment that adopts the hero's kind color (blue for
plugin, green for skill, purple for agent) without per-kind branching.
Hero is always tall enough (160px photo) to contain the action+hint
stack without overflow, so toggling the hint visibility doesn't grow
the hero or shift body content.

The hero-head grid reserves a third 300px column for the absolute
actions overlay so meta gets the proper 1fr free space instead of
being squeezed by a padding-right hack. Responsive breakpoint at
1100px reflows the actions stack below hero-head when the viewport
isn't wide enough to keep meta + actions side-by-side comfortably.

* Add optional -DataPath bind mount to run-local-dev.ps1

When the operator wants to inspect DuckDB files (system.duckdb, extracts,
marketplaces, store/, …) directly from Windows Explorer, the named volume
inside the Docker Desktop WSL VM isn't reachable. The new -DataPath param
generates a transient compose override that rebinds /data on app, scheduler,
extract (and Caddy's /srv:ro mirror) to a Windows host folder.

Fully additive — when -DataPath is omitted everything behaves exactly as
before: no override file is generated, $composeFiles array is unchanged,
finally cleanup is a no-op. Existing positional invocations
(.\run-local-dev.ps1 up | down | logs) keep binding to $Action because
$DataPath is a named-only parameter with no Position attribute.

The override is written via [System.IO.File]::WriteAllText so the YAML is
BOM-less across PS 5.1 / 7+ — Compose rejects BOM-prefixed YAML on Windows.
The override file is unique per PID and removed in the script's finally
block so concurrent invocations and crashes don't leak files.

* factor mark_system fanout into UserCuratedSubscriptionsRepository

The endpoint imported UserCuratedSubscriptionsRepository, ignored it
(noqa: F841), then duplicated the user-side fanout SQL inline. Adds
fanout_system_for_plugin() symmetric to the existing
fanout_system_for_user() and routes mark_plugin_system through it —
removes the dead import + 14 lines of inline SQL, returns the same
`affected_users` delta count, no behavior change.

* drop customer-specific path from .ps1 example

Per CLAUDE.md vendor-agnostic OSS rule: replaced
C:\\Business\\Groupon\\Agnes\\agnes-data with the generic
C:\\Users\\<you>\\agnes-data placeholder so the docstring
example reads cleanly on any reviewer's box.

* release: 0.48.0 + parallelize Release-workflow pytest

Cuts the release shipped via #228 #230 #231 #232 #233 #234 #236 #237 #238
#239 #240 plus this PR (#241). Major changes:

- System plugin tier (schema v39) — admins mark a plugin mandatory; fans
  out RBAC grants + subscriptions to every existing user/group plus
  hooks for new principals
- BREAKING: removed standalone /store + /my-ai-stack page routes
  (replaced by /marketplace?tab=flea + /marketplace?tab=my)
- Setup-prompt + bootstrap recovery fixes (#240)
- DuckDB CHECKPOINT-on-shutdown + 60s compose grace (#235)
- Marketplace + flea-market UX polish, agnes-metadata.json enrichment

Bonus: switch release.yml test step to `-n auto` (matches ci.yml).
Single-threaded was 15-20 min and frequently the bottleneck on PR
mergeability — now ~6 min.

---------

Co-authored-by: Minas Arustamyan <arustamyan.minas@gmail.com>
Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
2026-05-10 19:15:41 +00:00
ZdenekSrotyr
b5178fe942
fix(ci): smoke-test stale route + rollback ghcr auth + issues:write (#140)
Three CI fixes triggered by the failed PR #137 deploy:

1. scripts/smoke-test.sh: assertion 8 was hitting /api/admin/tables (renamed to /api/admin/registry long ago). The 404 was treated as deployment regression and triggered the auto-rollback. Same stale URL also fixed in CLAUDE.md, README.md, dev_docs/server.md.

2. .github/workflows/release.yml smoke-test job: added Log in to GHCR step. The auto-rollback's docker push :stable was failing with 'unauthenticated' because the smoke-test job had no GHCR login of its own — leaving :stable pointing at the broken image.

3. Rollback step gained GH_TOKEN env, AND the workflow's permissions block gained issues:write. Both were needed for gh issue create to actually create the alert issue (was silently swallowed by the || echo fallback).

Manual cleanup outside this PR: :stable currently points at the broken PR #137 image — needs manual retag back to stable-2026.04.505.
2026-04-30 09:42:27 +02:00
dependabot[bot]
7012966482
chore(deps): bump actions/checkout from 5 to 6 (#125)
Bumps [actions/checkout](https://github.com/actions/checkout) from 5 to 6.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](https://github.com/actions/checkout/compare/v5...v6)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: ZdenekSrotyr <139972147+ZdenekSrotyr@users.noreply.github.com>
2026-04-29 09:58:48 +02:00
dependabot[bot]
62a5b8540a
chore(deps): bump actions/upload-artifact from 4 to 7 (#123)
Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 4 to 7.
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](https://github.com/actions/upload-artifact/compare/v4...v7)

---
updated-dependencies:
- dependency-name: actions/upload-artifact
  dependency-version: '7'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-29 09:38:38 +02:00
ZdenekSrotyr
61f6b8d2d5
feat(ci+tests): deploy safety audit — linting, rollback, smoke tests, 50+ new tests (#120)
Comprehensive deploy safety audit implementing 19 improvements across CI/CD pipeline, test coverage, and source code.

### CI/CD Pipeline
- ruff + mypy added to both release.yml and keboola-deploy.yml (continue-on-error)
- Smoke test added to keboola-deploy.yml (was missing)
- Automatic rollback on smoke test failure in release.yml
- Expanded smoke-test.sh with catalog, admin/tables, marketplace.zip, metrics
- Required status checks via .github/settings.yml
- Dependabot + CODEOWNERS + pre-commit hooks + ruff config

### Source Code
- DB schema version check in /api/health (db_schema: ok/mismatch/unhealthy)
- Config versioning (config_version: 1 in instance.yaml, non-blocking validation)
- BigQuery extractor ATTACH error handling (try/except around INSTALL+ATTACH)
- Post-deploy smoke test script for prod VM validation

### Test Coverage (~50 new tests)
- v13->v14 migration, Email magic link TTL, PAT, Marketplace ZIP/Git,
  Jira webhooks, Hybrid Query BQ, Keboola/BQ extractor failure modes,
  Orchestrator failure modes

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-04-29 09:18:55 +02:00
ZdenekSrotyr
33b318e491
ci(release): build dev image on branch creation from main (#118)
Fixes the per-developer dev-VM workflow: paths-ignore on push skipped same-SHA branch creates, build-and-push.if was main+dispatch-only. Add create: trigger filtered to branch refs, broaden build-and-push.if, add concurrency group keyed on github.ref with cancel-in-progress to dedupe create+push collisions on new branches with code changes.
2026-04-29 08:15:30 +02:00
ZdenekSrotyr
5f6bb7a4b2
fix(security+ops) + release(0.12.1): #82 #85 #87 hardening + cut 0.12.1 (#104)
* fix(security+ops): #82 #85 #87 — auth hardening, API validation, deploy posture

Security and operational hardening across three issue groups:

- M23: docker-compose.override.yml → docker-compose.dev.yml (BREAKING, prod foot-gun)
- C13: Container runs as non-root user 'agnes' (USER directive in Dockerfile)
- M21: Docker resource limits (mem_limit, cpus) on app + scheduler
- M22: Caddyfile security headers (X-Frame-Options, X-Content-Type-Options, Referrer-Policy, -Server)
- M17: /api/health split into minimal (unauth) + /api/health/detailed (auth) (BREAKING)
- M26: release.yml restricts build-and-push to main + workflow_dispatch; paths-ignore for docs

- C2: table_id traversal validation on /api/data/{table_id}/download
- M4: Upload streaming (chunk-read + temp file) instead of full-buffer; /local-md hashed filename

- C5: reset_token removed from POST /api/users/{id}/reset-password response
- C8: Startup WARNING when no user has password_hash (bootstrap window visible)
- M9: Audit log on failed web form login (mirrors /auth/token endpoint)
- M10: Atomic magic-link consume via compare-and-swap (CONSUMED: marker + DuckDB conflict catch)

Also: SSRF protection on /api/admin/configure (#46), memory stats SQL aggregation (#90)

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

* fix(review): SSRF 169.254.x.x + IPv6 multicast; M10 marker cleanup safety

Review fixes:
- Add 169.254.0.0/16 (link-local, cloud metadata) to SSRF regex — was
  missing, allowing requests to AWS/GCP/Azure metadata endpoints
- Add ff[0-9a-f]{2}: (IPv6 multicast) to SSRF regex
- M10: wrap Step 3 (CONSUMED marker cleanup) in try-except with
  warning log — prevents unhandled exception if DB write fails after
  successful token consumption
- Add test for 169.254.169.254 SSRF rejection

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

* fix(review): SSRF IPv6 bypass, CLI health endpoint, upload FD leak

Address Devin Review findings on PR #104:

1. SSRF IPv6 bypass: Replace hostname regex with DNS resolution +
   ipaddress module checks. The old regex patterns like `fe80:` only
   matched up to the first colon, missing real IPv6 addresses like
   `fe80::1`, `fc00::1`, `ff02::1`. The new approach resolves the
   hostname via getaddrinfo and checks each resulting IP against
   ipaddress.is_private/is_loopback/is_link_local/is_reserved/is_multicast.

2. CLI commands broken: `da setup test-connection`, `da setup verify`,
   `da diagnose`, `da status` all called /api/health expecting the old
   format (status=="healthy", services dict). Now they call
   /api/health/detailed for service-level checks (with graceful fallback
   to the minimal endpoint when auth is not configured).

3. Temp file handle leak: _stream_to_temp returns an open
   NamedTemporaryFile; callers now close it before shutil.move() to
   prevent FD leaks until GC.

Also adds IPv6 SSRF test cases (loopback, link-local, unique-local,
multicast) with mocked DNS resolution for test environment independence.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

* fix(review): download regex blocks hyphenated IDs; document health split

Address Devin Review round-3 findings on PR #104:

1. _SAFE_IDENTIFIER regex blocked hyphenated table IDs: The download
   endpoint used the strict SQL-identifier regex which does not allow
   dots or hyphens, but Keboola table IDs like in.c-crm.orders
   contain both. Switched to _SAFE_QUOTED_IDENTIFIER which allows dots
   and hyphens while still blocking path-traversal chars (/, .., \)
   and quote/control characters. Added test for hyphenated/dotted IDs.

2. Documented health endpoint split in DEPLOYMENT.md: Added Health
   checks & external monitoring section explaining both endpoints
   (minimal unauth /api/health vs authenticated /api/health/detailed)
   and how to wire external monitoring tools to the detailed endpoint
   with a PAT.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

* release(0.12.1): cut hotfix for snapshot integrity + #82/#85/#87 hardening

* fix(security): apply CAS pattern to password reset confirm (#82/M10 follow-up)

Devin review on the rebased PR flagged the asymmetry: magic-link verify
got the atomic compare-and-swap pattern in the original M10 fix, but
password reset confirm at /auth/password/reset/confirm was still using
read-validate-clear. Two concurrent POSTs with the same valid reset
token could both succeed in setting different new passwords (last-write-
wins). Lower severity than the magic-link race because the attacker
would need the reset token AND to race the legitimate user, but the
asymmetry was a polish gap.

Mirrors app/auth/providers/email.py::_consume_token CAS exactly: write
unique CONSUMED:<random> marker via UPDATE...WHERE token=old_token, then
SELECT to verify our marker won, then proceed. Only the winner clears
the marker and applies the password change.

New regression test_concurrent_reset_only_one_wins in
tests/test_password_flows.py::TestResetConfirm pins the contract: two
ThreadPoolExecutor workers + Barrier hit /reset/confirm with the same
token; exactly one gets 302 (password applied), the other gets 200 with
'Invalid or expired'. Sanity-checked against the pre-CAS code — both
POSTs got 302 (race confirmed).

---------

Co-authored-by: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-04-28 19:57:30 +02:00
Petr Simecek
1bbbe58ea0
release(2.1.0): durable sync, CLI auto-update, versioned wheel URL, version unification (#43)
* fix(cli): versioned wheel URL in setup instructions; drop broken /cli/agnes.whl alias (#36)

* fix(cli): inline PEP 427 wheel filename in setup instructions

`uv tool install <server>/cli/agnes.whl` fails with

    error: The wheel filename "agnes.whl" is invalid: Must have a version

because uv validates the filename in the URL path *before* fetching — so
the server-side Content-Disposition header (which has the real versioned
filename) is never consulted, and an HTTP redirect does not help either:
uv resolves the filename from the initial URL.

Fix the root cause by inlining the real PEP 427 filename into the setup
snippet the dashboard copies to the clipboard. The wheel filename is
resolved server-side via `_find_wheel()` and substituted into the lines
returned from `setup_instructions.resolve_lines()`, so both the read-only
HTML preview and the JS clipboard renderer get byte-identical output.

Also added `/cli/wheel/{filename}` to serve wheels at their PEP 427 path,
and kept `/cli/agnes.whl` as a 302 redirect for manual/legacy callers —
though that redirect alone is NOT sufficient for `uv tool install` (uv
validates before following redirects) and is there only as defense-in-depth.

Verified locally:
- `uv tool install <server>/cli/wheel/agnes_the_ai_analyst-2.0.0-py3-none-any.whl` succeeds
- `/install` HTML now renders the versioned URL; `/cli/agnes.whl` no longer appears in the rendered snippet

* fix(cli): remove /cli/agnes.whl alias entirely — it only confused users

The bareword alias was never actually usable:

- `uv tool install <server>/cli/agnes.whl` fails at filename validation
  before any HTTP fetch, so neither the Content-Disposition header nor a
  302 redirect rescued it.
- The 302-to-versioned-path fallback left a visibly "working" URL in
  browser / curl -L contexts, which is exactly how the original bug got
  reported in the first place ("the URL loads, why doesn't install work?").

Remove the endpoint and scrub all remaining references. The only CLI wheel
URL is now `/cli/wheel/{filename}` with the real PEP 427 filename, which
the setup-instructions template already generates server-side.

Existing tests that referenced /cli/agnes.whl become negative tests
("must not appear") so we don't regress.

* feat(cli): --version flag; sync --dry-run + progress indicator (#38)

* feat(cli): add --version / -V flag

Prints `da <version>` from package metadata (importlib.metadata). Falls
back to "unknown" when the package is not installed (e.g. running from a
source checkout without `uv pip install -e .`), instead of crashing.

Eager typer callback, so `da --version` exits before subcommand
resolution and does not require any auth/config.

* feat(cli): da sync --dry-run + X/N progress indicator

--dry-run reports what would be downloaded/uploaded without hitting the
API or writing local state. Supports the full flag set (--table, --json,
--upload-only); JSON shape is {"dry_run": true, "would_download": [...],
"summary": {...}}.

Progress bar now shows "[X/N] Downloading <table>..." with a Rich
BarColumn + TaskProgressColumn + TimeElapsedColumn instead of a bare
spinner — makes long syncs visible.

* feat(cli): durable sync + server gzip + auto-update check (#41)

* fix(sync): atomic writes + manifest hash verification + retry on transient errors

Three durability hooks around stream_download and the sync command:

1. Atomic writes. stream_download now streams into `<target>.tmp` and
   calls os.replace() on success, so the real target file never exists
   in a half-written state. On failure the tmp is unlinked — no cleanup
   leftovers, no guard needed at read time.

2. Retry with backoff. Transient errors (ConnectError, ReadError,
   WriteError, RemoteProtocolError, TimeoutException, 5xx) are retried
   up to 3× with 0.3s / 1s / 3s backoff. 4xx (auth, 404) surfaces
   immediately — retrying those is pointless.

3. Manifest-hash verification. After download, sync.py computes MD5 of
   the target (same 8KiB chunking as app/api/sync.py:_file_hash) and
   compares against `server_tables[tid]["hash"]`. Mismatch ⇒ unlink,
   record error, skip state commit. The PAR1 structural check survives
   as a fallback for legacy manifests without a hash.

Also makes _rebuild_duckdb_views tolerant: single broken parquet is
skipped with a stderr warning instead of killing the whole rebuild.

Supersedes #40 — this commit is a strict super-set (hash check + PAR1
fallback + atomic write + retry). #40 can be closed without merging.

* perf(server): enable GZipMiddleware for JSON / HTML responses

GZipMiddleware at minimum_size=1024 shaves bandwidth on manifest-style
JSON endpoints (/api/sync/manifest, /api/version, …) and the /install
HTML preview. Parquet file downloads are already columnar-compressed so
the middleware sees limited benefit there — but it doesn't hurt, httpx
on the client side decompresses transparently.

Placed after session middleware so gzip wraps the session-Set-Cookie
response too, and before CORSMiddleware so compression is applied to
both cross-origin and same-origin responses.

* feat(cli): auto-check for newer CLI version on startup

Server side
- GET /cli/latest returns {version, wheel_filename, download_url_path}
  for whatever wheel is currently in AGNES_CLI_DIST_DIR. Public,
  cacheable, no secrets — consumed by the CLI auto-update probe.

Client side
- New cli/update_check.py: reads /cli/latest with a 3s timeout, caches
  the result in $DA_CONFIG_DIR/update_check.json for 24h. Cache is
  invalidated when the installed version changes (e.g. after a fresh
  `uv tool install`) so stale "you're behind" warnings don't linger.
- Root typer callback fires the probe before subcommand dispatch; any
  failure is swallowed so a bad network never blocks a working command.
- Outdated → one-line stderr warning:
    [update] da 2.0.0 is out of date — latest on this server is 2.1.0.
    Upgrade: uv tool install --force <server>/cli/wheel/<…>.whl
- Disable with DA_NO_UPDATE_CHECK=1.

* fix(pr-review): None-guard the upgrade line + skip gzip on parquet paths

Two follow-ups from Devin review on #41.

1. format_outdated_notice(UpdateInfo(download_url=None)) emitted literal
   "uv tool install --force None" — copy-pasting that fails. Drop the
   upgrade snippet when the URL is absent and keep only the version line.

2. GZipMiddleware compressed everything over 1024 bytes, including the
   parquet FileResponses served by /api/data/{tid}/download,
   /cli/wheel/{name}, and /cli/download. Parquet is already columnar-
   compressed — gzip there is pure CPU + latency with no size win, and
   /api/data bodies can reach hundreds of MB. Wrap GZipMiddleware in a
   small _SelectiveGZipMiddleware that skips those path prefixes and
   delegates the rest to the stock middleware. JSON / HTML endpoints
   (manifest, /install, /api/version, …) still get compressed.

* release: bump to 2.1.0 — unify AGNES_VERSION with pyproject.toml version (#42)

Before: two independent version systems. pyproject.toml carried semver
(2.0.0 → wheel filename → `da --version`) while release.yml injected
CalVer into AGNES_VERSION (e.g. 2026.04.155 → /api/version). Users saw
different strings in the CLI vs. the /install page, and the CLI auto-
update check couldn't tell "new deploy, same package version" apart
from "new package version".

Make pyproject.toml [project].version the single product-version source
of truth. release.yml extracts it and feeds AGNES_VERSION, so every
surface (/api/version, /api/health, /cli/latest, `da --version`) agrees
on one number. The CalVer tag keeps doing what CalVer is for: release
identity on the git tag and Docker image tag (versioned_tag).

Also wires AGNES_TAG through the build: release.yml → Dockerfile ARG →
env, so /api/version.image_tag finally reports the actual image tag
instead of the "unknown" fallback.

Bump to 2.1.0 to reflect the PRs shipped on ps/wheel-name-fix: durable
sync (atomic writes + manifest MD5 + retry), server GZip, CLI auto-
update probe, setup snippet PEP 427 URL.

* fix(pr-review): directional version compare in is_outdated()

UpdateInfo.is_outdated() used `self.latest != self.installed`, which
fires in both directions. If the server is rolled back or the user
connects to an older deployment, the CLI would warn "out of date"
and — worse — the formatted notice would prompt

    uv tool install --force <older-version>.whl

i.e. an unintended downgrade.

Compare with packaging.version.Version (PEP 440 aware, handles pre-
release tags). Fall back to dotted-int tuple compare if packaging is
somehow missing, and return False on unparseable strings — better to
miss an upgrade hint than to silently suggest a downgrade.

Adds 4 test cases: installed older (True), installed newer (False),
10.0.0 vs 2.1.0 lexical-compare trap (correct), unparseable strings
(False).

Addresses Devin review on #43.

* fix(pr-review): read FastAPI app version from package metadata

app/main.py:80 hardcoded `version="2.0.0"` in the FastAPI constructor.
After #42 bumped pyproject.toml to 2.1.0, /api/version, /cli/latest,
and `da --version` all reported 2.1.0 while /openapi.json and the
/docs UI still advertised 2.0.0.

Read `agnes-the-ai-analyst` version via importlib.metadata (same
pattern cli/main.py:_cli_version already uses), with a `"dev"`
fallback when the package is not installed (source checkout). This
way pyproject.toml stays the single source of truth across every
version surface — /openapi.json now tracks the bump automatically.

Adds a dedicated test file to pin this behavior so a future
regression to a hardcoded literal fails at CI.

Addresses second Devin finding on #43.

* fix(pr-review): _fmt_bytes PiB label + negative cache in update_check

Two more follow-ups from Devin review on #43.

1. _fmt_bytes off-by-unit. The old loop exited at TiB but the fallback
   labelled PiB, so 1 PiB rendered as "1024.0 PiB". Restructure: put
   every unit inside the loop (KiB through EiB) so the division count
   always matches the label. Covers up to 1 ZiB cleanly; anything
   beyond renders as "<big>.0 EiB" rather than crashing.

2. Negative cache for failed /cli/latest probes. On a corporate
   firewall / VPN that silently drops packets, the 3s HTTP timeout
   fired on *every* `da` invocation. Writing a `latest=None` cache
   entry with a 5-minute TTL caps that at one probe per 5min. Successful
   probes still use the 24h TTL. Reading logic branches on whether the
   cached `latest` is None.

Adds TestFmtBytes (2 cases: small/medium sizes and the PiB/EiB fallback
regression), plus two TestSync update-check cases covering negative-
cache reuse and TTL expiry.
2026-04-22 21:18:18 +02:00
ZdenekSrotyr
963db420fe
ci(release): push dev-<user-prefix>-latest alias for <user>/* branches (#31)
Adds a second tag to dev-channel image builds: when a branch is in the
form <prefix>/<whatever>, the image is also pushed as
ghcr.io/keboola/agnes-the-ai-analyst:dev-<prefix>-latest.

Enables per-developer dev VMs on GRPN (and elsewhere) to auto-deploy
without knowing the specific branch slug. Each VM pins its .env to
AGNES_TAG=dev-<prefix>-latest, and the auto-upgrade cron (5 min tick)
picks up the newly pushed image on the next run.

Common Git Flow prefixes are deliberately skipped so feature/*, fix/*,
hotfix/* etc. don't create noise tags. Matched list:
feature, fix, hotfix, bugfix, docs, chore, test, ci, ops, refactor,
perf, style, build.

Verified locally against several branch names:
  zs/my-feature     -> dev-zs-latest
  vr/foo            -> dev-vr-latest
  pc/bar-baz        -> dev-pc-latest
  feature/xyz       -> (skipped)
  fix/bug           -> (skipped)
  main              -> (no-op, stable channel)
  test-no-slash     -> (no-op, no slash)
2026-04-22 14:02:59 +02:00
ZdenekSrotyr
e2eb51f657
ci(release): build image for all branches, not just feature/** (#19)
* dryrun: intentional failing test (will be reverted)

* feat(auth): optional SEED_ADMIN_PASSWORD to pre-hash seed admin (dev helper)

Terraform gains enable_seed_password + seed_admin_password (sensitive) vars
on the customer-instance module; when enabled the password is piped via
startup-script into /opt/agnes/.env as SEED_ADMIN_PASSWORD. On first boot
app/main.py argon2-hashes it onto the seed user so the admin can log in
immediately without going through /auth/bootstrap. Never overwrites an
existing password_hash — safe against accidental reset on terraform apply.

* ci(release): build :dev-<slug> on any branch, not just feature/**

Before: only 'feature/**' branches triggered release.yml, so pushing
'zs/my-edit' or 'fix/bug' did not publish an image. dev_instances entry
pinning image_tag = 'dev-zs-my-edit' then crashed VM startup with
'image not found'.

Now: any branch push (except main, which produces :stable) publishes
:dev-<slug>. Slug strips a leading 'feature/' and replaces non-[a-z0-9-]
with '-', keeping existing feature/** behavior identical.

* Revert "dryrun: intentional failing test (will be reverted)"

This reverts commit cf9cc06a7884bb401ff29fc5cb6d8baf84dc3daa.
2026-04-21 21:33:57 +02:00
ZdenekSrotyr
1c7cc8aa29 fix(image): add AGNES_COMMIT_SHA build-arg to Dockerfile + release.yml
Completes the previous commit — bakes the full git SHA into the image ENV
at build time so the UI badge shows a real commit, not a sha256 digest
(which was the floating manifest digest and unhelpful for debugging).
2026-04-21 21:00:30 +02:00
ZdenekSrotyr
5188bd9127 ci: add per-branch image tag :dev-<slug> for branch-aware dev deploys
Extracts branch name from GITHUB_REF, slugifies it, and adds as extra tag
on feature branch builds. Main branch is unaffected (no branch_slug output).

Enables dev_instances tfvar with image_tag pinning specific feature branches.
2026-04-21 18:47:01 +02:00
ZdenekSrotyr
44b99f25ca fix: address Devin review round 5 — empty secret file, CI .env
- secrets.py: validate file content is non-empty before using it;
  regenerate if file exists but is empty/corrupted
- release.yml: touch .env before docker compose in smoke test
  (env_file: .env in docker-compose.yml requires the file to exist)

663 tests pass.
2026-04-10 14:55:31 +02:00
ZdenekSrotyr
40cca627be fix: address Devin review round 4 — bash arithmetic, CalVer max, docs
- smoke-test.sh: replace ((PASS++)) with PASS=$((PASS + 1)) to avoid
  set -e abort when counter is 0 (bash returns exit 1 for ((0)))
- CalVer: use max(N) from existing tags instead of count, safe when
  tags are deleted (e.g. deprecated version cleanup)
- CLAUDE.md: update schema version from v2 to v3

663 tests pass.
2026-04-10 14:39:16 +02:00
ZdenekSrotyr
dc8a9275e6 fix: address Devin review round 3 — retry exhaustion, discover path, WAL snapshot
- CalVer retry loop now exits with error if all 5 attempts fail
  (prevents pushing Docker image with unclaimed version tag)
- discover_tables endpoint reads data_source.keboola.url (consistent
  with configure_instance and _discover_and_register_tables)
- Pre-migration snapshot flushes WAL via CHECKPOINT before copying
  and copies .wal file if it still exists after flush

663 tests pass.
2026-04-10 14:11:17 +02:00
ZdenekSrotyr
c79d85f87c fix: config path mismatch + CalVer race condition (Devin review round 2)
- _discover_and_register_tables reads from data_source.keboola.url
  (matches what /api/admin/configure writes) instead of top-level
  keboola.url which doesn't exist
- CalVer: claim git tag BEFORE Docker build with retry loop (up to 5
  attempts). Prevents race where two concurrent CI runs get same N.
  Git tag acts as a distributed lock for version uniqueness.

663 tests pass.
2026-04-10 13:30:05 +02:00
ZdenekSrotyr
49f109bf73 fix: address PR review findings — config write, CalVer, error handling
- Config writes to DATA_DIR/state/instance.yaml (writable) instead of
  CONFIG_DIR (read-only :ro in Docker)
- instance_config.py checks DATA_DIR/state/ first, then falls back to
  CONFIG_DIR for backward compat
- CalVer counter is now global across channels (*-YYYY.MM.*) per spec
- Keboola error messages sanitized — log full error, return generic msg
- chmod in secrets.py wrapped in try/except for Windows compat
- Setup wizard JS handles 401 (expired JWT) with user-facing message
- deploy.yml changed to workflow_dispatch only (no duplicate test runs)
- Smoke test uses docker-compose.prod.yml + AGNES_TAG instead of sed
- docker-compose.prod.yml uses ${AGNES_TAG:-stable} env var

663 tests pass. 8 E2E verification tests pass.
2026-04-10 13:16:40 +02:00
ZdenekSrotyr
6c53082295 feat: multi-instance deployment — all 14 must-have items from spec
CalVer CI (release.yml) with stable/dev channels, health endpoint
with version/channel/schema_version, JWT secret auto-generation with
file persistence, smoke test script + Docker-in-CI, pre-migration
snapshot, /api/admin/configure for headless setup, /api/admin/
discover-and-register, /setup wizard, OpenAPI snapshot test, custom
connector mount support, CHANGELOG, migration safety tests, startup
banner.

663 tests pass (6 new migration safety + 3 OpenAPI snapshot + 1
updated JWT test).
2026-04-10 11:57:42 +02:00