Commit graph

42 commits

Author SHA1 Message Date
ZdenekSrotyr
9f5adbce37
ci: consolidate release pipeline (salvageable subset of #139) (#314)
* ci: add actionlint workflow lint, drop superseded deploy.yml stub

* ci: extract rollback into reusable rollback.yml, wire into release smoke-test

* ci: add weekly prune-dev-tags workflow for legacy CalVer tag/image cleanup

* release: 0.54.17 — CI/release workflow consolidation

* fix(ci): warn when rollback.yml receives a non-stable failed_image_tag

* fix(ci): rollback.yml + prune-dev-tags.sh review findings

rollback.yml:
- Pass workflow_dispatch inputs (failed_image_tag, target_image_tag)
  through env: instead of textual ${{ }} splicing into bash run blocks
  — prevents an actor with workflow_dispatch privilege from injecting
  shell via quote/backtick payloads.
- Guard against TARGET == FAILED when only one stable-* tag exists
  (fresh repo, or aggressive pruning at month boundary). Fail loudly
  rather than re-push the broken image as :stable.
- Add commit SHA to the rollback tracking-issue body — github.sha is
  inherited across workflow_call, so on-call no longer has to navigate
  rollback run → caller-workflow breadcrumb → failing commit.

prune-dev-tags.sh:
- Replace 'printf … | head -20' preview pipeline with array slice
  ('"${TO_PRUNE[@]:0:20}"'). Under set -o pipefail, head closing
  the pipe early SIGPIPEs printf (exit 141) and aborts the script
  before any deletion runs — exactly the multi-month-backlog scenario
  the script targets.
- Refactor GHCR-pass: fetch versions JSON once before the loop, then
  build a tag→version-id map up-front. Closes two problems:
    1. O(N × pages) GHCR API calls collapse to one paginated listing
       — months of accumulated CalVer tags no longer risk tripping
       abuse detection.
    2. The new jq filter excludes any version that ALSO carries a
       floating alias (:stable, :dev, *-latest). GHCR DELETE-version
       drops the entire manifest, so pruning a CalVer tag that shares
       a manifest with :stable (e.g. after a rollback re-tag) would
       have vaporized :stable. Now it's skipped with a log line.

lint-workflows.yml:
- Add an explicit shellcheck step. actionlint only walks
  .github/workflows/ and the shell embedded in their run: blocks, so
  freestanding scripts/ops/*.sh (which are in the workflow's path
  filter) were never actually validated despite triggering CI.

* fix(ci): shellcheck --severity=warning to skip pre-existing info findings

The new shellcheck step caught info-level findings (SC1091, SC2015) in
agnes-auto-upgrade.sh / agnes-tls-rotate.sh — pre-existing, not regressed
by this PR. Constrain shellcheck to warning+ severity (real bugs) so info
and style findings don't block CI; mirrors the actionlint step's
continue-on-error initial-rollout posture.

* fix(ci): second-pass review findings — concurrency, walk-back, failure propagation

rollback.yml:
- Add own concurrency block (group: rollback-<repo>-<failed_tag>,
  cancel-in-progress: false). The caller release.yml uses
  cancel-in-progress: true to avoid duplicate CalVer claims, but a
  second push to main mid-rollback would otherwise kill the workflow
  between the :stable recovery push and the :deprecated-* audit push,
  leaving :stable stuck on the broken image. A reusable workflow's own
  concurrency overrides the inherited one.
- Walk back through stable-* tags newest-first, skipping any whose
  :deprecated-<stripped> GHCR alias already exists (carries the mark of
  a prior failed rollback). The previous 'second-most-recent' heuristic
  could re-point :stable at a known-broken image on cascading failures.
- Reorder re-tag step: push :stable recovery FIRST, then the
  :deprecated-* audit tag. Defense in depth — even if the concurrency
  block somehow misfires, the worst case is missing audit metadata
  rather than production stuck on the broken image.
- Move GHCR login before resolve step so 'docker manifest inspect' can
  probe for :deprecated-* aliases during walk-back.
- Document the top-level permissions block's dual semantics
  (workflow_dispatch grants directly; workflow_call acts as a cap
  intersected with the caller's job-level permissions).

release.yml:
- Rewrite the 'issues: write' comment. Old wording ('default for jobs')
  was factually wrong — GITHUB_TOKEN's default for issues is never write
  — and read as 'this line just documents a default', so a future
  cleanup PR could delete it. The line is load-bearing: workflow_call
  permissions are bounded by the caller's GITHUB_TOKEN scope, and
  removing it would silently 403 rollback.yml's gh issue create step.

prune-dev-tags.sh:
- Drop the '|| echo "[]"' fallback on the GHCR versions fetch. The
  fallback turned every API failure (403 missing scope, 429 rate limit,
  transient 5xx) into a silent no-op with exit 0 — operators saw a
  green run while every TAG fell through to the same 'no eligible
  version' skip message used for legitimate manifest-collision skips.
- Reorder: fetch GHCR versions BEFORE any git-tag deletion. Git-tag
  delete is irrecoverable (next run rebuilds TO_PRUNE from 'git tag
  -l', so an orphan GHCR image is never enumerated again). Fetching
  first means an API failure aborts cleanly with no state change.
- Track PRUNE_FAILED flag. 'git push --delete' fallback is no longer
  unconditional — local 'git tag -d' is gated on successful remote
  push, so a refused remote delete (tag-protection rule, missing
  contents:write) leaves the local tag in place for retry. The flag
  propagates to a final 'exit 1' so the cron run turns red on any
  push or DELETE failure.

lint-workflows.yml:
- shellcheck step now uses 'find scripts/ops -type f -name *.sh' to
  match the workflow's recursive 'scripts/ops/**.sh' path filter. The
  previous bare 'scripts/ops/*.sh' glob only matched top-level files;
  a future script under a subdirectory would have triggered the
  workflow but never been linted.

* docs(releasing): document rollback.yml, prune-dev-tags.yml, lint-workflows.yml

Reflects the new operational workflows landing in this release:
- Auto-rollback paragraph in release.yml description (smoke-test job +
  rollback-on-smoke-fail → rollback.yml)
- rollback.yml subsection — workflow_call + workflow_dispatch entry
  points, walk-back target resolution, immutability + concurrency
  guarantees, manual operator gh workflow run examples
- prune-dev-tags.yml subsection — weekly cron, KEEP_MONTHS retention
  semantics, floating-alias safety, dry_run preview, failure-propagation
  exit-non-zero behavior
- lint-workflows.yml CI quirk — actionlint (continue-on-error) +
  shellcheck (--severity=warning blocking) advisory checks

CLAUDE.md non-negotiable rules unchanged — still high-level and
correct (changelog discipline + release-cut belongs to the PR + run the
full test suite).
2026-05-15 14:06:59 +02:00
ZdenekSrotyr
a1c7849b3e
ci: shard test suite + drop duplicate test run (#311)
The `test` job in ci.yml becomes a 4-way `test-shard` matrix (pytest-split,
balanced by a committed .test_durations), aggregated into a single `test`
status check so branch protection is unchanged.

release.yml's duplicate full-suite `test` job is removed — it re-ran the
same ~10 min suite a second time on every push to main/feature branches.
release.yml is now image-build only; the advisory ruff/mypy steps move to
a lean `lint` job in ci.yml.

Net: ~10 min -> ~3 min wall-clock per push, and the suite runs once
instead of twice.
2026-05-14 20:18:21 +00:00
ZdenekSrotyr
0407d194ba
ci: fix indentation in cli-wheel-clean-install Python heredoc (#273)
The cli-wheel-clean-install lane introduced in v0.53.4 (#272) failed on
its first main run with `IndentationError: unexpected indent`: YAML
`run: |` preserves the relative indent of the inline `python3 -c`
heredoc, so the Python interpreter saw `try:` at column 12 and refused
to parse.

Fix: write the assertion script to /tmp/smoke.py via a `cat <<'PY'`
heredoc (left-aligned content lands flat), mount it into the container,
and invoke the tool's venv python directly
(`$HOME/.local/share/uv/tools/agnes-the-ai-analyst/bin/python`).
Cleaner than the previous inline form and side-steps `uv tool run
--from <name>` doing a PyPI lookup that fails because we don't publish
there.

Verified locally with the same docker run as the CI step — prints
`OK: kbcstorage absent, urllib3 2.7.0`.
2026-05-12 17:32:28 +00:00
ZdenekSrotyr
103669dafd
fix(cli-install): move kbcstorage to [server] extra so wheel installs cleanly (P0 onboarding hotfix → 0.53.4) (#272)
* fix(cli-install): move kbcstorage to [server] extra so wheel installs cleanly

The 0.53.3 wheel served at /cli/wheel/ is unsatisfiable on a clean machine:
analyst runs `uv tool install <wheel-url>` per the published /setup
instructions and the resolver immediately fails with

    Because kbcstorage<=0.9.5 depends on urllib3<2.0.0 and
    agnes-the-ai-analyst==0.53.3 depends on kbcstorage>=0.9.0 and
    urllib3>=2.7.0, we can conclude that agnes-the-ai-analyst==0.53.3
    cannot be used.

The `[tool.uv] override-dependencies = ["urllib3>=2.7.0"]` in pyproject.toml
masked the conflict in workspace contexts (Dockerfile + dev install) but
does NOT propagate to the wheel — wheel METADATA is plain PEP 621
Requires-Dist, and a fresh resolver context (uv tool install <wheel-url>)
never sees the override. Every existing test passed because the dev venv
already has kbcstorage 0.9.5 + urllib3 2.7.0 coexisting under workspace
overrides; the break only surfaces on the next analyst's first install.

Fix: kbcstorage moved out of [project] dependencies into
[project.optional-dependencies].server, since it is server-side only
(connectors/keboola/client.py is the sole import site, called from admin
endpoints, server connectors, and integration tests — never from the CLI
install path). Server install picks it up via Dockerfile's
`uv pip install --system --no-cache ".[server]"`. CI installs `.[dev,server]`
so workspace tests still cover the kbcstorage path. Analyst CLI wheel
METADATA now lists `kbcstorage>=0.9.0; extra == 'server'` (gated) and
`uv tool install <wheel>` resolves cleanly.

Verified end-to-end:
- Built wheel locally; inspected METADATA — kbcstorage line is now `; extra == 'server'`.
- `docker run --rm python:3.13-slim` + `uv tool install <wheel>`: agnes 0.53.4 installs, `agnes --version` works, `agnes catalog --help` renders, kbcstorage absent from CLI venv, urllib3 = 2.7.0.
- Same container with `.[server]` install path: kbcstorage present, urllib3 = 2.7.0 (override applies in workspace context).
- Full pytest suite green locally (4157 passed, 25 skipped).

* release: 0.53.4 — analyst CLI install hotfix (urllib3/kbcstorage resolver conflict)

Patch bump shipping the [server] extra split + new clean-install CI lane.
No DB migration; no API change; no operator-facing config change.
Operator side (Dockerfile path) auto-picks `.[server]` so the production
image gains kbcstorage transparently. Analyst onboarding (uv tool install
<wheel>) starts working again.
2026-05-12 17:09:44 +00:00
minasarustamyan
9de679c714
System plugins (schema v39) + marketplace UX polish + drop legacy pages (#241)
* System plugin tier with mark/unmark fanout (schema v39)

Adds a mandatory plugin tier so admins can pin a small set of curated
plugins into every user's stack from day one. Marking a plugin via the
new toggle on /admin/marketplaces materializes resource_grants for every
group and user_plugin_optouts subscriptions for every user, so the
existing resolver pulls the plugin into every served set without a new
filter layer. Hooks on user-create (Google OAuth, magic-link, admin
POST, scheduler) and group-create propagate the same materialization to
new principals. UI locks: /admin/access disables the checkbox with a
SYSTEM pill; /marketplace cards swap the "In stack" green pill for an
amber "Required" badge with shield icon; the plugin detail install
button reads "Required by your org"; /my-ai-stack toggle is disabled.
Bypass paths return 409 (DELETE /api/admin/grants for system grants,
PUT /api/my-stack/curated/.../{enabled:false}, DELETE
/api/marketplace/curated/.../install). Unmark only flips the flag —
materialized rows persist so admins curate cleanup at their leisure
through the now-unlocked /admin/access checkboxes.

* Marketplace UX polish + drop legacy /store and /my-ai-stack pages

Two-part cleanup post-v39:

(1) Page deletion. /store and /my-ai-stack were already replaced by
/marketplace?tab=flea and /marketplace?tab=my respectively, but the
standalone routes lingered. Hard delete in dev mode — no redirects,
stale bookmarks 404. The /store/new upload wizard, the flea
detail/edit pages, the admin queue, and all /api/store/* +
/api/my-stack endpoints (CLI consumers) stay. Internal hardcoded
hrefs in the upload wizard's Cancel button and the advanced-setup
page repointed to the marketplace tabs.

(2) Detail-page install button rework. The single button that morphed
between "+ Add to my stack" and "✓ In your stack" did not
communicate uninstall affordance. The installed state now renders an
inline white status label *before* a separate red-bordered
"✕ Remove from stack" button on the same row, both at identical
height to avoid layout shift. System plugins keep their locked amber
"✓ Required by your org" pill (no Remove button — API refuses 409).
The post-action hint panel now fires on remove too with the title
flipped to "✓ Removed from your stack" — Claude Code needs the same
/update-agnes-plugins refresh either way.

Also: /admin/marketplaces Details modal "Mark as system" toggle
redesigned. The button was near-invisible (matched neutral row
metadata). It's now a balanced amber-toned chip with shield icon
and a structured confirm modal replacing the native confirm() dialog
that summarizes fanout consequences before commit.

* Move stack-hint inside hero with glass-on-gradient styling

The post-action hint card ("✓ Added to your stack" with the
/update-agnes-plugins recipe) used to live below the hero in
panel-what (gray card on white page body). Clicking add/remove
inserted/removed it between the hero and content, shifting the
panels below — a noticeable scroll jump.

The hint is now anchored inside the hero's top-right corner alongside
the install/remove buttons, both as flex children of an absolutely
positioned .actions container. The card uses a translucent
white-on-glass treatment that adopts the hero's kind color (blue for
plugin, green for skill, purple for agent) without per-kind branching.
Hero is always tall enough (160px photo) to contain the action+hint
stack without overflow, so toggling the hint visibility doesn't grow
the hero or shift body content.

The hero-head grid reserves a third 300px column for the absolute
actions overlay so meta gets the proper 1fr free space instead of
being squeezed by a padding-right hack. Responsive breakpoint at
1100px reflows the actions stack below hero-head when the viewport
isn't wide enough to keep meta + actions side-by-side comfortably.

* Add optional -DataPath bind mount to run-local-dev.ps1

When the operator wants to inspect DuckDB files (system.duckdb, extracts,
marketplaces, store/, …) directly from Windows Explorer, the named volume
inside the Docker Desktop WSL VM isn't reachable. The new -DataPath param
generates a transient compose override that rebinds /data on app, scheduler,
extract (and Caddy's /srv:ro mirror) to a Windows host folder.

Fully additive — when -DataPath is omitted everything behaves exactly as
before: no override file is generated, $composeFiles array is unchanged,
finally cleanup is a no-op. Existing positional invocations
(.\run-local-dev.ps1 up | down | logs) keep binding to $Action because
$DataPath is a named-only parameter with no Position attribute.

The override is written via [System.IO.File]::WriteAllText so the YAML is
BOM-less across PS 5.1 / 7+ — Compose rejects BOM-prefixed YAML on Windows.
The override file is unique per PID and removed in the script's finally
block so concurrent invocations and crashes don't leak files.

* factor mark_system fanout into UserCuratedSubscriptionsRepository

The endpoint imported UserCuratedSubscriptionsRepository, ignored it
(noqa: F841), then duplicated the user-side fanout SQL inline. Adds
fanout_system_for_plugin() symmetric to the existing
fanout_system_for_user() and routes mark_plugin_system through it —
removes the dead import + 14 lines of inline SQL, returns the same
`affected_users` delta count, no behavior change.

* drop customer-specific path from .ps1 example

Per CLAUDE.md vendor-agnostic OSS rule: replaced
C:\\Business\\Groupon\\Agnes\\agnes-data with the generic
C:\\Users\\<you>\\agnes-data placeholder so the docstring
example reads cleanly on any reviewer's box.

* release: 0.48.0 + parallelize Release-workflow pytest

Cuts the release shipped via #228 #230 #231 #232 #233 #234 #236 #237 #238
#239 #240 plus this PR (#241). Major changes:

- System plugin tier (schema v39) — admins mark a plugin mandatory; fans
  out RBAC grants + subscriptions to every existing user/group plus
  hooks for new principals
- BREAKING: removed standalone /store + /my-ai-stack page routes
  (replaced by /marketplace?tab=flea + /marketplace?tab=my)
- Setup-prompt + bootstrap recovery fixes (#240)
- DuckDB CHECKPOINT-on-shutdown + 60s compose grace (#235)
- Marketplace + flea-market UX polish, agnes-metadata.json enrichment

Bonus: switch release.yml test step to `-n auto` (matches ci.yml).
Single-threaded was 15-20 min and frequently the bottleneck on PR
mergeability — now ~6 min.

---------

Co-authored-by: Minas Arustamyan <arustamyan.minas@gmail.com>
Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
2026-05-10 19:15:41 +00:00
ZdenekSrotyr
b5178fe942
fix(ci): smoke-test stale route + rollback ghcr auth + issues:write (#140)
Three CI fixes triggered by the failed PR #137 deploy:

1. scripts/smoke-test.sh: assertion 8 was hitting /api/admin/tables (renamed to /api/admin/registry long ago). The 404 was treated as deployment regression and triggered the auto-rollback. Same stale URL also fixed in CLAUDE.md, README.md, dev_docs/server.md.

2. .github/workflows/release.yml smoke-test job: added Log in to GHCR step. The auto-rollback's docker push :stable was failing with 'unauthenticated' because the smoke-test job had no GHCR login of its own — leaving :stable pointing at the broken image.

3. Rollback step gained GH_TOKEN env, AND the workflow's permissions block gained issues:write. Both were needed for gh issue create to actually create the alert issue (was silently swallowed by the || echo fallback).

Manual cleanup outside this PR: :stable currently points at the broken PR #137 image — needs manual retag back to stable-2026.04.505.
2026-04-30 09:42:27 +02:00
dependabot[bot]
7012966482
chore(deps): bump actions/checkout from 5 to 6 (#125)
Bumps [actions/checkout](https://github.com/actions/checkout) from 5 to 6.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](https://github.com/actions/checkout/compare/v5...v6)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: ZdenekSrotyr <139972147+ZdenekSrotyr@users.noreply.github.com>
2026-04-29 09:58:48 +02:00
dependabot[bot]
8d0edbf1c1
chore(deps): bump peter-evans/create-pull-request from 7 to 8 (#124)
Bumps [peter-evans/create-pull-request](https://github.com/peter-evans/create-pull-request) from 7 to 8.
- [Release notes](https://github.com/peter-evans/create-pull-request/releases)
- [Commits](https://github.com/peter-evans/create-pull-request/compare/v7...v8)

---
updated-dependencies:
- dependency-name: peter-evans/create-pull-request
  dependency-version: '8'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-29 09:46:09 +02:00
dependabot[bot]
62a5b8540a
chore(deps): bump actions/upload-artifact from 4 to 7 (#123)
Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 4 to 7.
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](https://github.com/actions/upload-artifact/compare/v4...v7)

---
updated-dependencies:
- dependency-name: actions/upload-artifact
  dependency-version: '7'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-29 09:38:38 +02:00
ZdenekSrotyr
61f6b8d2d5
feat(ci+tests): deploy safety audit — linting, rollback, smoke tests, 50+ new tests (#120)
Comprehensive deploy safety audit implementing 19 improvements across CI/CD pipeline, test coverage, and source code.

### CI/CD Pipeline
- ruff + mypy added to both release.yml and keboola-deploy.yml (continue-on-error)
- Smoke test added to keboola-deploy.yml (was missing)
- Automatic rollback on smoke test failure in release.yml
- Expanded smoke-test.sh with catalog, admin/tables, marketplace.zip, metrics
- Required status checks via .github/settings.yml
- Dependabot + CODEOWNERS + pre-commit hooks + ruff config

### Source Code
- DB schema version check in /api/health (db_schema: ok/mismatch/unhealthy)
- Config versioning (config_version: 1 in instance.yaml, non-blocking validation)
- BigQuery extractor ATTACH error handling (try/except around INSTALL+ATTACH)
- Post-deploy smoke test script for prod VM validation

### Test Coverage (~50 new tests)
- v13->v14 migration, Email magic link TTL, PAT, Marketplace ZIP/Git,
  Jira webhooks, Hybrid Query BQ, Keboola/BQ extractor failure modes,
  Orchestrator failure modes

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-04-29 09:18:55 +02:00
ZdenekSrotyr
33b318e491
ci(release): build dev image on branch creation from main (#118)
Fixes the per-developer dev-VM workflow: paths-ignore on push skipped same-SHA branch creates, build-and-push.if was main+dispatch-only. Add create: trigger filtered to branch refs, broaden build-and-push.if, add concurrency group keyed on github.ref with cancel-in-progress to dedupe create+push collisions on new branches with code changes.
2026-04-29 08:15:30 +02:00
ZdenekSrotyr
5f6bb7a4b2
fix(security+ops) + release(0.12.1): #82 #85 #87 hardening + cut 0.12.1 (#104)
* fix(security+ops): #82 #85 #87 — auth hardening, API validation, deploy posture

Security and operational hardening across three issue groups:

- M23: docker-compose.override.yml → docker-compose.dev.yml (BREAKING, prod foot-gun)
- C13: Container runs as non-root user 'agnes' (USER directive in Dockerfile)
- M21: Docker resource limits (mem_limit, cpus) on app + scheduler
- M22: Caddyfile security headers (X-Frame-Options, X-Content-Type-Options, Referrer-Policy, -Server)
- M17: /api/health split into minimal (unauth) + /api/health/detailed (auth) (BREAKING)
- M26: release.yml restricts build-and-push to main + workflow_dispatch; paths-ignore for docs

- C2: table_id traversal validation on /api/data/{table_id}/download
- M4: Upload streaming (chunk-read + temp file) instead of full-buffer; /local-md hashed filename

- C5: reset_token removed from POST /api/users/{id}/reset-password response
- C8: Startup WARNING when no user has password_hash (bootstrap window visible)
- M9: Audit log on failed web form login (mirrors /auth/token endpoint)
- M10: Atomic magic-link consume via compare-and-swap (CONSUMED: marker + DuckDB conflict catch)

Also: SSRF protection on /api/admin/configure (#46), memory stats SQL aggregation (#90)

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

* fix(review): SSRF 169.254.x.x + IPv6 multicast; M10 marker cleanup safety

Review fixes:
- Add 169.254.0.0/16 (link-local, cloud metadata) to SSRF regex — was
  missing, allowing requests to AWS/GCP/Azure metadata endpoints
- Add ff[0-9a-f]{2}: (IPv6 multicast) to SSRF regex
- M10: wrap Step 3 (CONSUMED marker cleanup) in try-except with
  warning log — prevents unhandled exception if DB write fails after
  successful token consumption
- Add test for 169.254.169.254 SSRF rejection

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

* fix(review): SSRF IPv6 bypass, CLI health endpoint, upload FD leak

Address Devin Review findings on PR #104:

1. SSRF IPv6 bypass: Replace hostname regex with DNS resolution +
   ipaddress module checks. The old regex patterns like `fe80:` only
   matched up to the first colon, missing real IPv6 addresses like
   `fe80::1`, `fc00::1`, `ff02::1`. The new approach resolves the
   hostname via getaddrinfo and checks each resulting IP against
   ipaddress.is_private/is_loopback/is_link_local/is_reserved/is_multicast.

2. CLI commands broken: `da setup test-connection`, `da setup verify`,
   `da diagnose`, `da status` all called /api/health expecting the old
   format (status=="healthy", services dict). Now they call
   /api/health/detailed for service-level checks (with graceful fallback
   to the minimal endpoint when auth is not configured).

3. Temp file handle leak: _stream_to_temp returns an open
   NamedTemporaryFile; callers now close it before shutil.move() to
   prevent FD leaks until GC.

Also adds IPv6 SSRF test cases (loopback, link-local, unique-local,
multicast) with mocked DNS resolution for test environment independence.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

* fix(review): download regex blocks hyphenated IDs; document health split

Address Devin Review round-3 findings on PR #104:

1. _SAFE_IDENTIFIER regex blocked hyphenated table IDs: The download
   endpoint used the strict SQL-identifier regex which does not allow
   dots or hyphens, but Keboola table IDs like in.c-crm.orders
   contain both. Switched to _SAFE_QUOTED_IDENTIFIER which allows dots
   and hyphens while still blocking path-traversal chars (/, .., \)
   and quote/control characters. Added test for hyphenated/dotted IDs.

2. Documented health endpoint split in DEPLOYMENT.md: Added Health
   checks & external monitoring section explaining both endpoints
   (minimal unauth /api/health vs authenticated /api/health/detailed)
   and how to wire external monitoring tools to the detailed endpoint
   with a PAT.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

* release(0.12.1): cut hotfix for snapshot integrity + #82/#85/#87 hardening

* fix(security): apply CAS pattern to password reset confirm (#82/M10 follow-up)

Devin review on the rebased PR flagged the asymmetry: magic-link verify
got the atomic compare-and-swap pattern in the original M10 fix, but
password reset confirm at /auth/password/reset/confirm was still using
read-validate-clear. Two concurrent POSTs with the same valid reset
token could both succeed in setting different new passwords (last-write-
wins). Lower severity than the magic-link race because the attacker
would need the reset token AND to race the legitimate user, but the
asymmetry was a polish gap.

Mirrors app/auth/providers/email.py::_consume_token CAS exactly: write
unique CONSUMED:<random> marker via UPDATE...WHERE token=old_token, then
SELECT to verify our marker won, then proceed. Only the winner clears
the marker and applies the password change.

New regression test_concurrent_reset_only_one_wins in
tests/test_password_flows.py::TestResetConfirm pins the contract: two
ThreadPoolExecutor workers + Barrier hit /reset/confirm with the same
token; exactly one gets 302 (password applied), the other gets 200 with
'Invalid or expired'. Sanity-checked against the pre-CAS code — both
POSTs got 302 (race confirmed).

---------

Co-authored-by: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-04-28 19:57:30 +02:00
ZdenekSrotyr
4e4d2a39e6
chore(oss): isolate customer-specific deploy bits from scripts/grpn/ (#88, wave 1) (#94)
* chore(oss): isolate customer-specific deploy bits from scripts/grpn/ (#88)

Vendor-neutralization step before public release. The directory mixed
two concerns: (1) generic ops scripts referenced from mainline OSS
infrastructure (TLS rotation, auto-upgrade cron) and (2) one operator's
hackathon manual-deploy helper with hardcoded GCP project IDs, VM names,
and admin emails. Splitting them per concern.

Moved (still in OSS, just under a vendor-neutral name):
- scripts/grpn/agnes-tls-rotate.sh   → scripts/ops/agnes-tls-rotate.sh
- scripts/grpn/agnes-auto-upgrade.sh → scripts/ops/agnes-auto-upgrade.sh

Removed (belongs in private consumer infra repos, not upstream OSS):
- scripts/grpn/Makefile (hardcoded prj-grp-foundryai-dev-7c37, foundryai-development VM name, e_zsrotyr@groupon.com bootstrap email)
- scripts/grpn/README.md (GRPN hackathon deploy walkthrough)
- docs/superpowers/plans/2026-04-22-grpn-deploy-learnings.md (org-specific deploy log)

Cross-refs updated in README.md, CLAUDE.md, docs/DEPLOYMENT.md,
docker-compose.yml. CHANGELOG entry flags BREAKING (ops) for any
consumer infra repo that installs these scripts via path-based systemd
timers.

This is the first wave of #88 — the remaining leaks (test data with
prj-grp-dataview-prod-1ff9, AIAgent.FoundryAI tags in OpenMetadata test
fixtures, docstrings in connectors/openmetadata/enricher.py) will be a
separate, smaller PR.

Refs #88.

* chore(oss): comprehensive vendor-neutralization (#88 wave 2 + review fixes)

PR #94 review found that the original wave-1 grep was scoped wrong and
many leaks survived. This commit closes wave 1 properly AND folds in all
wave-2 anonymization in a single pass — easier to review than two PRs.

Wave-1 review-fix corrections:
- Caddyfile: scripts/grpn/agnes-tls-rotate.sh → scripts/ops/ (the original
  wave-1 grep filter excluded extensionless files like Caddyfile).
- CHANGELOG bullet rewritten — original wording implied an in-repo migration
  for infra/modules/customer-instance/, which is wrong (the TF module embeds
  the script inline via heredoc, never sourced from scripts/grpn/). Now
  flags downstream consumer infra repos only.
- infra/modules/customer-instance/variables.tf: Czech docstring with `grpn`
  example → English description with `acme, example` placeholders.

Wave-2 anonymization:
- Code docstrings (connectors/openmetadata/{client,transformer,enricher}.py,
  src/catalog_export.py, scripts/duckdb_manager.py): prj-grp-… →
  my-bq-project / prj-example-1234, AIAgent.FoundryAI → AIAgent.MyAgent,
  FoundryAIDataModel → AnalyticsDataModel.
- Test fixtures (4 files): same set of replacements — 157 tests still pass.
- .github/workflows/keboola-deploy.yml: "Groupon-side dev VMs" comment →
  generic "per-developer dev VMs".
- docs/auth-groups.md + scripts/debug/probe_google_groups.py:
  kids-ai-data-analysis project name → acme-internal-prod placeholder.
- 5 planning/spec docs under docs/superpowers/{plans,specs}/2026-04-21-*:
  hardcoded IPs (34.77.94.14, 34.77.102.61) → <dev-vm-ip>/<prod-vm-ip>;
  GRPN/Groupon → Acme/another-customer; prj-grp-… → prj-example-….
- scripts/switch-dev-vm.sh deleted — hackathon-era helper hardcoded to a
  specific shared dev VM. Per-developer dev VMs are the supported pattern.

Final grep `groupon|grpn|foundryai|prj-grp|groupondev|34\.77\.(94|102)\.…|kids-ai-data`
returns zero hits (excluding CHANGELOG.md historical entries).

CHANGELOG entry expanded to document both waves under one bullet, with
the BREAKING (ops) clarification about the TF module being unaffected.

Refs review of #94, closes #88.

* fix(oss): close remaining #94 review-2 findings (Czech, padak refs, CHANGELOG)

Reviewer of PR #94 round 2 caught 4 remaining items the wave-2 pass missed:

1. infra/modules/customer-instance/variables.tf had Czech descriptions on
   8 more variables. Previous review only flagged line 19; this round
   audited the rest. Translated lines 2, 28, 42-46 (heredoc), 60, 65, 71,
   78, 84 to English. Same review concern: a Terraform module that is
   the customer-facing API surface in Czech is unfit for OSS distribution.

2. infra/modules/customer-instance/outputs.tf had Czech descriptions on
   four outputs. Same fix.

3. docs/padak-security.md referenced a private repo (padak/keboola_agent_cli#206)
   in two places. Replaced with generic 'tracked upstream in the auth-CLI repo'
   per CLAUDE.md vendor-agnostic rule (no cross-refs to private repos).

4. scripts/fetch-env-from-secrets.sh:41 had a Czech comment.
   Translated.

5. CHANGELOG cosmetic: bullet said 'AIAgent.FoundryAI -> AIAgent.MyAgent'
   but the actual code uses both MyAgent (in docstrings) and Example
   (in test fixtures). Reworded to mention both targets.

Final grep across all shipping file types (.md, .py, .yml, .yaml, .sh,
Makefile, .json, .tf, .tpl, Caddyfile, .toml) for groupon|grpn|foundryai|
prj-grp|groupondev|34.77.94.14|34.77.102.61|kids-ai-data|padak/keboola_agent_cli
returns ZERO hits (excluding CHANGELOG.md). Czech-diacritic grep across
.tf/.toml/Caddyfile/Makefile/.yml returns ZERO hits.

157/157 OpenMetadata + DuckDB tests still pass.

* fix(oss): close #94 round-3 leaks (env.template, instance.yaml.example, padak typo)

Round-3 reviewer caught two MUST-FIX leaks the round-2 grep missed
(grep was scoped to extensions that did not include .template / .example
suffixes — the audit was right, the previous grep was not paranoid enough):

1. config/instance.yaml.example:114 — '(optional - Groupon-specific)' brand
   leak in a shipping config example. Replaced with '(optional)'.

2. config/.env.template:68 — stale path 'scripts/grpn/agnes-tls-rotate.sh'
   in operator-facing env-template comment. The script lives at
   scripts/ops/ now (commit 16a85cc); this comment had been pointing
   operators at a non-existent path.

3. docs/padak-security.md:188 — phrase duplication 'tracked in tracked
   upstream' from a sloppy substitution in round-2. Trivial wording fix.

Final paranoid grep across .md/.py/.yml/.yaml/.sh/Makefile/.json/.tf/.tpl/
Caddyfile/.toml/.template/.example/.env* with the full token set
(groupon|grpn|foundryai|prj-grp|groupondev|34\.77\.94\.14|34\.77\.102\.61|
kids-ai-data|padak/keboola_agent_cli) returns ZERO hits, excluding
CHANGELOG.md historical entries.

* fix(oss): #94 round-4 — QUICKSTART.md + rename padak-security.md

Devin Review caught two findings on the latest round-3 commit:

1. docs/QUICKSTART.md:67 still pointed users at the deleted
   scripts/switch-dev-vm.sh. A Quickstart user following step-by-step
   would hit a missing-file error at the final step. Replaced with the
   inline gcloud-ssh equivalent that the Removed bullet documents.

2. docs/padak-security.md filename retains the personal identifier
   'padak'. The PR fixed the body content (replaced
   padak/keboola_agent_cli#206 references with generic wording) but
   missed the filename. Renamed to docs/security-audit-2026-04.md
   (date-anchored, vendor-neutral). Updated the historical CHANGELOG
   link to point at the new path with an inline note about the rename.

* fix(oss): redact remaining hardcoded IPs from planning docs + remove default email

Devin Review caught two more leaks:
1. scripts/fetch-env-from-secrets.sh line 16 had a hardcoded
   personal-email default (zdenek.srotyr@keboola.com). Replaced with
   ':?' bash error so SEED_ADMIN_EMAIL must be explicitly set —
   safer than carrying any specific identity.
2. Planning docs still had 35.195.96.98 and 34.62.223.189 (legacy
   prod/dev IPs) that the round-1 IP-replace pattern missed (it only
   targeted 34.77.x.x). Generic regex redaction across all five
   planning docs replaces every public IP with <redacted-ip>,
   preserving private/loopback/IAP ranges.
2026-04-27 20:24:34 +02:00
Petr Simecek
4799119c81
feat(deploy): keboola-deploy tag-triggered workflow + Caddyfile LE/internal modes + dev_instances TLS support (#52)
* feat(deploy): keboola-deploy tag-triggered workflow + Caddyfile LE/internal modes + dev_instances TLS support

Three coordinated changes that together unblock Keboola's internal Agnes
deployment from the foot-gun where the dev VM tracks `:dev` (= last push
from anyone in the upstream repo).

1. .github/workflows/keboola-deploy.yml — new workflow

   Triggered ONLY on `keboola-deploy-*` git tag pushes (not on every branch
   push like release.yml). Builds an image and publishes two GHCR tags:

     ghcr.io/keboola/agnes-the-ai-analyst:keboola-deploy-<git-tag-suffix>
     ghcr.io/keboola/agnes-the-ai-analyst:keboola-deploy-latest

   The Keboola dev VM pins to `keboola-deploy-latest`; an operator deploys
   by `git tag keboola-deploy-foo && git push origin keboola-deploy-foo`.
   Audit trail lives in git tags (immutable, who-tagged-what-when), no
   PR-cycle needed for each deploy.

   Doesn't touch Vojta/Minas/David workflow — release.yml still builds
   `:dev-<slug>` for every branch push as before.

2. Caddyfile — parametrize TLS directive via $CADDY_TLS env var

   PR #51 hardcoded cert-file mode (`tls /certs/fullchain.pem ...`) for
   Groupon's corporate CA flow. That broke the Let's Encrypt path the
   module previously supported. Now:

     CADDY_TLS unset (default) → cert-file mode (Groupon corp PKI)
     CADDY_TLS="tls user@x.com"  → Let's Encrypt auto-issue
     CADDY_TLS="tls internal"     → Caddy-managed self-signed (lab/dev)

   Single Caddyfile, three regimes, no per-deployment fork. Validated with
   `caddy validate` in all three modes.

3. customer-instance module — dev_instances TLS + auto-set CADDY_TLS

   - variables.tf: dev_instances object schema gains optional tls_mode +
     domain (mirroring prod_instance). Defaults to "none" + "" so existing
     callers without those fields keep current behavior.
   - startup-script.sh.tpl: when tls_mode="caddy" and DOMAIN is set, write
     CADDY_TLS=tls <ACME_EMAIL> (or "tls internal" when ACME_EMAIL empty)
     into /opt/agnes/.env. Caddy then picks it up and the Caddyfile
     substitution flips the cert source.

   For an LE deploy: set tls_mode="caddy", domain="agnes-dev.example.com",
   ensure DNS A-record points at the VM, and acme_email is set on the
   module (or seed_admin_email is, since acme_email defaults to it).

After this lands, tag as infra-v1.6.0 so downstream infra repos can bump
their module ref without needing the upstream change tracking.

* feat(deploy): fetch optional Google OAuth credentials from Secret Manager

Mirrors the existing keboola-storage-token / agnes-<customer>-jwt-secret
pattern: VM SA reads google-oauth-client-{id,secret} secrets at boot
(if they exist + IAM is wired by caller via runtime_secrets) and writes
them into /opt/agnes/.env. Empty / missing / 403 → silent fallback
to "" so password and email auth keep working untouched.

Pairs with downstream change in agnes-infra-keboola which adds the two
secret names to runtime_secrets, granting the Keboola VM SA secretAccessor
on them. Operator pre-creates the SM containers via gcloud secrets create
google-oauth-client-{id,secret} (one-time, out of band) — values stay
in SM forever; rotation = `gcloud secrets versions add`.

This unblocks the Keboola agnes-dev deploy from PR #3 (infra) — without
GOOGLE_CLIENT_{ID,SECRET} in .env, app/auth/providers/google.is_available()
returns False and the Google sign-in button never even appears.
2026-04-25 23:19:00 +02:00
Petr Simecek
1bbbe58ea0
release(2.1.0): durable sync, CLI auto-update, versioned wheel URL, version unification (#43)
* fix(cli): versioned wheel URL in setup instructions; drop broken /cli/agnes.whl alias (#36)

* fix(cli): inline PEP 427 wheel filename in setup instructions

`uv tool install <server>/cli/agnes.whl` fails with

    error: The wheel filename "agnes.whl" is invalid: Must have a version

because uv validates the filename in the URL path *before* fetching — so
the server-side Content-Disposition header (which has the real versioned
filename) is never consulted, and an HTTP redirect does not help either:
uv resolves the filename from the initial URL.

Fix the root cause by inlining the real PEP 427 filename into the setup
snippet the dashboard copies to the clipboard. The wheel filename is
resolved server-side via `_find_wheel()` and substituted into the lines
returned from `setup_instructions.resolve_lines()`, so both the read-only
HTML preview and the JS clipboard renderer get byte-identical output.

Also added `/cli/wheel/{filename}` to serve wheels at their PEP 427 path,
and kept `/cli/agnes.whl` as a 302 redirect for manual/legacy callers —
though that redirect alone is NOT sufficient for `uv tool install` (uv
validates before following redirects) and is there only as defense-in-depth.

Verified locally:
- `uv tool install <server>/cli/wheel/agnes_the_ai_analyst-2.0.0-py3-none-any.whl` succeeds
- `/install` HTML now renders the versioned URL; `/cli/agnes.whl` no longer appears in the rendered snippet

* fix(cli): remove /cli/agnes.whl alias entirely — it only confused users

The bareword alias was never actually usable:

- `uv tool install <server>/cli/agnes.whl` fails at filename validation
  before any HTTP fetch, so neither the Content-Disposition header nor a
  302 redirect rescued it.
- The 302-to-versioned-path fallback left a visibly "working" URL in
  browser / curl -L contexts, which is exactly how the original bug got
  reported in the first place ("the URL loads, why doesn't install work?").

Remove the endpoint and scrub all remaining references. The only CLI wheel
URL is now `/cli/wheel/{filename}` with the real PEP 427 filename, which
the setup-instructions template already generates server-side.

Existing tests that referenced /cli/agnes.whl become negative tests
("must not appear") so we don't regress.

* feat(cli): --version flag; sync --dry-run + progress indicator (#38)

* feat(cli): add --version / -V flag

Prints `da <version>` from package metadata (importlib.metadata). Falls
back to "unknown" when the package is not installed (e.g. running from a
source checkout without `uv pip install -e .`), instead of crashing.

Eager typer callback, so `da --version` exits before subcommand
resolution and does not require any auth/config.

* feat(cli): da sync --dry-run + X/N progress indicator

--dry-run reports what would be downloaded/uploaded without hitting the
API or writing local state. Supports the full flag set (--table, --json,
--upload-only); JSON shape is {"dry_run": true, "would_download": [...],
"summary": {...}}.

Progress bar now shows "[X/N] Downloading <table>..." with a Rich
BarColumn + TaskProgressColumn + TimeElapsedColumn instead of a bare
spinner — makes long syncs visible.

* feat(cli): durable sync + server gzip + auto-update check (#41)

* fix(sync): atomic writes + manifest hash verification + retry on transient errors

Three durability hooks around stream_download and the sync command:

1. Atomic writes. stream_download now streams into `<target>.tmp` and
   calls os.replace() on success, so the real target file never exists
   in a half-written state. On failure the tmp is unlinked — no cleanup
   leftovers, no guard needed at read time.

2. Retry with backoff. Transient errors (ConnectError, ReadError,
   WriteError, RemoteProtocolError, TimeoutException, 5xx) are retried
   up to 3× with 0.3s / 1s / 3s backoff. 4xx (auth, 404) surfaces
   immediately — retrying those is pointless.

3. Manifest-hash verification. After download, sync.py computes MD5 of
   the target (same 8KiB chunking as app/api/sync.py:_file_hash) and
   compares against `server_tables[tid]["hash"]`. Mismatch ⇒ unlink,
   record error, skip state commit. The PAR1 structural check survives
   as a fallback for legacy manifests without a hash.

Also makes _rebuild_duckdb_views tolerant: single broken parquet is
skipped with a stderr warning instead of killing the whole rebuild.

Supersedes #40 — this commit is a strict super-set (hash check + PAR1
fallback + atomic write + retry). #40 can be closed without merging.

* perf(server): enable GZipMiddleware for JSON / HTML responses

GZipMiddleware at minimum_size=1024 shaves bandwidth on manifest-style
JSON endpoints (/api/sync/manifest, /api/version, …) and the /install
HTML preview. Parquet file downloads are already columnar-compressed so
the middleware sees limited benefit there — but it doesn't hurt, httpx
on the client side decompresses transparently.

Placed after session middleware so gzip wraps the session-Set-Cookie
response too, and before CORSMiddleware so compression is applied to
both cross-origin and same-origin responses.

* feat(cli): auto-check for newer CLI version on startup

Server side
- GET /cli/latest returns {version, wheel_filename, download_url_path}
  for whatever wheel is currently in AGNES_CLI_DIST_DIR. Public,
  cacheable, no secrets — consumed by the CLI auto-update probe.

Client side
- New cli/update_check.py: reads /cli/latest with a 3s timeout, caches
  the result in $DA_CONFIG_DIR/update_check.json for 24h. Cache is
  invalidated when the installed version changes (e.g. after a fresh
  `uv tool install`) so stale "you're behind" warnings don't linger.
- Root typer callback fires the probe before subcommand dispatch; any
  failure is swallowed so a bad network never blocks a working command.
- Outdated → one-line stderr warning:
    [update] da 2.0.0 is out of date — latest on this server is 2.1.0.
    Upgrade: uv tool install --force <server>/cli/wheel/<…>.whl
- Disable with DA_NO_UPDATE_CHECK=1.

* fix(pr-review): None-guard the upgrade line + skip gzip on parquet paths

Two follow-ups from Devin review on #41.

1. format_outdated_notice(UpdateInfo(download_url=None)) emitted literal
   "uv tool install --force None" — copy-pasting that fails. Drop the
   upgrade snippet when the URL is absent and keep only the version line.

2. GZipMiddleware compressed everything over 1024 bytes, including the
   parquet FileResponses served by /api/data/{tid}/download,
   /cli/wheel/{name}, and /cli/download. Parquet is already columnar-
   compressed — gzip there is pure CPU + latency with no size win, and
   /api/data bodies can reach hundreds of MB. Wrap GZipMiddleware in a
   small _SelectiveGZipMiddleware that skips those path prefixes and
   delegates the rest to the stock middleware. JSON / HTML endpoints
   (manifest, /install, /api/version, …) still get compressed.

* release: bump to 2.1.0 — unify AGNES_VERSION with pyproject.toml version (#42)

Before: two independent version systems. pyproject.toml carried semver
(2.0.0 → wheel filename → `da --version`) while release.yml injected
CalVer into AGNES_VERSION (e.g. 2026.04.155 → /api/version). Users saw
different strings in the CLI vs. the /install page, and the CLI auto-
update check couldn't tell "new deploy, same package version" apart
from "new package version".

Make pyproject.toml [project].version the single product-version source
of truth. release.yml extracts it and feeds AGNES_VERSION, so every
surface (/api/version, /api/health, /cli/latest, `da --version`) agrees
on one number. The CalVer tag keeps doing what CalVer is for: release
identity on the git tag and Docker image tag (versioned_tag).

Also wires AGNES_TAG through the build: release.yml → Dockerfile ARG →
env, so /api/version.image_tag finally reports the actual image tag
instead of the "unknown" fallback.

Bump to 2.1.0 to reflect the PRs shipped on ps/wheel-name-fix: durable
sync (atomic writes + manifest MD5 + retry), server GZip, CLI auto-
update probe, setup snippet PEP 427 URL.

* fix(pr-review): directional version compare in is_outdated()

UpdateInfo.is_outdated() used `self.latest != self.installed`, which
fires in both directions. If the server is rolled back or the user
connects to an older deployment, the CLI would warn "out of date"
and — worse — the formatted notice would prompt

    uv tool install --force <older-version>.whl

i.e. an unintended downgrade.

Compare with packaging.version.Version (PEP 440 aware, handles pre-
release tags). Fall back to dotted-int tuple compare if packaging is
somehow missing, and return False on unparseable strings — better to
miss an upgrade hint than to silently suggest a downgrade.

Adds 4 test cases: installed older (True), installed newer (False),
10.0.0 vs 2.1.0 lexical-compare trap (correct), unparseable strings
(False).

Addresses Devin review on #43.

* fix(pr-review): read FastAPI app version from package metadata

app/main.py:80 hardcoded `version="2.0.0"` in the FastAPI constructor.
After #42 bumped pyproject.toml to 2.1.0, /api/version, /cli/latest,
and `da --version` all reported 2.1.0 while /openapi.json and the
/docs UI still advertised 2.0.0.

Read `agnes-the-ai-analyst` version via importlib.metadata (same
pattern cli/main.py:_cli_version already uses), with a `"dev"`
fallback when the package is not installed (source checkout). This
way pyproject.toml stays the single source of truth across every
version surface — /openapi.json now tracks the bump automatically.

Adds a dedicated test file to pin this behavior so a future
regression to a hardcoded literal fails at CI.

Addresses second Devin finding on #43.

* fix(pr-review): _fmt_bytes PiB label + negative cache in update_check

Two more follow-ups from Devin review on #43.

1. _fmt_bytes off-by-unit. The old loop exited at TiB but the fallback
   labelled PiB, so 1 PiB rendered as "1024.0 PiB". Restructure: put
   every unit inside the loop (KiB through EiB) so the division count
   always matches the label. Covers up to 1 ZiB cleanly; anything
   beyond renders as "<big>.0 EiB" rather than crashing.

2. Negative cache for failed /cli/latest probes. On a corporate
   firewall / VPN that silently drops packets, the 3s HTTP timeout
   fired on *every* `da` invocation. Writing a `latest=None` cache
   entry with a 5-minute TTL caps that at one probe per 5min. Successful
   probes still use the 24h TTL. Reading logic branches on whether the
   cached `latest` is None.

Adds TestFmtBytes (2 cases: small/medium sizes and the PiB/EiB fallback
regression), plus two TestSync update-check cases covering negative-
cache reuse and TTL expiry.
2026-04-22 21:18:18 +02:00
ZdenekSrotyr
963db420fe
ci(release): push dev-<user-prefix>-latest alias for <user>/* branches (#31)
Adds a second tag to dev-channel image builds: when a branch is in the
form <prefix>/<whatever>, the image is also pushed as
ghcr.io/keboola/agnes-the-ai-analyst:dev-<prefix>-latest.

Enables per-developer dev VMs on GRPN (and elsewhere) to auto-deploy
without knowing the specific branch slug. Each VM pins its .env to
AGNES_TAG=dev-<prefix>-latest, and the auto-upgrade cron (5 min tick)
picks up the newly pushed image on the next run.

Common Git Flow prefixes are deliberately skipped so feature/*, fix/*,
hotfix/* etc. don't create noise tags. Matched list:
feature, fix, hotfix, bugfix, docs, chore, test, ci, ops, refactor,
perf, style, build.

Verified locally against several branch names:
  zs/my-feature     -> dev-zs-latest
  vr/foo            -> dev-vr-latest
  pc/bar-baz        -> dev-pc-latest
  feature/xyz       -> (skipped)
  fix/bug           -> (skipped)
  main              -> (no-op, stable channel)
  test-no-slash     -> (no-op, no slash)
2026-04-22 14:02:59 +02:00
ZdenekSrotyr
4f381dc103
fix(ci): propagate-infra-tag fail-soft on branch push / missing secret (#24)
Job-level 'if: secrets.X != ""' did not prevent workflow from being
scheduled on branch pushes (GitHub reports failure with 0 jobs in that
case). Refactored: first step is a guard that checks both the tag ref
pattern and the secret presence; downstream steps skip when the guard
says no.

Result: workflow now reports success with a clear warning annotation on
branch pushes or when the secret is absent; only real infra-v* tag
pushes with the secret set perform the bump.
2026-04-21 21:59:10 +02:00
ZdenekSrotyr
e2eb51f657
ci(release): build image for all branches, not just feature/** (#19)
* dryrun: intentional failing test (will be reverted)

* feat(auth): optional SEED_ADMIN_PASSWORD to pre-hash seed admin (dev helper)

Terraform gains enable_seed_password + seed_admin_password (sensitive) vars
on the customer-instance module; when enabled the password is piped via
startup-script into /opt/agnes/.env as SEED_ADMIN_PASSWORD. On first boot
app/main.py argon2-hashes it onto the seed user so the admin can log in
immediately without going through /auth/bootstrap. Never overwrites an
existing password_hash — safe against accidental reset on terraform apply.

* ci(release): build :dev-<slug> on any branch, not just feature/**

Before: only 'feature/**' branches triggered release.yml, so pushing
'zs/my-edit' or 'fix/bug' did not publish an image. dev_instances entry
pinning image_tag = 'dev-zs-my-edit' then crashed VM startup with
'image not found'.

Now: any branch push (except main, which produces :stable) publishes
:dev-<slug>. Slug strips a leading 'feature/' and replaces non-[a-z0-9-]
with '-', keeping existing feature/** behavior identical.

* Revert "dryrun: intentional failing test (will be reverted)"

This reverts commit cf9cc06a7884bb401ff29fc5cb6d8baf84dc3daa.
2026-04-21 21:33:57 +02:00
ZdenekSrotyr
2cbffce85f
ci: propagate infra-v* tags to template repo + auto-merge rules (#17)
* dryrun: verify per-branch GHCR tag

* ci: propagate infra-v* tag bumps to template repo

On push of any infra-v* tag, opens a PR in keboola/agnes-infra-template
that bumps the module ref in terraform/main.tf. Auto-merge rules in the
template (Renovate + CI validate + GitHub native auto-merge) land it
without manual work on patch/minor bumps.

Requires repo secret TEMPLATE_REPO_TOKEN (fine-grained PAT with
Contents:write + Pull requests:write on keboola/agnes-infra-template).

Fail-soft: if secret is missing the job is skipped and Renovate on the
template repo picks up the new tag on its next cycle as a fallback.

* docs(onboarding): 'Keeping the template up-to-date' maintainer section

Documents the two mechanisms (upstream release hook + Renovate), the
required repo settings (allow_auto_merge, validate.yml gate), the TOKEN
secret setup, and the one-time setup checklist. Notes the difference
between template repo (auto-merge on) and customer infra repos
(human approval).
2026-04-21 21:32:58 +02:00
ZdenekSrotyr
1c7cc8aa29 fix(image): add AGNES_COMMIT_SHA build-arg to Dockerfile + release.yml
Completes the previous commit — bakes the full git SHA into the image ENV
at build time so the UI badge shows a real commit, not a sha256 digest
(which was the floating manifest digest and unhelpful for debugging).
2026-04-21 21:00:30 +02:00
ZdenekSrotyr
5188bd9127 ci: add per-branch image tag :dev-<slug> for branch-aware dev deploys
Extracts branch name from GITHUB_REF, slugifies it, and adds as extra tag
on feature branch builds. Main branch is unaffected (no branch_slug output).

Enables dev_instances tfvar with image_tag pinning specific feature branches.
2026-04-21 18:47:01 +02:00
ZdenekSrotyr
5bbd82bacd fix: address Devin review — docker-e2e .env, jira webhook test isolation
- Create empty .env before docker compose up in CI (env_file: .env is required)
- Mock get_jira_service in webhook HMAC test to isolate signature check
  from Jira API availability — strict assert 200 instead of permissive 500
2026-04-13 14:36:31 +02:00
ZdenekSrotyr
5bfff6616c ci: add parallel test execution and nightly Docker E2E job 2026-04-12 14:15:46 +02:00
ZdenekSrotyr
44b99f25ca fix: address Devin review round 5 — empty secret file, CI .env
- secrets.py: validate file content is non-empty before using it;
  regenerate if file exists but is empty/corrupted
- release.yml: touch .env before docker compose in smoke test
  (env_file: .env in docker-compose.yml requires the file to exist)

663 tests pass.
2026-04-10 14:55:31 +02:00
ZdenekSrotyr
40cca627be fix: address Devin review round 4 — bash arithmetic, CalVer max, docs
- smoke-test.sh: replace ((PASS++)) with PASS=$((PASS + 1)) to avoid
  set -e abort when counter is 0 (bash returns exit 1 for ((0)))
- CalVer: use max(N) from existing tags instead of count, safe when
  tags are deleted (e.g. deprecated version cleanup)
- CLAUDE.md: update schema version from v2 to v3

663 tests pass.
2026-04-10 14:39:16 +02:00
ZdenekSrotyr
dc8a9275e6 fix: address Devin review round 3 — retry exhaustion, discover path, WAL snapshot
- CalVer retry loop now exits with error if all 5 attempts fail
  (prevents pushing Docker image with unclaimed version tag)
- discover_tables endpoint reads data_source.keboola.url (consistent
  with configure_instance and _discover_and_register_tables)
- Pre-migration snapshot flushes WAL via CHECKPOINT before copying
  and copies .wal file if it still exists after flush

663 tests pass.
2026-04-10 14:11:17 +02:00
ZdenekSrotyr
c79d85f87c fix: config path mismatch + CalVer race condition (Devin review round 2)
- _discover_and_register_tables reads from data_source.keboola.url
  (matches what /api/admin/configure writes) instead of top-level
  keboola.url which doesn't exist
- CalVer: claim git tag BEFORE Docker build with retry loop (up to 5
  attempts). Prevents race where two concurrent CI runs get same N.
  Git tag acts as a distributed lock for version uniqueness.

663 tests pass.
2026-04-10 13:30:05 +02:00
ZdenekSrotyr
49f109bf73 fix: address PR review findings — config write, CalVer, error handling
- Config writes to DATA_DIR/state/instance.yaml (writable) instead of
  CONFIG_DIR (read-only :ro in Docker)
- instance_config.py checks DATA_DIR/state/ first, then falls back to
  CONFIG_DIR for backward compat
- CalVer counter is now global across channels (*-YYYY.MM.*) per spec
- Keboola error messages sanitized — log full error, return generic msg
- chmod in secrets.py wrapped in try/except for Windows compat
- Setup wizard JS handles 401 (expired JWT) with user-facing message
- deploy.yml changed to workflow_dispatch only (no duplicate test runs)
- Smoke test uses docker-compose.prod.yml + AGNES_TAG instead of sed
- docker-compose.prod.yml uses ${AGNES_TAG:-stable} env var

663 tests pass. 8 E2E verification tests pass.
2026-04-10 13:16:40 +02:00
ZdenekSrotyr
6c53082295 feat: multi-instance deployment — all 14 must-have items from spec
CalVer CI (release.yml) with stable/dev channels, health endpoint
with version/channel/schema_version, JWT secret auto-generation with
file persistence, smoke test script + Docker-in-CI, pre-migration
snapshot, /api/admin/configure for headless setup, /api/admin/
discover-and-register, /setup wizard, OpenAPI snapshot test, custom
connector mount support, CHANGELOG, migration safety tests, startup
banner.

663 tests pass (6 new migration safety + 3 OpenAPI snapshot + 1
updated JWT test).
2026-04-10 11:57:42 +02:00
ZdenekSrotyr
816f217d8e feat: add commit SHA tag to Docker image push for rollback capability 2026-04-09 16:38:38 +02:00
ZdenekSrotyr
22b4d830e5 chore: upgrade docker actions to Node.js 24 (login-action@v4, build-push-action@v7) 2026-04-09 14:22:11 +02:00
ZdenekSrotyr
6ebfc15010 fix: setup-uv@v7 (v8 major tag doesn't exist yet) 2026-04-09 14:19:32 +02:00
ZdenekSrotyr
1ebf50bd78 fix: upgrade setup-uv@v5 → v8 (Node.js 24 native), add uv.lock
- setup-uv@v8 runs on Node.js 24 natively — no more deprecation warnings
- Removed FORCE_JAVASCRIPT_ACTIONS_TO_NODE24 workaround (no longer needed)
- Added uv.lock for reproducible dependency resolution
2026-04-09 14:16:55 +02:00
ZdenekSrotyr
554ba0d9f2 fix: remove Kamal deploy job (no server configured), force Node.js 24 in CI
- Removed deploy-production job — Kamal config has placeholder IPs, no secrets
- Renamed workflow to "Build & Push" — test + Docker image to GHCR
- Added FORCE_JAVASCRIPT_ACTIONS_TO_NODE24=true to suppress Node.js 20 warnings
2026-04-09 14:10:37 +02:00
ZdenekSrotyr
0279cc06fa refactor: consolidate deps into pyproject.toml, remove requirements.txt
- All dependencies now in pyproject.toml [project.dependencies]
- Dev/test deps in [project.optional-dependencies] dev and [tool.uv]
- Dockerfile uses uv pip install . from pyproject.toml
- CI uses uv pip install ".[dev]"
- Deleted requirements.txt and requirements-dev.txt
- Updated README, CLAUDE.md install instructions
- Enhanced .dockerignore (exclude tests, docs, infra from image)
2026-04-09 13:17:59 +02:00
ZdenekSrotyr
fa3aef652f chore: update GitHub Actions to Node.js 24 compatible versions (checkout@v5, setup-python@v6, setup-uv@v5) 2026-04-09 12:48:14 +02:00
ZdenekSrotyr
f9fae6e895 fix: CI installs requirements-dev.txt (faker needed for tests), set TESTING=1 2026-04-09 09:10:29 +02:00
ZdenekSrotyr
2635f77974 ci: add CI test suite + deploy pipeline
- ci.yml: runs 607 tests + Docker build on push/PR
- deploy.yml: tests → build → GHCR push → Kamal deploy on main
2026-04-08 18:24:05 +02:00
ZdenekSrotyr
cfa08c4b4c chore: remove obsolete CI workflows (deploy-guard, deploy.yml.example)
deploy-guard.yml referenced deleted tests and sudoers files.
deploy.yml.example used legacy SSH-based deployment.
Updated ci.yml and deploy.yml are in .gitignore (need workflow scope to push).
2026-04-08 18:16:48 +02:00
ZdenekSrotyr
a74f69d6b1 chore: exclude CI workflow from push (needs workflow scope) 2026-03-27 17:41:27 +01:00
ZdenekSrotyr
e0ce91ddb9 feat: add dataset permissions, script execution, Kamal config, CI/CD
- SyncSettingsRepository + DatasetPermissionRepository with RBAC
- Script deploy/run/undeploy API with import sandboxing
- User sync settings API with permission checks
- 4 CLI skills (connectors, security, notifications, corporate-memory)
- Kamal production + staging configs
- GitHub Actions CI + deploy workflows
- 91 total tests passing
2026-03-27 15:40:11 +01:00
Petr
c56905d34f Initial commit: OSS data distribution platform
Open-source AI data analyst platform extracted from internal repo.
Includes data sync engine, Keboola adapter, Flask web portal,
server deployment scripts, and configuration templates.
2026-03-08 23:31:28 +01:00