* fix(security+ops): #82#85#87 — auth hardening, API validation, deploy posture
Security and operational hardening across three issue groups:
- M23: docker-compose.override.yml → docker-compose.dev.yml (BREAKING, prod foot-gun)
- C13: Container runs as non-root user 'agnes' (USER directive in Dockerfile)
- M21: Docker resource limits (mem_limit, cpus) on app + scheduler
- M22: Caddyfile security headers (X-Frame-Options, X-Content-Type-Options, Referrer-Policy, -Server)
- M17: /api/health split into minimal (unauth) + /api/health/detailed (auth) (BREAKING)
- M26: release.yml restricts build-and-push to main + workflow_dispatch; paths-ignore for docs
- C2: table_id traversal validation on /api/data/{table_id}/download
- M4: Upload streaming (chunk-read + temp file) instead of full-buffer; /local-md hashed filename
- C5: reset_token removed from POST /api/users/{id}/reset-password response
- C8: Startup WARNING when no user has password_hash (bootstrap window visible)
- M9: Audit log on failed web form login (mirrors /auth/token endpoint)
- M10: Atomic magic-link consume via compare-and-swap (CONSUMED: marker + DuckDB conflict catch)
Also: SSRF protection on /api/admin/configure (#46), memory stats SQL aggregation (#90)
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
* fix(review): SSRF 169.254.x.x + IPv6 multicast; M10 marker cleanup safety
Review fixes:
- Add 169.254.0.0/16 (link-local, cloud metadata) to SSRF regex — was
missing, allowing requests to AWS/GCP/Azure metadata endpoints
- Add ff[0-9a-f]{2}: (IPv6 multicast) to SSRF regex
- M10: wrap Step 3 (CONSUMED marker cleanup) in try-except with
warning log — prevents unhandled exception if DB write fails after
successful token consumption
- Add test for 169.254.169.254 SSRF rejection
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
* fix(review): SSRF IPv6 bypass, CLI health endpoint, upload FD leak
Address Devin Review findings on PR #104:
1. SSRF IPv6 bypass: Replace hostname regex with DNS resolution +
ipaddress module checks. The old regex patterns like `fe80:` only
matched up to the first colon, missing real IPv6 addresses like
`fe80::1`, `fc00::1`, `ff02::1`. The new approach resolves the
hostname via getaddrinfo and checks each resulting IP against
ipaddress.is_private/is_loopback/is_link_local/is_reserved/is_multicast.
2. CLI commands broken: `da setup test-connection`, `da setup verify`,
`da diagnose`, `da status` all called /api/health expecting the old
format (status=="healthy", services dict). Now they call
/api/health/detailed for service-level checks (with graceful fallback
to the minimal endpoint when auth is not configured).
3. Temp file handle leak: _stream_to_temp returns an open
NamedTemporaryFile; callers now close it before shutil.move() to
prevent FD leaks until GC.
Also adds IPv6 SSRF test cases (loopback, link-local, unique-local,
multicast) with mocked DNS resolution for test environment independence.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
* fix(review): download regex blocks hyphenated IDs; document health split
Address Devin Review round-3 findings on PR #104:
1. _SAFE_IDENTIFIER regex blocked hyphenated table IDs: The download
endpoint used the strict SQL-identifier regex which does not allow
dots or hyphens, but Keboola table IDs like in.c-crm.orders
contain both. Switched to _SAFE_QUOTED_IDENTIFIER which allows dots
and hyphens while still blocking path-traversal chars (/, .., \)
and quote/control characters. Added test for hyphenated/dotted IDs.
2. Documented health endpoint split in DEPLOYMENT.md: Added Health
checks & external monitoring section explaining both endpoints
(minimal unauth /api/health vs authenticated /api/health/detailed)
and how to wire external monitoring tools to the detailed endpoint
with a PAT.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
* release(0.12.1): cut hotfix for snapshot integrity + #82/#85/#87 hardening
* fix(security): apply CAS pattern to password reset confirm (#82/M10 follow-up)
Devin review on the rebased PR flagged the asymmetry: magic-link verify
got the atomic compare-and-swap pattern in the original M10 fix, but
password reset confirm at /auth/password/reset/confirm was still using
read-validate-clear. Two concurrent POSTs with the same valid reset
token could both succeed in setting different new passwords (last-write-
wins). Lower severity than the magic-link race because the attacker
would need the reset token AND to race the legitimate user, but the
asymmetry was a polish gap.
Mirrors app/auth/providers/email.py::_consume_token CAS exactly: write
unique CONSUMED:<random> marker via UPDATE...WHERE token=old_token, then
SELECT to verify our marker won, then proceed. Only the winner clears
the marker and applies the password change.
New regression test_concurrent_reset_only_one_wins in
tests/test_password_flows.py::TestResetConfirm pins the contract: two
ThreadPoolExecutor workers + Barrier hit /reset/confirm with the same
token; exactly one gets 302 (password applied), the other gets 200 with
'Invalid or expired'. Sanity-checked against the pre-CAS code — both
POSTs got 302 (race confirmed).
---------
Co-authored-by: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
* chore(oss): isolate customer-specific deploy bits from scripts/grpn/ (#88)
Vendor-neutralization step before public release. The directory mixed
two concerns: (1) generic ops scripts referenced from mainline OSS
infrastructure (TLS rotation, auto-upgrade cron) and (2) one operator's
hackathon manual-deploy helper with hardcoded GCP project IDs, VM names,
and admin emails. Splitting them per concern.
Moved (still in OSS, just under a vendor-neutral name):
- scripts/grpn/agnes-tls-rotate.sh → scripts/ops/agnes-tls-rotate.sh
- scripts/grpn/agnes-auto-upgrade.sh → scripts/ops/agnes-auto-upgrade.sh
Removed (belongs in private consumer infra repos, not upstream OSS):
- scripts/grpn/Makefile (hardcoded prj-grp-foundryai-dev-7c37, foundryai-development VM name, e_zsrotyr@groupon.com bootstrap email)
- scripts/grpn/README.md (GRPN hackathon deploy walkthrough)
- docs/superpowers/plans/2026-04-22-grpn-deploy-learnings.md (org-specific deploy log)
Cross-refs updated in README.md, CLAUDE.md, docs/DEPLOYMENT.md,
docker-compose.yml. CHANGELOG entry flags BREAKING (ops) for any
consumer infra repo that installs these scripts via path-based systemd
timers.
This is the first wave of #88 — the remaining leaks (test data with
prj-grp-dataview-prod-1ff9, AIAgent.FoundryAI tags in OpenMetadata test
fixtures, docstrings in connectors/openmetadata/enricher.py) will be a
separate, smaller PR.
Refs #88.
* chore(oss): comprehensive vendor-neutralization (#88 wave 2 + review fixes)
PR #94 review found that the original wave-1 grep was scoped wrong and
many leaks survived. This commit closes wave 1 properly AND folds in all
wave-2 anonymization in a single pass — easier to review than two PRs.
Wave-1 review-fix corrections:
- Caddyfile: scripts/grpn/agnes-tls-rotate.sh → scripts/ops/ (the original
wave-1 grep filter excluded extensionless files like Caddyfile).
- CHANGELOG bullet rewritten — original wording implied an in-repo migration
for infra/modules/customer-instance/, which is wrong (the TF module embeds
the script inline via heredoc, never sourced from scripts/grpn/). Now
flags downstream consumer infra repos only.
- infra/modules/customer-instance/variables.tf: Czech docstring with `grpn`
example → English description with `acme, example` placeholders.
Wave-2 anonymization:
- Code docstrings (connectors/openmetadata/{client,transformer,enricher}.py,
src/catalog_export.py, scripts/duckdb_manager.py): prj-grp-… →
my-bq-project / prj-example-1234, AIAgent.FoundryAI → AIAgent.MyAgent,
FoundryAIDataModel → AnalyticsDataModel.
- Test fixtures (4 files): same set of replacements — 157 tests still pass.
- .github/workflows/keboola-deploy.yml: "Groupon-side dev VMs" comment →
generic "per-developer dev VMs".
- docs/auth-groups.md + scripts/debug/probe_google_groups.py:
kids-ai-data-analysis project name → acme-internal-prod placeholder.
- 5 planning/spec docs under docs/superpowers/{plans,specs}/2026-04-21-*:
hardcoded IPs (34.77.94.14, 34.77.102.61) → <dev-vm-ip>/<prod-vm-ip>;
GRPN/Groupon → Acme/another-customer; prj-grp-… → prj-example-….
- scripts/switch-dev-vm.sh deleted — hackathon-era helper hardcoded to a
specific shared dev VM. Per-developer dev VMs are the supported pattern.
Final grep `groupon|grpn|foundryai|prj-grp|groupondev|34\.77\.(94|102)\.…|kids-ai-data`
returns zero hits (excluding CHANGELOG.md historical entries).
CHANGELOG entry expanded to document both waves under one bullet, with
the BREAKING (ops) clarification about the TF module being unaffected.
Refs review of #94, closes#88.
* fix(oss): close remaining #94 review-2 findings (Czech, padak refs, CHANGELOG)
Reviewer of PR #94 round 2 caught 4 remaining items the wave-2 pass missed:
1. infra/modules/customer-instance/variables.tf had Czech descriptions on
8 more variables. Previous review only flagged line 19; this round
audited the rest. Translated lines 2, 28, 42-46 (heredoc), 60, 65, 71,
78, 84 to English. Same review concern: a Terraform module that is
the customer-facing API surface in Czech is unfit for OSS distribution.
2. infra/modules/customer-instance/outputs.tf had Czech descriptions on
four outputs. Same fix.
3. docs/padak-security.md referenced a private repo (padak/keboola_agent_cli#206)
in two places. Replaced with generic 'tracked upstream in the auth-CLI repo'
per CLAUDE.md vendor-agnostic rule (no cross-refs to private repos).
4. scripts/fetch-env-from-secrets.sh:41 had a Czech comment.
Translated.
5. CHANGELOG cosmetic: bullet said 'AIAgent.FoundryAI -> AIAgent.MyAgent'
but the actual code uses both MyAgent (in docstrings) and Example
(in test fixtures). Reworded to mention both targets.
Final grep across all shipping file types (.md, .py, .yml, .yaml, .sh,
Makefile, .json, .tf, .tpl, Caddyfile, .toml) for groupon|grpn|foundryai|
prj-grp|groupondev|34.77.94.14|34.77.102.61|kids-ai-data|padak/keboola_agent_cli
returns ZERO hits (excluding CHANGELOG.md). Czech-diacritic grep across
.tf/.toml/Caddyfile/Makefile/.yml returns ZERO hits.
157/157 OpenMetadata + DuckDB tests still pass.
* fix(oss): close#94 round-3 leaks (env.template, instance.yaml.example, padak typo)
Round-3 reviewer caught two MUST-FIX leaks the round-2 grep missed
(grep was scoped to extensions that did not include .template / .example
suffixes — the audit was right, the previous grep was not paranoid enough):
1. config/instance.yaml.example:114 — '(optional - Groupon-specific)' brand
leak in a shipping config example. Replaced with '(optional)'.
2. config/.env.template:68 — stale path 'scripts/grpn/agnes-tls-rotate.sh'
in operator-facing env-template comment. The script lives at
scripts/ops/ now (commit 16a85cc); this comment had been pointing
operators at a non-existent path.
3. docs/padak-security.md:188 — phrase duplication 'tracked in tracked
upstream' from a sloppy substitution in round-2. Trivial wording fix.
Final paranoid grep across .md/.py/.yml/.yaml/.sh/Makefile/.json/.tf/.tpl/
Caddyfile/.toml/.template/.example/.env* with the full token set
(groupon|grpn|foundryai|prj-grp|groupondev|34\.77\.94\.14|34\.77\.102\.61|
kids-ai-data|padak/keboola_agent_cli) returns ZERO hits, excluding
CHANGELOG.md historical entries.
* fix(oss): #94 round-4 — QUICKSTART.md + rename padak-security.md
Devin Review caught two findings on the latest round-3 commit:
1. docs/QUICKSTART.md:67 still pointed users at the deleted
scripts/switch-dev-vm.sh. A Quickstart user following step-by-step
would hit a missing-file error at the final step. Replaced with the
inline gcloud-ssh equivalent that the Removed bullet documents.
2. docs/padak-security.md filename retains the personal identifier
'padak'. The PR fixed the body content (replaced
padak/keboola_agent_cli#206 references with generic wording) but
missed the filename. Renamed to docs/security-audit-2026-04.md
(date-anchored, vendor-neutral). Updated the historical CHANGELOG
link to point at the new path with an inline note about the rename.
* fix(oss): redact remaining hardcoded IPs from planning docs + remove default email
Devin Review caught two more leaks:
1. scripts/fetch-env-from-secrets.sh line 16 had a hardcoded
personal-email default (zdenek.srotyr@keboola.com). Replaced with
':?' bash error so SEED_ADMIN_EMAIL must be explicitly set —
safer than carrying any specific identity.
2. Planning docs still had 35.195.96.98 and 34.62.223.189 (legacy
prod/dev IPs) that the round-1 IP-replace pattern missed (it only
targeted 34.77.x.x). Generic regex redaction across all five
planning docs replaces every public IP with <redacted-ip>,
preserving private/loopback/IAP ranges.
* feat(deploy): keboola-deploy tag-triggered workflow + Caddyfile LE/internal modes + dev_instances TLS support
Three coordinated changes that together unblock Keboola's internal Agnes
deployment from the foot-gun where the dev VM tracks `:dev` (= last push
from anyone in the upstream repo).
1. .github/workflows/keboola-deploy.yml — new workflow
Triggered ONLY on `keboola-deploy-*` git tag pushes (not on every branch
push like release.yml). Builds an image and publishes two GHCR tags:
ghcr.io/keboola/agnes-the-ai-analyst:keboola-deploy-<git-tag-suffix>
ghcr.io/keboola/agnes-the-ai-analyst:keboola-deploy-latest
The Keboola dev VM pins to `keboola-deploy-latest`; an operator deploys
by `git tag keboola-deploy-foo && git push origin keboola-deploy-foo`.
Audit trail lives in git tags (immutable, who-tagged-what-when), no
PR-cycle needed for each deploy.
Doesn't touch Vojta/Minas/David workflow — release.yml still builds
`:dev-<slug>` for every branch push as before.
2. Caddyfile — parametrize TLS directive via $CADDY_TLS env var
PR #51 hardcoded cert-file mode (`tls /certs/fullchain.pem ...`) for
Groupon's corporate CA flow. That broke the Let's Encrypt path the
module previously supported. Now:
CADDY_TLS unset (default) → cert-file mode (Groupon corp PKI)
CADDY_TLS="tls user@x.com" → Let's Encrypt auto-issue
CADDY_TLS="tls internal" → Caddy-managed self-signed (lab/dev)
Single Caddyfile, three regimes, no per-deployment fork. Validated with
`caddy validate` in all three modes.
3. customer-instance module — dev_instances TLS + auto-set CADDY_TLS
- variables.tf: dev_instances object schema gains optional tls_mode +
domain (mirroring prod_instance). Defaults to "none" + "" so existing
callers without those fields keep current behavior.
- startup-script.sh.tpl: when tls_mode="caddy" and DOMAIN is set, write
CADDY_TLS=tls <ACME_EMAIL> (or "tls internal" when ACME_EMAIL empty)
into /opt/agnes/.env. Caddy then picks it up and the Caddyfile
substitution flips the cert source.
For an LE deploy: set tls_mode="caddy", domain="agnes-dev.example.com",
ensure DNS A-record points at the VM, and acme_email is set on the
module (or seed_admin_email is, since acme_email defaults to it).
After this lands, tag as infra-v1.6.0 so downstream infra repos can bump
their module ref without needing the upstream change tracking.
* feat(deploy): fetch optional Google OAuth credentials from Secret Manager
Mirrors the existing keboola-storage-token / agnes-<customer>-jwt-secret
pattern: VM SA reads google-oauth-client-{id,secret} secrets at boot
(if they exist + IAM is wired by caller via runtime_secrets) and writes
them into /opt/agnes/.env. Empty / missing / 403 → silent fallback
to "" so password and email auth keep working untouched.
Pairs with downstream change in agnes-infra-keboola which adds the two
secret names to runtime_secrets, granting the Keboola VM SA secretAccessor
on them. Operator pre-creates the SM containers via gcloud secrets create
google-oauth-client-{id,secret} (one-time, out of band) — values stay
in SM forever; rotation = `gcloud secrets versions add`.
This unblocks the Keboola agnes-dev deploy from PR #3 (infra) — without
GOOGLE_CLIENT_{ID,SECRET} in .env, app/auth/providers/google.is_available()
returns False and the Google sign-in button never even appears.
* fix(cli): versioned wheel URL in setup instructions; drop broken /cli/agnes.whl alias (#36)
* fix(cli): inline PEP 427 wheel filename in setup instructions
`uv tool install <server>/cli/agnes.whl` fails with
error: The wheel filename "agnes.whl" is invalid: Must have a version
because uv validates the filename in the URL path *before* fetching — so
the server-side Content-Disposition header (which has the real versioned
filename) is never consulted, and an HTTP redirect does not help either:
uv resolves the filename from the initial URL.
Fix the root cause by inlining the real PEP 427 filename into the setup
snippet the dashboard copies to the clipboard. The wheel filename is
resolved server-side via `_find_wheel()` and substituted into the lines
returned from `setup_instructions.resolve_lines()`, so both the read-only
HTML preview and the JS clipboard renderer get byte-identical output.
Also added `/cli/wheel/{filename}` to serve wheels at their PEP 427 path,
and kept `/cli/agnes.whl` as a 302 redirect for manual/legacy callers —
though that redirect alone is NOT sufficient for `uv tool install` (uv
validates before following redirects) and is there only as defense-in-depth.
Verified locally:
- `uv tool install <server>/cli/wheel/agnes_the_ai_analyst-2.0.0-py3-none-any.whl` succeeds
- `/install` HTML now renders the versioned URL; `/cli/agnes.whl` no longer appears in the rendered snippet
* fix(cli): remove /cli/agnes.whl alias entirely — it only confused users
The bareword alias was never actually usable:
- `uv tool install <server>/cli/agnes.whl` fails at filename validation
before any HTTP fetch, so neither the Content-Disposition header nor a
302 redirect rescued it.
- The 302-to-versioned-path fallback left a visibly "working" URL in
browser / curl -L contexts, which is exactly how the original bug got
reported in the first place ("the URL loads, why doesn't install work?").
Remove the endpoint and scrub all remaining references. The only CLI wheel
URL is now `/cli/wheel/{filename}` with the real PEP 427 filename, which
the setup-instructions template already generates server-side.
Existing tests that referenced /cli/agnes.whl become negative tests
("must not appear") so we don't regress.
* feat(cli): --version flag; sync --dry-run + progress indicator (#38)
* feat(cli): add --version / -V flag
Prints `da <version>` from package metadata (importlib.metadata). Falls
back to "unknown" when the package is not installed (e.g. running from a
source checkout without `uv pip install -e .`), instead of crashing.
Eager typer callback, so `da --version` exits before subcommand
resolution and does not require any auth/config.
* feat(cli): da sync --dry-run + X/N progress indicator
--dry-run reports what would be downloaded/uploaded without hitting the
API or writing local state. Supports the full flag set (--table, --json,
--upload-only); JSON shape is {"dry_run": true, "would_download": [...],
"summary": {...}}.
Progress bar now shows "[X/N] Downloading <table>..." with a Rich
BarColumn + TaskProgressColumn + TimeElapsedColumn instead of a bare
spinner — makes long syncs visible.
* feat(cli): durable sync + server gzip + auto-update check (#41)
* fix(sync): atomic writes + manifest hash verification + retry on transient errors
Three durability hooks around stream_download and the sync command:
1. Atomic writes. stream_download now streams into `<target>.tmp` and
calls os.replace() on success, so the real target file never exists
in a half-written state. On failure the tmp is unlinked — no cleanup
leftovers, no guard needed at read time.
2. Retry with backoff. Transient errors (ConnectError, ReadError,
WriteError, RemoteProtocolError, TimeoutException, 5xx) are retried
up to 3× with 0.3s / 1s / 3s backoff. 4xx (auth, 404) surfaces
immediately — retrying those is pointless.
3. Manifest-hash verification. After download, sync.py computes MD5 of
the target (same 8KiB chunking as app/api/sync.py:_file_hash) and
compares against `server_tables[tid]["hash"]`. Mismatch ⇒ unlink,
record error, skip state commit. The PAR1 structural check survives
as a fallback for legacy manifests without a hash.
Also makes _rebuild_duckdb_views tolerant: single broken parquet is
skipped with a stderr warning instead of killing the whole rebuild.
Supersedes #40 — this commit is a strict super-set (hash check + PAR1
fallback + atomic write + retry). #40 can be closed without merging.
* perf(server): enable GZipMiddleware for JSON / HTML responses
GZipMiddleware at minimum_size=1024 shaves bandwidth on manifest-style
JSON endpoints (/api/sync/manifest, /api/version, …) and the /install
HTML preview. Parquet file downloads are already columnar-compressed so
the middleware sees limited benefit there — but it doesn't hurt, httpx
on the client side decompresses transparently.
Placed after session middleware so gzip wraps the session-Set-Cookie
response too, and before CORSMiddleware so compression is applied to
both cross-origin and same-origin responses.
* feat(cli): auto-check for newer CLI version on startup
Server side
- GET /cli/latest returns {version, wheel_filename, download_url_path}
for whatever wheel is currently in AGNES_CLI_DIST_DIR. Public,
cacheable, no secrets — consumed by the CLI auto-update probe.
Client side
- New cli/update_check.py: reads /cli/latest with a 3s timeout, caches
the result in $DA_CONFIG_DIR/update_check.json for 24h. Cache is
invalidated when the installed version changes (e.g. after a fresh
`uv tool install`) so stale "you're behind" warnings don't linger.
- Root typer callback fires the probe before subcommand dispatch; any
failure is swallowed so a bad network never blocks a working command.
- Outdated → one-line stderr warning:
[update] da 2.0.0 is out of date — latest on this server is 2.1.0.
Upgrade: uv tool install --force <server>/cli/wheel/<…>.whl
- Disable with DA_NO_UPDATE_CHECK=1.
* fix(pr-review): None-guard the upgrade line + skip gzip on parquet paths
Two follow-ups from Devin review on #41.
1. format_outdated_notice(UpdateInfo(download_url=None)) emitted literal
"uv tool install --force None" — copy-pasting that fails. Drop the
upgrade snippet when the URL is absent and keep only the version line.
2. GZipMiddleware compressed everything over 1024 bytes, including the
parquet FileResponses served by /api/data/{tid}/download,
/cli/wheel/{name}, and /cli/download. Parquet is already columnar-
compressed — gzip there is pure CPU + latency with no size win, and
/api/data bodies can reach hundreds of MB. Wrap GZipMiddleware in a
small _SelectiveGZipMiddleware that skips those path prefixes and
delegates the rest to the stock middleware. JSON / HTML endpoints
(manifest, /install, /api/version, …) still get compressed.
* release: bump to 2.1.0 — unify AGNES_VERSION with pyproject.toml version (#42)
Before: two independent version systems. pyproject.toml carried semver
(2.0.0 → wheel filename → `da --version`) while release.yml injected
CalVer into AGNES_VERSION (e.g. 2026.04.155 → /api/version). Users saw
different strings in the CLI vs. the /install page, and the CLI auto-
update check couldn't tell "new deploy, same package version" apart
from "new package version".
Make pyproject.toml [project].version the single product-version source
of truth. release.yml extracts it and feeds AGNES_VERSION, so every
surface (/api/version, /api/health, /cli/latest, `da --version`) agrees
on one number. The CalVer tag keeps doing what CalVer is for: release
identity on the git tag and Docker image tag (versioned_tag).
Also wires AGNES_TAG through the build: release.yml → Dockerfile ARG →
env, so /api/version.image_tag finally reports the actual image tag
instead of the "unknown" fallback.
Bump to 2.1.0 to reflect the PRs shipped on ps/wheel-name-fix: durable
sync (atomic writes + manifest MD5 + retry), server GZip, CLI auto-
update probe, setup snippet PEP 427 URL.
* fix(pr-review): directional version compare in is_outdated()
UpdateInfo.is_outdated() used `self.latest != self.installed`, which
fires in both directions. If the server is rolled back or the user
connects to an older deployment, the CLI would warn "out of date"
and — worse — the formatted notice would prompt
uv tool install --force <older-version>.whl
i.e. an unintended downgrade.
Compare with packaging.version.Version (PEP 440 aware, handles pre-
release tags). Fall back to dotted-int tuple compare if packaging is
somehow missing, and return False on unparseable strings — better to
miss an upgrade hint than to silently suggest a downgrade.
Adds 4 test cases: installed older (True), installed newer (False),
10.0.0 vs 2.1.0 lexical-compare trap (correct), unparseable strings
(False).
Addresses Devin review on #43.
* fix(pr-review): read FastAPI app version from package metadata
app/main.py:80 hardcoded `version="2.0.0"` in the FastAPI constructor.
After #42 bumped pyproject.toml to 2.1.0, /api/version, /cli/latest,
and `da --version` all reported 2.1.0 while /openapi.json and the
/docs UI still advertised 2.0.0.
Read `agnes-the-ai-analyst` version via importlib.metadata (same
pattern cli/main.py:_cli_version already uses), with a `"dev"`
fallback when the package is not installed (source checkout). This
way pyproject.toml stays the single source of truth across every
version surface — /openapi.json now tracks the bump automatically.
Adds a dedicated test file to pin this behavior so a future
regression to a hardcoded literal fails at CI.
Addresses second Devin finding on #43.
* fix(pr-review): _fmt_bytes PiB label + negative cache in update_check
Two more follow-ups from Devin review on #43.
1. _fmt_bytes off-by-unit. The old loop exited at TiB but the fallback
labelled PiB, so 1 PiB rendered as "1024.0 PiB". Restructure: put
every unit inside the loop (KiB through EiB) so the division count
always matches the label. Covers up to 1 ZiB cleanly; anything
beyond renders as "<big>.0 EiB" rather than crashing.
2. Negative cache for failed /cli/latest probes. On a corporate
firewall / VPN that silently drops packets, the 3s HTTP timeout
fired on *every* `da` invocation. Writing a `latest=None` cache
entry with a 5-minute TTL caps that at one probe per 5min. Successful
probes still use the 24h TTL. Reading logic branches on whether the
cached `latest` is None.
Adds TestFmtBytes (2 cases: small/medium sizes and the PiB/EiB fallback
regression), plus two TestSync update-check cases covering negative-
cache reuse and TTL expiry.
Adds a second tag to dev-channel image builds: when a branch is in the
form <prefix>/<whatever>, the image is also pushed as
ghcr.io/keboola/agnes-the-ai-analyst:dev-<prefix>-latest.
Enables per-developer dev VMs on GRPN (and elsewhere) to auto-deploy
without knowing the specific branch slug. Each VM pins its .env to
AGNES_TAG=dev-<prefix>-latest, and the auto-upgrade cron (5 min tick)
picks up the newly pushed image on the next run.
Common Git Flow prefixes are deliberately skipped so feature/*, fix/*,
hotfix/* etc. don't create noise tags. Matched list:
feature, fix, hotfix, bugfix, docs, chore, test, ci, ops, refactor,
perf, style, build.
Verified locally against several branch names:
zs/my-feature -> dev-zs-latest
vr/foo -> dev-vr-latest
pc/bar-baz -> dev-pc-latest
feature/xyz -> (skipped)
fix/bug -> (skipped)
main -> (no-op, stable channel)
test-no-slash -> (no-op, no slash)
Job-level 'if: secrets.X != ""' did not prevent workflow from being
scheduled on branch pushes (GitHub reports failure with 0 jobs in that
case). Refactored: first step is a guard that checks both the tag ref
pattern and the secret presence; downstream steps skip when the guard
says no.
Result: workflow now reports success with a clear warning annotation on
branch pushes or when the secret is absent; only real infra-v* tag
pushes with the secret set perform the bump.
* dryrun: intentional failing test (will be reverted)
* feat(auth): optional SEED_ADMIN_PASSWORD to pre-hash seed admin (dev helper)
Terraform gains enable_seed_password + seed_admin_password (sensitive) vars
on the customer-instance module; when enabled the password is piped via
startup-script into /opt/agnes/.env as SEED_ADMIN_PASSWORD. On first boot
app/main.py argon2-hashes it onto the seed user so the admin can log in
immediately without going through /auth/bootstrap. Never overwrites an
existing password_hash — safe against accidental reset on terraform apply.
* ci(release): build :dev-<slug> on any branch, not just feature/**
Before: only 'feature/**' branches triggered release.yml, so pushing
'zs/my-edit' or 'fix/bug' did not publish an image. dev_instances entry
pinning image_tag = 'dev-zs-my-edit' then crashed VM startup with
'image not found'.
Now: any branch push (except main, which produces :stable) publishes
:dev-<slug>. Slug strips a leading 'feature/' and replaces non-[a-z0-9-]
with '-', keeping existing feature/** behavior identical.
* Revert "dryrun: intentional failing test (will be reverted)"
This reverts commit cf9cc06a7884bb401ff29fc5cb6d8baf84dc3daa.
* dryrun: verify per-branch GHCR tag
* ci: propagate infra-v* tag bumps to template repo
On push of any infra-v* tag, opens a PR in keboola/agnes-infra-template
that bumps the module ref in terraform/main.tf. Auto-merge rules in the
template (Renovate + CI validate + GitHub native auto-merge) land it
without manual work on patch/minor bumps.
Requires repo secret TEMPLATE_REPO_TOKEN (fine-grained PAT with
Contents:write + Pull requests:write on keboola/agnes-infra-template).
Fail-soft: if secret is missing the job is skipped and Renovate on the
template repo picks up the new tag on its next cycle as a fallback.
* docs(onboarding): 'Keeping the template up-to-date' maintainer section
Documents the two mechanisms (upstream release hook + Renovate), the
required repo settings (allow_auto_merge, validate.yml gate), the TOKEN
secret setup, and the one-time setup checklist. Notes the difference
between template repo (auto-merge on) and customer infra repos
(human approval).
Completes the previous commit — bakes the full git SHA into the image ENV
at build time so the UI badge shows a real commit, not a sha256 digest
(which was the floating manifest digest and unhelpful for debugging).
Extracts branch name from GITHUB_REF, slugifies it, and adds as extra tag
on feature branch builds. Main branch is unaffected (no branch_slug output).
Enables dev_instances tfvar with image_tag pinning specific feature branches.
- Create empty .env before docker compose up in CI (env_file: .env is required)
- Mock get_jira_service in webhook HMAC test to isolate signature check
from Jira API availability — strict assert 200 instead of permissive 500
- secrets.py: validate file content is non-empty before using it;
regenerate if file exists but is empty/corrupted
- release.yml: touch .env before docker compose in smoke test
(env_file: .env in docker-compose.yml requires the file to exist)
663 tests pass.
- smoke-test.sh: replace ((PASS++)) with PASS=$((PASS + 1)) to avoid
set -e abort when counter is 0 (bash returns exit 1 for ((0)))
- CalVer: use max(N) from existing tags instead of count, safe when
tags are deleted (e.g. deprecated version cleanup)
- CLAUDE.md: update schema version from v2 to v3
663 tests pass.
- CalVer retry loop now exits with error if all 5 attempts fail
(prevents pushing Docker image with unclaimed version tag)
- discover_tables endpoint reads data_source.keboola.url (consistent
with configure_instance and _discover_and_register_tables)
- Pre-migration snapshot flushes WAL via CHECKPOINT before copying
and copies .wal file if it still exists after flush
663 tests pass.
- _discover_and_register_tables reads from data_source.keboola.url
(matches what /api/admin/configure writes) instead of top-level
keboola.url which doesn't exist
- CalVer: claim git tag BEFORE Docker build with retry loop (up to 5
attempts). Prevents race where two concurrent CI runs get same N.
Git tag acts as a distributed lock for version uniqueness.
663 tests pass.
- Config writes to DATA_DIR/state/instance.yaml (writable) instead of
CONFIG_DIR (read-only :ro in Docker)
- instance_config.py checks DATA_DIR/state/ first, then falls back to
CONFIG_DIR for backward compat
- CalVer counter is now global across channels (*-YYYY.MM.*) per spec
- Keboola error messages sanitized — log full error, return generic msg
- chmod in secrets.py wrapped in try/except for Windows compat
- Setup wizard JS handles 401 (expired JWT) with user-facing message
- deploy.yml changed to workflow_dispatch only (no duplicate test runs)
- Smoke test uses docker-compose.prod.yml + AGNES_TAG instead of sed
- docker-compose.prod.yml uses ${AGNES_TAG:-stable} env var
663 tests pass. 8 E2E verification tests pass.
- Removed deploy-production job — Kamal config has placeholder IPs, no secrets
- Renamed workflow to "Build & Push" — test + Docker image to GHCR
- Added FORCE_JAVASCRIPT_ACTIONS_TO_NODE24=true to suppress Node.js 20 warnings
- All dependencies now in pyproject.toml [project.dependencies]
- Dev/test deps in [project.optional-dependencies] dev and [tool.uv]
- Dockerfile uses uv pip install . from pyproject.toml
- CI uses uv pip install ".[dev]"
- Deleted requirements.txt and requirements-dev.txt
- Updated README, CLAUDE.md install instructions
- Enhanced .dockerignore (exclude tests, docs, infra from image)
deploy-guard.yml referenced deleted tests and sudoers files.
deploy.yml.example used legacy SSH-based deployment.
Updated ci.yml and deploy.yml are in .gitignore (need workflow scope to push).
- SyncSettingsRepository + DatasetPermissionRepository with RBAC
- Script deploy/run/undeploy API with import sandboxing
- User sync settings API with permission checks
- 4 CLI skills (connectors, security, notifications, corporate-memory)
- Kamal production + staging configs
- GitHub Actions CI + deploy workflows
- 91 total tests passing
Open-source AI data analyst platform extracted from internal repo.
Includes data sync engine, Keboola adapter, Flask web portal,
server deployment scripts, and configuration templates.