* fix(keboola): per-table fallback to legacy Storage-API client
The DuckDB Keboola extension's per-table COPY fails with
`Schema '..."in.c-..."' does not exist or not authorized` on
projects whose Snowflake backend doesn't expose bucket schemas
to the storage-token-derived QueryService role
(keboola/duckdb-extension#17). ATTACH itself succeeds, so the
existing extension-level fallback in `_try_attach_extension`
never triggers — the table is just marked failed.
- Promote `kbcstorage>=0.9.0` from optional to core dep so the
legacy client import in `_extract_via_legacy` doesn't crash
default installs with `ModuleNotFoundError`.
- Wrap `_extract_via_extension` in a per-table try/except so a
scan failure retries via `_extract_via_legacy` instead of
recording `tables_failed` and moving on.
Slower than the extension path, but produces correct parquets
on affected projects while the upstream extension fix lands.
* test(keboola): cover per-table extension→legacy fallback
Two existing tests mocked _extract_via_extension to throw and asserted
the original message survived in result["errors"]. With per-table
fallback, the new flow retries via _extract_via_legacy — which on the
mock URLs would throw a different (404 / DNS-fail) error, replacing the
asserted message.
- Mock _extract_via_legacy alongside _extract_via_extension in
test_network_timeout_during_extraction +
test_partial_failure_continues +
test_all_tables_fail_returns_full_failure_stats so the assertion
observes the final propagated error from the fallback chain.
- Add test_extension_per_table_failure_falls_back_to_legacy that
exercises the new behavior directly: extension scan fails with the
QueryService schema-not-authorized message
(keboola/duckdb-extension#17), legacy succeeds, parquet ends up
queryable.
Patch release bundling the only Unreleased change: bump httpx client
timeout for agnes query --remote from 30s to 300s (configurable via
AGNES_QUERY_TIMEOUT). Renames CHANGELOG [Unreleased] section to
[0.35.1] and bumps pyproject version to match.
Five compounding defects on default `docker compose up` deploys made the
session pipeline silently broken: sessions uploaded by analysts via
`agnes push` landed on /data/user_sessions/<user>/*.jsonl but nothing
ever processed them. Fix is one PR: promote anthropic + openai to core
deps, wire all three LLM-pipeline jobs into scheduler-v2 with offset
cadences (10m/15m/17m), drop the side-car services from compose, seed a
default ai: block on first-time setup with an env-var fallback in code,
surface the pending review queue to admins, and expose a health check
that warns when uploaded jsonls aren't being processed.
**BREAKING** for operators on COMPOSE_PROFILES=full or with custom
Compose overrides referencing the corporate-memory or session-collector
service stanzas — drop them. The scheduler is now the sole driver.
LLM provider SDKs are imported by services/corporate_memory and
services/verification_detector — both production code paths. Listing
them only in [project.optional-dependencies].dev caused the scheduler
container to boot-loop with ModuleNotFoundError on default
`docker compose up` deploys, because the Dockerfile installs core
deps only (`uv pip install --system --no-cache .`).
Adds tests/test_packaging.py to lock the contract: anthropic + openai
must live in [project].dependencies, not in dev extras.
CHANGELOG: rename [Unreleased] → [0.32.0] — 2026-05-04, prepend a new
empty [Unreleased] for next-PR landing zone.
pyproject.toml: version 0.31.0 → 0.32.0.
Per repo discipline (memory: feedback_release_cut_with_pr.md), the
release-cut commit lands as the FINAL commit of the PR that contained
the user-visible behavior change — it does not get a separate PR.
After merge: tag v0.32.0 on the merge commit + create a GitHub Release
(memory: feedback_github_release_per_tag.md — the tag alone isn't
enough; the Release prose is the operator-visible artifact).
Headline: closes#160. da query --remote now resolves query_mode='remote'
BQ rows whose entity is VIEW or MATERIALIZED_VIEW (the bug Pavel hit).
Plus 4 reinforcing fixes — server-side cost guardrail (bq_max_scan_bytes,
default 5 GiB), registry-gating of direct bq.* paths, bigquery_query()
function-call backdoor closed, structured CLI render of typed BQ errors —
and one operator-side admin convenience (BQ test-connection endpoint +
billing_project placeholder UI).
14 issues caught and addressed across 6 iterations of Devin Review.
E2E verified on agnes-zsrotyr.groupondev.com (commit 7f743d03):
- VIEW path resolves (count=23 from active_inventory_view)
- VIEW aggregate parity vs filtered BASE TABLE
- cost guardrail rejects with structured 400 detail
- bq_path_not_registered 403 (incl. quoted "bq" variant)
- bigquery_query() blocklist 400
- test-connection endpoint 200 with elapsed_ms
* security(auth): per-IP rate limit on auth endpoints + generalize last-admin guard
Closes#45 and #151.
#45 — every auth endpoint was unthrottled (login, magic-link, token,
bootstrap), leaving us open to password brute-force and SMTP
email-bombing. Wires slowapi (new dep) into the middleware chain with
per-route limits: 10/min on login + token, 5/min on send-link, 3/min on
bootstrap. Returns 429 with Retry-After: 60 once exceeded. Per-IP key
respects the leftmost X-Forwarded-For hop (Caddy in front of the app
strips client-supplied XFF). Operator escape hatch:
AGNES_AUTH_RATELIMIT_ENABLED=0. Test suite disables the limiter via
autouse conftest fixture so existing auth tests that hammer endpoints
in tight loops are unaffected.
#151 — DELETE /api/admin/users/{id}/memberships/{group_id} and the
mirror DELETE /api/admin/groups/{group_id}/members/{user_id} only
guarded against self-removal as last admin. Generalizes to refuse
removing anyone from the seeded Admin group when they are the only
remaining active admin (mirrors the existing
count_admins(active_only=True) <= 1 check on delete_user / update_user).
Recovery from zero admins requires direct DB access, so this closes
a path where a scheduler/bootstrap actor that bypasses normal admin
checks could otherwise empty the group.
* security(auth): throttle remaining email-bombing + token-confirm endpoints
Address code-review gap on PR #165 — the first commit covered /send-link
but missed two endpoints with the IDENTICAL email-bombing surface:
- POST /auth/password/reset — sends reset mail, anti-enum response
- POST /auth/password/setup/request — sends setup mail, anti-enum response
Both now share the 5/min limit with /send-link.
Also add 10/min to the token-confirm surfaces — high-entropy tokens but
partial leaks via logs / referer have surfaced before, and unbounded
guess rate would let an attacker exhaust the keyspace adjacent to a
leaked prefix:
- POST /auth/email/verify
- GET /auth/email/verify — closes the click-through bypass
- POST /auth/password/reset/confirm
- POST /auth/password/setup/confirm
Doc fix: rate_limit.py module docstring + CHANGELOG entry no longer
claim "disable without a redeploy" (misleading). The Limiter constructor
freezes `enabled` from env at import time, matching every other Agnes
env knob — operators set the flag and bounce the container.
Tests: 4 new cases in test_auth_rate_limit.py covering
/reset, /setup/request, /reset/confirm, GET /verify. Full suite:
2583 passed, 32 skipped, 0 failed.
* security(auth): throttle JSON /auth/password/setup — closes form-throttle bypass
Second code-review pass on PR #165 caught a fifth gap: POST /auth/password/setup
(JSON variant, kept for backward compat) consumes the same setup_token as
the web form /setup/confirm but was unthrottled — an attacker brute-forcing
the token just switches from the form path to the JSON path and resumes
at unbounded RPS. Apply the same 10/min limit and signature shape used
on /setup/confirm.
Also extend CHANGELOG note about the JSON-variant bypass for future
operators reading the security entry.
Test: 1 new case (test_password_setup_json_rate_limited_after_10_requests),
9 rate-limit tests + 28 password-flow tests + 41 auth-provider tests pass,
no regressions.
* chore(release): cut 0.30.1 — auth security hardening (rate limit + last-admin guard)
* fix(tls-rotate): self-signed fallback sets basicConstraints=critical,CA:FALSE
OpenSSL's default '[v3_ca]' config marks CA:TRUE on 'req -x509', which
causes strict TLS stacks (rustls / webpki, used by uv, cargo, and
future versions of pip) to reject the cert with
'invalid peer certificate: CaUsedAsEndEntity' per RFC 5280 §4.2.1.9.
Browsers, curl, and OpenSSL-based clients tolerated the violation,
hiding the bug until a uv user hit it.
Affects every VM running on the self-signed fallback while the corp
PKI hasn't published the real chain yet. Fix lands on the next
agnes-tls-rotate.timer tick (or 'systemctl start
agnes-tls-rotate.service' for an immediate refresh). Existing CSR /
real-cert paths unaffected; only the bring-up fallback regenerates.
* chore(release): cut 0.29.0
---------
Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
* fix(analyst): document BigQuery remote-query capability in bootstrap CLAUDE.md template
Closes#153.
The CLAUDE.md template generated by `da analyst bootstrap` (config/claude_md_template.txt)
covered metrics, sync, corporate memory, and directory layout — but had ZERO
mention of query_mode: "remote", da fetch, da query --remote, or --register-bq.
Result: the AI analyst running in a freshly-bootstrapped workspace had no
idea BigQuery-backed tables existed, no path to fetch unsynced data, and no
fallback for tables not in the catalog.
Validated against /Users/<user>/foundry-ai/foundryai-data-analyst/CLAUDE.md
on 2026-05-01: section confirmed missing. Workspace-level (parent-dir)
CLAUDE.md carried legacy SSH-heredoc instructions but the analyst-level
file (which Claude reads as primary project context) had nothing.
## Changes
### config/claude_md_template.txt (+83)
Added a `## Remote Queries (BigQuery)` section covering:
- Discovery first — `da catalog --json | jq '...'` to see all tables
with their query_mode, then `da schema` and `da describe` for shape.
- Three query patterns:
- `da fetch` (preferred) — materialize a filtered subset locally,
query the snapshot, drop when done.
- `da query --remote` — one-shot server-side execution (cheap probes).
- `da query --register-bq` — hybrid joins between local + ad-hoc BQ.
- `da fetch` estimate-first discipline — rules of thumb on
--select / --where / --estimate / snapshot reuse.
- BigQuery SQL flavor cheat sheet for `--where` (DATE literal,
DATE_SUB, REGEXP_CONTAINS, CAST AS INT64).
- Unknown-table fallback: when a table isn't in `da catalog` at all,
use ad-hoc `--register-bq` if the agnes server SA has BQ access, or
ask admin to register with `query_mode: "remote"` for ongoing use.
- Pointer to `da skills show agnes-data-querying` for deeper guidance.
### docs/setup/claude_md_template.txt (deleted)
Stale 359-line template that documented the deprecated SSH-heredoc
remote_query.sh protocol. No code references it (verified via grep
across .py / .sh / .yml / .md). Removing eliminates two failure
modes:
1. A future refactor accidentally pulling it into a workspace and
shipping deprecated guidance to analyst Claude sessions.
2. Reviewer confusion over which template is canonical.
### CHANGELOG.md
`### Fixed` and `### Removed` entries under [Unreleased].
## Tested
- Manually walked the diff against `da skills show agnes-data-querying`
output on a live VM (foundryai-development) — patterns + flags
match the modern CLI exactly.
- Re-bootstrap test deferred: requires network round-trip; pattern
is identical to existing template substitution path so render is
not at risk.
## Out of scope
- The companion gap that data_description.md often only enumerates
query_mode: "local" tables (no signal that other modes exist) —
separate concern, fix likely belongs in the metadata generator
on the server side, not in the analyst template.
- Encouraging admins to register frequently-queried BQ tables as
`query_mode: "remote"` in the registry — workflow improvement, not
a code bug.
* chore(release): cut 0.28.0
---------
Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
* feat(rbac): drop dataset_permissions + access_requests + users.role + is_public; v19 migration
BREAKING. Sjednocení datové RBAC vrstvy do per-group resource_grants modelu.
Před PR byla legacy data RBAC vrstva (dataset_permissions + is_public bypass)
de-facto neaktivní — is_public neměl API/UI/CLI surface, default true znamenal
že can_access_table vždycky bypassl. Dnes každý non-admin přístup vyžaduje
explicitní resource_grants(group, "table", id) řádek.
Schema v18 → v19 (src/db.py:_v18_to_v19_finalize):
- DROP TABLE dataset_permissions, access_requests
- DROP COLUMN users.role (NULL artifact since v13)
- DROP COLUMN table_registry.is_public
- Drops přes table-rebuild idiom (rename → create new → INSERT … SELECT
→ drop old) kvůli DuckDB ALTER DROP COLUMN limitacím na tabulkách
s historic FK constraints. INSERT picks intersection sloupců, takže
test fixtures s minimal pre-v19 schemou migrate cleanly.
Runtime:
- src/rbac.py:can_access_table → deleguje na app.auth.access.can_access
- DatasetPermissionRepository, AccessRequestRepository smazány
- AGNES_ENABLE_TABLE_GRANTS env-gate v app/resource_types.py odstraněn
(TABLE je unconditionally enabled)
API drop:
- app/api/permissions.py, app/api/access_requests.py celé soubory
- /admin/permissions web route + admin_permissions.html
- "Request Access" modal v catalog.html + locked-row UI
- ~10 if user.get("role") != "admin" checků nahrazeno (admin shortcut
je uvnitř can_access_table)
- /api/settings: drop permissions field z GET; PUT /api/settings/dataset
gate přepnut na can_access(user_id, "table", dataset, conn)
Auth:
- app/auth/jwt.py:create_access_token: drop role parametr (claim zmizí
z nově vydávaných JWT; staré tokeny zůstávají valid, claim ignored)
- app/api/users.py: drop role z CreateUserRequest / UpdateUserRequest
(admin promotion = explicit add to Admin group via memberships API)
- src/repositories/users.py: drop role z create() / update()
CLI:
- da admin set-role smazán → hard-fail s replacement command
- da admin add-user --role flag pryč
- da auth import-token --role flag pryč
- da auth whoami: drop "Role:" výpis
- cli/config.py:save_token: role parametr now optional, no longer written
(back-compat se starými token.json soubory zachována — pole se ignoruje)
Tests:
- DELETE: test_permissions.py, test_permissions_api.py, test_access_requests_api.py
- REWRITE: test_access_control.py (resource_grants flow), test_rbac.py
(can_access_table over resource_grants), test_journey_rbac.py
(drop access-request flow), test_resource_types.py (drop env-gate
tests, drop is_public from helpers), test_v2_*.py (drop role-based
user dicts in favor of id-based + Admin group membership),
test_settings_api.py (no permissions field, can_access gate)
- TRIVIAL: ~30 souborů — drop role="admin" arg z UserRepository.create
a 3rd positional role z create_access_token
- NEW: test_v18_to_v19 migration test (test_db.py),
test_can_access_table_no_implicit_public (test_rbac.py),
test_admin_set_role_returns_hardfail (test_cli_admin.py)
- OpenAPI snapshot regenerated
Docs:
- CHANGELOG: BREAKING entry pod [Unreleased]
- CLAUDE.md: schema v18 → v19
- docs/architecture.md: schema table + RBAC sekce přepsána
- docs/auth-google-oauth.md: admin promotion přes da admin break-glass
- cli/skills/security.md: kompletně přepsáno na group-based model
- docs/TODO-rbac-data-enforcement.md: smazáno (TODO splněn)
Test results: 2363 passed, 19 failed. Zbývající failures jsou pre-existing
Windows-specific issues (fcntl, charset) nesouvisející s tímto PR —
ověřeno git stash pop.
Plan: ~/.claude/plans/floofy-coalescing-parnas.md
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore(release): cut 0.27.0
---------
Co-authored-by: Minas Arustamyan <arustamyan.minas@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
* refactor(ops): bake all host artifacts into image, drop every curl-from-main
Replaces the curl-from-main pattern (originally introduced in 0.25.0 for
agnes-auto-upgrade.sh; older for the compose files + Caddyfile) with image-
bundled host artifacts. Same-tag delivery for everything the host runs,
version-pinned by AGNES_TAG, atomically rolled back by reverting the image.
## Motivation
The customer-instance startup template was curling 6 files from
raw.githubusercontent.com on every VM boot:
docker-compose.yml
docker-compose.prod.yml
docker-compose.host-mount.yml
docker-compose.tls.yml
Caddyfile
scripts/ops/agnes-auto-upgrade.sh (added in 0.25.0)
Every one of them already lives inside the image (`COPY . .` copies the
whole repo to /app/). Curling them from the public internet duplicates
content the image already carries and introduces three problems:
1. **Split-brain version pinning.** image_tag pins the docker image to an
immutable digest. The compose files + script bypassed that pinning by
tracking `main` (or the rarely-set compose_ref). A customer pinned to
stable-2026.04.516 could wake up tomorrow with their host artifacts
floating on whatever shipped to main overnight — even though they're
explicitly pinned for stability.
2. **No rollback knob.** Reverting a bad host artifact meant reverting
the upstream PR globally — affects every customer that reboots after
the bad commit. No "rollback for me only" path; tag-pinning gave no
protection.
3. **Public-internet dependency on every boot.** The image is already
pulled from a private registry on the same boot. Reusing that channel
is strictly cheaper than adding a second one. Customers with restricted
egress (no raw.githubusercontent.com reachability) silently broke on
every boot.
## Changes
### Dockerfile (+19 -8)
After `COPY . .` and before the wheel build, an explicit `cp` lifts every
host-side artifact into a stable contract path /opt/agnes-host/:
agnes-auto-upgrade.sh (mode 0755 — host cron driver)
docker-compose.{yml,prod,host-mount,tls}.yml
Caddyfile (mode 0644)
Why a copy instead of pointing at /app directly: /app is owned by uid 999
(USER agnes); /opt/agnes-host is root-owned, mode 0755 across the board,
stable path that won't shift if /app structure refactors.
### infra/modules/customer-instance/startup-script.sh.tpl (+22 -36)
Replaced six curls and the standalone agnes-auto-upgrade.sh extract block
(introduced earlier in this PR) with one extract sequence in section 3:
docker pull "$${IMAGE_REPO}:$${IMAGE_TAG}"
EXTRACT_CONTAINER=$(docker create "$${IMAGE_REPO}:$${IMAGE_TAG}")
trap "docker rm '$EXTRACT_CONTAINER' >/dev/null 2>&1 || true" EXIT
docker cp "$EXTRACT_CONTAINER:/opt/agnes-host/." "$APP_DIR/"
docker cp "$EXTRACT_CONTAINER:/opt/agnes-host/agnes-auto-upgrade.sh" /usr/local/bin/agnes-auto-upgrade.sh
chmod +x /usr/local/bin/agnes-auto-upgrade.sh
The auto-upgrade section (#6) is now a no-op — script is already in place.
### infra/modules/customer-instance/variables.tf (+1 -1)
`compose_ref` marked DEPRECATED in description. Default unchanged for
one release cycle to avoid breaking existing terraform plans. Will be
removed in a future major bump.
### CHANGELOG.md
`### Changed` entry under [Unreleased] — supersedes the narrower entry
this PR previously had (which only covered the script).
## Out of scope (filed as follow-ups)
1. **agnes-the-ai-analyst-infra/startup.sh (operator deploy)** still
curls the same artifacts from main. Symmetric fix needed there.
Will file as a separate PR against the infra repo.
2. **Self-update inside agnes-auto-upgrade.sh** after a successful
`docker compose pull` of a new digest. Otherwise the running cron
keeps using the OLD baked-in script for one tick after image upgrade.
~10 LOC. Deferred to keep this PR scoped.
3. **scripts/ops/agnes-tls-rotate.sh** has the same shape — host-side
bash currently sourced via the infra repo. Should follow the same
bake-into-image pattern.
## Tested
- Local: `docker build .` succeeds with the new RUN block.
- `docker create` + `docker cp /opt/agnes-host/.` round-trips all 6
artifacts; sha matches each source file.
- Not yet tested on a live VM bring-up — that requires a CI image with
this Dockerfile change. **Recommend reviewer trigger CI build, then
do a single VM-recreate against a dev VM (e.g. foundryai-development)
to confirm the extract path works end-to-end before merge.**
## Compatibility
- Existing VMs running 0.25.0 are unaffected — they have host artifacts
in place from `curl from main` already; this PR doesn't touch them.
They pick up the new pattern only on next VM recreate.
- VMs pinned to an image_tag *older* than this PR (no /opt/agnes-host
in the image) would FAIL the docker cp. Current diff fails-loud (no
fallback). Recommend operators upgrade to a fresh-enough image_tag
alongside the template upgrade — same coupling as any compose-flag bump.
* docs(infra): document image_tag >= v0.26.0 minimum on prod/dev_instances
The new startup script extracts host artifacts from /opt/agnes-host/
inside the image — a directory added in this PR (will ship as v0.26.0).
Pinning image_tag to an older tag would fail-loud at first boot with
'docker cp: No such file or directory'. Existing VMs are unaffected
because the module ignores metadata_startup_script changes.
Devin ANALYSIS_0004 on PR #149.
* fix(changelog): mark BREAKING + drop private-repo reference
Per CLAUDE.md, breaking changes start with **BREAKING** so operators
can grep before bumping the pin. The image_tag minimum constraint
introduced here qualifies — older tags fail-loud at first boot.
Also drop the explicit 'agnes-the-ai-analyst-infra' name from the
entry; the OSS distribution shouldn't reference operator-side
deploy templates by their private-repo names. Generic 'consumer-
side deploy templates' wording instead.
Devin BUG_0001 + WARN_0001 on PR #149.
* chore(release): cut 0.26.0
---------
Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
* fix(ops): fail-fast guard in agnes-auto-upgrade — refuse to start containers if config disk not mounted
Companion to keboola/agnes-the-ai-analyst-infra#62. Same incident:
foundryai-development 2026-04-30, marketplaces / DuckDB / session secret
written to /data (sdb) instead of the config disk (sdc), wiped on next
container recreate.
## Why an app-side guard
agnes-auto-upgrade.sh fires every 5 min on every VM. If `/data/state` is
not on the config disk (because of the propagation regression fixed by the
infra PR, or the boot-time udev race fixed by infra #58, or any future
mount-loss path), this script previously ran `docker compose up -d`
anyway — and the app silently wrote state onto the wrong disk. Next
recreate, that state was gone.
The boot-time fixes in infra are preventive. This is the runtime backstop.
## Behavior
Before the existing pull/up logic, when /dev/disk/by-id/google-config-disk
exists on the VM:
1. Up to 3 mount-and-verify attempts with backoff (2s, 4s, 6s).
- Mount the config disk if /data/state is not a mountpoint.
- Detect mismatch: if /data/state is mounted from the wrong source,
umount and retry.
2. After the loop, assert findmnt source matches the config disk.
- On mismatch: `logger -t agnes-auto-upgrade FATAL` + exit 1. systemd
marks the service failed; no docker compose action runs; existing
containers (if any) keep running on stale state, but no new write
lands on the wrong disk.
3. Once verified mounted: re-apply `mount --make-rprivate /data /data/state`
on every run. Idempotent. Guards against propagation regressions
sneaking back in via future docker / kernel changes.
VMs without a config disk (foundryai-poc, single-disk legacy) skip the
whole block — the `if [ -e $CONFIG_DEVICE ]` guard.
## Tested
Patched script installed on foundryai-development as a hotfix; manual run
post-migration was a no-op (digest unchanged); /data/state stayed on sdc
across a full `docker compose down + up -d` cycle.
## Rollout
- This file is fetched by infra startup.sh from
raw.githubusercontent.com/keboola/agnes-the-ai-analyst/main on every
boot. Once merged to main, all VMs pick up the new script on their
next boot — no infra recreate needed.
- For immediate rollout to running VMs without waiting for next boot:
`scp scripts/ops/agnes-auto-upgrade.sh <vm>:/tmp/ &&
ssh <vm> sudo install -m755 -o root -g root /tmp/agnes-auto-upgrade.sh
/usr/local/bin/agnes-auto-upgrade.sh` (already done on
foundryai-development).
* chore: vendor-agnostic comment + changelog text
Drop customer-specific VM names from the script comment and
CHANGELOG entry. The OSS distribution should not name a particular
operator's hosts; the technical description already conveys why
the guard exists.
* fix(ops): suppress mount stderr in retry loop
Match the rest of the script's error-tolerant idiom (2>/dev/null).
Mount failures in the cold-boot udev race the loop is designed
to handle gracefully should not flow to stdout — cron would mail
on every transient retry.
Devin BUG_0001 on PR #146.
* fix(changelog): move auto-upgrade entry to [Unreleased]
Entry landed under v0.20.0 because that section was [Unreleased]
when this branch first opened — releases v0.21–v0.24 cut in the
meantime stranded it inside an already-released section. Move it
back where new entries belong.
Devin BUG_0001 on PR #146.
* fix(infra): single-source agnes-auto-upgrade.sh via curl from main
Replace the inline heredoc copy of the auto-upgrade script in the
customer-instance Terraform startup template with a curl fetch from
raw.githubusercontent.com on every boot. The inline copy had drifted
several iterations behind canonical scripts/ops/agnes-auto-upgrade.sh
(missing TLS overlay detection, array-form COMPOSE_FILES, and now
the config-disk fail-fast guard from this PR).
Devin ANALYSIS_0001 on PR #146.
* fix(infra): fetch docker-compose.tls.yml unconditionally + document coupling
The canonical agnes-auto-upgrade.sh from main detects TLS at runtime
via cert files on disk, regardless of the TLS_MODE Terraform variable.
Certs can appear after boot via agnes-tls-rotate.sh or manual
provisioning, and the cron job would then fail every 5 min under
'set -euo pipefail' because docker-compose.tls.yml was never fetched.
Also document the main-vs-COMPOSE_REF coupling: when the canonical
script references a new compose file, the fetch list above must be
updated to match — pinned-ref VMs would otherwise break.
Devin BUG_0001 + ANALYSIS_0001 on PR #146.
* fix(ops,infra): unconditional Caddyfile + skip tls overlay if missing
Caddyfile fetch now matches docker-compose.tls.yml: unconditional in
startup-script.sh.tpl. Without it, Docker would auto-create an empty
directory at the bind-mount target and Caddy would crash-loop while
the tls overlay has already closed :8000 — making the app
unreachable on any non-caddy VM where certs land via rotate or
manual provisioning.
Defensive layer: agnes-auto-upgrade.sh now also requires Caddyfile
to exist (size > 0) before activating the tls profile, with a
WARN log if it's missing. Belt-and-suspenders so the failure mode
is contained even when the script is deployed by some other path
(not just the customer-instance TF module).
Devin BUG_0001 on PR #146.
* chore(release): cut 0.25.0
---------
Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
* docs(spec): #134 unify BigQuery access behind BqAccess facade
Brainstorm output for issue #134. Captures:
- root cause (incl. correction of the issue's hypothesis about commit 33a9964)
- BqAccess facade API + project resolution rules
- error contract — typed BqAccessError mapped to HTTP 502 for upstream
BQ failures, 500 for deployment/config bugs
- migration plan for v2_scan, v2_sample, RemoteQueryEngine
- test rewrite eliminating _bq_client_factory injection point
- E2E verification protocol on agnes-development as success criterion
* docs(spec): #134 revise after first review
Incorporates code-reviewer findings:
Must-fix:
- Add v2_schema (2 copies of INSTALL/LOAD/SECRET dance) to migration scope.
- Reframe v2_scan headline: missing try/except around BQ calls is the
actual cause of bare 500s, not project resolution (which 33a9964 fixed).
- List two more deferred call sites (extractor.py, register_bq_table)
with explicit rationale.
Important:
- Drop billing != data clause from cross_project_forbidden heuristic;
rely only on 'serviceusage' substring. billing != data is normal
for cross-project setup, was over-classifying.
- Split bq_bad_request into _user (400) and _server (502) variants;
add sql_origin parameter to translate_bq_error so call sites declare
whether SQL contains user input.
- Add @functools.cache to BqAccess.from_config; document tests bypass
via dependency_overrides.
- Replace monkey-patched-classmethod test pattern with
BqAccess(client_factory=...) injection at construction time. Cleaner
than today's _bq_client_factory and 1:1 migration shape.
- Keep BqProjects.data (reviewer assumed registry has source_project;
it doesn't). Multi-project explicitly listed as non-goal with note.
Nice-to-have:
- Add 'Implementation strategy' section: 2 staged commits (bug fix
alone is revertable; refactor follows).
- Extend E2E protocol to cover all three endpoints, not just /sample.
- Note removal of stale docstring at src/remote_query.py:204.
* docs(spec): #134 revision 3 — incorporates second-round review
Must-fix from second review:
- v2_schema split into two migration cases: _fetch_bq_schema translates
errors via translate_bq_error; _fetch_bq_table_options preserves its
swallow-all 'except Exception → return {}' so /schema doesn't 502 on
partition-info failures.
- RemoteQueryEngine.__init__ now resolves BqAccess lazily (in
_get_bq_client, not in __init__). Without this, ~7 DuckDB-only tests
in test_remote_query.py would suddenly fail with not_configured.
- translate_bq_error pass-through for BqAccessError is now load-bearing
(clause 1, before any Google-API branch). bq.client() raises BqAccessError
for bq_lib_missing/auth_failed; without explicit pass-through those
fall to 'unknown' and re-raise as bare 500.
- Commit 1 now emits the SAME structured response shape as commit 2 to
avoid contract churn between commits.
- BIGQUERY_PROJECT env-var precedence is BREAKING for env-only deployments
— flagged in CHANGELOG ### Changed.
Editorial:
- sql_origin renamed to bad_request_status with values 'client_error' /
'upstream_error' (clearer about what the parameter actually decides).
bq_bad_request_user/_server kinds collapsed to bq_bad_request (400)
and bq_upstream_error (502).
- CLI (cli/commands/query.py) noted as external RemoteQueryEngine caller;
unaffected because new bq_access kwarg has default None.
- Added unit/integration tests for the new contracts:
test_translate_passes_through_BqAccessError,
test_v2_scan_returns_500_on_bq_lib_missing,
test_v2_schema_returns_200_with_empty_partition_on_bq_failure,
test_resolve_succeeds_after_config_set.
- E2E protocol now covers /schema as the fourth endpoint.
- Documented functools.cache-doesn't-cache-exceptions semantics and
fixture nullcontext-doesn't-close caveat for nested sessions.
* docs(spec): #134 revision 4 — incorporates third-round review
Third reviewer verdict: 'implementation-ready with two trivial edits';
explicitly noted prior rounds did the heavy lifting.
Edits:
1. get_bq_access() module-level function instead of @classmethod
@functools.cache from_config. Removes the classmethod-cache stacking
footgun (different Python versions wrap differently) and gives FastAPI's
dependency introspection a clean function signature. Drops the
'Do not subclass BqAccess' caveat that no longer applies.
2. Commit 1 strategy explicitly: wrap _fetch_bq_sample (v2_sample),
_bq_dry_run_bytes + _run_bq_scan (v2_scan), and _fetch_bq_schema
(v2_schema strict block). Do NOT touch _fetch_bq_table_options swallow-all
in commit 1 — preserved as-is, then migrated (still preserved) in commit 2.
All three endpoints emit the same structured body shape so client parsers
see one consistent contract throughout the staged rollout. No more
half-rolled-out window where /sample is bare 500 while /scan is
structured 502.
* docs(plan): #134 implementation plan — Phase 1 (atomic bug fix) + Phase 2 (BqAccess refactor) + Phase 3 (verification)
Bite-sized TDD tasks. 3 phases, 16 tasks total:
Phase 1 (Commit 1) — atomic bug fix across all four v2 endpoints:
Tasks 1.1-1.5 wrap _fetch_bq_sample, _bq_dry_run_bytes, _run_bq_scan,
_fetch_bq_schema with structured 502/400 try/except. _fetch_bq_table_options
preserved untouched. CHANGELOG Fixed entries.
Phase 2 (Commit 2) — BqAccess facade extraction + migration:
Tasks 2.1-2.5 build connectors/bigquery/access.py bottom-up
(BqProjects, BqAccessError, translate_bq_error, default factories,
BqAccess class, get_bq_access module-level cached). Task 2.6 adds
conftest.py fixture. Tasks 2.7-2.9 migrate v2_scan, v2_sample, v2_schema
to BqAccess. Tasks 2.10-2.11 migrate RemoteQueryEngine + tests
(lazy bq_access, drop _bq_client_factory). Task 2.12 CHANGELOG
Changed BREAKING + Internal.
Phase 3 — Verification:
3.1 full pytest. 3.2 squash into two PR-shape commits. 3.3 manual
E2E on agnes-development per spec protocol → close#134.
Self-review table maps spec sections to implementing tasks; no gaps.
* fix(v2): #134 structured 502/400 on BQ errors across /scan, /scan/estimate, /sample, /schema
Wraps the BigQuery call sites in v2_scan, v2_sample, and v2_schema (strict
block only) with try/except for google.api_core exceptions, translating to
HTTPException with a structured body shape: {error, message, details}.
Fixes Pavel's report (#134) where these endpoints returned bare HTTP 500
with no body when the SA on agnes-development hit cross-project Forbidden
on serviceusage.services.use.
Also fixes /sample's missing billing_project fallback (the bug 33a9964
fixed for /scan never landed here).
Status code split:
- /scan, /scan/estimate: BadRequest -> 400 (bq_bad_request) since SQL is
user-derived from req.select/where/order_by.
- /sample, /schema: BadRequest -> 502 (bq_upstream_error) since SQL is
server-constructed from validated identifiers.
- All Forbidden -> 502 with cross_project_forbidden if 'serviceusage' in
error message (with hint pointing at data_source.bigquery.billing_project),
else bq_forbidden.
Body shape matches what the upcoming BqAccess refactor (next commit) will
produce, so client-side parsers see one consistent contract throughout
the staged rollout.
_fetch_bq_table_options preserved exactly as-is — its swallow-all-and-return-empty
contract is intentional and survives into the refactor; /schema continues to
return 200 with empty partition info when partition queries fail.
Outer wraps in scan_endpoint, scan_estimate_endpoint, sample, and schema
endpoints exist only to make the test pattern (monkeypatching whole
_fetch_* functions) work, and are tagged TODO(#134 Phase 2) for removal
once BqAccess centralizes translation.
* refactor(bq): #134 BqAccess facade — unify v2_scan, v2_sample, v2_schema, RemoteQueryEngine
Extracts the duplicated BigQuery-access pattern (project resolution +
client construction + DuckDB-extension session + Google-API error
translation) into connectors/bigquery/access.py. Migrates four
call sites to use it:
- app/api/v2_scan.py — _bq_dry_run_bytes, _run_bq_scan
- app/api/v2_sample.py — _fetch_bq_sample
- app/api/v2_schema.py — _fetch_bq_schema (strict translation),
_fetch_bq_table_options (preserves swallow-all best-effort contract)
- src/remote_query.py — RemoteQueryEngine, lazy bq_access kwarg
The new module exposes:
- BqProjects (frozen dataclass: billing + data project IDs)
- BqAccessError (typed exception with HTTP_STATUS class mapping)
- BqAccess (facade with injectable client_factory/duckdb_session_factory
for tests; defaults call the real google-cloud-bigquery + DuckDB extension)
- get_bq_access (module-level @functools.cache; FastAPI Depends target)
- translate_bq_error (Google API exception → BqAccessError mapper, with
BqAccessError pass-through, 'serviceusage'-substring heuristic for
cross_project_forbidden, and bad_request_status param distinguishing
user-derived (400) from server-constructed (502) SQL)
- _default_client_factory, _default_duckdb_session_factory
RemoteQueryEngine.__init__ no longer accepts _bq_client_factory; tests
migrate to bq_access=BqAccess(projects, client_factory=...). DuckDB-only
RemoteQueryEngine tests need no changes — bq_access defaults to None and
get_bq_access() is only invoked on first BQ call (lazy resolution).
BqAccessError raised internally is translated to RemoteQueryError(
error_type="bq_error") in _get_bq_client to preserve the engine's
existing public contract — CLI and /api/query/hybrid callers see no change.
Endpoint tests (test_v2_scan, test_v2_scan_estimate, test_v2_sample,
test_v2_schema) migrate from monkey-patching whole _fetch_* functions
to using the new bq_access fixture in tests/conftest.py — which
exercises the REAL translation path through BqAccess + translate_bq_error,
closing the test gap flagged in Task 1.1's review.
Side-effect behavior change: v2_sample's FROM clause now uses the data
project (instance.yaml data_source.bigquery.project), not the conflated
billing_project from Phase 1. Documented in CHANGELOG ### Internal.
BREAKING for deployments combining BIGQUERY_PROJECT env var with
data_source.bigquery.project in instance.yaml — env var now overrides
data project too. See CHANGELOG ### Changed.
Two known-duplicate BQ-access sites (connectors/bigquery/extractor.py,
scripts/duckdb_manager.register_bq_table) explicitly out of scope;
tracked as follow-up.
Removed stale docstring at the previous src/remote_query.py:204
that referenced scripts.duckdb_manager._create_bq_client as the default
BQ client factory (RemoteQueryEngine never actually used that function).
Test counts: tests/test_bq_access.py +27 (new), tests/test_v2_*.py +
tests/test_remote_query.py migrated to bq_access fixture (counts unchanged
or +1-2 per file). Full suite: 2086 passed, 8 pre-existing failures
(DB migration tests with unrelated internal_roles DependencyException —
not introduced by this PR).
* fix(bq_access): translate DefaultCredentialsError to BqAccessError(auth_failed)
CI on PR #138 caught: bigquery.Client(...) resolves Application Default
Credentials at construction time; without ADC (CI without SA key, dev
laptop without 'gcloud auth application-default login') it raises
google.auth.exceptions.DefaultCredentialsError synchronously.
Pre-fix _default_client_factory only caught ImportError, so DefaultCredentialsError
propagated as raw exception — and from production endpoints would surface
as bare 500 (the exact failure mode #134 sets out to fix).
Now translates to BqAccessError(kind='auth_failed', details.hint='Run
gcloud auth application-default login...'). Endpoint catch chain returns
HTTP 502 with structured body. Adds unit test
test_raises_auth_failed_on_default_credentials_error.
Third-round spec review flagged this case in passing; the fix didn't land.
CI's auth-less environment surfaced it.
* fix(bq_access): get_bq_access() returns sentinel instead of raising when not configured
Devin BUG_0001 on PR #138 review: 'get_bq_access() as FastAPI Depends
breaks all v2 endpoints for non-BigQuery instances'.
Pre-fix: get_bq_access() raised BqAccessError(not_configured) when
neither BIGQUERY_PROJECT env nor data_source.bigquery.project was set.
Because FastAPI resolves Depends() BEFORE the endpoint body runs, this
exception fires during dep-injection — the endpoint's try/except
BqAccessError clause never gets a chance to catch it. Result: every
v2 request on Keboola-only or CSV-only instances returned bare HTTP
500, even for local-source tables that never touch BigQuery.
Fix: get_bq_access() now returns a sentinel BqAccess with empty
BqProjects and factories that raise BqAccessError(not_configured)
on actual use. Construction succeeds, FastAPI's dep-injection cleanly
yields the sentinel, the endpoint runs. The local-source code path
in build_sample / build_schema / etc. never calls bq.client() or
bq.duckdb_session() (it reads parquet directly), so non-BQ tables
return 200 as before. Only when an endpoint actually tries to query
BQ (source_type == 'bigquery') does the sentinel raise — and the
endpoint's existing except BqAccessError catches it normally,
returning structured 502 with hint.
Test get_bq_access::test_raises_not_configured_when_neither_set
renamed and rewritten to test_returns_sentinel_when_neither_set:
asserts BqAccess is returned, then asserts client() and
duckdb_session() each raise BqAccessError(not_configured) on call.
Test test_does_not_cache_exceptions removed (no longer applicable)
and replaced with test_sentinel_is_cached_per_process documenting
the operator-restart-on-config-change contract.
* docs(spec+plan): #134 genericize customer-specific tokens (CLAUDE.md OSS rule)
Devin BUG_0001/0002 round 3 on PR #138: spec and plan docs contained
customer-specific deployment hostnames, deployment names, and a GCP
project ID that violated CLAUDE.md's vendor-agnostic OSS rule
('Nothing customer-specific belongs in code, configuration defaults,
comments, docs, commit messages, PR titles, or PR bodies').
Replacements:
agnes-development.groupondev.com -> <your-agnes-host>
agnes-development -> <your-dev-instance>
prj-grp-dataview-prod-1ff9 -> <your-data-project>
s1_session_landings -> <bq_table_id>
E2E verification semantics unchanged — operators still run the same
four curls + config flip + retry, just substituting their own host /
deployment name / project / table.
* fix(bq_access): hook get_bq_access.cache_clear into instance_config.reset_cache
Devin ANALYSIS_0004 on PR #138: get_bq_access is @functools.cache'd at
process level, so it captures BigQuery project IDs at first call and
ignores subsequent instance.yaml changes. Pre-Phase-2 the v2 endpoints
re-read get_value() on every request, so admin /api/admin/server-config
saves (which call instance_config.reset_cache()) hot-reloaded the BQ
project. Without this fix, my refactor silently regresses that contract
— operators editing instance.yaml via the admin UI would see no effect
on v2 endpoints until container restart.
instance_config.reset_cache() now also calls
connectors.bigquery.access.get_bq_access.cache_clear() (lazy import,
swallowed if connectors module isn't loaded — keeps instance_config
usable in isolated unit tests).
Adds test_instance_config_reset_cache_invalidates_get_bq_access as
regression guard. Updates CHANGELOG Internal entry to mention the
hot-reload contract + the not-configured sentinel behavior (round-3
fix from Devin BUG_0001 was previously only in commit message).
* fix(bq_access): surface not_configured before identifier validation + plan path genericize
Devin BUG_0001 + BUG_0002 round 5 on PR #138.
BUG_0001 (plan doc): personal filesystem path violated CLAUDE.md
vendor-agnostic rule. Replaced with '<worktree-root>' placeholder.
BUG_0002 (sentinel error path): when get_bq_access() returns the sentinel
BqAccess (BQ not configured), the empty bq.projects.data was reaching
validate_quoted_identifier first and raising ValueError -> endpoint
mapped to HTTP 400 'unsafe_identifier' instead of structured 500
'not_configured' with hint.
Each fetch helper now checks 'if not bq.projects.data: bq.client()' as
the first step, which triggers the sentinel's BqAccessError(not_configured).
Endpoint catches the typed error and returns HTTP 500 with hint pointing
at data_source.bigquery.project. Best-effort _fetch_bq_table_options
returns {} silently in this case (preserves the swallow-all contract).
* fix(bq_access): classify DuckDB-native exceptions from bigquery_query() via string match
Devin ANALYSIS on PR #138 review (latest round). The DuckDB bigquery
extension is a C++ plugin making its own HTTP calls — when BQ returns
403, it throws duckdb.IOException with the BQ error embedded as text,
not gax.Forbidden. translate_bq_error's isinstance checks would miss
these, falling to case 7 → bare 500 in production for v2_scan, v2_sample,
and v2_schema (the bigquery_query() paths).
Fix: last-resort string-match heuristic before the re-raise. 'Forbidden'
/ '403' / 'Bad Request' / '400' in the lowercased message classifies via
the same kind hierarchy. The 'serviceusage' substring still distinguishes
cross_project_forbidden from bq_forbidden. Specific enough that random
exceptions without HTTP-error keywords still re-raise.
Adds 4 unit tests covering the new heuristic + the 'don't swallow random
exceptions' invariant.
* chore(release): cut 0.22.0
PR #138 contains issue #134 user-visible behavior changes:
- BREAKING: BIGQUERY_PROJECT env var now overrides instance.yaml
data_source.bigquery.project for v2 endpoints (previously
RemoteQueryEngine billing only).
- Fixed: structured 502/400 on /api/v2/sample, /scan, /scan/estimate,
/schema when BigQuery raises Forbidden/BadRequest (was bare 500).
- Internal: BqAccess facade refactor unifying four duplicate BQ-access
call sites; instance_config.reset_cache() now invalidates BqAccess
cache too so admin server-config saves hot-reload BQ project IDs.
Bumps to 0.22.0 because PR #137 merged first and took 0.21.0.
Bootstraps the Agnes Claude Code marketplace + RBAC-allowed plugins from
the dashboard CTA, and inlines the server's TLS cert when the chain isn't
publicly trusted (self-signed / private CA). Cross-platform setup prompt
covers Windows Git Bash, macOS, Linux. Includes Bun-compiled `claude` fix
(macOS goes via git-clone fallback, same as Windows), PAT stripping after
clone, explicit error handling, and four rounds of Devin Review fixes
(phantom step references, $PLATFORM re-detection, heredoc/awk line-count
sync). Cuts 0.21.0.
See CHANGELOG.md [0.21.0] section for details.
* fix(scheduler): HTTP marketplaces job + SCHEDULER_API_TOKEN shared secret
Two scheduler-reliability bugs surfaced after the v0.12.1 USER-agnes flip:
1. The marketplaces job called src.marketplace.sync_marketplaces() in-process
from the scheduler container, racing the app's long-lived system.duckdb
handle. DuckDB rejects cross-process writers — every cron tick 500-ed on
"Could not set lock on file ... PID 0".
2. The data-refresh + new marketplaces jobs both 401-ed on the API because
SCHEDULER_API_TOKEN was never propagated by the Terraform startup script.
The scheduler had no credential to authenticate with.
Fix:
- New POST /api/marketplaces/sync-all (admin-only) drives the nightly refresh
through the app process so it inherits the existing DB connection.
- Scheduler swaps fn->http for marketplaces; all jobs are now plain HTTP and
the scheduler is reduced to a cron clock.
- New app/auth/scheduler_token.py adds a shared-secret auth path. The
startup script generates a 256-bit secret on first boot, persists it
across reboots, and writes it to /opt/agnes/.env. Both containers source
the same .env. The app validates incoming Bearer tokens against the env
var (constant-time, length-floored) and resolves matches to a synthetic
scheduler@system.local user that's a member of the Admin system group.
Audit-log entries from the scheduler are attributed to this user.
- app/main.py seeds the synthetic user at startup so the first cron tick
has a valid actor; lazy seed in get_scheduler_user covers token rotation
before the next app restart.
Tests: 5 new in tests/test_auth_scheduler_token.py covering empty/short
secret rejection, exact-match comparison, idempotent user seeding, and
lazy provisioning. 142 marketplace + scheduler tests + 96 auth tests
remain green.
Existing VMs with .env from before this change need a one-time
re-provisioning (re-run startup-script or rotate via openssl rand);
documented in CHANGELOG.
* fix(audit): use '_all' sentinel for bulk marketplace sync — Devin review #127
Avoids the literal string 'marketplace:None' in the audit_log resource
column when the bulk sync endpoint writes its summary row.
* fix(scheduler): unblock event loop + per-job timeouts — Devin review #127
Two findings from Devin re-review on commit 5fbad15:
1. BUG: trigger_sync_all was async def, so FastAPI ran it on the asyncio
event loop. sync_marketplaces() does blocking I/O (subprocess git
clones up to GIT_TIMEOUT_SEC=300 each, threading.Lock, DuckDB writes)
and would freeze every concurrent request for the duration of a bulk
sync. Switched to plain def so FastAPI auto-routes to the thread pool.
2. ANALYSIS: scheduler used a fixed 120s httpx timeout for every POST.
Bulk marketplace sync iterates the registry under a single lock with
up to 300s per repo — easily exceeds 120s on 2-3 slow repos. The
scheduler then sees a timeout, doesn't update last_run, and re-fires
on the next 30s tick, queueing redundant work. Per-job timeout
override added to the JOBS tuple; marketplaces gets 900s (15 min),
data-refresh keeps 120s, health-check 30s.
* fix(auth): require_session_token rejects scheduler shared secret — Devin review #127
require_session_token gates /auth/tokens (PAT minting). Pre-fix it only
rejected JWTs with typ=pat — but the scheduler shared secret is an opaque
string, so verify_token() returns None, payload becomes {}, and the
PAT-claim check silently passed. A caller bearing SCHEDULER_API_TOKEN
could mint persistent PATs that survive a secret rotation.
Added explicit is_scheduler_token() check before the PAT-claim check;
new regression test in tests/test_auth_scheduler_token.py.
Devin's other note (pre-existing async def trigger_sync at marketplaces.py:392
also calls blocking sync_one) — Devin flagged it as out-of-scope for this PR
and I agree; tracking separately.
* release(0.17.0): cut + clean up CHANGELOG duplicates
Cuts 0.17.0 (minor: scheduler shared-secret auth + sync-all endpoint
plus the deploy-shape fixes that landed since the last release tag).
Bumps pyproject from 0.15.0 — also corrects the missed bump from PR #120
(v0.16.0 was tagged on GitHub and shipped as :stable, but pyproject
stayed at 0.15.0, so /api/version, /cli/latest, and `da --version` had
been under-reporting the running release).
Removes the long-form duplicate entries for 0.13.0 / 0.14.0 / 0.15.0
above [0.16.0] — the canonical short summaries (with GitHub-release
links) already exist below 0.16.0, the long forms were leftover state
from before those versions were cut and have been silently shadowed
ever since.
Replaces the BigQuery wrap-view pattern with a discovery + scoped-fetch toolkit driven by the analyst's Claude session. Adds /api/v2/{catalog,schema,sample,scan,scan/estimate}, da catalog/schema/describe/fetch/snapshot/disk-info CLI commands, sqlglot-backed WHERE validator, process-local quota tracker, agent rails skill (cli/skills/agnes-data-querying.md). BREAKING: BQ wrap views off by default — set data_source.bigquery.legacy_wrap_views=true for one cycle. Backward-compat field_validator on primary_key. Catalog cache now matches documented 300s TTL with RBAC fresh per request. Cuts release v0.14.0.
Adds /admin/server-config UI for editing instance.yaml from the web. Hardening: SSRF gate on data_source URLs, narrow-overlay write strategy, atomic writes, audit log with secret masking on shape changes, threading lock on read-modify-write, corrupt-overlay refusal on write side + louder log on read side, modal Promise resolution on backdrop dismiss, sentinel scrub on save (defense-in-depth client+server). Bundles Windows PowerShell wrapper from #80. Cuts release v0.13.0.
* fix(security+ops): #82#85#87 — auth hardening, API validation, deploy posture
Security and operational hardening across three issue groups:
- M23: docker-compose.override.yml → docker-compose.dev.yml (BREAKING, prod foot-gun)
- C13: Container runs as non-root user 'agnes' (USER directive in Dockerfile)
- M21: Docker resource limits (mem_limit, cpus) on app + scheduler
- M22: Caddyfile security headers (X-Frame-Options, X-Content-Type-Options, Referrer-Policy, -Server)
- M17: /api/health split into minimal (unauth) + /api/health/detailed (auth) (BREAKING)
- M26: release.yml restricts build-and-push to main + workflow_dispatch; paths-ignore for docs
- C2: table_id traversal validation on /api/data/{table_id}/download
- M4: Upload streaming (chunk-read + temp file) instead of full-buffer; /local-md hashed filename
- C5: reset_token removed from POST /api/users/{id}/reset-password response
- C8: Startup WARNING when no user has password_hash (bootstrap window visible)
- M9: Audit log on failed web form login (mirrors /auth/token endpoint)
- M10: Atomic magic-link consume via compare-and-swap (CONSUMED: marker + DuckDB conflict catch)
Also: SSRF protection on /api/admin/configure (#46), memory stats SQL aggregation (#90)
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
* fix(review): SSRF 169.254.x.x + IPv6 multicast; M10 marker cleanup safety
Review fixes:
- Add 169.254.0.0/16 (link-local, cloud metadata) to SSRF regex — was
missing, allowing requests to AWS/GCP/Azure metadata endpoints
- Add ff[0-9a-f]{2}: (IPv6 multicast) to SSRF regex
- M10: wrap Step 3 (CONSUMED marker cleanup) in try-except with
warning log — prevents unhandled exception if DB write fails after
successful token consumption
- Add test for 169.254.169.254 SSRF rejection
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
* fix(review): SSRF IPv6 bypass, CLI health endpoint, upload FD leak
Address Devin Review findings on PR #104:
1. SSRF IPv6 bypass: Replace hostname regex with DNS resolution +
ipaddress module checks. The old regex patterns like `fe80:` only
matched up to the first colon, missing real IPv6 addresses like
`fe80::1`, `fc00::1`, `ff02::1`. The new approach resolves the
hostname via getaddrinfo and checks each resulting IP against
ipaddress.is_private/is_loopback/is_link_local/is_reserved/is_multicast.
2. CLI commands broken: `da setup test-connection`, `da setup verify`,
`da diagnose`, `da status` all called /api/health expecting the old
format (status=="healthy", services dict). Now they call
/api/health/detailed for service-level checks (with graceful fallback
to the minimal endpoint when auth is not configured).
3. Temp file handle leak: _stream_to_temp returns an open
NamedTemporaryFile; callers now close it before shutil.move() to
prevent FD leaks until GC.
Also adds IPv6 SSRF test cases (loopback, link-local, unique-local,
multicast) with mocked DNS resolution for test environment independence.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
* fix(review): download regex blocks hyphenated IDs; document health split
Address Devin Review round-3 findings on PR #104:
1. _SAFE_IDENTIFIER regex blocked hyphenated table IDs: The download
endpoint used the strict SQL-identifier regex which does not allow
dots or hyphens, but Keboola table IDs like in.c-crm.orders
contain both. Switched to _SAFE_QUOTED_IDENTIFIER which allows dots
and hyphens while still blocking path-traversal chars (/, .., \)
and quote/control characters. Added test for hyphenated/dotted IDs.
2. Documented health endpoint split in DEPLOYMENT.md: Added Health
checks & external monitoring section explaining both endpoints
(minimal unauth /api/health vs authenticated /api/health/detailed)
and how to wire external monitoring tools to the detailed endpoint
with a PAT.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
* release(0.12.1): cut hotfix for snapshot integrity + #82/#85/#87 hardening
* fix(security): apply CAS pattern to password reset confirm (#82/M10 follow-up)
Devin review on the rebased PR flagged the asymmetry: magic-link verify
got the atomic compare-and-swap pattern in the original M10 fix, but
password reset confirm at /auth/password/reset/confirm was still using
read-validate-clear. Two concurrent POSTs with the same valid reset
token could both succeed in setting different new passwords (last-write-
wins). Lower severity than the magic-link race because the attacker
would need the reset token AND to race the legitimate user, but the
asymmetry was a polish gap.
Mirrors app/auth/providers/email.py::_consume_token CAS exactly: write
unique CONSUMED:<random> marker via UPDATE...WHERE token=old_token, then
SELECT to verify our marker won, then proceed. Only the winner clears
the marker and applies the password change.
New regression test_concurrent_reset_only_one_wins in
tests/test_password_flows.py::TestResetConfirm pins the contract: two
ThreadPoolExecutor workers + Barrier hit /reset/confirm with the same
token; exactly one gets 302 (password applied), the other gets 200 with
'Invalid or expired'. Sanity-checked against the pre-CAS code — both
POSTs got 302 (race confirmed).
---------
Co-authored-by: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Pre-1.0 SemVer convention: BREAKING changes land in MINOR. This release is
heavy on those — RBAC v13 schema rewrite, schema v14 FK constraints,
marketplace admin god-mode drop, internal_roles/group_mappings/user_role_grants
tables removed, exit code 2 from Keboola extractor (partial fail), Script API
admin-only, scripts/grpn/ → scripts/ops/.
CHANGELOG.md: rename Unreleased → [0.12.0] — 2026-04-28; append a fresh
empty Unreleased above so the next PR has somewhere to land.
pyproject.toml: 0.11.5 → 0.12.0.
Tag the merge commit as v0.12.0 + push the tag after merge to main.
Follow-up to the RBAC v13 + marketplace work in the parent commit. Addresses
deferred Devin findings, gemini-flagged blockers, and adds three guard rails.
== Schema v14 — FK constraints on user_group_members + resource_grants ==
Adds DuckDB foreign-key constraints so cascade deletes can no longer leave
orphaned member / grant rows pointing at a deleted group_id (which were
relying on application-level cascades up to v13). Migration is RENAME →
CREATE-with-FK → INSERT → DROP, wrapped in BEGIN TRANSACTION so a partial
failure rolls back without leaving the DB at a half-applied schema.
== AGNES_ENABLE_TABLE_GRANTS feature flag (default off) ==
ResourceType.TABLE was shipped in the parent commit as listing-only — admins
can record grants but runtime enforcement still flows through legacy
dataset_permissions. To avoid the misleading-UX surface area, the chip is
hidden from /admin/access and POST /api/admin/grants returns 422 with the
env-var name in detail until the operator opts in. Existing TABLE rows in
resource_grants stay listable + deletable so cleanup is never blocked.
Helpers: is_resource_type_enabled(rt), enabled_resource_types().
== Break-glass admin CLI ==
`da admin break-glass <user>` adds the user to the Admin user_group with
source='system_seed' regardless of RBAC state. Bypasses authentication —
relies on filesystem access to ${DATA_DIR}/state/system.duckdb implying
host-level trust. Recovery path when the operator has locked themselves
out of /admin/access.
== Devin round-2 fixes (deferred on b4ec4c4) ==
- src/repositories/user_groups.py — narrow update() guard from blocking any
mutation on system groups to blocking name change only. Description edits
now pass through. Endpoint pre-check stays as defense-in-depth. Prior
behavior surfaced as a misleading 409 'Cannot rename a system group' on
description-only PATCH.
- app/api/access.py:delete_group — wrap cascade DELETEs + repo.delete in
BEGIN TRANSACTION / COMMIT / ROLLBACK. Prevents orphan rows if any
DELETE fails after the user_groups row is gone.
- app/marketplace_server/{packager,router}.py — split compute_etag_for_user()
from build_zip(); router resolves etag first and 304-shorts before any
file read or ZIP_DEFLATED. In-process cachetools.TTLCache (default 120s,
env-tunable via AGNES_MARKETPLACE_ETAG_TTL, set 0 to disable).
invalidate_etag_cache() called by sync to force re-hash on content drift.
== Tests ==
- TestTableGrantsFeatureFlag (4 cases) — endpoint exclude/include, grant
rejection/acceptance under the flag.
- test_v12_to_v13_finalize_rollback_on_failure — destructive: monkeypatches
_seed_system_groups to raise mid-transaction, asserts schema_version stays
at 12, legacy tables intact, new tables empty (rollback fired). Then
restores the real function and asserts the retry succeeds.
- test_update_system_group_description_allowed,
test_update_system_group_same_name_no_op — repo-level coverage of the
narrowed guard.
This squashes 13 commits from ma/staging plus a small docstring translation
into a single coherent unit. Three workstreams.
== RBAC v13 redesign ==
- Drops core.viewer/analyst/km_admin/admin hierarchy and the
internal_roles / group_mappings / user_role_grants / plugin_access tables.
- Replaced by user_group_members + resource_grants. Atomic v12→v13 backfill
wrapped in BEGIN/COMMIT; ROLLBACK leaves schema_version at 12 for retry.
- Two authorization primitives in app.auth.access:
require_admin — Admin-group god-mode
require_resource_access(rt, "{path}") — entity-scoped grants
Single DB lookup per request; no session cache; no implies BFS.
- /admin/access UI (single page) replaces /admin/role-mapping +
/admin/plugin-access. CLI `da admin group/grant *` replaces
`da admin role/mapping/grant-role/revoke-role/effective-roles`.
- ResourceType.TABLE listing-only — admins can record table grants,
runtime enforcement still flows through legacy dataset_permissions
(migration plan in docs/TODO-rbac-data-enforcement.md).
== Claude Code marketplace ==
- Aggregated /marketplace.zip + /marketplace.git/* (PAT-gated,
RBAC-filtered, content-addressed cache via dulwich).
- Admin god-mode dropped on the marketplace surface — admins curate
their own view via grants like everyone else.
- Bare-repo cache materializes per RBAC-filtered ETag; stale entries
not pruned in this iteration (disclaimed in git_backend.py docstring).
== #81#83#44 security/ops hardening ==
- #81 Group A — orchestrator ATTACH allow-listing (extension/url/alias).
- #81 Group B — Keboola extractor 3-state exit codes:
0 success / 1 total fail / 2 PARTIAL fail
Sync API logs PARTIAL FAILURE alert on exit 2. Operators with binary
alerting must teach it the new partial signal.
- #81 Group C — schema v10 view_ownership; rejects silent overwrite
of a prior connector's view name on collision.
- #81 Group D — extractor-side identifier validation.
- #83 — Jira webhook fail-closed when JIRA_WEBHOOK_SECRET unset
+ path-traversal fix.
- #44 — entire /api/scripts/* surface is admin-only (planted-script +
sandbox-bypass risk closed).
== Web UI polish + deploy fix ==
- /admin/access: live grant-count badges (no stale snapshot revert),
shared-header CSS link added to /catalog and /admin/{tables,permissions},
per-resource-type colored stripes.
- docker-compose.host-mount.yml: bind,rbind so dual-disk hosts don't
silently shadow sub-mounts and write state to the wrong disk.
== OSS vendor-neutralization (waves 1+2) ==
- scripts/grpn/ → scripts/ops/. Customer-specific identifiers
(project IDs, internal hostnames, dev/prod VM IPs, brand names)
replaced with placeholders across code, docs, Terraform, Caddyfile,
OAuth probe, and planning docs. Downstream infra repos that copied
scripts/grpn/agnes-tls-rotate.sh or agnes-auto-upgrade.sh must
update the path.
== Translation ==
- src/repositories/user_groups.py::ensure_system docstring translated
from Czech to English for codebase consistency.
Co-authored-by: Mina Rustamyan <mina@keboola.com>
Cuts 0.11.5 with all the [Unreleased] bullets that landed on top of PR #73
between commit a899877 (the original "v0.11.4" tag in the chain) and the
final merge commit on main. No new public-API surface; the user-visible
payoff is that v8→v9-migrated installations work end-to-end (login flows,
GET /api/users, admin nav, the new role-management REST API and its
last-admin protection) and `make local-dev` startup is finally quiet.
Bullets covered (full text in CHANGELOG.md [0.11.5]):
- _hydrate_legacy_role re-resolves from grants on every request — fixes
privilege-retention after grant revoke via the role-management API.
- Dev-bypass + OAuth callback now pass user_id to resolve_internal_roles
so direct grants land in the session cache (not the DB-fallback path).
- GET /api/users hydrates user dicts before Pydantic validation
(HTTP 500 on every migrated install) + same fix for update/delete
paths so last-admin protection triggers on migrated admins.
- Scheduler stopped spamming POST /auth/token 401 — the auto-fetch
fallback was always broken; SCHEDULER_API_TOKEN is now the only path.
- POST /auth/token / Google OAuth / password / email-magic-link all
hydrate user["role"] before issuing the JWT (Pydantic 500 + wrong
token payload). New TestAuthLoginFlowsPostMigration regression class.
- docs/RBAC.md no longer documents the non-existent implies= keyword
on register_internal_role.
- _seed_core_roles now actually runs on every connect (the docstring
was lying — only ran during fresh install + v8→v9). New
TestSeedCoreRolesSafetyNet regression class.
This commit also adds:
- AuthlibDeprecationWarning suppression at app/main.py top — upstream-
internal forward-compat note from authlib._joserfc_helpers, not
actionable on our side. Filter is targeted by class (with a
message-based fallback) so other DeprecationWarnings remain visible.
- pyproject.toml version: 0.11.4 → 0.11.5.
- CHANGELOG.md: [Unreleased] → [0.11.5] — 2026-04-27, new empty
[Unreleased] skeleton appended for the next PR to land on.
Tag v0.11.5 follows; keboola-deploy-v0.11.5 tag triggers the
keboola-deploy.yml workflow for agnes-dev.keboola.com.
* feat(auth): v9 schema — unified role management foundation (WIP)
Tasks 1-5, 10 of the role-management-complete plan. Foundation only,
follow-up commits add REST API, CLI, UI, and tests.
Schema v9:
- user_role_grants table: direct user → internal_role mapping
(complementary to group_mappings). Drives PAT/headless auth and
persists across sessions. Source field tracks 'direct' vs auto-seed.
- internal_roles.implies (JSON): transitive role hierarchy. core.admin
implies core.km_admin → core.analyst → core.viewer. Resolver does BFS
expand at lookup time.
- internal_roles.is_core (BOOL): distinguishes seeded core.* hierarchy
from module-registered roles. UI renders them differently.
- v8→v9 migration: ADD COLUMN, CREATE TABLE, _seed_core_roles +
_backfill_users_role_to_grants, then NULL legacy users.role values.
DuckDB FK constraint blocks DROP COLUMN — sloupec zůstává jako
deprecated artifact (UserRepository ignoruje), fyzický drop deferred.
Resolver:
- Regex extended to allow dotted namespace (core.admin,
context_engineering.admin), max 64 chars total.
- expand_implies(role_keys, conn): BFS over implies JSON column.
- resolve_internal_roles signature gains optional user_id parameter;
unions group-mapping resolution with user_role_grants direct grants
before implies expansion.
require_internal_role:
- Two-path resolution: session cache (OAuth) → DB grants (PAT/headless
fallback). PAT clients now legitimately satisfy gates without the
OAuth round-trip, fixing the v8 limitation where every PAT-callable
admin endpoint needed require_role(Role.ADMIN) instead of
require_internal_role(...).
Backward-compat:
- require_role(Role.X) and require_admin become thin wrappers over
require_internal_role(f"core.{role}"). Implies hierarchy preserves the
legacy "at least this level" semantics automatically — no per-level
comparison code needed.
- src/rbac.py helpers (is_admin, has_role, get_user_role,
set_user_role, can_access_table, get_accessible_tables) all read from
the resolver via _get_internal_role_keys.
- UserRepository.create() and update() now mirror role changes into
user_role_grants via _grant_core_role helper. Preserves API while
making the new table the source of truth.
- UserRepository.delete() pre-deletes user_role_grants rows
(FK cascade — DuckDB doesn't auto-cascade).
- count_admins() reads user_role_grants ⨝ internal_roles instead of the
now-NULL users.role column.
First consumer:
- app/api/admin.py module-level docstring documents the v9 pattern for
future module authors. Existing require_role(Role.ADMIN) callsites
flow through the wrapper; no behavior change for OAuth callers, and
PAT callers gain access via direct grants.
Tests: full suite green (1396 passed, 6 skipped). Existing tests
exercise the new pathway transparently because UserRepository.create
auto-grants. New test_pat_caller_with_direct_grant_passes pins the
PAT-aware contract.
Schema: v9 (was v8). pyproject.toml + CHANGELOG bump deferred to the
final PR-prep commit.
* feat(auth): role management complete — REST API + CLI + UI + docs (v0.11.4)
Sjednocuje legacy users.role enum s v8 internal-roles foundation pod jeden
model s implies hierarchií, dodává admin UI + REST API + CLI pro správu
group mappings i přímých user grants, a dělá require_internal_role
PAT-aware tak, aby admin endpointy fungovaly uniformly napříč OAuth
i headless callery.
REST API (app/api/role_management.py, +496 LOC):
- 8 endpointů pod /api/admin: internal-roles list, group-mappings CRUD,
users/{id}/role-grants CRUD, users/{id}/effective-roles debug.
- Všechny gated require_internal_role("core.admin"). Audit-log na každé
mutaci (role_mapping.created/deleted, role_grant.created/deleted).
- Last-admin protection: refuse to delete the final core.admin grant
(mirrors users.py:count_admins protection).
- Nový UserRoleGrantsRepository v src/repositories/user_role_grants.py.
CLI (cli/commands/admin.py extension, +258 LOC):
- da admin role list / show <key>
- da admin mapping list / create <group-id> <role-key> / delete <id>
- da admin grant-role <email> <role-key>
- da admin revoke-role <email> <role-key>
- da admin effective-roles <email>
- Všechno přes typer + PAT auth, --json flag, response-shape tolerantní.
UI (admin_role_mapping.html + admin_user_detail.html + nav + user list):
- Nová stránka /admin/role-mapping: internal_roles read-only table +
group_mappings table with create/delete forms.
- Nová stránka /admin/users/{id}: core role single-select + capabilities
multi-checkbox + effective-roles debug (direct + group + expanded).
- Existing user list dostává "Detail" link na novou stránku.
- Nav link na /admin/role-mapping.
Tests: +85 nových testů přes 4 nové soubory:
- test_schema_v9_migration.py (8) — fresh install + v8→v9 backfill +
legacy column NULL semantics + unknown-role fallback + invariants.
- test_api_role_management.py (33) — všech 8 endpointů, happy + error
paths, audit-log assertions, last-admin protection.
- test_cli_admin_role.py (25 + 1 conditional) — typer subcommands,
text + json output, PAT integration smoke.
- test_admin_role_mapping_ui.py (9) + test_admin_user_capabilities_ui.py (10)
— page rendering, auth gating, form contracts, JS hooks.
Full suite: 1482 passed, 6 skipped (was 1396 → +86, žádné regrese).
Docs:
- docs/internal-roles.md kompletní rewrite — odstranil "no UI yet",
přidal hierarchy diagram, dual-path resolution, dotted-namespace
convention, admin workflow přes UI/CLI/REST, refresh semantics
for group mappings vs direct grants, migration notes.
- CLAUDE.md schema v8 → v9.
- CHANGELOG.md [0.11.4] s BREAKING marker pro users.role NULL
semantics + complete Added/Changed/Removed/Internal sekce.
- pyproject.toml: 0.11.3 → 0.11.4.
Sequencing: po mergi tohoto PR Pabu rebasuje pabu/local-dev (PR #72)
na main, jeho schema migrations se posouvají z v9/v10/v11 na v10/v11/v12.
Implementation breakdown:
- Sequential (já): foundation tasks — schema v9, resolver, PAT-aware
require_internal_role, backward-compat wrappers, rbac refactor,
UserRepository auto-grant.
- Parallel sub-agents (3 worktrees, ~10 min): REST API, CLI, UI.
- Sequential (já): integrace, docs/CHANGELOG/version, schema tests,
fullsuite verification.
* fix(auth): address Devin review on PR #73 — three regressions
Three concrete bugs caught in Devin's PR review, all fixed in this commit.
1. **users.role hydration on read** (the big one):
v8→v9 migration NULLs users.role for every existing user, but a long
tail of read sites still inspect user["role"] directly:
- app/web/templates/_app_header.html:15 — admin nav gate
- app/web/templates/_app_header.html:36-37 — role badge in dropdown
- app/web/router.py:319-321 — UserInfo.is_admin/is_analyst/is_privileged
- app/web/router.py:489 — corporate memory is_km_admin
- app/api/catalog.py:54 — admin "see all tables" bypass
- app/api/sync.py:215 — admin "see all sync states" bypass
Without a fix, every existing admin loses the entire admin nav (and
API admin bypasses) immediately after upgrade — a serious regression.
Fix: new helper _hydrate_legacy_role() in app/auth/dependencies.py
maps the highest-level core.* grant back into user["role"] as the
legacy enum string. Called from get_current_user() on both auth paths
(LOCAL_DEV_MODE + JWT/PAT). Idempotent — skips when role is already
populated. Net effect: every pre-v9 callsite keeps working transparently
for both OAuth and PAT callers, with one extra DB round-trip per
authenticated request (same cost as the existing PAT-aware
require_internal_role fallback).
3 regression tests in tests/test_schema_v9_migration.py:
- test_hydration_recovers_role_from_user_role_grants
- test_hydration_returns_highest_grant (multi-grant → highest wins)
- test_hydration_falls_back_to_viewer_when_no_grants (safe fallback)
2. **CLI effective-roles TypeError**:
API returns direct/group as List[Dict] (RoleGrantResponse-shaped),
but the CLI did ', '.join(direct) which raises TypeError on dicts.
Tests masked it because mocks used bare string lists. Replaced
raw .join() with a _names() helper that extracts role_key from
each item, falling back to str() for legacy mock shapes.
3. **UI template field-name mismatch**:
admin_user_detail.html JS reads data.groups but the API serializes
the field as group (singular, per EffectiveRolesResponse pydantic).
Currently benign because the API always returns group:[], but the
field would silently disappear once the group-derived view is wired
up. Added data.group as the primary lookup, kept the legacy aliases
for shape-drift tolerance.
Full suite: 1485 passed (was 1482, +3 hydration tests), 6 skipped, no
regressions.
* fix(auth): Devin review #2 + UX self-service + RBAC docs rename
Three threads landed in one commit because they share the same
auth/role surface and CHANGELOG entry.
Devin review #73 second round (2 actionable findings):
- _hydrate_legacy_role no longer short-circuits on truthy users.role.
The role-management endpoints (POST/DELETE /api/admin/users/{id}/
role-grants + the changeCoreRole UI flow) only mutate
user_role_grants — they don't update the legacy column. The early
return trusted that stale value, so a user downgraded via the new
REST/UI kept role="admin" in their dict on subsequent requests,
which fooled _is_admin_user_dict (src/rbac.py) and the catalog/sync
admin-bypass short-circuits into retaining elevated table access
even though require_internal_role correctly denied the API gates.
Always re-resolves now, making user_role_grants the single source
of truth on every authenticated request. Cost: one DB round-trip
per request — same as the existing PAT-aware fallback. Pinned by
test_hydration_ignores_stale_legacy_role_after_grant_revoke.
- Dev-bypass (app/auth/dependencies.py) and OAuth callback
(app/auth/providers/google.py) now pass user_id to
resolve_internal_roles so direct grants land in
session["internal_roles"] alongside group-mapped roles. Pre-fix,
every admin-gated request fell through to the per-request DB
fallback inside require_internal_role and the dev-bypass log line
read "resolved 0 internal role(s)" for an obviously-admin user.
test_session_internal_roles_populated updated to assert union.
User-visible UX (also addresses local-test feedback):
- HTTP 500 on /admin/users post-v8→v9 migration — UserResponse.role
is required str, but legacy users.role was NULL-ed by the
migration. _to_response in app/api/users.py now routes every dict
through _hydrate_legacy_role; same fix lifts the silent no-op of
last-admin protection in update_user/delete_user (the role-equality
short-circuits would skip the count_admins guard for migrated
admins). Three regression tests under TestAPIUsersPostMigration.
- /profile is now a real self-service detail page for *every*
signed-in user (not just admins). Three new server-side sections:
Effective roles (resolver output as chip cloud), Direct grants
(rows in user_role_grants with source label), Roles via groups
(which Cloud Identity / dev group grants which role for the
current user). Non-admins finally see *why* a feature is or isn't
accessible. Admins additionally see a deep-link to
/admin/users/{id} for editing their own grants.
- /admin/role-mapping group-id picker. New "Known groups" panel
above the create form: clickable chips for the calling admin's
own session.google_groups (tagged "your group") merged with
external_group_ids already used in existing mappings (tagged
"already mapped"). Click a chip → fills the form. Empty-state
copy points operators at LOCAL_DEV_GROUPS / Google sign-in
instead of leaving them to guess Cloud Identity opaque IDs from
memory.
Operational fixes:
- Scheduler log-noise: every cron tick produced a
POST /auth/token 401 because the auto-fetch fallback called the
endpoint with just an email (no password) and silently fell
through. Removed the broken path entirely. Operators set
SCHEDULER_API_TOKEN (long-lived PAT) in production; in
LOCAL_DEV_MODE the dev-bypass auto-authenticates the un-tokenized
request, so jobs continue to work.
Docs:
- docs/internal-roles.md → docs/RBAC.md (git mv preserves history).
Standard industry term, more discoverable for engineers grepping
for RBAC in a new repo. Restructured: Quickstart-by-role
(operator / end-user / module author), step-by-step
Module-author workflow with code examples (register key, gate
endpoint, declare implies, write contract test), naming pitfalls,
refresh semantics. CLAUDE.md gets a new
"Extensibility → RBAC" section pointing contributors at the doc
before they add gated endpoints. Cross-refs in app/api/admin.py
+ tests/test_role_resolver.py updated.
Tests: 293 in the auth/role/scheduler/UI test set passed, 0 regressions.
* fix(auth): Devin review #3 — login flows + RBAC docs
Two new findings on commit 7d1c048, both real and addressed.
Finding 1 (BUG, HTTP 500): every auth login flow loaded users via
UserRepository.get_by_email and passed user["role"] straight to
create_access_token, Pydantic response models, and _set_login_cookie
without going through _hydrate_legacy_role. Post-v9 the legacy column
is NULL for migrated users, and TokenResponse.role is a required str —
so POST /auth/token raised ValidationError → HTTP 500 for any v8-admin
trying to log in via password. Same root cause produced non-crashing
but semantically wrong JWTs (role: null) from Google OAuth, password
web flows, and email magic-link verification.
Fix: hydrate inline in every login flow before reading user["role"]:
- app/auth/router.py — POST /auth/token (the crash site)
- app/auth/providers/google.py — OAuth callback (was just stale JWT)
- app/auth/providers/password.py — 5 flows: JSON login, web login,
JSON setup, web reset confirm, web setup confirm
- app/auth/providers/email.py — centralized in _consume_token,
covers both /verify endpoints
New regression class TestAuthLoginFlowsPostMigration pins both the
no-crash and the correct-role contracts for all four legacy levels
(viewer/analyst/km_admin/admin) on POST /auth/token.
Finding 2 (DOCS): docs/RBAC.md showed register_internal_role() being
called with implies=[...], but the function signature is (key, *,
display_name, description, owner_module). A module author copying the
example would TypeError at import time. The implies field on
internal_roles IS honored at runtime by expand_implies, but the
registry-side write path (register_internal_role + InternalRoleSpec +
sync_registered_roles_to_db) doesn't exist yet — implies is currently
seeded only for the core.* hierarchy via _seed_core_roles in src/db.py.
Rewrote the Implies hierarchy and Module-author workflow sections to
document what's actually supported in 0.11.4 and what a future change
would need to add. The "for cross-module hierarchies, register each
level + grant both" pattern works today.
Tests: 322 in the auth/role/scheduler/UI/password test set passed,
0 regressions.
* fix(db): _seed_core_roles actually runs on every connect (Devin review #4)
Devin flagged that the docstring on `_seed_core_roles` promised per-connect
execution as a safety net for accidental DELETEs and in-code seed changes,
but the only call sites lived inside `if current < SCHEMA_VERSION:` — so
once a DB was on v9 the function never ran again, and the docstring lied.
Picked option (b) from the review (actually call it on every startup) over
option (a) (fix the docstring) because the safety net is genuinely useful:
- recovery from accidental admin DELETE on internal_roles,
- in-code _CORE_ROLES_SEED tweaks (display_name/description/implies)
ship without a manual SQL deploy,
- fresh installs and migrations stop needing their own seed call sites.
Tail call gated by `get_schema_version(conn) <= SCHEMA_VERSION` so the
future-version-is-noop rollback contract still holds — a v9 binary won't
touch a DB that's been upgraded past v9.
Test coverage: new TestSeedCoreRolesSafetyNet class (3 tests) pins the
three contracts — deleted row re-seeds, mutated display_name re-syncs
from in-code seed, applied_at on schema_version doesn't churn on
already-current DBs. Existing TestMigrationSafety::test_future_version_is_noop
still passes (verified against the gating logic).
* feat(auth): internal roles + external→internal group mapping (foundation)
Two-layer authorization model: external Cloud Identity groups (org-managed)
get mapped onto internal Agnes-defined capabilities (app-managed) via an
admin-curated many-to-many table. Per-request permission checks read off
the session — no DB hit. Refresh requires re-login.
Schema v8 — new tables:
- internal_roles (id, key UNIQUE, display_name, description, owner_module, …)
— app-defined capabilities like 'context_admin'. Modules self-register at
import; the startup hook syncs the registry into this table (idempotent).
- group_mappings (id, external_group_id, internal_role_id FK, …)
— admin-managed bindings, UNIQUE(external_group_id, internal_role_id).
app/auth/role_resolver.py — new module:
- register_internal_role(key, display_name, description, owner_module)
Module-author entry point. lower_snake_case key, immutable, validated.
Same key + same fields = no-op (re-import safe); same key + different
fields = ValueError so two modules can't silently overwrite each other.
- sync_registered_roles_to_db(conn) — startup reconciliation. Inserts new
keys, updates drifted metadata, never deletes (preserves mappings).
- resolve_internal_roles(external_groups, conn) — joins group_mappings.
Sorted, deduplicated role-key list. Plugged into google_callback +
dev-bypass branch in get_current_user.
- require_internal_role('key') — FastAPI dependency factory; reads
session.internal_roles; 403 with explicit message when missing.
Resolution runs at sign-in only (Google callback + LOCAL_DEV_GROUPS change
in dev-bypass) — same semantics as session.google_groups. No admin UI yet;
mappings created via repository directly until follow-up PR ships UI.
21 new tests in tests/test_role_resolver.py: register/list, idempotency,
collision detection, key-format validation; sync insert/update/no-delete;
resolve empty/single/many-to-many/malformed-input; e2e via
LOCAL_DEV_GROUPS — gated endpoint allowed/denied + direct session-cookie
inspection. Full sweep: 178/178 passed across auth + db + repo tests.
(Two pre-existing test_catalog_export.py failures verified unrelated.)
* fix(auth): polish review feedback — first-request dev populate + PAT doc
Two follow-ups from a code-reviewer pass on the foundation commit before
opening the PR:
- Dev-bypass populates session["internal_roles"] on the first request
after sign-in, not just when external groups change. The previous
guard only resolved when groups_changed=True, which left a hole for
the LOCAL_DEV_GROUPS=`""` (explicit empty) flow: target=[],
current=None, neither write branch fires, internal_roles stays
unset, and require_internal_role then 403s with no roles to check
against. The OAuth callback writes session["internal_roles"]
unconditionally on sign-in (even []); dev-bypass now matches that
semantics. Adds a single-pass populate gated on the key being
absent from the session, so subsequent same-state requests still
no-op (cheap session lookup, no resolver call).
- Document that internal roles are session-scoped and PAT/headless
clients will get 403 from any require_internal_role(...) endpoint.
Same constraint already applies to session.google_groups (PAT JWTs
deliberately don't snapshot group memberships — they could change
after issuance with no way to re-sign), but the doc didn't surface
this — an operator pointing a CLI at a role-gated endpoint would
see 403 with no clue why. New "PAT and headless requests" section
spells out the constraint, the rationale, and the three escape
valves (use users.role for the gate; route through OAuth; wait for
the planned `da admin grant-role` CLI helper).
54 auth tests still pass locally (21 role-resolver + 33 existing
auth-provider).
* release(0.11.3): cut release for the internal-roles foundation
Bumps pyproject.toml 0.11.2 → 0.11.3 and renames CHANGELOG's
[Unreleased] section to [0.11.3] — 2026-04-26 (with a fresh
empty [Unreleased] skeleton appended). Adds the matching
[0.11.3] link reference at the bottom of CHANGELOG so the
section heading renders as a hyperlink to the GitHub release
page once the tag lands.
The bullet itself is unchanged content; the rephrasing of
"dev-bypass when external groups change" → "dev-bypass —
populates on first request and whenever external groups
change, mirroring the OAuth callback's always-write
semantics" reflects the polish committed in d590579, plus
the appended PAT/headless caveat pointing at the doc
section that landed in the same polish pass.
* fix(auth): address review feedback from Pavel — PAT-specific 403, audit logs, hardening
Round-2 polish over the internal-roles foundation, addressing Pavel's review
on PR #71. No behavior change for the happy path; tightens the safety rails
and makes the failure modes self-explanatory.
User-visible:
- require_internal_role now distinguishes "no session" (Bearer/PAT caller)
from "signed in but missing role" and surfaces a PAT-specific 403 detail
in the first case ("This endpoint needs an interactive (OAuth) session
— Bearer/PAT tokens do not carry session-resolved roles by design").
- docs/internal-roles.md documents deactivate+reactivate as the supported
"force re-resolve now" lever for users that can't be made to log out.
Internal hardening:
- INFO-level audit log on every successful resolve (OAuth callback +
dev-bypass) so a wrong-role complaint is debuggable from the log alone.
- Startup warning when SESSION_SECRET is shorter than 32 chars, matching
the existing JWT_SECRET_KEY gate — both HMAC surfaces sign trust-laden
state (session.internal_roles, session.google_groups, JWTs).
- _clear_registry_for_tests() now refuses to run unless TESTING=1 so a
stray import path in production can't drop the registered capabilities.
Tests:
- 4 new tests in tests/test_role_resolver.py covering: stale-session
contract after a mid-session mapping revoke (pin the documented
limitation), PAT 403 detail wording, OAuth pipeline data flow from
external groups to internal_roles, and the dev-bypass empty-list
fallback when the resolver raises.
CHANGELOG.md updated under [0.11.3] (### Changed + ### Internal).
CLAUDE.md schema doc bumped from v7 to v8.
---------
Co-authored-by: Claude <noreply@anthropic.com>
* feat(auth): mock session.google_groups in LOCAL_DEV_MODE via LOCAL_DEV_GROUPS
LOCAL_DEV_MODE auto-logged-in the dev user but left session.google_groups
empty, so group-aware UI/code paths can't be exercised on localhost without
a real Google OAuth round-trip. New LOCAL_DEV_GROUPS env var (JSON array
matching the production {id, name} shape) populates the session on every
dev-bypass request — same structure the OAuth callback writes, so mock and
prod stay in lockstep. Compare-then-write avoids spurious Set-Cookie noise
on PAT/CLI requests; malformed input falls back to [] with a WARNING so
the dev mock never breaks the dev flow.
* refactor(auth): fail-fast LOCAL_DEV_GROUPS at startup + cache + no-mutate
Three small follow-ups on the same dev-mock vector before merge:
- Validate LOCAL_DEV_GROUPS at app startup and report the parsed group IDs
in the LOCAL_DEV_MODE banner. A malformed value now warns loudly at boot
instead of silently logging on the first authenticated request, where
it's easy to miss.
- Cache the parsed result single-slot, keyed by the raw env-string. Avoids
re-parsing JSON on every authenticated request without test-isolation
surprises — when the env value changes, the key changes and the cache
transparently rebuilds.
- Stop mutating the parsed-input dicts (item.setdefault → spread-merge)
so the cached list stays a fresh value on every rebuild.
- Replace the try/except guard around request.session with hasattr —
SessionMiddleware is always registered, the silent except was paranoid.
Tests grow by a direct session-cookie inspection (decoupled from the
profile template) and three startup-banner log assertions.
* fix(auth): drop fragile session-decoder test + actually skip empty-target write
Two follow-ups on the LOCAL_DEV_GROUPS feature before merge:
- Drop test_session_holds_mocked_groups_directly. It manually decoded the
signed session cookie via TimestampSigner + base64, hardcoding both the
Starlette session-cookie format and the 14-day max_age. Starlette has
changed its session encoding before (URLSafeTimedSerializer pre-0.20)
and would do so again silently — the test would fail with a cryptic
BadSignature, not a clear "mock is broken" signal. The remaining
test_dev_user_sees_mocked_groups_on_profile already covers the same
observable signal (mocked groups in /profile body) without coupling to
Starlette internals.
- Actually skip the session write when target_groups is empty. The previous
comment claimed compare-then-write avoided spurious Set-Cookie noise on
PAT/CLI requests, but on those requests session.get("google_groups") is
None and target is [], so None != [] always evaluates True and the write
fired anyway, marking the session dirty and re-issuing Set-Cookie on
every request. Adding `target_groups and ...` to the guard makes the
comment honest: empty mock now genuinely no-ops, stable browser sessions
still skip via value-equality, and the only remaining write is the one
that actually changes state.
33 auth tests still pass locally.
* fix(auth): match production's always-write semantics for stale dev groups
Devin code-review finding on PR #70: my earlier `target_groups and ...`
short-circuit silently diverged from the production OAuth callback. In
app/auth/providers/google.py:189-194 the callback always writes
session.google_groups on each login — including [] on failure or empty
token — so the session always reflects authoritative current state. The
mock should match.
Failure mode the previous guard left open: a developer sets
LOCAL_DEV_GROUPS=[{...}] for a session, the groups land in the signed
cookie, then the developer unsets the env var and reloads. target → [],
session.get → [{...}], `if target_groups and ...` is False, no write,
stale groups stay in the browser session indefinitely. Mock now lies
about state until logout.
Fix splits the guard:
- target_groups truthy + value-changed → write the new mock (existing path)
- target_groups falsy + non-empty stored → write [] to clear stale state
- otherwise no-op (target [] + stored None/[]: no transition to record)
PAT/CLI requests with no prior session still take the no-op path
(target=[], session.get → None which is falsy), so the original goal of
suppressing spurious Set-Cookie noise on token traffic is preserved.
Tests already cover the populated and unset paths; the new clear-stale
branch is correct by construction (production has the same shape) and
the rare manual reset workflow.
* release(0.11.2): default mocked groups in make local-dev + docs/local-development.md
Cuts 0.11.2 around the LOCAL_DEV_GROUPS work plus a small dev-experience
follow-up: every `make local-dev` now boots with two sensible default
mocked groups (Local Dev Engineers + Local Dev Admins on example.com),
so /profile and group-aware code paths render something realistic
without the operator having to discover and set LOCAL_DEV_GROUPS.
Layered so the default lives in the workflow, not the contract:
- scripts/run-local-dev.sh seeds LOCAL_DEV_GROUPS via shell ":="
syntax — only sets the var when the operator hasn't already.
Override: LOCAL_DEV_GROUPS='[...]' make local-dev. Disable:
LOCAL_DEV_GROUPS= make local-dev.
- docker-compose.local-dev.yml swaps the commented JSON example for
a bare `- LOCAL_DEV_GROUPS` passthrough — the value comes from the
shell, the compose file just propagates it. Operators running
`docker compose up` directly without the wrapper script get an
empty mock (correct: they didn't opt into the make-driven defaults).
- Makefile help line mentions the mocked groups so the behavior is
visible without grepping.
New docs/local-development.md consolidates dev-onboarding instructions
that were previously scattered across docker-compose.local-dev.yml
inline comments, docs/auth-groups.md "Local-dev mock" section, the
Makefile help text, and CLAUDE.md "First-Time Setup". Single page now
covers TL;DR, what LOCAL_DEV_MODE actually bypasses, group mocking
controls + verification, what is *not* mocked (Cloud Identity, real
OAuth, admin Workspace permissions), and the safety rails that keep
the dev shortcuts off production.
Version bump 0.11.1 → 0.11.2 in pyproject.toml, CHANGELOG cuts
[Unreleased] → [0.11.2] — 2026-04-26 with a fresh empty [Unreleased]
skeleton.
* fix(local-dev): default LOCAL_DEV_GROUPS truncated by shell parameter expansion
Reported by an operator running `make local-dev` against the freshly
released 0.11.2 — the LOCAL_DEV_MODE banner showed:
LOCAL_DEV_GROUPS is not valid JSON, ignoring:
Expecting ',' delimiter: line 1 column 70 (char 69)
LOCAL_DEV_GROUPS is set but produced no valid groups —
check the WARNING above for the parse error.
Cause: the default value lived inside `${LOCAL_DEV_GROUPS:=…}` parameter
expansion. Bash matches `}` to close the expansion at the *first* `}`
encountered in the body, regardless of context — even one inside a
nested JSON object literal. The two-element JSON array was therefore
truncated to the first group's closing brace, leaving an unparseable
fragment:
[{"id":"local-dev-engineers@example.com","name":"Local Dev Engineers"
There is no escaping syntax for `}` inside parameter expansion (the
backslash escapes I had only escaped the quotes — `}` reaches bash
literally). Fix: hold the default in a single-quoted variable and
reference it through `${LOCAL_DEV_GROUPS:-$DEFAULT_LOCAL_DEV_GROUPS}`.
The variable's value is opaque to the expansion — no `}` matching
inside it — so the JSON survives intact. Verified with `python -m json`:
parsed OK: 2 groups: ['local-dev-engineers@example.com',
'local-dev-admins@example.com']
Operators on a running 0.11.2 stack: `make local-dev-down && make
local-dev` to pick up the corrected default.
* fix(local-dev): respect LOCAL_DEV_GROUPS= disable path + add 0.11.2 changelog link
Two follow-ups from a Devin code-review pass on PR #70:
- run-local-dev.sh: switch ${LOCAL_DEV_GROUPS:-$DEFAULT} to
${LOCAL_DEV_GROUPS-$DEFAULT} (no leading colon). The :- form
substitutes the default when the variable is unset OR set-but-empty,
silently overwriting the documented disable knob. Three places
promise this works — docs/local-development.md, the CHANGELOG entry,
and the script's own comment — so the bug was an operator-facing
lie, not just an implementation detail. The bare - form only
substitutes on unset, so `LOCAL_DEV_GROUPS= make local-dev` now
reaches the Python parser as "" and short-circuits to []. Verified
with both empty and unset shells.
- CHANGELOG.md: add the [0.11.2] link reference at the bottom.
Keep-a-Changelog convention is to mirror every version heading
with a release-tag link in the footer; the 0.11.2 heading was
missing its counterpart, breaking the Markdown link rendering on
GitHub.
---------
Co-authored-by: Claude <noreply@anthropic.com>
Patch release containing the two follow-up changes from 0.11.0:
- Caddy CADDY_TLS env passthrough in docker-compose.yml (#55) — should
have shipped with #52 but the first PR got accidentally closed before
merge. Without this fix Caddy ignores .env CADDY_TLS and crash-loops
on any LE / internal-CA deployment.
- CLAUDE.md changelog discipline (#59) — every PR touching user-visible
behavior must update CHANGELOG.md under [Unreleased] in the same PR.
The discipline rule itself caused this release to exist: writing the
[Unreleased] entry made the missed fix obvious, which is exactly the
feedback loop the rule is supposed to create.
The version = "2.x" strings in earlier pyproject.toml snapshots were
arbitrary placeholders from the initial scaffold (cookiecutter default),
not a reflection of API maturity. Resetting to 0.11.0 to signal pre-1.0
status: public surface (CLI flags, REST endpoints, instance.yaml schema,
extract.duckdb contract) may still shift between minor versions.
CalVer image tags (stable-YYYY.MM.N, dev-YYYY.MM.N) continue from CI;
semver tags (v0.X.Y) are cut at release boundaries and reference the
same commit as a stable-* tag from the same day.
CHANGELOG.md replaces the old CalVer draft format with Keep a Changelog
+ semver. The 0.11.0 entry curates everything currently in main:
- Auth: Workspace groups, password reset, PAT, magic-link, seed admin pwd
- Deploy: keboola-deploy workflow, Caddy/LE/cert-file TLS, dev_instances
TLS, optional Google OAuth from SM, LOCAL_DEV_MODE, /setup wizard
- CLI: wheel distribution, auto-update, --version, --dry-run, gzip
- Data: remote query (BQ+DuckDB), business metrics, OpenAPI snapshot test
- Security: padak-security.md audit batch + urllib3 + argon2-cffi
- Two BREAKING items called out (Caddy profile rename, Caddyfile default
cert mode flipped to cert-file)
* fix(cli): versioned wheel URL in setup instructions; drop broken /cli/agnes.whl alias (#36)
* fix(cli): inline PEP 427 wheel filename in setup instructions
`uv tool install <server>/cli/agnes.whl` fails with
error: The wheel filename "agnes.whl" is invalid: Must have a version
because uv validates the filename in the URL path *before* fetching — so
the server-side Content-Disposition header (which has the real versioned
filename) is never consulted, and an HTTP redirect does not help either:
uv resolves the filename from the initial URL.
Fix the root cause by inlining the real PEP 427 filename into the setup
snippet the dashboard copies to the clipboard. The wheel filename is
resolved server-side via `_find_wheel()` and substituted into the lines
returned from `setup_instructions.resolve_lines()`, so both the read-only
HTML preview and the JS clipboard renderer get byte-identical output.
Also added `/cli/wheel/{filename}` to serve wheels at their PEP 427 path,
and kept `/cli/agnes.whl` as a 302 redirect for manual/legacy callers —
though that redirect alone is NOT sufficient for `uv tool install` (uv
validates before following redirects) and is there only as defense-in-depth.
Verified locally:
- `uv tool install <server>/cli/wheel/agnes_the_ai_analyst-2.0.0-py3-none-any.whl` succeeds
- `/install` HTML now renders the versioned URL; `/cli/agnes.whl` no longer appears in the rendered snippet
* fix(cli): remove /cli/agnes.whl alias entirely — it only confused users
The bareword alias was never actually usable:
- `uv tool install <server>/cli/agnes.whl` fails at filename validation
before any HTTP fetch, so neither the Content-Disposition header nor a
302 redirect rescued it.
- The 302-to-versioned-path fallback left a visibly "working" URL in
browser / curl -L contexts, which is exactly how the original bug got
reported in the first place ("the URL loads, why doesn't install work?").
Remove the endpoint and scrub all remaining references. The only CLI wheel
URL is now `/cli/wheel/{filename}` with the real PEP 427 filename, which
the setup-instructions template already generates server-side.
Existing tests that referenced /cli/agnes.whl become negative tests
("must not appear") so we don't regress.
* feat(cli): --version flag; sync --dry-run + progress indicator (#38)
* feat(cli): add --version / -V flag
Prints `da <version>` from package metadata (importlib.metadata). Falls
back to "unknown" when the package is not installed (e.g. running from a
source checkout without `uv pip install -e .`), instead of crashing.
Eager typer callback, so `da --version` exits before subcommand
resolution and does not require any auth/config.
* feat(cli): da sync --dry-run + X/N progress indicator
--dry-run reports what would be downloaded/uploaded without hitting the
API or writing local state. Supports the full flag set (--table, --json,
--upload-only); JSON shape is {"dry_run": true, "would_download": [...],
"summary": {...}}.
Progress bar now shows "[X/N] Downloading <table>..." with a Rich
BarColumn + TaskProgressColumn + TimeElapsedColumn instead of a bare
spinner — makes long syncs visible.
* feat(cli): durable sync + server gzip + auto-update check (#41)
* fix(sync): atomic writes + manifest hash verification + retry on transient errors
Three durability hooks around stream_download and the sync command:
1. Atomic writes. stream_download now streams into `<target>.tmp` and
calls os.replace() on success, so the real target file never exists
in a half-written state. On failure the tmp is unlinked — no cleanup
leftovers, no guard needed at read time.
2. Retry with backoff. Transient errors (ConnectError, ReadError,
WriteError, RemoteProtocolError, TimeoutException, 5xx) are retried
up to 3× with 0.3s / 1s / 3s backoff. 4xx (auth, 404) surfaces
immediately — retrying those is pointless.
3. Manifest-hash verification. After download, sync.py computes MD5 of
the target (same 8KiB chunking as app/api/sync.py:_file_hash) and
compares against `server_tables[tid]["hash"]`. Mismatch ⇒ unlink,
record error, skip state commit. The PAR1 structural check survives
as a fallback for legacy manifests without a hash.
Also makes _rebuild_duckdb_views tolerant: single broken parquet is
skipped with a stderr warning instead of killing the whole rebuild.
Supersedes #40 — this commit is a strict super-set (hash check + PAR1
fallback + atomic write + retry). #40 can be closed without merging.
* perf(server): enable GZipMiddleware for JSON / HTML responses
GZipMiddleware at minimum_size=1024 shaves bandwidth on manifest-style
JSON endpoints (/api/sync/manifest, /api/version, …) and the /install
HTML preview. Parquet file downloads are already columnar-compressed so
the middleware sees limited benefit there — but it doesn't hurt, httpx
on the client side decompresses transparently.
Placed after session middleware so gzip wraps the session-Set-Cookie
response too, and before CORSMiddleware so compression is applied to
both cross-origin and same-origin responses.
* feat(cli): auto-check for newer CLI version on startup
Server side
- GET /cli/latest returns {version, wheel_filename, download_url_path}
for whatever wheel is currently in AGNES_CLI_DIST_DIR. Public,
cacheable, no secrets — consumed by the CLI auto-update probe.
Client side
- New cli/update_check.py: reads /cli/latest with a 3s timeout, caches
the result in $DA_CONFIG_DIR/update_check.json for 24h. Cache is
invalidated when the installed version changes (e.g. after a fresh
`uv tool install`) so stale "you're behind" warnings don't linger.
- Root typer callback fires the probe before subcommand dispatch; any
failure is swallowed so a bad network never blocks a working command.
- Outdated → one-line stderr warning:
[update] da 2.0.0 is out of date — latest on this server is 2.1.0.
Upgrade: uv tool install --force <server>/cli/wheel/<…>.whl
- Disable with DA_NO_UPDATE_CHECK=1.
* fix(pr-review): None-guard the upgrade line + skip gzip on parquet paths
Two follow-ups from Devin review on #41.
1. format_outdated_notice(UpdateInfo(download_url=None)) emitted literal
"uv tool install --force None" — copy-pasting that fails. Drop the
upgrade snippet when the URL is absent and keep only the version line.
2. GZipMiddleware compressed everything over 1024 bytes, including the
parquet FileResponses served by /api/data/{tid}/download,
/cli/wheel/{name}, and /cli/download. Parquet is already columnar-
compressed — gzip there is pure CPU + latency with no size win, and
/api/data bodies can reach hundreds of MB. Wrap GZipMiddleware in a
small _SelectiveGZipMiddleware that skips those path prefixes and
delegates the rest to the stock middleware. JSON / HTML endpoints
(manifest, /install, /api/version, …) still get compressed.
* release: bump to 2.1.0 — unify AGNES_VERSION with pyproject.toml version (#42)
Before: two independent version systems. pyproject.toml carried semver
(2.0.0 → wheel filename → `da --version`) while release.yml injected
CalVer into AGNES_VERSION (e.g. 2026.04.155 → /api/version). Users saw
different strings in the CLI vs. the /install page, and the CLI auto-
update check couldn't tell "new deploy, same package version" apart
from "new package version".
Make pyproject.toml [project].version the single product-version source
of truth. release.yml extracts it and feeds AGNES_VERSION, so every
surface (/api/version, /api/health, /cli/latest, `da --version`) agrees
on one number. The CalVer tag keeps doing what CalVer is for: release
identity on the git tag and Docker image tag (versioned_tag).
Also wires AGNES_TAG through the build: release.yml → Dockerfile ARG →
env, so /api/version.image_tag finally reports the actual image tag
instead of the "unknown" fallback.
Bump to 2.1.0 to reflect the PRs shipped on ps/wheel-name-fix: durable
sync (atomic writes + manifest MD5 + retry), server GZip, CLI auto-
update probe, setup snippet PEP 427 URL.
* fix(pr-review): directional version compare in is_outdated()
UpdateInfo.is_outdated() used `self.latest != self.installed`, which
fires in both directions. If the server is rolled back or the user
connects to an older deployment, the CLI would warn "out of date"
and — worse — the formatted notice would prompt
uv tool install --force <older-version>.whl
i.e. an unintended downgrade.
Compare with packaging.version.Version (PEP 440 aware, handles pre-
release tags). Fall back to dotted-int tuple compare if packaging is
somehow missing, and return False on unparseable strings — better to
miss an upgrade hint than to silently suggest a downgrade.
Adds 4 test cases: installed older (True), installed newer (False),
10.0.0 vs 2.1.0 lexical-compare trap (correct), unparseable strings
(False).
Addresses Devin review on #43.
* fix(pr-review): read FastAPI app version from package metadata
app/main.py:80 hardcoded `version="2.0.0"` in the FastAPI constructor.
After #42 bumped pyproject.toml to 2.1.0, /api/version, /cli/latest,
and `da --version` all reported 2.1.0 while /openapi.json and the
/docs UI still advertised 2.0.0.
Read `agnes-the-ai-analyst` version via importlib.metadata (same
pattern cli/main.py:_cli_version already uses), with a `"dev"`
fallback when the package is not installed (source checkout). This
way pyproject.toml stays the single source of truth across every
version surface — /openapi.json now tracks the bump automatically.
Adds a dedicated test file to pin this behavior so a future
regression to a hardcoded literal fails at CI.
Addresses second Devin finding on #43.
* fix(pr-review): _fmt_bytes PiB label + negative cache in update_check
Two more follow-ups from Devin review on #43.
1. _fmt_bytes off-by-unit. The old loop exited at TiB but the fallback
labelled PiB, so 1 PiB rendered as "1024.0 PiB". Restructure: put
every unit inside the loop (KiB through EiB) so the division count
always matches the label. Covers up to 1 ZiB cleanly; anything
beyond renders as "<big>.0 EiB" rather than crashing.
2. Negative cache for failed /cli/latest probes. On a corporate
firewall / VPN that silently drops packets, the 3s HTTP timeout
fired on *every* `da` invocation. Writing a `latest=None` cache
entry with a 5-minute TTL caps that at one probe per 5min. Successful
probes still use the 24h TTL. Reading logic branches on whether the
cached `latest` is None.
Adds TestFmtBytes (2 cases: small/medium sizes and the PiB/EiB fallback
regression), plus two TestSync update-check cases covering negative-
cache reuse and TTL expiry.
Removed kbcstorage from all dependency groups (optional + dev) so urllib3
is no longer pinned to <2.0. Legacy Keboola client is available via
manual install: pip install kbcstorage
kbcstorage pins urllib3<2.0.0 which blocks Dependabot security patches.
Moved to [project.optional-dependencies] keboola-legacy since the primary
extraction path uses the DuckDB Keboola extension, not kbcstorage.
Legacy fallback uses lazy import — app works without it installed.
- All dependencies now in pyproject.toml [project.dependencies]
- Dev/test deps in [project.optional-dependencies] dev and [tool.uv]
- Dockerfile uses uv pip install . from pyproject.toml
- CI uses uv pip install ".[dev]"
- Deleted requirements.txt and requirements-dev.txt
- Updated README, CLAUDE.md install instructions
- Enhanced .dockerignore (exclude tests, docs, infra from image)