* feat(home+news): state-aware /home + /news + admin-edited news section
Squash of the vr/home-page feature work for clean rebase onto main.
Original 18-commit history preserved in branch backup/vr-home-page-pre-rebase.
What's in this PR:
**State-aware /home page**
- New `/home` route with hero + auto-mode + connectors (Asana / GWS /
Atlassian) + lookarounds. Onboarded vs not-onboarded state-machine
branches a single template (`home_not_onboarded.html`); the install
steps, "Setup a new Claude Code" CTA (90-day PAT mint), and per-
connector setup prompts hide once `users.onboarded=TRUE`. A
completion badge replaces them.
- "Mark me as offboarded" button reverses the flag without an SQL UPDATE.
- `users.onboarded BOOLEAN` column added; default FALSE; flipped by the
CLI's `agnes init` post-success POST and the `/admin/users` API.
- Connector setup prompts pre-check whether the tool is already
installed/connected before re-running setup.
- GWS scope set widened to include Google Chat (`chat.spaces`,
`chat.messages`).
**Single template + design tokens**
- `dashboard.html` now extends `base.html` via the new
`{% block layout %}` opt-out (full-width pages skip the 800px
`.container`). Net: every page shares one shell.
- `style-custom.css` `:root` extended with `--space-{7,9,10,12}`,
`--radius-2xl`, `--shadow-{card,elevated}`, `--text-{muted,disabled}`,
`--focus-ring`, `--transition-*`, `--width-{narrow,app,wide}` so
inline page styles can migrate incrementally.
**Auth redirects honor AGNES_HOME_ROUTE**
- `safe_next_path` resolves the configured home route when no `default=`
is passed; OAuth callbacks, magic-link clicks, password form, and
LOCAL_DEV_MODE shortcuts now land on `/home` (or whatever the operator
picked) instead of always /dashboard.
**News section + /news permalink + /admin/news editor**
- Schema-bumped `news_template` table (single versioned entity, draft +
publish gate). `published BOOLEAN` distinguishes draft from public;
monotonically-increasing `version` per save; rows >30d pruned on
save except the currently-displayed published version.
- `/home` bottom-of-page renders the latest published intro with a
"Read more →" link to `/news` (which renders the full body).
- `/admin/news` editor with sandboxed live preview, versions table,
per-row Unpublish, Format-help cheatsheet.
- `agnes admin news show / draft / edit / publish / unpublish /
versions / export` (CLI). Talks to the live server via the
`/api/admin/news/*` endpoints (PAT-authed) — no direct DB access
so it coexists with a running uvicorn.
- **Optimistic-lock guard**: `agnes admin news publish --version N` and
PUT/PATCH endpoints accept `expected_version` and 409 with structured
`{error: "version_conflict", expected, actual, actual_by}` when a
concurrent admin replaced the draft. Edit refuses to overwrite a
draft authored by someone else without `--force` or
`--expect-version`.
- nh3 (Rust-backed ammonia) HTML sanitizer; iframe pre-pass strips
any iframe whose src is not on the YouTube/Vimeo/Loom allowlist;
javascript:/data: schemes blocked everywhere.
- Author CSS vocabulary: `.news-hero` (blue gradient hero block),
`.callout`/`.callout-{info,warn,success,danger}`,
`.video-embed`, `.news-section`, `.news-grid-{2,3}`, `.news-cta` —
all consolidated in `style-custom.css` under "News content
vocabulary (shared)" so /home perex, /news body, and /admin/news
preview share one source of styling.
- Code-inside-`<pre>` contrast fix (was unreadable amber-on-silver).
- `.news-content` table styling (border, header band, row-hover).
**`scripts/dev/run-local.sh`** — local uvicorn launcher. Pulls Google
OAuth client id/secret from GCP Secret Manager
(`AGNES_OAUTH_GCP_PROJECT`-driven, no vendor defaults), points
`AGNES_CLI_DIST_DIR` at `./dist` so the wheel endpoint resolves, and
`--dev` flips `LOCAL_DEV_MODE=1` + `AGNES_HOME_ROUTE=/home` for one-
command iteration. `LOCAL_DEV_MODE=1` also enables the FastAPI debug
toolbar.
**CLAUDE.md "Run tests before every push" section** codifies
`pytest tests/ -n auto -q` as non-negotiable before each push.
**Tests**: 51 + 14 + 8 = 73 new tests across news-template repo,
sanitizer, API, web, CLI; plus updated home/auth/template tests for
the new shared-shell architecture.
Origin docs (gitignored, customer-fork content):
docs/brainstorms/home-page-requirements.md,
docs/plans/2026-05-07-001-feat-home-page-plan.md.
* feat(cli): agnes onboarded {on,off,status} — self-scoped flag toggle
User-facing equivalent of the in-page "Mark me as (off)boarded" button
on /home. POSTs /api/me/onboarded with {onboarded, source}; --source
overrides the audit-log marker so flips made from the CLI vs the web
button vs agnes init automation stay distinguishable.
`status` reads via /api/me/profile (when present); falls back to a
quick body-marker scan of /home so the read path doesn't write an
audit_log row. PAT-authed via cli.client.api_post — same convention
as agnes admin news / agnes admin add-user etc.
Tests: 5 covering on/off/status round-trip, idempotency, and
audit-log source recording. Full suite holds at 12 pre-existing
failures (same set as before).
* ui(nav+home): primary nav reorg + green What's new band + /marketplace link fix
Primary nav (post-rebase audit + per-user feedback):
- Items: Home → Marketplace → Data Packages → Memory. Admin dropdown
for admins only. The "Dashboard" label was renamed Home — point still
resolves through `home_route` so customer instances on /dashboard
still land there.
- Activity Center moved into the Admin dropdown. Per-team adoption
analytics is admin-consumed in practice; the route still allows
any authed user for direct deep-links so existing /home tile +
bookmarks keep working.
- Memory link added (→ /corporate-memory) — was previously buried in
the /home "Look around" tiles.
- Setup local agent + My Stack dropped from main nav. Setup is the
/home install flow's home now; My Stack lives as a tab inside
/marketplace.
/home tweaks:
- Plugin marketplace tile now points at /marketplace (was /store —
legacy from before the marketplace rebrand landed in #230).
- "What's new" section header gets a green band (success-flavored
D1FAE5 background, A7F3D0 border, darker green title) so the
bottom-of-page news block visibly distinguishes from the blue
install-hero at the top. Header strip only — body stays white.
Test fix: test_home_route_resolution renamed `dashboard_link_uses_home_route`
→ `home_link_uses_home_route` and asserts `href="/home">Home` instead
of `href="/home">Dashboard` after the label change.
* fix(home): decouple Step 3 + Connect-tools collapse from server onboarded flag
The server-side `users.onboarded` flip happens through two paths:
1. Explicit user click on "Mark me as onboarded" or `agnes onboarded on`.
2. Implicit `agnes init` POST → /api/me/onboarded on success.
Path 2 produced a UX surprise: an analyst running `agnes init` mid-flow
reloaded /home and saw Step 3 (auto-mode) + Connect-your-tools auto-
collapse to summary bars. They were actively working through those
sections — the install POST never signalled "I'm done with the rest
of setup", just "Agnes itself is installed".
Decouple the section-collapse decision from the server flag:
- Step 1 + Step 2 install blocks: still hidden on `onboarded=TRUE`
(their completion is a hard server signal — Agnes IS installed).
- Step 3 + Connect-your-tools: render flat by default in BOTH states.
Wrapped in `<details class="setup-collapsible" open>` so the
browser's native disclosure handles per-section toggle without JS,
but the `<summary>` is CSS-hidden until the page-level
`data-setup-minimized="1"` attribute is set on `.home-mock`.
- New "Minimize setup view" toggle inside the blue install-hero,
rendered only when onboarded. Click flips the data-attr on
`.home-mock` AND removes the `open` attribute from each
`<details>`. State persists in `localStorage["agnes_home_setup_minimized"]`
so the choice survives reloads but is per-device.
- "Show full setup view" (the same button when minimized) re-opens
both `<details>` and clears localStorage.
When minimized, each `<details>` still has its own native expand/
collapse — click the gray summary bar to peek at one section without
toggling the page-level minimize off.
Tests:
- test_step3_and_connectors_render_flat_when_onboarded_by_default —
asserts `<details class="setup-collapsible" ... open>` for both
sections post-onboarding and the absence of any server-rendered
`data-setup-minimized` attribute on the `.home-mock` root.
- test_minimize_toggle_visible_only_when_onboarded — toggle button
rendered only when onboarded.
Full pytest holds at 12 pre-existing failures (same set).
* fix: cutover regressions + parallel Keboola legacy fallback
Bundled fixes from a fresh-deploy run on a Keboola Storage backend with
the block-shared-snowflake-access feature flag — DuckDB Keboola
extension's per-table scan can't access bucket schemas, so the legacy
kbcstorage Storage-API client is the only working path.
CUTOVER REGRESSIONS
- agnes pull hash mismatch on every Keboola local-mode table —
src/orchestrator.py:_update_sync_state stored md5(mtime+size)[:12]
while the CLI compares against full 32-char content MD5. Now stores
the same content MD5 the materialized SQL path already used.
- Trailing-slash sanitization in connectors/keboola/access.py and
extractor.py — DuckDB Keboola extension's ATTACH fails when the URL
ends in / (canonical form).
- src/profiler.py:TableInfo.description becomes optional — two call
sites instantiated without it, crashing the profiler pass.
- scripts/ops/agnes-auto-upgrade.sh: chown on UID change — older images
ran as root, current runs as agnes (uid 999). Reads target uid:gid
from /etc/passwd inside the new image and chowns ${STATE_DIR},
/data/extracts, /data/analytics when the digest moves.
- POST /api/sync/trigger is now singleton per process — two
near-simultaneous trigger calls each forked an extractor subprocess,
fought for extract.duckdb's file lock, starved uvicorn, flipped the
container to unhealthy. Trigger now returns 409
(sync_already_in_progress) when held; _run_sync acquires non-blocking.
PARALLEL LEGACY FALLBACK
- Process pool fan-out for the _extract_via_legacy queue (default 8
workers, override via AGNES_KEBOOLA_PARALLELISM). Process pool, not
thread pool, because connectors/keboola/client.py:export_table does
os.chdir(temp_dir) — process-global, so threads raced and slice files
landed in the wrong directory ("[Errno 2] No such file or directory:
'<job_id>.csv_X_Y_Z.csv'").
- Extractor subprocess timeout 1800s -> 3600s (configurable via
AGNES_EXTRACTOR_TIMEOUT_SEC). 28+ tables × multi-minute Keboola export
jobs need the headroom on telemetry-class projects.
- Process group cleanup on timeout — Popen(start_new_session=True) puts
the extractor in its own group. On timeout the parent SIGTERMs the
group (10s grace) then SIGKILLs stragglers. Without this, the pool
workers were reparented to PID 1 and continued holding open Keboola
Storage export jobs. Inline extractor script also installs a SIGTERM
-> sys.exit(143) handler so the with ProcessPoolExecutor(...) block
__exit__ runs cleanly.
Tests: existing tests that patched subprocess.run updated to patch
subprocess.Popen with a _FakePopen stand-in (same exit-code-injection
contract). Two tests that exercised the parallel path forced
AGNES_KEBOOLA_PARALLELISM=1 to keep mocks alive (mocks don't ride into
ProcessPoolExecutor subprocesses).
Squashed onto current main (was 7 commits + multi-commit CHANGELOG +
agnes-auto-upgrade.sh conflicts; squash avoids per-commit conflict
resolution against main's flat-mount STATE_DIR refactor and 0.38.0
release cut).
* feat(keboola): Storage API direct extract path; drop extension data path
The DuckDB Keboola extension's COPY routes through Keboola QueryService,
which is unreliable on linked-bucket projects (extension v0.1.6 fixes
that case but isn't yet in the community CDN, and pre-fix any project
with the block-shared-snowflake-access feature flag couldn't see bucket
schemas at all). Move the extract path off the extension entirely and
talk to the Storage API directly via signed-URL download — works on any
project, regardless of extension state.
connectors/keboola/storage_api.py (NEW)
Lightweight client built on requests.Session. Three endpoints:
- POST /v2/storage/tables/{id}/export-async (kicks off job)
- GET /v2/storage/jobs/{id} (poll until done)
- GET /v2/storage/files/{id}?federationToken=1 (signed URL detail)
- GET <signed_url> (download bytes)
Supports sliced exports (manifest + per-slice signed URLs) and gzipped
payloads. ExportFilter dataclass mirrors the Keboola filter spec
(whereFilters / columns / changedSince / limit) and handles JSON
round-trip with the registry's source_query column. Token redaction
in error messages. Bounded exponential backoff on job polling.
No cloud-SDK dependency on the data path; thread-safe.
connectors/keboola/extractor.py
- materialize_query() rewritten: takes bucket/source_table/source_query
(JSON filter spec), exports via KeboolaStorageClient, converts CSV
to parquet via DuckDB, atomic os.replace. Same return shape so
sync.py downstream code stays uniform with the BQ branch.
- _extract_via_legacy() also moved to Storage API direct (kept the
name for caller compatibility with _legacy_worker / the parallel
batch extractor). Per-call temp directories — no os.chdir, threads
don't race.
app/api/sync.py
_run_materialized_pass for source_type='keboola' rows now constructs a
KeboolaStorageClient (replaces KeboolaAccess) and passes
bucket/source_table/source_query to materialize_query. Reuses one
client across rows for HTTP keep-alive. Sources keboola URL from env
too (KEBOOLA_STACK_URL) when instance.yaml doesn't have stack_url
configured.
cli/commands/admin.py
discover-and-register defaults Keboola rows to query_mode='materialized'
(NULL source_query = full table), matching the v26 migration's
unification of the local/materialized split for Keboola. BigQuery and
Jira keep their per-source defaults.
src/db.py
Schema bump 25 → 26. Migration: UPDATE table_registry SET
query_mode='materialized' WHERE source_type='keboola' AND
query_mode='local'. NULL source_query on those rows means "full table
export" — same effective behavior the local mode provided, but now
via Storage API instead of the extension.
pyproject.toml
kbcstorage dep stays (admin-side bucket/table list still uses the
SDK in app/api/admin.py / connectors/keboola/client.py); only the
data path is migrated off the SDK. Comment updated to reflect the
new boundary.
tests
- test_keboola_storage_api.py (NEW, 19 tests): ExportFilter parsing,
HTTP client (token redaction, retry logic, polling), download_file
(single, gzipped, sliced), end-to-end export_table_to_csv.
- test_keboola_materialize.py rewritten: mocks KeboolaStorageClient
instead of FakeAccess; same atomic-write + zero-rows + unsafe-id
contracts.
- test_sync_trigger_keboola_materialized.py: registry rows now carry
bucket+source_table+JSON-shape source_query.
114+ Keboola-impacted tests green locally.
* test: schema version assertion bumped to 26 alongside the keboola query_mode migration
* fix(keboola): cutover hot-patches surfaced on agnes-dev
Five small fixes that were applied as in-container hot-patches during
agnes-dev cutover and need to be on the source-of-truth image so a fresh
upgrade does not undo them.
- app/api/sync.py: auto-discover gate considers the WHOLE registry (any
source, any mode), not just rows where source matches and query_mode
is local. After the v25→v26 keboola materialized migration an
instance can have 30 materialized rows and zero local rows; the
previous gate kept re-firing _discover_and_register_tables every
scheduler tick, creating duplicate auto-discovered rows with the
wrong bucket prefix every time.
- app/api/admin.py: _discover_and_register_tables reassembles the
bucket as <stage>.<bucket-id> (e.g. in.c-finance) instead of
dropping the stage prefix; default query_mode for keboola is now
materialized (the v26 contract); validator allows NULL source_query
for keboola materialized rows (full-table export via Storage API
export-async, no SQL needed).
- cli/commands/admin.py: register-table mirrors the server validator
(NULL source_query allowed for source_type=keboola); --bucket help
text generalized to cover both BQ dataset and Keboola bucket id.
- connectors/keboola/extractor.py: max_line_size=64 MiB on
read_csv_auto so embedded JSON / SQL cells (kbc_component_configuration
in particular) do not trip the default 2 MiB ceiling.
- connectors/keboola/storage_api.py: GCP backend support — when the
Storage API returns a manifest whose slice URLs are gs://
references with a gcsCredentials block, rewrite to the JSON REST
download endpoint and authenticate with the issued OAuth bearer
token; redact tokens in any surfaced error string.
* test: align with new keboola materialized + auto-discover-gate contracts
- test_admin_keboola_materialized: rename
test_register_keboola_materialized_rejects_missing_source_query →
test_register_keboola_materialized_accepts_missing_source_query.
v25→v26 introduced 'keboola materialized with NULL source_query
means full-table export via Storage API export-async' as the
default registration shape; the rejection case is no longer the
contract.
- test_sync_filter: add list_all() to _StubRegistry. The auto-discover
gate in _run_sync now keys off the WHOLE registry (not just local
rows) so materialized-only Keboola instances do not re-trigger
discovery on every tick.
* feat(keboola): native parquet export — skip CSV roundtrip
Storage API export-async accepts fileType={csv,parquet}. Switching the
materialized sync to parquet eliminates the CSV → DuckDB COPY → parquet
roundtrip that pinned a single uvicorn worker over 4 GiB on multi-GB
tables (read_csv with all_varchar + max_line_size=64MB has to
materialize the whole CSV in memory before COPY can stream out a
parquet). Snowflake UNLOAD on Keboola's side already produces typed,
self-contained parquet files; the extractor downloads them and renames
into place.
Two cases:
- **Single-file** export (small table): file_info.url points at one
signed URL; download_file streams chunks straight to .parquet.tmp
and we're done. No DuckDB.
- **Sliced** export (Snowflake UNLOAD respects MAX_FILE_SIZE — 16 MiB
default — so anything larger arrives as N parquet slices): each
slice is a complete parquet file with its own footer; naive concat
would corrupt them. download_file_slices keeps the slices as
separate files in a tempdir, then DuckDB COPY (SELECT * FROM
read_parquet([slice0, slice1, ...])) merges them into one
consolidated parquet. DuckDB streams row groups during this — peak
memory bounded to one row group (~1 MiB) regardless of source size.
The legacy CSV path stays as the explicit opt-in via source_query=
'{"file_type":"csv"}' for projects whose backend can't UNLOAD
parquet (none known today; cheap escape hatch). Backward-compat alias
KeboolaStorageClient.export_table_to_csv kept.
Also fixes a latent bug in download_file's gzip detection: previous
heuristic flagged any unencrypted file as gzipped, which would have
corrupted parquet downloads at gunzip time. Name-suffix-only now.
* fix: tempdir leak cleanup, every 0m schedule, /sync/trigger body shapes
Three small self-contained fixes uncovered during agnes-dev cutover.
- connectors/keboola/extractor.py: tempfile.TemporaryDirectory now uses
ignore_cleanup_errors=True so a worker death mid-write doesn't leave
multi-GiB stale slice trees on the boot disk. (12 GiB seen after a
disk-full crash where TemporaryDirectory's own cleanup also raised
and got swallowed.)
- src/scheduler.py: is_valid_schedule accepts 'every 0m' (interval=0
= always due). Force-resync of an errored row no longer requires
waiting out the default 'every 1h' interval — admin can flip the
schedule, trigger, then flip back.
- app/api/sync.py: POST /api/sync/trigger accepts both ['table_id']
(legacy bare-array body) and {'tables': ['table_id']} (matches the
response payload shape, more discoverable for clients building
requests by hand). Malformed bodies return 422 with a structured
detail; null/missing means 'sync everything' as before.
Tests cover: tempdir cleanup on raise (sliced parquet path),
is_valid_schedule + is_table_due 'every 0m' acceptance, and trigger
body parametrized matrix (8 valid shapes + 6 rejection cases).
* fix: targeted-trigger filter in materialized pass + auto-upgrade defer
Two operational gaps observed during agnes-dev cutover, in the same
sync-routing area.
- _run_materialized_pass now takes a 'tables' arg and skips rows not in
the target set with reason='not_in_target'. POST /api/sync/trigger
with a body of tables previously only scoped the legacy extractor
subprocess — the materialized pass kept iterating every due
materialized row, so an admin asking to re-sync kbc_job re-ran
every other due materialized row alongside it. Match on registry id
OR name (admins commonly pass either form). tables=None preserves
the no-filter behavior.
- New GET /api/sync/status (public, no auth) returns {locked: bool}
off _sync_lock.locked(). agnes-auto-upgrade.sh probes this before
docker compose up -d and exits 0 with a 'deferred recreate' log
line if a sync is in flight — the next 5-min cron tick retries.
Pre-fix, an auto-upgrade triggered mid-sync would recreate the
uvicorn worker and kill the in-flight extractor / Snowflake-UNLOAD
download (observed when kbc_job's first 7-day retry got SIGKILLed).
Connection failures in the probe fall through to the upgrade —
being stuck on a wedged image is worse than interrupting a
hypothetical sync.
* fix: auto-discover protects admin overrides + surfaces drift
Two real-world incidents on agnes-dev drove this:
1. kbc_job was registered manually with the correct
(in.c-kbc_telemetry, kbc_job) coordinates. A naive auto-discover
re-run would have inserted a SECOND kbc_job row at the slugified
id 'in_c-keboola-storage_kbc_job' (where Keboola's discovery
places it) — and that row's Storage API export-async 404s.
2. An earlier auto-discover bug stripped the stage prefix from
bucket ids ('c-finance' instead of 'in.c-finance'), inserting
137 rows whose syncs all failed.
Fix:
- _discover_and_register_tables now builds a plan first
(_build_keboola_discovery_plan) classifying each discovered table
into one of new / existing_match / existing_drift / invalid, then
executes only the 'new' bucket. Drift rows are reported with both
sides of the disagreement plus drift_kind:
- same_id_diff_coords: registry has the same id but different
bucket / source_table (admin migrated coords inline).
- name_collision: discovery's slugified id differs from any
registry id, but the discovered .name matches an existing row's
.name (case-insensitive). Catches the kbc_job case.
- Bucket detection now prefers the API's authoritative bucket_id
field (separate field on the Keboola tables.list response,
normalised by KeboolaClient.discover_all_tables). Falls back to
id-string parsing only when bucket_id is missing (older fallback
path inside discover_all_tables).
- Endpoint POST /api/admin/discover-and-register?dry_run=true
returns the plan without writing — would_register, drift,
invalid lists. Lets an operator audit before merging discovery
with a registry that has admin overrides.
Removed 'every 0m' from test_register_request_rejects_malformed_sync_schedule
— the runtime started accepting it in the previous commit (force-resync
override) and the validator follows suit.
* feat(keboola): AGNES_TEMP_DIR routes tempfiles off overlayfs /tmp
The container's /tmp lives on the boot disk's overlayfs (29 GiB on
agnes-dev, shared with /var). Snowflake UNLOAD of a wide table writes
slices into per-call /tmp tempdirs that fill multi-GiB / many-slice
exports long before the dedicated data disk fills. agnes-dev hit
100% boot-disk while the 20 GiB data disk had 15 GiB free.
connectors.keboola.storage_api.get_temp_root() reads AGNES_TEMP_DIR;
mkdirs the target on first use; unset / empty / unwritable falls
back to None (system tempdir, OSS-pre-fix behaviour). Both
materialize_query (parquet path) and _extract_via_legacy (CSV
fallback) and the sliced-CSV concat path in storage_api use the
helper now.
docker-compose.yml defaults AGNES_TEMP_DIR=/data/tmp on app, scheduler,
and extract services. The data volume is the dedicated disk in
production layouts and a plain docker volume in single-disk
dev/laptop setups — same blast radius as the previous /tmp default
on the latter, no regression.
Adds end-to-end flow for installing and keeping the per-user filtered
Claude Code marketplace in sync with the user's Agnes stack
(admin RBAC grants \ MyAIStack opt-outs U /store installs).
Setup (one-liner in install prompt step 5):
`agnes refresh-marketplace --bootstrap` clones the per-user marketplace
bare repo to ~/.agnes/marketplace, strips PAT from the cloned origin
URL, registers the local path with Claude Code, and installs every
plugin in the served manifest at --scope project. Replaces a 15-line
inline shell sequence that tripped Claude Code's agent-driven `rm -rf`
permission gate.
Auto-refresh (SessionStart hook installed by `agnes init`):
`agnes refresh-marketplace --quiet` runs every Claude Code session,
fetches+resets the clone (server rebuilds as orphan commits, so
pull --ff-only is impossible), and version-aware reconciles:
- missing in workspace -> claude plugin install <name>@agnes --scope project
- version differs -> claude plugin update <name>@agnes
- matches -> skip
Don't auto-uninstall plugins that disappeared from the manifest --
a transient empty manifest from the server would wipe the stack.
Hook output: when --quiet AND something actually changed, emits Claude
Code hook JSON on stdout -- `systemMessage` (transient toast) and
`hookSpecificOutput.additionalContext` (model-side system reminder),
both carrying the change summary plus a "/exit + restart Claude Code"
instruction (Claude only scans plugins at session start).
Windows hook compatibility: the refresh-marketplace hook command is
wrapped in `bash -c "..."` because Claude Code on Windows runs hook
commands directly without invoking a shell, so `2>/dev/null || true`
would otherwise be passed as literal argv tokens.
Cross-cutting:
- cli/lib/marketplace.py: shared CLONE_DIR + MARKETPLACE_NAME constants.
- cli/lib/hooks.py: SessionStart now has two independent entries
(pull + refresh-marketplace) so a failure in one doesn't suppress
the other; legacy `da sync` and prior single-pull layouts upgrade
cleanly on re-init.
- PAT injection on every git fetch via per-invocation credential
helper (token in \$AGNES_TOKEN env, never in argv or .git/config).
- Pre-snapshot of installed plugins captured BEFORE
`claude plugin marketplace update` so silent auto-applied version
bumps still fire notifications.
- scripts/dev/agnes-client-reset.sh: cleans ~/.claude/plugins/marketplaces/agnes,
~/.claude/plugins/cache/agnes, drops uv build cache, documents
workspace-scoped residue that can't be enumerated from the script.
- app/web/setup_instructions.py: legacy AGNES_DEBUG_AUTH path also
uses clone (direct HTTPS marketplace add is broken end-to-end on
every Claude Code distribution -- stores response as single file,
plugin source paths then 404).
28 new tests (test_cli_refresh_marketplace.py) + extended hook + setup
template tests cover bootstrap, fetch+reset ordering, version-aware
reconcile, project-path filtering, hook JSON shape, and the bash-c
Windows wrapper invariant.
Introduces STATE_DIR as the single source of truth for the writable
state directory path, with backward-compatible default of
${DATA_DIR}/state. Pairs with a new docker-compose.flat-mount.yml
overlay that mounts the state disk in PARALLEL to the data disk
(rather than nested under it).
Why
---
The default deployment topology nests state under data: sdb at /data,
sdc at /data/state. That layout has known fragility documented in
docs/state-dir.md — bind-propagation gotchas, two-writer collisions
on the same prefix, mount-order coupling. The 2026-05-05 incident in
the Groupon FoundryAI deployment was a manifestation of the
propagation gotcha.
The flat layout (sdb at /data, sdc at /data-state — parallel, not
nested) eliminates the nested-mount class entirely. Each disk is its
own bind mount, recursive by default in modern Docker. No volume
options to forget. No two-writer collision (host scripts and
container app share /data-state at the same path, single namespace).
What changes
------------
App code (Python):
- src/db.py: new _get_state_dir() helper. get_system_db() and
schema migration snapshot use it.
- app/secrets.py: new _state_dir() helper. _load_or_generate() uses
it for .session_secret and .jwt_secret.
- app/main.py: .env_overlay loaded from _state_dir().
Host scripts:
- scripts/ops/agnes-auto-upgrade.sh: STATE_DIR drives mount-sanity
check and cert detection. Defaults preserve existing behavior.
- scripts/ops/agnes-tls-rotate.sh: STATE_DIR drives CERT_DIR.
New compose overlay:
- docker-compose.flat-mount.yml: parallel /data and /data-state binds
per service. Mutually exclusive with docker-compose.host-mount.yml;
pick one based on disk topology.
Documentation:
- docs/state-dir.md: layout choice (A nested vs B flat), pros/cons,
migration steps, and which code paths read STATE_DIR.
Backward compatibility
----------------------
STATE_DIR defaults to ${DATA_DIR}/state — current behavior. Existing
deployers that don't set the var see no behavior change. Migration
to flat layout is opt-in per the runbook in docs/state-dir.md.
Validation
----------
- bash -n on both host scripts: pass
- docker compose config -f docker-compose.flat-mount.yml: resolves
cleanly with all 6 services binding /data and /data-state directly
- python3 import + helper exercise: STATE_DIR override works,
default falls back to ${DATA_DIR}/state
Companion to PR #191 (drop named-volume driver_opts in host-mount.yml).
That PR fixes the immutability footgun for Layout A; this PR offers
Layout B as the architectural alternative.
Two bugs Devin caught:
1. Caddy `try_files A B C` rewrites the URI to its LAST entry when no
file matches (per Caddy docs). Without an explicit "back to original
URI" fallback, a parquet missing from all three known static paths
would get rewritten to `/jira/data/<id>.parquet`, and the
reverse_proxy below would forward THAT rewritten URI to app:8000 →
404. The PR's documented "missed → falls through to app handler"
promise didn't actually hold for legacy / future connectors. Append
`/api/data/<id>/download` as the final try_files entry so the
reverse_proxy receives the analyst-facing URI.
2. agnes-auto-upgrade.sh's TLS-overlay decision (which checks Caddyfile
existence) ran BEFORE the config re-fetch loop. If a tick's fetch
added a previously-missing Caddyfile, this tick's docker compose
would still omit `--profile tls` until the next 5-min tick — a
window where the recreate uses the wrong overlay set. Move the
COMPOSE_FILES tls extension AFTER the fetch.
Also strip the workspace prompt of table-list / metric-count
enumerations (per user feedback): those are dynamic snapshots that go
stale; replace with explicit "use `agnes catalog` / `agnes schema` /
`agnes describe` to discover" guidance plus a note about
`rough_size_hint` semantics. The Available Datasets `{% for t in tables %}`
loop is gone — analysts use the live CLI instead.
Sibling change to the Caddy file_server PR (#182). Without this,
existing long-uptime VMs would pull the new agnes image on auto-upgrade
but keep their stale Caddyfile + docker-compose.yml — leaving the
file_server route + the data:/srv:ro mount inert. Confirmed live
2026-05-05 when the file_server change merged in main but stayed
unreachable on a running dev VM until /opt/agnes/* was scp'd by hand.
agnes-auto-upgrade.sh now hashes the bind-mounted config files
(Caddyfile + every docker-compose overlay) on every 5 min tick and
triggers a `docker compose up -d` recreation when the hash drifts
— same trigger path as an image-digest change. Fail-soft via the
.new-then-mv pattern: a curl 404 / network blip leaves the existing
file untouched.
Self-update at the bottom of the script: re-fetch
/usr/local/bin/agnes-auto-upgrade.sh itself so the very fix that
watches config files lands on running VMs without a manual ssh-and-
curl cycle. Otherwise we'd have a self-perpetuating "old script
problem" — the watch-config logic never propagating to the VMs that
need it.
Operators no longer need to ssh + scp Caddyfile/compose changes.
Diagnostic + operator-facing documentation that closes the loop on the work in this PR.
`da diagnose` (via /api/health/detailed):
- New _check_bq_billing_project() helper. When data_source.type='bigquery' and BqProjects.billing == .data, surface a yellow warning: 'BigQuery billing project equals data project'. Hint includes the YAML field path + the /admin/server-config UI shortcut. Diagnose's overall status promotes warning → degraded so the CLI echoes it.
- Non-BQ instances (Keboola-only, etc.) skip the check.
- Implementation hooks into the existing /api/health/detailed surface — no new endpoint, no CLI changes.
config/instance.yaml.example documentation:
- data_source.bigquery.billing_project: USER_PROJECT_DENIED hint, /admin/server-config UI reference
- data_source.bigquery.legacy_wrap_views: analyst-side discipline note (use `da fetch` / `da query --remote`), issue #101 history, view-heavy deployment guidance
- data_source.bigquery.max_bytes_per_materialize: cost guardrail block (NEW — wasn't documented in .example before)
- ai.base_url: provider list + UI hint
- openmetadata + desktop: 'configurable via /admin/server-config UI' headers
- corporate_memory: leading note that the schema is editable via UI
Other docs:
- CHANGELOG.md: comprehensive Unreleased section
- CLAUDE.md: schema chain → v20 + Materialized SQL connector mode + per-connector tab UI mention
- README.md: mode-first source table summary
- docs/architecture.md: per-connector tab UI mention
- cli/skills/connectors.md: bootstrap rails (parallel to #154)
- docs/superpowers/plans/2026-05-01-admin-tables-form-cleanup.md: implementation plan archive (2515 lines)
- scripts/seed_dummy_tables.py: drop is_public after #150 RBAC migration (column gone)
Tests:
- test_diagnose_billing.py — 3 cases (BQ with billing==data warns, BQ with billing!=data clean, non-BQ skips)
* fix(tls-rotate): self-signed fallback sets basicConstraints=critical,CA:FALSE
OpenSSL's default '[v3_ca]' config marks CA:TRUE on 'req -x509', which
causes strict TLS stacks (rustls / webpki, used by uv, cargo, and
future versions of pip) to reject the cert with
'invalid peer certificate: CaUsedAsEndEntity' per RFC 5280 §4.2.1.9.
Browsers, curl, and OpenSSL-based clients tolerated the violation,
hiding the bug until a uv user hit it.
Affects every VM running on the self-signed fallback while the corp
PKI hasn't published the real chain yet. Fix lands on the next
agnes-tls-rotate.timer tick (or 'systemctl start
agnes-tls-rotate.service' for an immediate refresh). Existing CSR /
real-cert paths unaffected; only the bring-up fallback regenerates.
* chore(release): cut 0.29.0
---------
Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
* feat(rbac): drop dataset_permissions + access_requests + users.role + is_public; v19 migration
BREAKING. Sjednocení datové RBAC vrstvy do per-group resource_grants modelu.
Před PR byla legacy data RBAC vrstva (dataset_permissions + is_public bypass)
de-facto neaktivní — is_public neměl API/UI/CLI surface, default true znamenal
že can_access_table vždycky bypassl. Dnes každý non-admin přístup vyžaduje
explicitní resource_grants(group, "table", id) řádek.
Schema v18 → v19 (src/db.py:_v18_to_v19_finalize):
- DROP TABLE dataset_permissions, access_requests
- DROP COLUMN users.role (NULL artifact since v13)
- DROP COLUMN table_registry.is_public
- Drops přes table-rebuild idiom (rename → create new → INSERT … SELECT
→ drop old) kvůli DuckDB ALTER DROP COLUMN limitacím na tabulkách
s historic FK constraints. INSERT picks intersection sloupců, takže
test fixtures s minimal pre-v19 schemou migrate cleanly.
Runtime:
- src/rbac.py:can_access_table → deleguje na app.auth.access.can_access
- DatasetPermissionRepository, AccessRequestRepository smazány
- AGNES_ENABLE_TABLE_GRANTS env-gate v app/resource_types.py odstraněn
(TABLE je unconditionally enabled)
API drop:
- app/api/permissions.py, app/api/access_requests.py celé soubory
- /admin/permissions web route + admin_permissions.html
- "Request Access" modal v catalog.html + locked-row UI
- ~10 if user.get("role") != "admin" checků nahrazeno (admin shortcut
je uvnitř can_access_table)
- /api/settings: drop permissions field z GET; PUT /api/settings/dataset
gate přepnut na can_access(user_id, "table", dataset, conn)
Auth:
- app/auth/jwt.py:create_access_token: drop role parametr (claim zmizí
z nově vydávaných JWT; staré tokeny zůstávají valid, claim ignored)
- app/api/users.py: drop role z CreateUserRequest / UpdateUserRequest
(admin promotion = explicit add to Admin group via memberships API)
- src/repositories/users.py: drop role z create() / update()
CLI:
- da admin set-role smazán → hard-fail s replacement command
- da admin add-user --role flag pryč
- da auth import-token --role flag pryč
- da auth whoami: drop "Role:" výpis
- cli/config.py:save_token: role parametr now optional, no longer written
(back-compat se starými token.json soubory zachována — pole se ignoruje)
Tests:
- DELETE: test_permissions.py, test_permissions_api.py, test_access_requests_api.py
- REWRITE: test_access_control.py (resource_grants flow), test_rbac.py
(can_access_table over resource_grants), test_journey_rbac.py
(drop access-request flow), test_resource_types.py (drop env-gate
tests, drop is_public from helpers), test_v2_*.py (drop role-based
user dicts in favor of id-based + Admin group membership),
test_settings_api.py (no permissions field, can_access gate)
- TRIVIAL: ~30 souborů — drop role="admin" arg z UserRepository.create
a 3rd positional role z create_access_token
- NEW: test_v18_to_v19 migration test (test_db.py),
test_can_access_table_no_implicit_public (test_rbac.py),
test_admin_set_role_returns_hardfail (test_cli_admin.py)
- OpenAPI snapshot regenerated
Docs:
- CHANGELOG: BREAKING entry pod [Unreleased]
- CLAUDE.md: schema v18 → v19
- docs/architecture.md: schema table + RBAC sekce přepsána
- docs/auth-google-oauth.md: admin promotion přes da admin break-glass
- cli/skills/security.md: kompletně přepsáno na group-based model
- docs/TODO-rbac-data-enforcement.md: smazáno (TODO splněn)
Test results: 2363 passed, 19 failed. Zbývající failures jsou pre-existing
Windows-specific issues (fcntl, charset) nesouvisející s tímto PR —
ověřeno git stash pop.
Plan: ~/.claude/plans/floofy-coalescing-parnas.md
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore(release): cut 0.27.0
---------
Co-authored-by: Minas Arustamyan <arustamyan.minas@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
* fix(ops): fail-fast guard in agnes-auto-upgrade — refuse to start containers if config disk not mounted
Companion to keboola/agnes-the-ai-analyst-infra#62. Same incident:
foundryai-development 2026-04-30, marketplaces / DuckDB / session secret
written to /data (sdb) instead of the config disk (sdc), wiped on next
container recreate.
## Why an app-side guard
agnes-auto-upgrade.sh fires every 5 min on every VM. If `/data/state` is
not on the config disk (because of the propagation regression fixed by the
infra PR, or the boot-time udev race fixed by infra #58, or any future
mount-loss path), this script previously ran `docker compose up -d`
anyway — and the app silently wrote state onto the wrong disk. Next
recreate, that state was gone.
The boot-time fixes in infra are preventive. This is the runtime backstop.
## Behavior
Before the existing pull/up logic, when /dev/disk/by-id/google-config-disk
exists on the VM:
1. Up to 3 mount-and-verify attempts with backoff (2s, 4s, 6s).
- Mount the config disk if /data/state is not a mountpoint.
- Detect mismatch: if /data/state is mounted from the wrong source,
umount and retry.
2. After the loop, assert findmnt source matches the config disk.
- On mismatch: `logger -t agnes-auto-upgrade FATAL` + exit 1. systemd
marks the service failed; no docker compose action runs; existing
containers (if any) keep running on stale state, but no new write
lands on the wrong disk.
3. Once verified mounted: re-apply `mount --make-rprivate /data /data/state`
on every run. Idempotent. Guards against propagation regressions
sneaking back in via future docker / kernel changes.
VMs without a config disk (foundryai-poc, single-disk legacy) skip the
whole block — the `if [ -e $CONFIG_DEVICE ]` guard.
## Tested
Patched script installed on foundryai-development as a hotfix; manual run
post-migration was a no-op (digest unchanged); /data/state stayed on sdc
across a full `docker compose down + up -d` cycle.
## Rollout
- This file is fetched by infra startup.sh from
raw.githubusercontent.com/keboola/agnes-the-ai-analyst/main on every
boot. Once merged to main, all VMs pick up the new script on their
next boot — no infra recreate needed.
- For immediate rollout to running VMs without waiting for next boot:
`scp scripts/ops/agnes-auto-upgrade.sh <vm>:/tmp/ &&
ssh <vm> sudo install -m755 -o root -g root /tmp/agnes-auto-upgrade.sh
/usr/local/bin/agnes-auto-upgrade.sh` (already done on
foundryai-development).
* chore: vendor-agnostic comment + changelog text
Drop customer-specific VM names from the script comment and
CHANGELOG entry. The OSS distribution should not name a particular
operator's hosts; the technical description already conveys why
the guard exists.
* fix(ops): suppress mount stderr in retry loop
Match the rest of the script's error-tolerant idiom (2>/dev/null).
Mount failures in the cold-boot udev race the loop is designed
to handle gracefully should not flow to stdout — cron would mail
on every transient retry.
Devin BUG_0001 on PR #146.
* fix(changelog): move auto-upgrade entry to [Unreleased]
Entry landed under v0.20.0 because that section was [Unreleased]
when this branch first opened — releases v0.21–v0.24 cut in the
meantime stranded it inside an already-released section. Move it
back where new entries belong.
Devin BUG_0001 on PR #146.
* fix(infra): single-source agnes-auto-upgrade.sh via curl from main
Replace the inline heredoc copy of the auto-upgrade script in the
customer-instance Terraform startup template with a curl fetch from
raw.githubusercontent.com on every boot. The inline copy had drifted
several iterations behind canonical scripts/ops/agnes-auto-upgrade.sh
(missing TLS overlay detection, array-form COMPOSE_FILES, and now
the config-disk fail-fast guard from this PR).
Devin ANALYSIS_0001 on PR #146.
* fix(infra): fetch docker-compose.tls.yml unconditionally + document coupling
The canonical agnes-auto-upgrade.sh from main detects TLS at runtime
via cert files on disk, regardless of the TLS_MODE Terraform variable.
Certs can appear after boot via agnes-tls-rotate.sh or manual
provisioning, and the cron job would then fail every 5 min under
'set -euo pipefail' because docker-compose.tls.yml was never fetched.
Also document the main-vs-COMPOSE_REF coupling: when the canonical
script references a new compose file, the fetch list above must be
updated to match — pinned-ref VMs would otherwise break.
Devin BUG_0001 + ANALYSIS_0001 on PR #146.
* fix(ops,infra): unconditional Caddyfile + skip tls overlay if missing
Caddyfile fetch now matches docker-compose.tls.yml: unconditional in
startup-script.sh.tpl. Without it, Docker would auto-create an empty
directory at the bind-mount target and Caddy would crash-loop while
the tls overlay has already closed :8000 — making the app
unreachable on any non-caddy VM where certs land via rotate or
manual provisioning.
Defensive layer: agnes-auto-upgrade.sh now also requires Caddyfile
to exist (size > 0) before activating the tls profile, with a
WARN log if it's missing. Belt-and-suspenders so the failure mode
is contained even when the script is deployed by some other path
(not just the customer-instance TF module).
Devin BUG_0001 on PR #146.
* chore(release): cut 0.25.0
---------
Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
The script's `mkdir -p` left ownership of `/data/state/certs/` to whichever
process won the create race — root when systemd's timer fired before the
app container's first volume init, UID 999 when the container ran first.
With mode 700, a root-owned dir blocks the UID-999 agnes container from
reading its own fullchain.pem; `_read_agnes_ca_pem()` returns None, and
the cross-platform TLS trust block (Step 0 from PR #137) silently
disappears from the /install setup prompt. Operators on the unlucky-race
VMs got a setup prompt that couldn't bootstrap client trust against the
self-signed host. Existing VMs self-heal on next timer tick.
Three CI fixes triggered by the failed PR #137 deploy:
1. scripts/smoke-test.sh: assertion 8 was hitting /api/admin/tables (renamed to /api/admin/registry long ago). The 404 was treated as deployment regression and triggered the auto-rollback. Same stale URL also fixed in CLAUDE.md, README.md, dev_docs/server.md.
2. .github/workflows/release.yml smoke-test job: added Log in to GHCR step. The auto-rollback's docker push :stable was failing with 'unauthenticated' because the smoke-test job had no GHCR login of its own — leaving :stable pointing at the broken image.
3. Rollback step gained GH_TOKEN env, AND the workflow's permissions block gained issues:write. Both were needed for gh issue create to actually create the alert issue (was silently swallowed by the || echo fallback).
Manual cleanup outside this PR: :stable currently points at the broken PR #137 image — needs manual retag back to stable-2026.04.505.
Bootstraps the Agnes Claude Code marketplace + RBAC-allowed plugins from
the dashboard CTA, and inlines the server's TLS cert when the chain isn't
publicly trusted (self-signed / private CA). Cross-platform setup prompt
covers Windows Git Bash, macOS, Linux. Includes Bun-compiled `claude` fix
(macOS goes via git-clone fallback, same as Windows), PAT stripping after
clone, explicit error handling, and four rounds of Devin Review fixes
(phantom step references, $PLATFORM re-detection, heredoc/awk line-count
sync). Cuts 0.21.0.
See CHANGELOG.md [0.21.0] section for details.
Adds `scripts/run-local-dev.ps1` as a sibling of the bash script for Windows operators. Same compose stack (`docker-compose.yml` + `.dev.yml` + `.local-dev.yml`), same up/down/logs subcommands, same LOCAL_DEV_GROUPS default seeding. Restores caller's working directory and LOCAL_DEV_GROUPS on every exit path (success, error, Ctrl+C). Avoids advanced-script promotion so `up -d` / `down -v` reach docker compose instead of being eaten by -Debug/-Verbose.
* fix(security+ops): #82#85#87 — auth hardening, API validation, deploy posture
Security and operational hardening across three issue groups:
- M23: docker-compose.override.yml → docker-compose.dev.yml (BREAKING, prod foot-gun)
- C13: Container runs as non-root user 'agnes' (USER directive in Dockerfile)
- M21: Docker resource limits (mem_limit, cpus) on app + scheduler
- M22: Caddyfile security headers (X-Frame-Options, X-Content-Type-Options, Referrer-Policy, -Server)
- M17: /api/health split into minimal (unauth) + /api/health/detailed (auth) (BREAKING)
- M26: release.yml restricts build-and-push to main + workflow_dispatch; paths-ignore for docs
- C2: table_id traversal validation on /api/data/{table_id}/download
- M4: Upload streaming (chunk-read + temp file) instead of full-buffer; /local-md hashed filename
- C5: reset_token removed from POST /api/users/{id}/reset-password response
- C8: Startup WARNING when no user has password_hash (bootstrap window visible)
- M9: Audit log on failed web form login (mirrors /auth/token endpoint)
- M10: Atomic magic-link consume via compare-and-swap (CONSUMED: marker + DuckDB conflict catch)
Also: SSRF protection on /api/admin/configure (#46), memory stats SQL aggregation (#90)
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
* fix(review): SSRF 169.254.x.x + IPv6 multicast; M10 marker cleanup safety
Review fixes:
- Add 169.254.0.0/16 (link-local, cloud metadata) to SSRF regex — was
missing, allowing requests to AWS/GCP/Azure metadata endpoints
- Add ff[0-9a-f]{2}: (IPv6 multicast) to SSRF regex
- M10: wrap Step 3 (CONSUMED marker cleanup) in try-except with
warning log — prevents unhandled exception if DB write fails after
successful token consumption
- Add test for 169.254.169.254 SSRF rejection
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
* fix(review): SSRF IPv6 bypass, CLI health endpoint, upload FD leak
Address Devin Review findings on PR #104:
1. SSRF IPv6 bypass: Replace hostname regex with DNS resolution +
ipaddress module checks. The old regex patterns like `fe80:` only
matched up to the first colon, missing real IPv6 addresses like
`fe80::1`, `fc00::1`, `ff02::1`. The new approach resolves the
hostname via getaddrinfo and checks each resulting IP against
ipaddress.is_private/is_loopback/is_link_local/is_reserved/is_multicast.
2. CLI commands broken: `da setup test-connection`, `da setup verify`,
`da diagnose`, `da status` all called /api/health expecting the old
format (status=="healthy", services dict). Now they call
/api/health/detailed for service-level checks (with graceful fallback
to the minimal endpoint when auth is not configured).
3. Temp file handle leak: _stream_to_temp returns an open
NamedTemporaryFile; callers now close it before shutil.move() to
prevent FD leaks until GC.
Also adds IPv6 SSRF test cases (loopback, link-local, unique-local,
multicast) with mocked DNS resolution for test environment independence.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
* fix(review): download regex blocks hyphenated IDs; document health split
Address Devin Review round-3 findings on PR #104:
1. _SAFE_IDENTIFIER regex blocked hyphenated table IDs: The download
endpoint used the strict SQL-identifier regex which does not allow
dots or hyphens, but Keboola table IDs like in.c-crm.orders
contain both. Switched to _SAFE_QUOTED_IDENTIFIER which allows dots
and hyphens while still blocking path-traversal chars (/, .., \)
and quote/control characters. Added test for hyphenated/dotted IDs.
2. Documented health endpoint split in DEPLOYMENT.md: Added Health
checks & external monitoring section explaining both endpoints
(minimal unauth /api/health vs authenticated /api/health/detailed)
and how to wire external monitoring tools to the detailed endpoint
with a PAT.
Generated with [Devin](https://cli.devin.ai/docs)
Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
* release(0.12.1): cut hotfix for snapshot integrity + #82/#85/#87 hardening
* fix(security): apply CAS pattern to password reset confirm (#82/M10 follow-up)
Devin review on the rebased PR flagged the asymmetry: magic-link verify
got the atomic compare-and-swap pattern in the original M10 fix, but
password reset confirm at /auth/password/reset/confirm was still using
read-validate-clear. Two concurrent POSTs with the same valid reset
token could both succeed in setting different new passwords (last-write-
wins). Lower severity than the magic-link race because the attacker
would need the reset token AND to race the legitimate user, but the
asymmetry was a polish gap.
Mirrors app/auth/providers/email.py::_consume_token CAS exactly: write
unique CONSUMED:<random> marker via UPDATE...WHERE token=old_token, then
SELECT to verify our marker won, then proceed. Only the winner clears
the marker and applies the password change.
New regression test_concurrent_reset_only_one_wins in
tests/test_password_flows.py::TestResetConfirm pins the contract: two
ThreadPoolExecutor workers + Barrier hit /reset/confirm with the same
token; exactly one gets 302 (password applied), the other gets 200 with
'Invalid or expired'. Sanity-checked against the pre-CAS code — both
POSTs got 302 (race confirmed).
---------
Co-authored-by: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
This squashes 13 commits from ma/staging plus a small docstring translation
into a single coherent unit. Three workstreams.
== RBAC v13 redesign ==
- Drops core.viewer/analyst/km_admin/admin hierarchy and the
internal_roles / group_mappings / user_role_grants / plugin_access tables.
- Replaced by user_group_members + resource_grants. Atomic v12→v13 backfill
wrapped in BEGIN/COMMIT; ROLLBACK leaves schema_version at 12 for retry.
- Two authorization primitives in app.auth.access:
require_admin — Admin-group god-mode
require_resource_access(rt, "{path}") — entity-scoped grants
Single DB lookup per request; no session cache; no implies BFS.
- /admin/access UI (single page) replaces /admin/role-mapping +
/admin/plugin-access. CLI `da admin group/grant *` replaces
`da admin role/mapping/grant-role/revoke-role/effective-roles`.
- ResourceType.TABLE listing-only — admins can record table grants,
runtime enforcement still flows through legacy dataset_permissions
(migration plan in docs/TODO-rbac-data-enforcement.md).
== Claude Code marketplace ==
- Aggregated /marketplace.zip + /marketplace.git/* (PAT-gated,
RBAC-filtered, content-addressed cache via dulwich).
- Admin god-mode dropped on the marketplace surface — admins curate
their own view via grants like everyone else.
- Bare-repo cache materializes per RBAC-filtered ETag; stale entries
not pruned in this iteration (disclaimed in git_backend.py docstring).
== #81#83#44 security/ops hardening ==
- #81 Group A — orchestrator ATTACH allow-listing (extension/url/alias).
- #81 Group B — Keboola extractor 3-state exit codes:
0 success / 1 total fail / 2 PARTIAL fail
Sync API logs PARTIAL FAILURE alert on exit 2. Operators with binary
alerting must teach it the new partial signal.
- #81 Group C — schema v10 view_ownership; rejects silent overwrite
of a prior connector's view name on collision.
- #81 Group D — extractor-side identifier validation.
- #83 — Jira webhook fail-closed when JIRA_WEBHOOK_SECRET unset
+ path-traversal fix.
- #44 — entire /api/scripts/* surface is admin-only (planted-script +
sandbox-bypass risk closed).
== Web UI polish + deploy fix ==
- /admin/access: live grant-count badges (no stale snapshot revert),
shared-header CSS link added to /catalog and /admin/{tables,permissions},
per-resource-type colored stripes.
- docker-compose.host-mount.yml: bind,rbind so dual-disk hosts don't
silently shadow sub-mounts and write state to the wrong disk.
== OSS vendor-neutralization (waves 1+2) ==
- scripts/grpn/ → scripts/ops/. Customer-specific identifiers
(project IDs, internal hostnames, dev/prod VM IPs, brand names)
replaced with placeholders across code, docs, Terraform, Caddyfile,
OAuth probe, and planning docs. Downstream infra repos that copied
scripts/grpn/agnes-tls-rotate.sh or agnes-auto-upgrade.sh must
update the path.
== Translation ==
- src/repositories/user_groups.py::ensure_system docstring translated
from Czech to English for codebase consistency.
Co-authored-by: Mina Rustamyan <mina@keboola.com>
* chore(oss): isolate customer-specific deploy bits from scripts/grpn/ (#88)
Vendor-neutralization step before public release. The directory mixed
two concerns: (1) generic ops scripts referenced from mainline OSS
infrastructure (TLS rotation, auto-upgrade cron) and (2) one operator's
hackathon manual-deploy helper with hardcoded GCP project IDs, VM names,
and admin emails. Splitting them per concern.
Moved (still in OSS, just under a vendor-neutral name):
- scripts/grpn/agnes-tls-rotate.sh → scripts/ops/agnes-tls-rotate.sh
- scripts/grpn/agnes-auto-upgrade.sh → scripts/ops/agnes-auto-upgrade.sh
Removed (belongs in private consumer infra repos, not upstream OSS):
- scripts/grpn/Makefile (hardcoded prj-grp-foundryai-dev-7c37, foundryai-development VM name, e_zsrotyr@groupon.com bootstrap email)
- scripts/grpn/README.md (GRPN hackathon deploy walkthrough)
- docs/superpowers/plans/2026-04-22-grpn-deploy-learnings.md (org-specific deploy log)
Cross-refs updated in README.md, CLAUDE.md, docs/DEPLOYMENT.md,
docker-compose.yml. CHANGELOG entry flags BREAKING (ops) for any
consumer infra repo that installs these scripts via path-based systemd
timers.
This is the first wave of #88 — the remaining leaks (test data with
prj-grp-dataview-prod-1ff9, AIAgent.FoundryAI tags in OpenMetadata test
fixtures, docstrings in connectors/openmetadata/enricher.py) will be a
separate, smaller PR.
Refs #88.
* chore(oss): comprehensive vendor-neutralization (#88 wave 2 + review fixes)
PR #94 review found that the original wave-1 grep was scoped wrong and
many leaks survived. This commit closes wave 1 properly AND folds in all
wave-2 anonymization in a single pass — easier to review than two PRs.
Wave-1 review-fix corrections:
- Caddyfile: scripts/grpn/agnes-tls-rotate.sh → scripts/ops/ (the original
wave-1 grep filter excluded extensionless files like Caddyfile).
- CHANGELOG bullet rewritten — original wording implied an in-repo migration
for infra/modules/customer-instance/, which is wrong (the TF module embeds
the script inline via heredoc, never sourced from scripts/grpn/). Now
flags downstream consumer infra repos only.
- infra/modules/customer-instance/variables.tf: Czech docstring with `grpn`
example → English description with `acme, example` placeholders.
Wave-2 anonymization:
- Code docstrings (connectors/openmetadata/{client,transformer,enricher}.py,
src/catalog_export.py, scripts/duckdb_manager.py): prj-grp-… →
my-bq-project / prj-example-1234, AIAgent.FoundryAI → AIAgent.MyAgent,
FoundryAIDataModel → AnalyticsDataModel.
- Test fixtures (4 files): same set of replacements — 157 tests still pass.
- .github/workflows/keboola-deploy.yml: "Groupon-side dev VMs" comment →
generic "per-developer dev VMs".
- docs/auth-groups.md + scripts/debug/probe_google_groups.py:
kids-ai-data-analysis project name → acme-internal-prod placeholder.
- 5 planning/spec docs under docs/superpowers/{plans,specs}/2026-04-21-*:
hardcoded IPs (34.77.94.14, 34.77.102.61) → <dev-vm-ip>/<prod-vm-ip>;
GRPN/Groupon → Acme/another-customer; prj-grp-… → prj-example-….
- scripts/switch-dev-vm.sh deleted — hackathon-era helper hardcoded to a
specific shared dev VM. Per-developer dev VMs are the supported pattern.
Final grep `groupon|grpn|foundryai|prj-grp|groupondev|34\.77\.(94|102)\.…|kids-ai-data`
returns zero hits (excluding CHANGELOG.md historical entries).
CHANGELOG entry expanded to document both waves under one bullet, with
the BREAKING (ops) clarification about the TF module being unaffected.
Refs review of #94, closes#88.
* fix(oss): close remaining #94 review-2 findings (Czech, padak refs, CHANGELOG)
Reviewer of PR #94 round 2 caught 4 remaining items the wave-2 pass missed:
1. infra/modules/customer-instance/variables.tf had Czech descriptions on
8 more variables. Previous review only flagged line 19; this round
audited the rest. Translated lines 2, 28, 42-46 (heredoc), 60, 65, 71,
78, 84 to English. Same review concern: a Terraform module that is
the customer-facing API surface in Czech is unfit for OSS distribution.
2. infra/modules/customer-instance/outputs.tf had Czech descriptions on
four outputs. Same fix.
3. docs/padak-security.md referenced a private repo (padak/keboola_agent_cli#206)
in two places. Replaced with generic 'tracked upstream in the auth-CLI repo'
per CLAUDE.md vendor-agnostic rule (no cross-refs to private repos).
4. scripts/fetch-env-from-secrets.sh:41 had a Czech comment.
Translated.
5. CHANGELOG cosmetic: bullet said 'AIAgent.FoundryAI -> AIAgent.MyAgent'
but the actual code uses both MyAgent (in docstrings) and Example
(in test fixtures). Reworded to mention both targets.
Final grep across all shipping file types (.md, .py, .yml, .yaml, .sh,
Makefile, .json, .tf, .tpl, Caddyfile, .toml) for groupon|grpn|foundryai|
prj-grp|groupondev|34.77.94.14|34.77.102.61|kids-ai-data|padak/keboola_agent_cli
returns ZERO hits (excluding CHANGELOG.md). Czech-diacritic grep across
.tf/.toml/Caddyfile/Makefile/.yml returns ZERO hits.
157/157 OpenMetadata + DuckDB tests still pass.
* fix(oss): close#94 round-3 leaks (env.template, instance.yaml.example, padak typo)
Round-3 reviewer caught two MUST-FIX leaks the round-2 grep missed
(grep was scoped to extensions that did not include .template / .example
suffixes — the audit was right, the previous grep was not paranoid enough):
1. config/instance.yaml.example:114 — '(optional - Groupon-specific)' brand
leak in a shipping config example. Replaced with '(optional)'.
2. config/.env.template:68 — stale path 'scripts/grpn/agnes-tls-rotate.sh'
in operator-facing env-template comment. The script lives at
scripts/ops/ now (commit 16a85cc); this comment had been pointing
operators at a non-existent path.
3. docs/padak-security.md:188 — phrase duplication 'tracked in tracked
upstream' from a sloppy substitution in round-2. Trivial wording fix.
Final paranoid grep across .md/.py/.yml/.yaml/.sh/Makefile/.json/.tf/.tpl/
Caddyfile/.toml/.template/.example/.env* with the full token set
(groupon|grpn|foundryai|prj-grp|groupondev|34\.77\.94\.14|34\.77\.102\.61|
kids-ai-data|padak/keboola_agent_cli) returns ZERO hits, excluding
CHANGELOG.md historical entries.
* fix(oss): #94 round-4 — QUICKSTART.md + rename padak-security.md
Devin Review caught two findings on the latest round-3 commit:
1. docs/QUICKSTART.md:67 still pointed users at the deleted
scripts/switch-dev-vm.sh. A Quickstart user following step-by-step
would hit a missing-file error at the final step. Replaced with the
inline gcloud-ssh equivalent that the Removed bullet documents.
2. docs/padak-security.md filename retains the personal identifier
'padak'. The PR fixed the body content (replaced
padak/keboola_agent_cli#206 references with generic wording) but
missed the filename. Renamed to docs/security-audit-2026-04.md
(date-anchored, vendor-neutral). Updated the historical CHANGELOG
link to point at the new path with an inline note about the rename.
* fix(oss): redact remaining hardcoded IPs from planning docs + remove default email
Devin Review caught two more leaks:
1. scripts/fetch-env-from-secrets.sh line 16 had a hardcoded
personal-email default (zdenek.srotyr@keboola.com). Replaced with
':?' bash error so SEED_ADMIN_EMAIL must be explicitly set —
safer than carrying any specific identity.
2. Planning docs still had 35.195.96.98 and 34.62.223.189 (legacy
prod/dev IPs) that the round-1 IP-replace pattern missed (it only
targeted 34.77.x.x). Generic regex redaction across all five
planning docs replaces every public IP with <redacted-ip>,
preserving private/loopback/IAP ranges.
* feat(auth): mock session.google_groups in LOCAL_DEV_MODE via LOCAL_DEV_GROUPS
LOCAL_DEV_MODE auto-logged-in the dev user but left session.google_groups
empty, so group-aware UI/code paths can't be exercised on localhost without
a real Google OAuth round-trip. New LOCAL_DEV_GROUPS env var (JSON array
matching the production {id, name} shape) populates the session on every
dev-bypass request — same structure the OAuth callback writes, so mock and
prod stay in lockstep. Compare-then-write avoids spurious Set-Cookie noise
on PAT/CLI requests; malformed input falls back to [] with a WARNING so
the dev mock never breaks the dev flow.
* refactor(auth): fail-fast LOCAL_DEV_GROUPS at startup + cache + no-mutate
Three small follow-ups on the same dev-mock vector before merge:
- Validate LOCAL_DEV_GROUPS at app startup and report the parsed group IDs
in the LOCAL_DEV_MODE banner. A malformed value now warns loudly at boot
instead of silently logging on the first authenticated request, where
it's easy to miss.
- Cache the parsed result single-slot, keyed by the raw env-string. Avoids
re-parsing JSON on every authenticated request without test-isolation
surprises — when the env value changes, the key changes and the cache
transparently rebuilds.
- Stop mutating the parsed-input dicts (item.setdefault → spread-merge)
so the cached list stays a fresh value on every rebuild.
- Replace the try/except guard around request.session with hasattr —
SessionMiddleware is always registered, the silent except was paranoid.
Tests grow by a direct session-cookie inspection (decoupled from the
profile template) and three startup-banner log assertions.
* fix(auth): drop fragile session-decoder test + actually skip empty-target write
Two follow-ups on the LOCAL_DEV_GROUPS feature before merge:
- Drop test_session_holds_mocked_groups_directly. It manually decoded the
signed session cookie via TimestampSigner + base64, hardcoding both the
Starlette session-cookie format and the 14-day max_age. Starlette has
changed its session encoding before (URLSafeTimedSerializer pre-0.20)
and would do so again silently — the test would fail with a cryptic
BadSignature, not a clear "mock is broken" signal. The remaining
test_dev_user_sees_mocked_groups_on_profile already covers the same
observable signal (mocked groups in /profile body) without coupling to
Starlette internals.
- Actually skip the session write when target_groups is empty. The previous
comment claimed compare-then-write avoided spurious Set-Cookie noise on
PAT/CLI requests, but on those requests session.get("google_groups") is
None and target is [], so None != [] always evaluates True and the write
fired anyway, marking the session dirty and re-issuing Set-Cookie on
every request. Adding `target_groups and ...` to the guard makes the
comment honest: empty mock now genuinely no-ops, stable browser sessions
still skip via value-equality, and the only remaining write is the one
that actually changes state.
33 auth tests still pass locally.
* fix(auth): match production's always-write semantics for stale dev groups
Devin code-review finding on PR #70: my earlier `target_groups and ...`
short-circuit silently diverged from the production OAuth callback. In
app/auth/providers/google.py:189-194 the callback always writes
session.google_groups on each login — including [] on failure or empty
token — so the session always reflects authoritative current state. The
mock should match.
Failure mode the previous guard left open: a developer sets
LOCAL_DEV_GROUPS=[{...}] for a session, the groups land in the signed
cookie, then the developer unsets the env var and reloads. target → [],
session.get → [{...}], `if target_groups and ...` is False, no write,
stale groups stay in the browser session indefinitely. Mock now lies
about state until logout.
Fix splits the guard:
- target_groups truthy + value-changed → write the new mock (existing path)
- target_groups falsy + non-empty stored → write [] to clear stale state
- otherwise no-op (target [] + stored None/[]: no transition to record)
PAT/CLI requests with no prior session still take the no-op path
(target=[], session.get → None which is falsy), so the original goal of
suppressing spurious Set-Cookie noise on token traffic is preserved.
Tests already cover the populated and unset paths; the new clear-stale
branch is correct by construction (production has the same shape) and
the rare manual reset workflow.
* release(0.11.2): default mocked groups in make local-dev + docs/local-development.md
Cuts 0.11.2 around the LOCAL_DEV_GROUPS work plus a small dev-experience
follow-up: every `make local-dev` now boots with two sensible default
mocked groups (Local Dev Engineers + Local Dev Admins on example.com),
so /profile and group-aware code paths render something realistic
without the operator having to discover and set LOCAL_DEV_GROUPS.
Layered so the default lives in the workflow, not the contract:
- scripts/run-local-dev.sh seeds LOCAL_DEV_GROUPS via shell ":="
syntax — only sets the var when the operator hasn't already.
Override: LOCAL_DEV_GROUPS='[...]' make local-dev. Disable:
LOCAL_DEV_GROUPS= make local-dev.
- docker-compose.local-dev.yml swaps the commented JSON example for
a bare `- LOCAL_DEV_GROUPS` passthrough — the value comes from the
shell, the compose file just propagates it. Operators running
`docker compose up` directly without the wrapper script get an
empty mock (correct: they didn't opt into the make-driven defaults).
- Makefile help line mentions the mocked groups so the behavior is
visible without grepping.
New docs/local-development.md consolidates dev-onboarding instructions
that were previously scattered across docker-compose.local-dev.yml
inline comments, docs/auth-groups.md "Local-dev mock" section, the
Makefile help text, and CLAUDE.md "First-Time Setup". Single page now
covers TL;DR, what LOCAL_DEV_MODE actually bypasses, group mocking
controls + verification, what is *not* mocked (Cloud Identity, real
OAuth, admin Workspace permissions), and the safety rails that keep
the dev shortcuts off production.
Version bump 0.11.1 → 0.11.2 in pyproject.toml, CHANGELOG cuts
[Unreleased] → [0.11.2] — 2026-04-26 with a fresh empty [Unreleased]
skeleton.
* fix(local-dev): default LOCAL_DEV_GROUPS truncated by shell parameter expansion
Reported by an operator running `make local-dev` against the freshly
released 0.11.2 — the LOCAL_DEV_MODE banner showed:
LOCAL_DEV_GROUPS is not valid JSON, ignoring:
Expecting ',' delimiter: line 1 column 70 (char 69)
LOCAL_DEV_GROUPS is set but produced no valid groups —
check the WARNING above for the parse error.
Cause: the default value lived inside `${LOCAL_DEV_GROUPS:=…}` parameter
expansion. Bash matches `}` to close the expansion at the *first* `}`
encountered in the body, regardless of context — even one inside a
nested JSON object literal. The two-element JSON array was therefore
truncated to the first group's closing brace, leaving an unparseable
fragment:
[{"id":"local-dev-engineers@example.com","name":"Local Dev Engineers"
There is no escaping syntax for `}` inside parameter expansion (the
backslash escapes I had only escaped the quotes — `}` reaches bash
literally). Fix: hold the default in a single-quoted variable and
reference it through `${LOCAL_DEV_GROUPS:-$DEFAULT_LOCAL_DEV_GROUPS}`.
The variable's value is opaque to the expansion — no `}` matching
inside it — so the JSON survives intact. Verified with `python -m json`:
parsed OK: 2 groups: ['local-dev-engineers@example.com',
'local-dev-admins@example.com']
Operators on a running 0.11.2 stack: `make local-dev-down && make
local-dev` to pick up the corrected default.
* fix(local-dev): respect LOCAL_DEV_GROUPS= disable path + add 0.11.2 changelog link
Two follow-ups from a Devin code-review pass on PR #70:
- run-local-dev.sh: switch ${LOCAL_DEV_GROUPS:-$DEFAULT} to
${LOCAL_DEV_GROUPS-$DEFAULT} (no leading colon). The :- form
substitutes the default when the variable is unset OR set-but-empty,
silently overwriting the documented disable knob. Three places
promise this works — docs/local-development.md, the CHANGELOG entry,
and the script's own comment — so the bug was an operator-facing
lie, not just an implementation detail. The bare - form only
substitutes on unset, so `LOCAL_DEV_GROUPS= make local-dev` now
reaches the Python parser as "" and short-circuits to []. Verified
with both empty and unset shells.
- CHANGELOG.md: add the [0.11.2] link reference at the bottom.
Keep-a-Changelog convention is to mirror every version heading
with a release-tag link in the footer; the 0.11.2 heading was
missing its counterpart, breaking the Markdown link rendering on
GitHub.
---------
Co-authored-by: Claude <noreply@anthropic.com>
* feat(auth): display Google Workspace groups on /profile
- Request cloud-identity.groups.readonly scope in Google OAuth
- Fetch groups via Cloud Identity API after callback; tolerate 4xx
(non-Workspace tenants) and network errors — never break login
- Store result in Starlette session as google_groups
- Replace /profile redirect with a real profile page rendering
account details (email, name, role) and the group list; show a
friendly empty state when no groups are available
- Tests: helper parsing + 403 + exception paths; profile page
smoke test; updated the old redirect test
* test: remove stale /profile redirect tests
Cherry-pick of Zdeněk's 4f7e4cd ("display Google Workspace groups on
/profile") replaces the /profile redirect with a real profile page —
but only updated one of three tests that expected the old behaviour.
These two tests in test_admin_tokens_ui.py and test_pat.py were left
asserting `/profile → 302 /tokens`, which now returns
`/profile → 302 /login?next=%2Fprofile` for unauth users (the standard
auth guard) or `/profile → 200 HTML` for authenticated users.
Removed both rather than patched — coverage for the new behaviour
already exists in tests/test_auth_providers.py (added by the same
commit). The /tokens render assertions in the deleted test_pat.py case
are redundant with test_admin_tokens_ui.py's own /tokens UI tests.
* fix(auth): Google groups search query needs parent + labels predicates
Cloud Identity Groups Search API returns 400 INVALID_ARGUMENT when the
CEL query lacks the required `parent == 'customers/<id>'` predicate AND
a `'<label>' in labels` membership predicate. Zdeněk's original 4f7e4cd
query had only `member_key_id == '<email>'` — every fetch silently
returned [] and the /profile groups list was always empty.
Fix: build the query with all three required pieces:
parent == 'customers/my_customer' (alias = caller's own Workspace
org; no need to look up customer ID)
member_key_id == '<email>' (filter to this user's memberships)
'cloudidentity.googleapis.com/groups.discussion_forum' in labels
(Workspace mailing-list groups —
the common case; security-group
coverage is a follow-up)
Also: log the full error body (not truncated to 200 chars) and the
query string so the next time Google rejects something we can diagnose
in one log line instead of a re-deploy.
Caught when first agnes-dev login completed normally (HTTP 302) but app
log showed `Google groups fetch returned 400 for petr@keboola.com:
{"error":{"code":400,"message":"Request contains an invalid argument."}}`
on the same VM (kids-ai-data-analysis / agnes-dev.keboola.com).
Reference: https://cloud.google.com/identity/docs/reference/rest/v1/groups/search
* feat(web): add Profile link to user dropdown menu
The /profile page (Zdeněk's 4f7e4cd cherry-pick) renders a real profile
view including Google Workspace groups, but had no entry point in the
UI — users could only reach it by typing the URL manually. Add a
"Profile" menu item between the user header (email + role) and
"My tokens" so the page is discoverable.
Side effect: cleaned up the leftover `or _path.startswith('/profile')`
condition on the "My tokens" active class, which dated from the old
/profile → /tokens redirect (removed in c789617). Now each menu item
owns its own active state.
* fix: profile-link tests + .env quoting for CADDY_TLS
Two issues caught by Keboola's first agnes-dev deploy + agnes-auto-upgrade
cron run:
1. tests/test_web_ui.py — two negative assertions ("href=/profile" NOT in
body) date from when /profile was a redirect-only stub. Now /profile
is a real page (groups display) AND has a dropdown menu link, so the
negative assertions flip to positive. Same for ">Profile<" text in
the non-admin nav test.
2. startup-script.sh.tpl — CADDY_TLS line must be QUOTED in .env, because
agnes-auto-upgrade.sh sources .env via `set -a; . .env; set +a` and
bash treats `KEY=value with spaces` as `KEY=value` followed by `with`
and `spaces` exec attempts. Symptom: cron log spam
`/opt/agnes/.env: line 14: petr@keboola.com: command not found`,
the cron exits non-zero, and no auto-upgrade ever happens. Caddy
itself reads the value fine because docker-compose env_file=.env
parses key=value properly without shell-evaluating the rest.
Fix: emit `CADDY_TLS="tls <email>"` instead of `CADDY_TLS=tls <email>`.
Both the cron source and docker-compose env_file accept the quoted
form; cron stops failing.
* fix(auth): use searchTransitiveGroups + security label for non-admin user
Three bugs in the original cherry-pick + my prior fix attempt, all caught
by a stdlib probe script (scripts/debug/probe_google_groups.py) run
locally with a Playground-issued OAuth token:
1. Wrong endpoint. `groups:search` is the admin "find groups in org"
endpoint and 400s for non-admin users regardless of query. Switched
to `groups/-/memberships:searchTransitiveGroups` which is the
user-perspective "what groups am I in" endpoint.
2. Wrong label. Querying with `cloudidentity.googleapis.com/groups.discussion_forum`
returns 403 "Insufficient permissions to retrieve memberships" even
on the new endpoint — Workspace policy denies non-admin reads of
discussion-forum groups. Switching to `groups.security` returns 200
with the actual membership list. Empirically every Workspace group
at Keboola carries BOTH labels, so the security filter sees the full
set anyway. Confirmed with the probe script.
3. Wrong response shape. `searchTransitiveGroups` returns
{"memberships": [...]}, not {"groups": [...]}. Parser updated
accordingly.
Also adds scripts/debug/probe_google_groups.py — stdlib-only standalone
probe that hits 6 candidate endpoints with a user OAuth token. Saved a
deploy cycle (~10 min) per query iteration; future API-syntax debugging
should start there.
Verified end-to-end: petr@keboola.com login on agnes-dev returns 5
groups (LIC-1PASSWORD, ROLE_ATLASSIAN_*, etc.) via the probe; once
deployed, the same will populate session["google_groups"] and render
on /profile.
* test(auth): update Google groups parser fixture to match searchTransitiveGroups shape
Mock payload was `{"groups": [...]}` (the shape `groups:search` returns).
After switching to `groups/-/memberships:searchTransitiveGroups` in the
prior commit, the actual response is `{"memberships": [...]}` and the
parser iterates that key. Test now mirrors the real shape.
The per-item structure (groupKey.id + displayName) is unchanged, so the
expected output dict stays the same: [{"id": "...", "name": "..."}].
* docs(auth): add docs/auth-groups.md — Google Workspace groups runbook
Captures the non-obvious bits: the GCP-side setup checklist (Cloud
Identity API + scope on consent screen + Internal user type), the
`security` vs `discussion_forum` label trap (the latter 403s for
non-admins, the former 200s — one of those is a 4-iteration debug
session and shouldn't have to be repeated), where groups are stored
(session, not DB) and how to refresh (re-login), plus how to use the
probe script for future API-syntax issues.
Deliberately stops short of explaining "what is Cloud Identity" or
"what is OAuth scope" — those belong in Google's own docs, not ours.
* docs(claude): document release workflows + module versioning + recreate trick
New "Release & deploy workflows" section in CLAUDE.md covers what didn't
exist anywhere in the repo before:
- Distinction between release.yml (auto-build per push) vs the new
keboola-deploy.yml (tag-triggered, explicit deploy only) — plus when
to use which (per-developer convenience vs shared dev VM safety)
- Module versioning (infra-vX.Y.Z) and the bump-after-merge dance
- The lifecycle.ignore_changes [metadata_startup_script] gotcha and how
to force a recreate via workflow_dispatch's recreate_targets input
All generic — no customer hostnames, project IDs, IPs. Customer-specific
deploy steps belong in the consuming infra repo's README.
Also: cross-reference docs/auth-groups.md from the Authentication
section so future Claude sessions find the Workspace-groups runbook
without grepping.
---------
Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
* feat(dev): add LOCAL_DEV_MODE for one-command local dev
When LOCAL_DEV_MODE=1, every protected route auto-authenticates as a seeded
admin user (default dev@localhost) — no login screen, no Google OAuth config,
no magic-link roundtrip. Startup logs a loud warning to make misuse obvious.
Also fixes two preexisting bugs in the magic-link flow that surfaced while
wiring up the dev fallback:
- /auth/email/verify only accepted POST, but the URL embedded in emails is
a GET link — clicking from any mail client returned 405. Added a GET
variant that consumes the token, sets the auth cookie, and redirects to
/dashboard.
- Token expiry check compared an offset-aware datetime.now(timezone.utc)
against an offset-naive value from DuckDB, raising TypeError on every
valid link. Normalize the stored timestamp to UTC before subtracting.
Dev-only fallback (scoped strictly to LOCAL_DEV_MODE to keep test and
production behavior identical): send-link logs the magic link to stderr
and returns it as dev_link in the JSON response when no SMTP is configured.
Usage:
./scripts/run-local-dev.sh
open http://localhost:8000 # lands on /dashboard as admin
* fix(dev): URL-encode magic-link email + avoid /login redirect loop
Two issues surfaced by Devin review on PR #32.
1. _build_magic_link interpolated email into the URL unescaped. For addresses
with '+' (e.g. user+tag@gmail.com) Starlette's query parser decoded '+'
as a space on the GET /verify side, so repo.get_by_email returned None
and every click yielded 401 "Invalid link". quote(email, safe='') fixes
both the email transport and the dev_link fallback.
2. /login in LOCAL_DEV_MODE unconditionally redirected to /dashboard. If
dev-user seeding failed at startup (main.py wraps seed in try/except),
/dashboard 401'd, the HTML redirect handler bounced to /login, and the
loop repeated until the browser aborted. Now /login checks the dev user
actually exists before short-circuiting; otherwise it falls through to
the normal login form so the missing seed is visible.
* fix: redirect unauthenticated HTML routes to /login (#10)
* docs(plan): user mgmt + PAT + CLI distribution implementation plan (#9#10#11#12)
* build(docker): produce wheel artifact for /cli/download (#9)
* feat(db): schema v5 — users.active + deactivated_at/by (#11)
* feat(api): /cli/download wheel + /cli/install.sh with baked server URL (#9)
* feat(users): repository supports active flag + count_admins (#11)
* feat(ui): /install page with per-deployment install instructions (#9)
* feat(api): user PATCH/reset-password/set-password/activate/deactivate (#11)
* fix(cli): da login prompts for password and sends it in body (#9)
* test(api): safeguard tests for self-deactivate and last admin (#11)
* feat(auth): reject requests from deactivated users (#11)
* fixup(#10): propagate next through /login buttons + lock down sanitizer tests
* feat(cli): da admin set-role/activate/deactivate/reset-password/set-password (#11)
* feat(ui): /admin/users management page (#11)
* feat(db): schema v6 — personal_access_tokens (#12)
* feat(users): access_tokens repository (#12)
* feat(auth): JWT carries typ (session|pat) and explicit jti (#12)
* feat(auth): reject revoked/expired PATs; update last_used_at (#12)
* feat(api): /auth/tokens CRUD + admin revoke; session-only guard (#12)
* feat(cli): da auth token create/list/revoke (#12)
* feat(ui): /profile page with PAT create/list/revoke (#12)
* docs: PAT usage and session/PAT TTL clarification (#12)
* feat(auth): PAT first-use-from-new-IP audit + last_used_ip (schema v7) (#12)
Closes remaining acceptance gap from issue #12: audit_log entry on first use
of a PAT from an IP that differs from the recorded last_used_ip.
- schema v7: personal_access_tokens.last_used_ip column
- AccessTokenRepository.mark_used now stores the client IP
- get_current_user extracts client IP (X-Forwarded-For first hop, fallback
to request.client.host) and emits a token.first_use_new_ip audit when the
IP changes on a subsequent use (not the very first use)
- tests: new-ip audit, same-ip no-op, first-ever-use no-op, schema v7 column
* fix: address Devin review findings on PR #28
- app/main.py: exclude /auth/* from HTML redirect handler so JSON
endpoints under /auth/ (PAT CRUD used by `da auth token` CLI) keep
their 401 JSON contract (Devin #1, bug)
- app/api/tokens.py: reject expires_in_days <= 0 explicitly; use
`is not None` so 0 no longer silently creates a non-expiring token
(Devin #2)
- app/api/users.py: validate role against Role enum in create_user
to match update_user and prevent 500 on role-protected requests
later (Devin #3)
- app/web/templates/admin_users.html: escape user-supplied strings
before innerHTML; move onclick handlers to addEventListener via
data attributes so emails with quotes / HTML no longer break the UI
or enable stored XSS (Devin #4)
- app/auth/router.py, app/auth/providers/{password,google}.py:
reject deactivated users at login instead of issuing a JWT that
would then fail on the next request — removes the confusing
redirect loop (Devin #5)
- CLAUDE.md: document schema v7 instead of stale v4 (Devin #6)
- tests/test_web_ui.py: regression test for the /auth/* JSON 401
* feat(web): add /profile and /admin/users links to dashboard nav
* feat(web): point setup banner at /install page
* chore(web): drop unused setup_instructions context
* fix: address Devin review round 2 on PR #28
- app/api/tokens.py: when expires_in_days is None (the "never" option),
use a ~100-year JWT expiry so the token doesn't silently die in 24h
via the session-default fallback in create_access_token. The real
expiry enforcement stays in verify_token's DB-level check (Devin 🔴)
- app/web/templates/profile.html: escape t.name and other user-supplied
strings via esc() helper before innerHTML, same pattern as
admin_users.html. Move revoke onclick to data-attribute +
addEventListener (Devin 🟡)
- app/api/cli_artifacts.py: use `mktemp -d` with X's at end of template
for GNU/BSD portability, place wheel inside the temp dir and
clean up with rm -rf (Devin 🚩)
* feat(web): redesign /install page; make curl one-liner primary, collapse manual
Rebuild the public /install page using the dashboard visual language
(shared header, card layout, gradient hero, design tokens from
style-custom.css). The page is now anchored on the one-liner install
path: curl -fsSL <server>/cli/install.sh | bash is rendered as the
primary, prominent step 1, while the old manual wheel-download flow
is tucked behind a closed-by-default <details> block for users in
restricted/offline environments.
Information architecture:
hero (server URL + version)
-> step 1: quick install (one-liner, big Copy button)
-> step 2: create PAT on /profile + export DA_TOKEN / da auth whoami
-> step 3: Claude Code / MCP via ~/.config/da/token.json
-> collapsed "Manual install" details for download-wheel flow
-> footer link to docs/HEADLESS_USAGE.md
Every shell snippet has a vanilla-JS "Copy" button that confirms
visually ("Copied!" for 1.5s) and falls back to textarea+execCommand
on non-secure contexts. No new dependencies, no bundler.
The route now also pulls an optional user so the header shows the
same nav (Dashboard / Profile / Logout) as dashboard.html when a
session exists, while staying fully public when signed out.
* fix(cli): use real wheel filename in install.sh (broken pip/uv install)
The installer wrote the downloaded wheel as agnes_cli.whl, which lacks a
PEP-427 version component — both pip and uv tool install reject it and
abort the one-liner.
Use curl -OJ so Content-Disposition determines the on-disk filename, then
resolve it via glob. Install an EXIT trap to remove the tmpdir even when
install fails.
* fix(web): correct manual install wheel glob and add PEP 668 / PATH hints
- Wheel glob is agnes_the_ai_analyst-*.whl (not agnes-*.whl) — the old
pattern never matched the real artefact name from the build.
- Add — or — separator between uv tool install and pip install.
- Warn that pip install --user is blocked on macOS Homebrew / modern
Debian (PEP 668) and recommend uv tool install as the default path.
- Both flows now show the ~/.local/bin PATH hint so a fresh shell can
find the da binary after install.
* fix(web): consistent session.user reference in install header
The avatar-letter fallback inside {% if session.user %} was reading
user.name / user.email directly, but the route dependency can pass
user=None — those references resolved to an empty FlexDict and produced
an empty avatar circle. Read everything through session.user to match
the guard and the dashboard pattern.
* fix(web): point headless usage link at GitHub source
/docs/HEADLESS_USAGE.md 404s — no static route serves repo docs. Point
the footer link at the rendered markdown on GitHub instead of adding a
dedicated docs serving route just for one file.
* feat(web): /install hero size, anon sign-in banner, step 2 copy polish
- Bump hero h1 from 26px to 30px to match dashboard primary scale.
- Anonymous visitors see a small sign-in banner above Step 2 (creating
a token requires auth; without the banner the flow appears stuck).
- Add an 'After generating your token' section label inside Step 2 so
the /profile CTA button no longer looks wedged mid-sentence between
adjacent paragraphs.
* chore(web): /install a11y + version pill polish
- aria-live='polite' on copy buttons so screen readers announce the
'Copied!' state change.
- Replace redundant INSTANCE_NAME eyebrow (already in the header logo)
with 'Getting started'.
- Hide the version pill when AGNES_VERSION is unset/'dev' — avoids the
misleading 'vdev' label in local/unbuilt runs.
- Manual summary focus-visible outline-offset +2px (was -2px which
clipped inside the card), and mark the chevron as decorative.
* fix(web): use session.user in dashboard avatar fallback
Inside {% if session.user %} guard, the avatar fallback referenced
(user.name or user.email). If user is None the block crashes when
the profile picture is absent. Align with the guard variable.
* fix: address Devin review round 3 on PR #28
- app/api/users.py: stop auto-sending email from reset_password. The
magic-link sender would deliver a "Login Link" that — when clicked —
consumes the reset_token via verify_magic_link and logs the user in
WITHOUT prompting for a new password. Admins now share the raw
reset_token from the API response manually, or use set-password
directly. email_sent is always False. Documented inline. (Devin 🟡)
- app/api/cli_artifacts.py: harden /cli/install.sh generation against
shell injection via Host header or AGNES_VERSION. base_url is
validated against a strict scheme+host+port regex; version against
an alnum + dot/dash/underscore allowlist. Both values are also
piped through shlex.quote() as defense in depth. (Devin 🟡)
The shared users.reset_token column between magic-link and password-
reset flows (Devin 🚩) remains an architectural gap; splitting into
separate columns needs schema v8 and is tracked for a follow-up PR.
* docs, chore(grpn): manual-deploy helpers + hackathon deploy learnings
Adds scripts/grpn/ — Makefile + agnes-auto-upgrade.sh + README for
operating Agnes on GRPN's existing foundryai-development VM when the
full Terraform flow is blocked by org policies:
- iam.disableServiceAccountKeyCreation (org constraint) forbids SA
JSON keys, so GCP_SA_KEY-based CI is unavailable
- No projectIamAdmin delegation → bootstrap-gcp.sh can't grant roles
- Secret Manager IAM bindings require setIamPolicy which editor lacks
Helper targets: deploy, deploy-tag, recreate, restart, stop, start,
status, version, logs, ps, env, ssh, tunnel, open, bootstrap-admin,
set-data-source, install-cron, uninstall-cron.
docs/superpowers/plans/2026-04-22-grpn-deploy-learnings.md — running
log of all org-policy constraints hit during the hackathon deploy,
with workarounds and derived follow-ups (WIF support, external_ip
variable, customer onboarding IAM checklist).
Not a replacement for the TF flow — stopgap until WIF lands.
* fix(web): make header logos clickable links to home
* feat(web): one-click "Setup a new Claude Code" button
Adds a single-button flow on the dashboard and /install page that
generates a fresh personal access token via POST /auth/tokens and
copies a complete, paste-ready setup script (server URL, token,
install/verify commands) to the clipboard. Falls back to a modal
textarea when the clipboard is blocked; redirects to /login on 401;
surfaces backend errors inline.
- dashboard.html: replaces the top "Set up your local environment"
anchor with a real button wired to setupNewClaude(). Removes the
duplicate bottom setup banner to keep a single entry point.
- install.html: for signed-in users, Step 1 leads with the one-click
button and demotes the curl one-liner into a collapsible "Or run
manually" aside. Anonymous visitors still see the curl flow plus a
sign-in hint.
- No new deps. Vanilla JS. Token lives in memory/clipboard only —
never rendered into persistent DOM.
* feat(cli): add "da auth import-token" for non-interactive PAT login
Writes a provided JWT into ~/.config/da/token.json using the canonical
{access_token, email, role} shape expected by save_token(). Decodes the
token locally to pull email/role claims, verifies it against the server
via GET /api/catalog/tables, and refuses to overwrite an existing token
file if the server returns 401. --email / --role overrides exist for
tokens missing those claims; --skip-verify bypasses the server round-trip
for offline / CI scenarios.
* test(cli): cover da auth import-token success + 401 + claim-fallback paths
Three new tests in TestAuthImportToken:
- valid JWT + 200 -> canonical token.json written
- 401 from /api/catalog/tables -> exit 1, existing token file untouched
- JWT without email/role claims -> refused without overrides, accepted
with --email / --role flags
* feat(web): update one-click Claude setup instructions — explicit uv install, import-token, skills question
Replaces the fragile `cat > token.json <<EOF` clipboard payload with an
explicit, auditable sequence:
1. `curl -fsSL /cli/download` + `uv tool install --force` (no opaque
`curl | bash`).
2. `da auth import-token --token ...` instead of hand-written JSON.
3. Explicit PATH persistence for zsh/bash.
4. A required question to the user about whether to copy the bundled
skills into ~/.claude/skills/agnes/ or pull them on-demand via
`da skills show`.
5. A final confirmation step with whoami + version output.
Factored both pages to include a shared partial
(app/web/templates/_claude_setup_instructions.jinja) so dashboard.html
and install.html can never drift apart again. {server_url} and {token}
stay as runtime placeholders substituted by renderSetupInstructions().
* feat(ui): modernize /admin/users + unify header nav across pages
- New shared partial app/web/templates/_app_header.html — single source
of truth for the top navigation. Used by base.html and dashboard.html
(which doesn't extend base.html). Active page highlighted via
request.url.path. Admin "Users" link gated by session.user.role.
- style-custom.css: add .app-header / .app-nav-link / .app-btn-logout /
.app-avatar styles (mirrors dashboard's previous inline copy under
app-* prefix). Mobile-friendly fallback at <720px.
- base.html: include the new partial so every page extending base
(admin_users, profile, login_email, error, …) gets the same chrome
the dashboard has.
- dashboard.html: replace its inline <header class="header"> markup
with the shared partial. Inline .header CSS left in place as
harmless dead code (separate cleanup PR).
- admin_users.html: rewritten with avatars, role pills (color-coded
per role), toggle switch for active, search/filter input, toast
notifications, modal dialogs replacing alert/confirm/prompt,
one-click copy for the reset token, empty / loading states.
All XSS-safe via the existing esc() helper + data-attribute
event delegation.
- tests/test_web_ui.py: smoke test that /admin/users renders the new
shared header chrome and the modernized markup.
* feat(api): serve CLI wheel at /cli/agnes.whl for direct uv install
uv tool install inspects the URL path suffix to recognise a wheel, so
/cli/download (which has no .whl suffix) cannot be installed directly.
Expose a stable /cli/agnes.whl alias over the same wheel lookup so users
can run: uv tool install --force https://<server>/cli/agnes.whl
* test(cli): cover da auth import-token --server persisting to config.yaml
The server persistence was already implemented in the import-token command
(save_config({server}) call) but not covered by tests. Add an explicit test
so the one-step setup contract — single import-token call writes both token
and server — cannot regress.
* feat(web): simpler Claude setup — single uv install URL, single import-token call
User feedback: the prior clipboard payload repeated the server URL and
token across multiple steps (curl + tmpfile + install + rm + separate
seed-config + import-token). Collapse to:
1. uv tool install --force {server_url}/cli/agnes.whl (single URL, direct)
2. da auth import-token --token ... --server ... (one call, persists both)
3. da auth whoami
4. skills (ask user first)
5. confirm
uv accepts HTTPS URLs that end in .whl and installs them directly, so
the tmpfile dance is unnecessary. import-token --server already persists
the server to config.yaml, so no separate printf > config.yaml step.
* fix(tests): update admin users heading assertion after template rename
The admin_users.html template now uses <h2 class="users-title">Users</h2>
instead of <h2>User management</h2>. Update the assertion to match.
* feat(ui): unify header across remaining 7 standalone pages
These 7 pages render their own full <html> and don't extend base.html,
so the previous unification commit only covered base + dashboard. Each
had its own ad-hoc <header> markup with inconsistent classes
(.top-header / .header / .page-header), inconsistent nav-link sets,
and inconsistent avatar/email styling.
Replace each inline <header>...</header> block with the shared
{% include '_app_header.html' %} so /activity-center, /admin/permissions,
/admin/tables, /catalog, /corporate-memory, /corporate-memory/admin,
and /install all show the same chrome (Dashboard / Install CLI /
Profile / Users / email + avatar / Logout) with the active page
highlighted via request.url.path.
Old inline header CSS (.header, .top-header, .page-header, .nav-link,
etc.) is left in place as harmless dead code; it can be cleaned up in
a follow-up sweep.
* feat(web): add readable preview of Claude setup payload on dashboard + /install
Move the line-by-line setup instructions into app/web/setup_instructions.py
as the single source of truth, then render them in two modes from the
existing _claude_setup_instructions.jinja partial:
- preview_mode=True → visible, read-only <pre><code> block with the real
server URL and a clearly-styled placeholder token (never a real one).
- preview_mode=False → the JS SETUP_INSTRUCTIONS_TEMPLATE used by the
one-click flow (unchanged behaviour).
Both /dashboard (env-setup-cta card) and /install (Step 1 card) now show
the preview directly under the 'Setup a new Claude Code' button so users
can see exactly what will land in their clipboard before they click.
* feat(web): update setup instructions — `da diagnose` step, explicit section titles
Rework the Claude Code setup payload to:
- Give every numbered step an unambiguous verb header ("1) Install the CLI",
"2) Log in", "3) Verify the login", "4) Run diagnostics", "5) Skills (ask
the user first)", "6) Confirm").
- Add step 4 `da diagnose` as the post-login health check. The CLI already
ships this command (cli/commands/diagnose.py); it prints "Overall:
healthy" and a list of green checks that map cleanly to next actions.
- Ask the skills copy-vs-on-demand question verbatim so Claude Code always
prompts the user the same way.
- Replace the terse "Confirm" line with a 4-bullet summary (version,
whoami, skills choice, diagnose status) so the return message is
structured and comparable across setups.
* chore(web): remove stale MCP card from /install (no MCP server today)
The 'Use with Claude Code / MCP' card (Step 3 on /install) referenced an
MCP integration Agnes does not ship. Remove the whole card. The one-click
'Setup a new Claude Code' flow in Step 1 already covers the long-lived
client use case and is less confusing than dangling persistence tips for
a non-existent integration.
* feat(api): include user_email + last_used_ip + user_id in admin tokens list response
Adds AdminTokenItem response model (superset of TokenListItem) and
AccessTokenRepository.list_all_with_user() joining personal_access_tokens
with users to denormalize user_email. Needed for /admin/tokens UI where
admins triage tokens across all users.
* feat(web): /admin/tokens page — list, filter, search, revoke across all users
Adds a new admin-only page with client-side filtering (status, user email,
last-used window), column sorting, counts bar (active/revoked/expired),
and an inline revoke action. Mirrors the /admin/users visual language.
* feat(web): add Tokens nav link for admins + deep-link from admin/users row
Admin-only nav entry to /admin/tokens, and a per-row Tokens button on
/admin/users that prefills the token page's user filter via ?user=<email>.
* test(admin): cover /admin/tokens rendering, filter state, non-admin denial, revoke
Verifies admin can render the page (title + JS hooks present), a non-admin
is blocked, unauthenticated users are redirected, the admin list response
includes user_email / user_id / last_used_ip, and admin can revoke another
user's token.
* feat(web): modern redesign of /admin/tokens — hero, stat strip, refined table, responsive cards, a11y
* feat(web): ditch the table — /admin/tokens as a card stack, modern GitHub-style list
Replaces the table-based layout with a stack of self-contained token cards
inside a <ul role=list>. Each card is a flex row: avatar + name/meta on the
left, last-used block in the middle, status pill + outlined 'Revoke' button
on the right. Status and sort controls are pill-shaped toggle chips; user
email search has an inline search icon. No <table>/<tr>/<th>/<td> anywhere.
Responsive below 720px (card stacks vertically) and 480px (stat chips 2x2).
Preserves filter IDs (flt-status, flt-user, flt-last-used) and data-revoke
for existing tests.
* feat(web): add /tokens (role-aware) — single page for both user PAT CRUD and admin overview
- Rename admin_tokens.html -> tokens.html with a new is_admin context flag.
- New route GET /tokens: renders the same card-stack UI for everyone.
* Admins: loads /auth/admin/tokens, shows owner column + stat strip, keeps
the owner-email search box and sort-by-owner chip.
* Non-admins: loads /auth/tokens (own tokens only), hides owner column +
stat chips, adds a 'New token' CTA in the hero that opens a modal
(name + expires_in_days) calling POST /auth/tokens. The raw token is
revealed once in a dismissable banner and cleared from the DOM on Hide.
- GET /admin/tokens now 302-redirects to /tokens, preserving query string
(so the /admin/users deep-link ?user=foo still works).
* feat(web): /tokens full-bleed layout to match dashboard width
The hero, toolbar, and card list used to sit inside base.html's .container
(max-width 800px). Break out with negative horizontal margins so the page
spans the viewport like /dashboard does, capped at 1440px for readability
on very wide screens with a 24px gutter on each side.
- No change to base.html itself. The override is scoped to .tokens-page.
- body { overflow-x: hidden; } guards against rare horizontal scrollbars.
- < 808px viewport: reset to natural flow (mobile already narrower).
- ≥ 1488px viewport: cap to 1440px and re-center.
* chore(web): remove /profile template + nav link (redirect /profile -> /tokens)
The old /profile PAT CRUD page is now redundant — the modern /tokens page
covers both user and admin flows. Delete the template; the router's
/profile handler already 302-redirects to /tokens.
Nav cleanup:
- Remove the 'Profile' link.
- Show a single 'Tokens' link to every signed-in user (previously only
admins saw it).
- Active-state matches /tokens, /admin/tokens, and /profile so the
highlight survives the redirect chain.
/install CTA now points at /tokens instead of /profile.
* test: cover /tokens for admin + non-admin flows, /profile redirect, nav update
tests/test_admin_tokens_ui.py
- Point admin rendering test at /tokens directly and tighten assertions
(admin-only stat strip + owner search, non-admin CTA absent).
- Add test_non_admin_can_render_tokens_page: personal body, New-token CTA,
create-modal, reveal banner; stat strip + owner search absent.
- Add test_admin_tokens_redirects_to_tokens: 302 to /tokens, query string
(?user=...) preserved for the /admin/users deep-link.
- Add test_profile_redirects_to_tokens: 302 to /tokens.
- Add test_non_admin_can_create_pat_via_tokens_page_api: exercises the
POST /auth/tokens call that the non-admin create-modal submits.
tests/test_pat.py
- test_profile_page_renders -> test_profile_page_redirects_to_tokens:
assert the 302 + that /tokens lands on the unified non-admin body.
tests/test_web_ui.py
- admin_users nav assertion: 'Tokens' link present, 'Profile' link absent.
- Add test_nav_shows_tokens_link_for_non_admin: non-admins see the same
'Tokens' link (previously only admins did).
- Add test_profile_redirects_to_tokens back-compat check.
* feat(web): collapse 'What Claude Code will receive' by default
The preview block on /dashboard and /install now uses <details>/<summary>
so it is hidden by default. Click the chevron/title to expand and review
the clipboard payload. Markup stays in the DOM so existing tests that
assert on content continue to pass.
* fix(web): /tokens width — override .container to 1280px like dashboard
The negative-margin full-bleed trick was fragile and pushed content past
the right edge on deployed viewports. Replace with a simple max-width
override of base.html's .container on this page only, matching
/dashboard's 1280px center-column layout.
* feat(web): split role-aware /tokens into my_tokens.html + admin_tokens.html
* feat(web): router — separate handlers for /tokens (own) and /admin/tokens (all)
* feat(web): nav — show Tokens for all, add All tokens for admins
* test: cover split token pages (own vs all) + admin access gating
* feat(web): move 'My tokens' into a user dropdown menu
Replaces the separate Tokens/email/Logout nav trio with a rounded
avatar trigger that opens a dropdown containing the user's email,
role, a 'My tokens' link, and Logout. Admin-only 'All tokens' stays
as a top-level nav item since it's an admin function, not a personal
one. Click-outside and Escape close the panel; chevron rotates on
open.
* fix(api): allow PATs to list/get/revoke their own tokens (CLI flow)
The documented 'da auth token list/revoke' CLI flow in
docs/HEADLESS_USAGE.md uses a PAT, but the previous dependency
(require_session_token) returned 403. Only create_token must be
session-only to prevent PAT-spawning-PAT chains; listing and
revoking your own tokens is safe with a PAT.
* fix(api): cap expires_in_days at 3650 to avoid datetime overflow (500 to 400)
Values above ~11 million days overflowed datetime.max in
datetime.now(utc) + timedelta(days=...) and surfaced as an
unhandled OverflowError → 500. Cap at 10 years with a clear
400 instead; the no-expiry code path is unaffected.
* fix(api): relax _SAFE_URL_RE to allow path prefixes, underscores, and IPv6
The previous regex rejected legitimate reverse-proxy base_url values
(https://host/agnes/), underscores in Docker Compose hostnames, and
IPv6 literals (http://[::1]:8000). Widen the charset and allow an
optional trailing path. shlex.quote continues to provide
defense-in-depth against any metacharacter that slips through.
* fix(web): /login/email and Google OAuth propagate next_path
Previously, /login/email silently dropped the ?next=<path> query
param so the hidden form field rendered empty and login always
landed on /dashboard. Google's button was hard-coded to
/auth/google/login, ignoring next entirely.
- /login page now appends ?next to the Google button URL
- /login/email reads + sanitizes next, passes as template context
- google_login stashes sanitized next_path in session['login_next']
- google_callback pops + re-sanitizes and redirects there
Sanitization factored into app/auth/_common.safe_next_path.
* fix(auth): differentiate argon2 VerifyMismatchError from internal errors in web login
The previous except (VerifyMismatchError, Exception) collapsed both
cases into the generic 'invalid credentials' redirect, silently
hiding corrupted-hash / library errors from ops. Split the two:
bad password still gets ?error=invalid; anything else logs via
logger.exception and redirects with ?err=auth_internal so ops have
a visible signal and users don't retry forever against a broken
password_hash column.
* docs: correct CLAUDE.md table name (personal_access_tokens)
v7 note referenced 'access_tokens.last_used_ip' but the real table
is personal_access_tokens (as mentioned two tokens earlier in the
same bullet). Same-file consistency fix.
* chore(web): clarify admin user-reset UI — encourage Set password over the unused reset_token
POST /api/users/{id}/reset-password stores and returns a token
but no endpoint consumes it — the magic-link sender would log the
user in without prompting for a new password, defeating the reset.
- Drop the 'Reset' row action from admin_users so admins aren't
pointed at a dead end.
- Rewrite the reveal-modal copy to tell admins to use Set password
and explicitly note that the magic-link flow isn't available
for reset tokens in this build.
The API endpoint stays for API-level future use.
* test: cover PAT CLI flow, expires_in_days overflow, proxy base_url, next propagation
- tests/test_pat.py: PAT can list own tokens (200, was 403);
PAT can revoke own tokens (204); create_token returns 400 for
expires_in_days > 3650 (was 500 via datetime overflow).
- tests/test_cli_artifacts.py: _SAFE_URL_RE accepts reverse-proxy
path prefixes, underscores, and IPv6 literals; end-to-end check
of cli_install_script with a stubbed base_url that includes
a path prefix (Agnes behind /agnes/).
- tests/test_web_ui.py: /login propagates ?next to the Google
button URL; /login/email renders next in the hidden form field
and strips hostile values; unit coverage of safe_next_path.
* fix(security): use \Z instead of $ in URL/version allowlists (trailing-\n bypass)
Python regex `$` also matches just before a trailing newline, so a Host
header or AGNES_VERSION value like "good.example.com\n$(rm -rf /)"
would slip past the allowlist. `\Z` anchors to strict end-of-string.
shlex.quote downstream remains as defense-in-depth, but the allowlist
is now the tight gate it claims to be.
* fix(auth): PAT with null expiry omits JWT exp claim (DB is the source of truth)
Previously a PAT created with `expires_in_days=null` (user-requested
"never expires") set the DB `expires_at` to NULL (correct) but still
baked a ~100y `exp` claim into the JWT. That is misleading: the PAT
silently did expire eventually, despite the UI and API promising
"no expiry".
`create_access_token` now accepts `omit_exp=True` to skip the `exp`
claim entirely. `app/api/tokens.py` passes that when `expires_in_days
is None`. The authoritative expiry check lives in
`app/auth/dependencies.py`, which reads `expires_at` from the DB row —
unchanged. PyJWT accepts claim-less JWTs indefinitely.
* test: cover trailing-newline regex bypass + no-exp JWT for unbounded PAT
- test_safe_url_re_rejects_trailing_newline_bypass: asserts both
`_SAFE_URL_RE` and `_SAFE_VERSION_RE` reject values with a trailing
`\n` (previously accepted because Python `$` matches before `\n`).
- test_pat_null_expiry_jwt_has_no_exp_claim: POST /auth/tokens with
`expires_in_days=null`, decode the returned JWT, assert `exp` is
absent while `typ=pat`, `sub`, and `jti` are still present.
- test_pat_with_null_expiry_is_accepted_by_verify_token: verify_token
round-trips a claim-less JWT without ExpiredSignatureError.
- test_pat_null_expiry_end_to_end_allows_authenticated_request: use
the null-expiry PAT against /auth/tokens and confirm it authenticates.
* docs(auth): document X-Forwarded-For trust model in _client_ip
Deployment runs behind Caddy which strips incoming X-Forwarded-For
and sets its own, so the leftmost hop is trustworthy. Clarify that
the stored last_used_ip is audit-only and never used for access
control — if the app is ever exposed directly, this value becomes
client-settable.
* docs: /profile → /tokens in install.sh next-steps, CLI error, HEADLESS_USAGE, security skill
After splitting PAT management to /tokens (with /profile as a back-compat
302), stale references remained in user-facing text. Update them to the
canonical /tokens URL so shell scripts, CLI error hints, docs, and the
bundled security skill are all consistent.
v1.3.0 added google_monitoring_uptime_check_config + alert policies to the
module, but bootstrap-gcp.sh was not updated. Fresh customers (and the
first apply after upgrading existing customers) hit 403 on
monitoring.uptimeCheckConfigs.create.
Fix: enable monitoring.googleapis.com + grant roles/monitoring.editor to
the deploy SA. Idempotent (safe to re-run on existing projects).
Reads JWT_SECRET_KEY and KEBOOLA_STORAGE_TOKEN from Secret Manager,
combines with non-secret config, writes .env with chmod 600.
Run as part of VM startup or manually for rotation.
Creates agnes-deploy SA with Terraform-scoped roles, GCS tfstate bucket,
and generates a JSON key. Idempotent — safe to re-run.
Expanded .gitignore to block *-key.json files from ever being committed.
- smoke-test.sh: replace ((PASS++)) with PASS=$((PASS + 1)) to avoid
set -e abort when counter is 0 (bash returns exit 1 for ((0)))
- CalVer: use max(N) from existing tags instead of count, safe when
tags are deleted (e.g. deprecated version cleanup)
- CLAUDE.md: update schema version from v2 to v3
663 tests pass.
Makefile simplified to four targets (test, dev, docker, lint) aligned
with the current FastAPI/Docker architecture. scripts/README.md rewritten
to list only the active and migration scripts that still exist.
- QUICKSTART.md: replace data_description.md.example copy step with
note that tables are registered via the admin API or web UI
- NOTIFICATIONS.md: replace examples/ section with planned-feature note
- telegram_bot.md: remove examples/notifications/ rows from deployment
table and example scripts section; note feature is planned
- dev_docs/README.md: remove plan-corporate-memory.md entry
- duckdb_manager.py: update comment from remote_query.py to query API endpoint
migrate_registry_to_duckdb.py: imports tables from data_description.md
or table_registry.json into DuckDB table_registry with source columns.
migrate_parquets_to_extracts.py: copies parquets to /data/extracts/
and creates extract.duckdb with _meta + views.
GCP OS Login doesn't honor /etc/group changes for SSH sessions,
so analyst can't read /opt/data-analyst/.env even after usermod.
Wrapper now reads .remote_query.env from scripts dir (dataread group),
falls back to .env for admin users. The env file contains only
non-secret BQ config (project ID, location, data dir).
Analyst user (foundry_e_psimecek) couldn't access /opt/data-analyst/.
Added to data-ops group on server.
New scripts/remote_query.sh wrapper handles env setup (PYTHONPATH,
CONFIG_DIR, .env) so agents use simple:
ssh alias 'bash ~/server/scripts/remote_query.sh --sql "..." --format table'
Updated claude_md_template.txt to use wrapper instead of raw commands.
find_project_root() and parse_data_description() now check CONFIG_DIR
env var first when looking for data_description.md. On server deployment,
data_description.md lives in instance/config/ (CONFIG_DIR), not in the
OSS repo's docs/ directory.
Add src/remote_query.py CLI module enabling the AI agent to run SQL
queries spanning local Parquet tables and remote BigQuery tables in a
single DuckDB session on the server. Two-phase protocol: BQ sub-queries
(--register-bq) fetch filtered/aggregated data, then DuckDB SQL (--sql)
joins everything.
Safety: COUNT(*) pre-check, memory estimation (2GB cap), row limits
(500K per BQ sub-query, 100K final result).
Changes:
- New src/remote_query.py with CLI, BQ registration, output formatting
- Add bq_entity_type field to TableConfig (view vs table routing)
- Extract create_local_views() from duckdb_manager.py for reuse
- Update claude_md_template.txt with remote query agent instructions
- Update example configs with remote_query section and docs
- 52 new tests (42 remote_query + 10 bq_entity_type), all passing
Server venv is created during bootstrap via SSH (same package list,
installed natively on Linux). Removes sync_data.sh section that copied
pip freeze output across platforms (Windows/macOS freeze is incompatible
with Linux).