agnes-the-ai-analyst

Author	SHA1	Message	Date
ZdenekSrotyr	438ac78905	fix(admin/users): explain empty group dropdown instead of silent placeholder The 'Add to group' dropdown on /admin/users/{id} silently filtered out every Google-Workspace-managed group (rightly — the API would 409 on POST). On deployments where Admin and Everyone are both Workspace-mapped via AGNES_GROUP_{ADMIN,EVERYONE}_EMAIL and no custom Agnes groups exist yet (FoundryAI prod + dev today), the picker showed only the literal '— Pick a group —' option with the 'Add' button disabled. Operator had no indication that they needed to create a custom group first. Three states surface a hint below the picker now: - user is already in every group (literally nothing left) - every remaining group is Google-Workspace-managed (link to /admin/groups + admin.google.com explainer) - no groups exist at all The skip-google-managed logic stays — POST would still 409 on those rows, this just stops the empty-state from being a silent dead end.	2026-05-07 09:09:45 +02:00
Minas Arustamyan	50e0463501	feat(marketplace): clone-based plugin setup + auto-refresh SessionStart hook Adds end-to-end flow for installing and keeping the per-user filtered Claude Code marketplace in sync with the user's Agnes stack (admin RBAC grants \ MyAIStack opt-outs U /store installs). Setup (one-liner in install prompt step 5): `agnes refresh-marketplace --bootstrap` clones the per-user marketplace bare repo to ~/.agnes/marketplace, strips PAT from the cloned origin URL, registers the local path with Claude Code, and installs every plugin in the served manifest at --scope project. Replaces a 15-line inline shell sequence that tripped Claude Code's agent-driven `rm -rf` permission gate. Auto-refresh (SessionStart hook installed by `agnes init`): `agnes refresh-marketplace --quiet` runs every Claude Code session, fetches+resets the clone (server rebuilds as orphan commits, so pull --ff-only is impossible), and version-aware reconciles: - missing in workspace -> claude plugin install <name>@agnes --scope project - version differs -> claude plugin update <name>@agnes - matches -> skip Don't auto-uninstall plugins that disappeared from the manifest -- a transient empty manifest from the server would wipe the stack. Hook output: when --quiet AND something actually changed, emits Claude Code hook JSON on stdout -- `systemMessage` (transient toast) and `hookSpecificOutput.additionalContext` (model-side system reminder), both carrying the change summary plus a "/exit + restart Claude Code" instruction (Claude only scans plugins at session start). Windows hook compatibility: the refresh-marketplace hook command is wrapped in `bash -c "..."` because Claude Code on Windows runs hook commands directly without invoking a shell, so `2>/dev/null \|\| true` would otherwise be passed as literal argv tokens. Cross-cutting: - cli/lib/marketplace.py: shared CLONE_DIR + MARKETPLACE_NAME constants. - cli/lib/hooks.py: SessionStart now has two independent entries (pull + refresh-marketplace) so a failure in one doesn't suppress the other; legacy `da sync` and prior single-pull layouts upgrade cleanly on re-init. - PAT injection on every git fetch via per-invocation credential helper (token in \$AGNES_TOKEN env, never in argv or .git/config). - Pre-snapshot of installed plugins captured BEFORE `claude plugin marketplace update` so silent auto-applied version bumps still fire notifications. - scripts/dev/agnes-client-reset.sh: cleans ~/.claude/plugins/marketplaces/agnes, ~/.claude/plugins/cache/agnes, drops uv build cache, documents workspace-scoped residue that can't be enumerated from the script. - app/web/setup_instructions.py: legacy AGNES_DEBUG_AUTH path also uses clone (direct HTTPS marketplace add is broken end-to-end on every Claude Code distribution -- stores response as single file, plugin source paths then 404). 28 new tests (test_cli_refresh_marketplace.py) + extended hook + setup template tests cover bootstrap, fetch+reset ordering, version-aware reconcile, project-path filtering, hook JSON shape, and the bash-c Windows wrapper invariant.	2026-05-07 06:59:13 +02:00
ZdenekSrotyr	df896816d8	chore: rename stale 'da' references to 'agnes' + CHANGELOG Drive-by docstring/comment cleanup in cli_artifacts.py and update_check.py. CHANGELOG entry for the auto-upgrade feature shipped in this branch.	2026-05-06 23:23:59 +02:00
ZdenekSrotyr	af2b866961	docs(version): clarify APP_VERSION scope + middleware /api prefix rationale	2026-05-06 23:23:23 +02:00
ZdenekSrotyr	57170bc556	feat(server): expose APP_VERSION + MIN_COMPAT_CLI_VERSION on /api/* response headers Adds X-Agnes-Latest-Version and X-Agnes-Min-Version headers to every /api/* response. CLI consumes these to hard-stop on incompatible drift. MIN_COMPAT_CLI_VERSION ships at 0.0.0 — no enforcement until a deliberate wire-protocol break bumps it. Also dedupes app version logic: app/main.py:_app_version() helper deleted, replaced by app/version.py:APP_VERSION as the single source of truth. test_app_version.py rewritten to target app.version.	2026-05-06 23:23:23 +02:00
ZdenekSrotyr	f4bc04958d	fix: Devin Review #1 — apply backtick mask to wrapping rewriter `_rewrite_user_sql_for_bigquery_query` does its own bare-name detection (mirroring the non-RBAC parts of `_bq_guardrail_inputs`). The backtick masking from #201 was applied to `_bq_guardrail_inputs` and the forbidden-table loop, but missed this third site — so a registered local-mode table name appearing as the table segment of a user-supplied full backtick path (e.g. ``\`prj.ds.orders\`` matching registered local ``orders``) tripped the cross-source guard and forced every backtick-path query into the 50-100× slower ATTACH-catalog fallback. Mask once at the top of the function, route both the BQ-name detection (line ~830) and the cross-source check (line ~867) through the masked copy. New regression test `test_local_name_inside_backtick_path_does_not_trip_cross_source` proves the wrapper now wraps when it should.	2026-05-06 21:06:21 +02:00
ZdenekSrotyr	824e3cb636	feat(query): registry-gate full backtick BigQuery paths (#201 ) Adds Pass 3 to `_bq_guardrail_inputs` that scans user SQL for full backtick paths `<project>.<dataset>.<table>` and gates them identically to the `bq."<dataset>"."<table>"` pass: - Project must match the configured BigQuery data project (`get_bq_access().projects.data`). Mismatch → HTTP 403 `bq_path_cross_project`. - Path must point at a registered row. Unregistered → HTTP 403 `bq_path_not_registered`. - Non-admin caller must hold a grant on the registered row's id. Missing grant → HTTP 403 `bq_path_access_denied`. Pre-fix, full backtick paths bypassed Agnes RBAC entirely — only the service account scope limited reach. Post-fix the boundary matches what `agnes catalog`-driven flows already enforce. Admin still bypasses the per-id grant check but cannot bypass registration or project match. Pass 3 also seeds `dry_run_set` for resolved registered paths so the cost-cap dry-run runs against the same physical table the user named — composing cleanly with the Layer 2 fail-fast fallback.	2026-05-06 18:02:53 +02:00
ZdenekSrotyr	c32be3fe96	fix(query): cap-guard fallback retries original SQL, fails fast (#201 ) When BQ rejects the rewritten dry-run SQL with `bq_bad_request`, the cap-guard now retries with the user's ORIGINAL SQL instead of building a synthetic `SELECT * FROM <table>` per registered table. The synthetic path threw away user filters / projections / partition predicates and routinely ballooned the estimate to "full table size", falsely tripping `remote_scan_too_large` on legitimate narrow queries (typical issue #201 trace: rewriter corrupts a backtick path → BQ parse error → synthetic over-estimate → 400). Behaviour: - Rewritten SQL succeeds: same as before (issue #171 single-dry-run). - Rewritten SQL parse-errors, original SQL succeeds: use original estimate. Common case for users submitting BQ-native input. - Both fail with `bq_bad_request`: HTTP 400 `remote_estimate_failed` with a hint pointing at `agnes catalog` / BQ-native syntax. No silent over-estimate. - Non-parse BQ error (forbidden, upstream): still 502 as before. This is a behaviour change for clients matching error kinds — failure to estimate scan size now surfaces as `remote_estimate_failed` instead of being masked behind `remote_scan_too_large` from the synthetic path. Replaces the existing `test_guardrail_falls_back_to_per_table_estimate_on_bq_parse_error` (which pinned the old contract) with `test_fallback_tries_original_sql_first` and `test_fallback_fails_fast_on_pure_duckdb_syntax`.	2026-05-06 18:02:53 +02:00
ZdenekSrotyr	720a2180c0	fix(query): rewriter respects backtick segments (#201 ) `agnes query --remote` corrupted user SQL when the request contained a full BigQuery backtick path (`<project>.<dataset>.<table>`) whose table segment matched a registered bare-name alias. The bare-name rewriter used `\b` word-boundary matching against the lower-cased SQL; both `.` and `` ` `` are non-word characters, so the regex fired INSIDE the user's backtick path and produced malformed nested-backtick SQL that BigQuery rejected at parse time. Fix: - Add `_mask_backticks(sql)` helper: replace each `…` segment with spaces of equal length, preserving offsets so word-boundary searches find positions only outside backticks. - `_bq_guardrail_inputs` (bare-name pass + forbidden-table pass) searches against the masked SQL. - `_rewrite_bq_table_refs_to_native` Pass 1 splits the SQL on `(\`[^\`]*\`)` and rewrites only the outside-backtick chunks. Pass 2 (`bq."ds"."tbl"` → backtick form) is unchanged — its prefix can't appear inside backticks. Adds three regressions covering the rewrite + guardrail paths.	2026-05-06 18:02:53 +02:00
ZdenekSrotyr	81d065b1ea	fix: Devin Review #1 — bigquery_query() first arg uses billing project, not data In cross-project BQ setups (where billing != data), the SA typically has serviceusage.services.use on the billing project but not on the data project. The rewriter passed bq.projects.data as the first arg to bigquery_query(), which BQ uses as the execution + billing project → 403 USER_PROJECT_DENIED. Match the convention used everywhere else in the codebase (app/api/v2_scan.py, app/api/v2_sample.py, app/api/v2_schema.py, connectors/bigquery/extractor.py): backtick paths inside the inner SQL use the data project (resolves the actual table location), the bigquery_query() first arg uses the billing project (decides who pays + which project the job runs under). For single-project deploys the two are identical so the fix is a no-op there. Test pins the cross-project case: data-prj for backticks, billing-prj for the bigquery_query() first arg.	2026-05-06 14:07:38 +02:00
ZdenekSrotyr	aee585fac6	fix: devil's advocate R2 — narrow shared-client try, PID tmp suffix, Syntax error anchor R2 adversarial review surfaced 3 issues, all addressed: #1 cli/client.py:572-577 outer try/except wrapped both _get_shared_client() AND the actual download. A 401/403/404/5xx from the server triggered a full second download attempt with a fresh client — wasted bandwidth on hard failures, no fail-fast on revoked PAT. Narrowed the try to only the shared-client construction; the download itself is no longer retried under the fallback except. #2 concurrent agnes pull invocations (e.g. SessionStart hook + manual run) collided on bare <target>.tmp / <target>.partN paths — one process's in-progress write got yanked by the other's cleanup, manifest hash check then failed spuriously. Per-process suffix (<target>.{pid}.tmp, <target>.{pid}.partN) makes intermediate files disjoint; the final os.replace to the bare target is atomic so last-writer-wins. #3 _looks_like_bq_rewrite_parse_error patterns 'Syntax error' could false-positive on a query like WHERE log_msg = 'Syntax error in foo' that fails for an unrelated reason (quota, network) and has the literal substring echoed in the error text. Anchored to 'Syntax error: ' (with trailing colon) — BQ always emits the colon in this error format, user SQL string literals normally don't.	2026-05-06 13:57:29 +02:00
ZdenekSrotyr	e5645fd280	fix: devil's advocate R1 — chunked probe, parse-error heuristic narrow, pool settings refresh, content-length sanity, multi-project skip R1 adversarial review surfaced 5 issues, all addressed: #1 chunked download silently disabled in non-Caddy deployments (HEAD on GET-only FastAPI route returns 405). _probe_range_support now falls back to GET with Range: bytes=0-0 when HEAD fails — works against both Caddy file_server (HEAD-friendly) and dev FastAPI direct (GET-only). #2 parse-error fallback heuristic too broad — matched on Unrecognized name / Function not found / No matching signature / Invalid cast, which BQ surfaces for ordinary user-column typos. That triggered slow ATTACH-catalog retry on every typo (2× latency tax). Narrowed to just 'Syntax error' / 'syntax error' which are the genuine DuckDB-vs-BQ dialect mismatch markers. #3 apply_bq_session_settings was only run on fresh-built pool entries, not on reuse. An operator's /admin/server-config change to bq_query _timeout_ms wouldn't propagate to long-lived pooled sessions until restart. Fixed: re-apply on every pool acquire (idempotent + fail-soft). #4 content-length sanity bound — a misconfigured proxy returning a wildly inflated Content-Length would cause overlapping chunked Range requests against the actual file → corrupt assembled output (caught by manifest hash check, but only after wasted bandwidth). Cap at 100 GiB; above that, drop to single-stream. #5 rewriter assumed every BQ row resolves under the single bq.projects.data project. Bucket containing '.' suggests a project- qualified bucket (multi-project deployment); rewriter would silently target the wrong project. Conservative skip with regression test.	2026-05-06 13:50:46 +02:00
ZdenekSrotyr	8e56d45c68	fix(query): code-review fixes — outer LIMIT wrap, dollar-quoting, parse-error fallback Address code-reviewer findings on the bigquery_query() rewrite path: 1. Outer LIMIT wrap — bigquery_query() materialises BQ result into DuckDB before fetchmany sees it (vs ATTACH-catalog Storage Read API streaming). A user 'SELECT *' against a billion-row remote table would buffer the entire result before request.limit applied. Wrap rewritten SQL in an outer 'LIMIT N+1' so the cap pushes into the BQ job itself. 2. Dollar-quoted inner SQL — naive replace("'", "''") doubling missed DuckDB backslash-escape sequences (\\, \\n, \\t, …). A predicate like 'WHERE name = ''O\\'Brien''' was unsafe under the doubling path. DuckDB $bqq_inner$ … $bqq_inner$ form takes the inner SQL verbatim with no escapes whatsoever. Falls back to legacy doubling if user SQL improbably contains the literal tag. 3. Parse-error fallback — when the rewritten path fails with a BQ-side parse / validation error (DuckDB-only syntax like ::INT cast that survives identifier rewrite but BQ refuses), retry the user's original SQL via the legacy ATTACH-catalog path so the request still succeeds. Mirrors the existing dry-run fallback contract. 4. CHANGELOG — delete duplicate CLI bullets that landed under already-released [0.38.1] (file corruption from merge — entries are correctly under [0.39.0]).	2026-05-06 13:29:45 +02:00
ZdenekSrotyr	b2c1ff143c	fix(query): rewrite BQ-backed user SQL via bigquery_query() to enable predicate pushdown User SQL hitting query_mode='remote' BigQuery rows was 50-100x slower than the equivalent direct bigquery_query() call because DuckDB's master view (CREATE VIEW … AS SELECT * FROM bigquery.<ds>.<tbl>) does not push WHERE/SELECT/LIMIT into BQ in ATTACH-catalog mode. The BQ extension opens a Storage Read API session over the entire upstream table; on >100M-row sources this was 70-150s and frequently failed with 'Response too large to return'. Extract the existing dry-run rewriter's core (table-name → BQ-native backtick path) into a shared helper. Add an execution-path rewriter that wraps the whole user SQL in bigquery_query('<project>', '<inner>') so the BQ planner sees the full query and engages partition pruning + projection pushdown server-side. Conservative fall-through: cross-source JOINs (BQ ↔ Keboola/Jira local), queries already containing bigquery_query(, and unconfigured BQ project all skip the rewrite and run the original SQL via ATTACH-catalog so behavior degrades gracefully.	2026-05-06 13:02:34 +02:00
ZdenekSrotyr	6bc8739010	feat(admin/tables): show source, schedule, folder, registered, and sync-error in row	2026-05-06 11:09:02 +02:00
ZdenekSrotyr	b230d44687	docs(admin/tables): clarify NUL sentinel in unescapeShellQuoting	2026-05-06 10:15:56 +02:00
ZdenekSrotyr	05e535d743	fix(admin/tables): unescape shell-quoting backslashes in descriptions	2026-05-06 10:13:49 +02:00
ZdenekSrotyr	e369d0ed7b	fix(admin/tables): clamp long description to 2 lines so Actions stay reachable	2026-05-06 10:06:57 +02:00
ZdenekSrotyr	6c94d2cbce	Merge remote-tracking branch 'origin/main' into pr180-review # Conflicts: # CHANGELOG.md # pyproject.toml	2026-05-06 07:27:25 +02:00
ZdenekSrotyr	df2c33147c	fix: Devin Review on #194 round 2 — 3 BUG-class findings 1. instance.yaml overlay path now matches read site under STATE_DIR. Three sites updated: - app/api/admin.py:1005 (server-config endpoint writer) - app/api/admin.py:2610 (configure endpoint writer) - app/instance_config.py:106 (overlay reader) All three now go through _state_dir() so under flat-mount layout (STATE_DIR=/data-state) the irreplaceable instance.yaml overlay lands on the state disk (sdc) instead of the regenerable data disk (sdb). Without this fix, .env_overlay correctly went to the state disk while instance.yaml went to the data disk — config would be lost if an operator wiped sdb. 2. Strip customer-specific tokens from OSS repo per CLAUDE.md vendor-agnostic rule: - docker-compose.host-mount.yml: 'a deployer (Groupon FoundryAI)' → 'a deployer in production' - docker-compose.flat-mount.yml: 'caused 2026-05-05 in the Groupon FoundryAI deployment' → generic 'production failure mode' - docs/state-dir.md: rewrote the incident reference to describe the failure mode abstractly without naming the deployment; updated the recommendation table to say 'shadow-mount class' instead of dating the specific incident. 3. Updated docs/state-dir.md 'What reads STATE_DIR' to list all read/write sites including the three migrated in this round (admin.py, instance_config.py, marketplaces.py). ANALYSIS finding (tls-rotate.sh hardcoded host-mount.yml) deferred — same operator-side class as auto-upgrade.sh hardcoded host-mount, documented limitation per the PR body.	2026-05-05 20:02:50 +02:00
ZdenekSrotyr	b6543c9c55	fix: Devin Review on #194 — 2 BUG-class findings 1. .env_overlay write paths now match read path under STATE_DIR. app/main.py:343 reads via _state_dir() (post-PR #194), but two write sites still hardcoded ${DATA_DIR}/state/.env_overlay: - app/api/admin.py:2687 — configure endpoint secrets persistence - app/api/marketplaces.py:152 — marketplace PAT persistence Under flat-mount layout (STATE_DIR=/data-state) the admin UI wrote secrets to /data/state/.env_overlay while the app read from /data-state/.env_overlay, silently dropping the value on next restart. Both write sites now go through _state_dir(). 2. host-mount.yml: caddy inherits data:/srv:ro from base, but with no service populating the data: named volume (other services switched to direct /data binds), the inherited mount points at an empty Docker volume — try_files finds nothing, every parquet download falls through to uvicorn, defeating the v0.36.0 file_server bypass under the host-mount layout. Added a caddy override that restates all mounts including a direct /data:/srv:ro bind. Mirrors the comment + treatment already in flat-mount.yml.	2026-05-05 19:47:12 +02:00
Vojtech Rysanek	a303de0372	feat: STATE_DIR env var + flat-mount overlay (parallel disks) Introduces STATE_DIR as the single source of truth for the writable state directory path, with backward-compatible default of ${DATA_DIR}/state. Pairs with a new docker-compose.flat-mount.yml overlay that mounts the state disk in PARALLEL to the data disk (rather than nested under it). Why --- The default deployment topology nests state under data: sdb at /data, sdc at /data/state. That layout has known fragility documented in docs/state-dir.md — bind-propagation gotchas, two-writer collisions on the same prefix, mount-order coupling. The 2026-05-05 incident in the Groupon FoundryAI deployment was a manifestation of the propagation gotcha. The flat layout (sdb at /data, sdc at /data-state — parallel, not nested) eliminates the nested-mount class entirely. Each disk is its own bind mount, recursive by default in modern Docker. No volume options to forget. No two-writer collision (host scripts and container app share /data-state at the same path, single namespace). What changes ------------ App code (Python): - src/db.py: new _get_state_dir() helper. get_system_db() and schema migration snapshot use it. - app/secrets.py: new _state_dir() helper. _load_or_generate() uses it for .session_secret and .jwt_secret. - app/main.py: .env_overlay loaded from _state_dir(). Host scripts: - scripts/ops/agnes-auto-upgrade.sh: STATE_DIR drives mount-sanity check and cert detection. Defaults preserve existing behavior. - scripts/ops/agnes-tls-rotate.sh: STATE_DIR drives CERT_DIR. New compose overlay: - docker-compose.flat-mount.yml: parallel /data and /data-state binds per service. Mutually exclusive with docker-compose.host-mount.yml; pick one based on disk topology. Documentation: - docs/state-dir.md: layout choice (A nested vs B flat), pros/cons, migration steps, and which code paths read STATE_DIR. Backward compatibility ---------------------- STATE_DIR defaults to ${DATA_DIR}/state — current behavior. Existing deployers that don't set the var see no behavior change. Migration to flat layout is opt-in per the runbook in docs/state-dir.md. Validation ---------- - bash -n on both host scripts: pass - docker compose config -f docker-compose.flat-mount.yml: resolves cleanly with all 6 services binding /data and /data-state directly - python3 import + helper exercise: STATE_DIR override works, default falls back to ${DATA_DIR}/state Companion to PR #191 (drop named-volume driver_opts in host-mount.yml). That PR fixes the immutability footgun for Layout A; this PR offers Layout B as the architectural alternative.	2026-05-05 19:28:07 +02:00
ZdenekSrotyr	f2ce915458	fix: Devin Review on #188 commit `28423907` — 2 bugs 🚩 /api/v2/catalog still async def while now calling sync stat() `/api/v2/catalog` was left as `async def` when the rest of Tier 1 was converted, on the assumption it was lightweight. The new `_materialized_size_hint` populator added in this PR calls `Path.stat()` / `Path.exists()` for every visible row to bucket the parquet size — on a local FS that's microseconds, but on a network-mounted DATA_DIR (NFS / CIFS / GCS-FUSE) those syscalls can block the event loop. Convert to plain `def` so FastAPI auto-offloads to the thread pool, mirroring /api/query etc. 🔴 stream_download translates HTTPStatusError as generic transport error `response.raise_for_status()` inside the retry loop raises `httpx.HTTPStatusError` on 4xx/5xx. After retries exhaust, the new `isinstance(last_exc, httpx.HTTPError)` check at line 219 was eating the status code: HTTPStatusError is a subclass of HTTPError, so the generic transport translation produced "Unexpected error: HTTPStatusError" instead of the informative "Client error '401 Unauthorized' for url …" that callers expect. Fix: short-circuit HTTPStatusError before the HTTPError branch — it re-raises verbatim so the caller's status-code handling + the rich server error body (e.g. 401 expired token, 403 cross_project_forbidden) reach the analyst. api_get / api_post / api_delete / api_patch don't have the same bug: httpx Client.get/etc. don't raise HTTPStatusError unless the caller explicitly calls .raise_for_status(), and our wrappers don't. Only stream_download does, hence the targeted fix there.	2026-05-05 18:29:44 +02:00
ZdenekSrotyr	e5fb913cec	perf: Tier 1 event-loop unblocking — async def → def on BQ-bound handlers Five hottest BQ-touching endpoints were `async def` but invoked synchronous DuckDB / BQ-extension calls inside the body. Under uvicorn's single event loop that meant a single heavy `agnes query --remote` (waiting up to ~200 s for BQ's jobs.query) froze EVERY other request — /api/health, dashboard, auth, even another query — for the full BQ wait. Operators saw "VM idle, app frozen" during PR #188's testing. Convert to plain `def` so FastAPI auto-offloads the body to the anyio thread pool. Event loop stays free for non-BQ requests. - app/api/query.py:execute_query - app/api/v2_scan.py:scan_estimate_endpoint, scan_endpoint - app/api/v2_sample.py:sample - app/api/v2_schema.py:schema Audit: 0 `await` statements in any converted handler (verified file-by- file), so the rename is safe. Tests in tests/test_v2_*.py called the handlers via `asyncio.run(...)` which now fails on a non-coroutine return; swapped for direct calls (asyncio.run( -> ( ) — keeps paren balance). Plus AGNES_THREADPOOL_SIZE env var (default 200, was anyio's stock 40) in app/main.py:lifespan. Set via anyio.to_thread.current_default_thread_limiter().total_tokens. 200 is comfortable headroom for <50 concurrent analysts; bump for more. 480/480 impacted tests pass (the 2 remaining errors are a pre-existing fixture setup issue in test_reader_smoke_matrix.py unrelated to this change).	2026-05-05 17:44:08 +02:00
ZdenekSrotyr	30e81a15b9	feat(workspace-prompt): decision tree + size-hint so analyst Claude gets it right first try Three concrete changes addressing the "analyst Claude misuses the CLI" class of bugs (image.png table — issues #3, #5, plus the recurrent "how big is this table" guesswork): 1. config/claude_md_template.txt — the template agnes init writes to <workspace>/CLAUDE.md. Surfaces every catalog-row field with a why, adds a query_mode-based decision tree, explicit --estimate scoping (snapshot create ONLY — was the #1 first-try error), an agnes fetch → agnes snapshot create rename note, and a 6-row failure-mode table that maps each common error wording to its right next step. 2. app/api/v2_catalog.py — populate rough_size_hint for local + materialized rows from the on-disk parquet size, bucketed small/medium/large/very_large. Was hardcoded null with a TODO; AI couldn't tell "is this 6.8 GB" without a failed --remote round-trip. 3. cli/update_check.py — the [update] banner survived the da→agnes rename and printed "[update] da X is out of date" on every command, training analysts to associate the binary with the old name. Verified by rendering the template against representative contexts (33/33 tests pass) and running every use case from the original screenshot through the real CLI against a dev VM.	2026-05-05 16:44:24 +02:00
ZdenekSrotyr	1be997f6d4	feat(caddy): file_server for parquet downloads — bypass uvicorn A single analyst's multi-GB `agnes pull` held the only uvicorn worker for the duration of the stream, starving UI / /api/health / every other API endpoint. Container flipped to `unhealthy`. Triggered while a 6.8 GB `order_economics` pull was in-flight on prod 2026-05-05. Caddy now intercepts `GET /api/data/{table_id}/download` and serves the parquet directly via sendfile from the data volume (mounted r-o at /srv inside the caddy container). RBAC enforced by `forward_auth` to a new lightweight `GET /api/data/{table_id}/check-access` endpoint (returns 204 / 403) — the bulk transfer never reaches uvicorn. Path discovery via `try_files` over the known extract.duckdb v2 source subdirs. Anything not at a static path falls through to the existing app handler so legacy `src_data/parquet` and future connectors still work without a Caddyfile change. Non-Caddy deployments are unchanged. Stage 1 (multi-worker uvicorn) was considered but blocked by the single-writer DuckDB lock on system.duckdb — workers > 1 would crash at startup on "Could not set lock on file", the same race that pushed the scheduler from in-process writes to HTTP-via-app. Multi-reader workers + single-writer coordination is out of scope for this PR.	2026-05-05 16:41:33 +02:00
ZdenekSrotyr	4f04235502	feat(bigquery): bq_query_timeout_ms knob; default 600s (was 90s) DuckDB BigQuery extension defaults `bq_query_timeout_ms` to 90 s, which is too tight for analyst-scale queries against view-backed BQ datasets. `agnes query --remote` HTTP 400'd with `Binder Error: Query execution exceeded the timeout. Job ID: ...` whenever the underlying BQ job ran longer than 90 s, even though the job itself was healthy. Add `data_source.bigquery.query_timeout_ms` (default 600 000 ms = 10 min, sentinel 0 falls through to the extension default). Applied via `SET bq_query_timeout_ms` after every `LOAD bigquery` on every BQ-touching DuckDB session: orchestrator's `_remote_attach` ATTACH path, BqAccess session factory, and the standalone extractor. Configurable via `/admin/server-config` UI. Fail-soft: extension versions that don't recognise the setting silently keep the default rather than poisoning the session.	2026-05-05 16:40:40 +02:00
ZdenekSrotyr	8d8d2c219e	refactor(cli-store): pull/info → agnes admin store; add agnes store mine Backup-orchestration commands were split across two namespaces (pull in agnes store, push in agnes admin store), which broke the operator mental model — pull/push are a paired operation and should sit together. Move pull + info into agnes admin store so all bulk operations share one help screen. Add agnes store mine as the user-facing equivalent — calls the same /api/store/bundle.zip endpoint with ?owner=me, which the server resolves to the caller's user_id. Authors can archive their own uploads without admin role; whole-Store bulk reads stay admin-flavored as a discoverability hint. Server: 3-line addition to export_bundle handles owner='me' as a magic alias for the caller. No new endpoint. Tests updated: pull/info expectations move from agnes store to agnes admin store; new tests cover agnes store mine and the ?owner=me server resolution. 69/69 store tests green locally.	2026-05-05 13:49:18 +02:00
ZdenekSrotyr	3d63965a67	Merge remote-tracking branch 'origin/main' into pr180-review # Conflicts: # CHANGELOG.md # app/web/templates/_app_header.html	2026-05-05 12:05:50 +02:00
ZdenekSrotyr	a8f9d065c8	feat(store): bundle export/import + agnes store update + agnes admin store push Adds whole-Store backup/restore primitives so an external CI/CD job can mirror the Store to a git repo (and restore back from one). REST: - GET /api/store/bundle.zip — deterministic ZIP of all (filtered) Store entities. Layout: manifest.json + entities/<id>/{plugin,assets}/. Manifest carries owner_email for cross-instance restore. Auth: any authenticated user (Store is community-open). - POST /api/store/import-bundle — admin-only restore. Modes merge\|replace\|skip; owner resolution by email with stub-disabled-user fallback when the email is unknown on the target instance. CLI: - agnes store update <id> [--description X] [--zip PATH] ... — in-place edit (server PUT permits owner OR admin per F4). Closes the missing edit affordance for analysts who want to fix a typo or push a new ZIP without losing install_count. - agnes store pull [-o store.zip] [--unpack DIR] — download the bundle. --unpack streams + extracts so an external git-backup workflow can drop the tree straight into a repo and `git add .`. - agnes store info [--json] — counts + size summary. - agnes admin store push <zip-or-dir> [--mode ...] — admin-only restore. Auto-zips a directory client-side so a working-tree → server round-trip is one command. cli/v2_client.py gains api_get_stream helper for binary downloads. Tests: 5 new server tests (bundle shape + filters + round-trip + stub user creation + skip mode + admin-only gate) + 11 new CLI tests (update, pull/unpack, info, admin push). 66/66 store-related tests green locally.	2026-05-05 11:51:31 +02:00
ZdenekSrotyr	952dc9e74d	fix(profile-sessions): tolerate stat() failures on individual jsonl (Devin Review on #179 ) The previous gather used `sorted(glob, key=lambda p: p.stat().st_mtime)`. A transient OSError (race with delete, permission flicker, EBADF on a weird filesystem) on any single file raised through the lambda and 500-ed the whole page. Reworked: stat each path under try/except into a (path, stat) list, sort the already-statted entries. Bad files drop silently from the listing. Regression test test_profile_sessions_page_tolerates_stat_failures patches Path.stat to raise on one of two files, asserts the page returns 200 with the good row rendered and the bad row dropped.	2026-05-05 09:53:06 +02:00
ZdenekSrotyr	d878764ac1	fix(session-collector-api): mirror sibling endpoints' audit-on-exception (Devin Review on #179 ) Devin flagged that run_session_collector still had the same audit-skip gap I fixed in run_verification_detector and run_corporate_memory in the previous two rounds — a PermissionError walking /home, an OSError on /data/user_sessions mkdir, or any other unhandled exception from collector.run() would skip the audit_log row and only show in docker logs. Same try/except + unhandled_error pattern as the sibling endpoints. All three LLM-pipeline run-* endpoints now record their failures the same way; /admin/scheduler-runs sees them. Regression test in tests/test_admin_run_endpoints.py::TestRunSessionCollector::test_unhandled_exception_still_audits.	2026-05-05 09:31:33 +02:00
ZdenekSrotyr	9ebe991b55	feat(profile): per-session jsonl download from /profile/sessions User feedback during e2e of #179: the listing page is nice but I want to grab the raw jsonl and look at what's inside. Adds GET /profile/sessions/<filename>: - Auth via get_current_user (owner-only). - Path safety: rejects "/", "\", "..", leading ".", and any non-".jsonl" filename. The served path resolves under ${DATA_DIR}/user_sessions/<caller.id>/; if resolution escapes that base directory, returns 404 (never 403, so existence of other users' files isn't leaked). - FileResponse with Content-Disposition: attachment. UI: Download button per row in profile_sessions.html. Tests in test_web_ui.py: path-traversal / nested / dotfile / non-jsonl all 404 for owner; unauthenticated 302/401/403; authenticated owner gets 200 + correct Content-Disposition.	2026-05-05 09:15:12 +02:00
ZdenekSrotyr	e86da72997	fix(corporate-memory-api): mirror verification-detector audit-on-exception (Devin Review on #179 ) Devin flagged that run_corporate_memory still had the same audit-skip gap I just fixed in run_verification_detector — if collect_all() throws anything other than the already-translated ValueError (DuckDB lock, network blip, unexpected SDK error), the audit_log row was never written and /admin/scheduler-runs missed the failure. Same try/except + unhandled_error pattern as the verification_detector fix from `4c4dfee8`. Regression test in tests/test_admin_run_endpoints.py::TestRunCorporateMemory::test_unhandled_exception_still_audits.	2026-05-05 09:11:13 +02:00
ZdenekSrotyr	4c4dfee8e6	feat(profile): /profile/sessions page + audit on detector exception + correct SCHEDULER_AUDIT_ACTIONS Three changes addressing user feedback during e2e test of #179 + Devin Review on `e86dd5ed`. 1) /profile/sessions — new self-service user page in the user menu. Lists all session jsonls the caller uploaded via `agnes push` joined against session_extraction_state. Each row shows uploaded_at, file size, status badge (pending/processed/extracted), processed_at, and items_extracted. The page docstring + help text explicitly call out that items_extracted=0 means the verification detector ran fine but the LLM found no claims to track — that's the documented "no items" outcome, not a broken pipeline. Closes the gap surfaced during the e2e test of #176 where a user could see their sessions on disk and process them through the LLM but had no UI to inspect what happened. 2) run_verification_detector audits unhandled exceptions (Devin #1). If detector.run() threw anything other than the already-translated ValueError, the audit_log row was never written. The endpoint now wraps detector.run in try/except, records the exception in audit_params["unhandled_error"], then re-raises as 500 after audit. The /admin/scheduler-runs page surfaces the failure row with the error type + message. 3) SCHEDULER_AUDIT_ACTIONS list corrected (Devin #2). Previous list had "marketplaces_sync_all" (wrong — actual is "marketplace.sync_all") plus "data_refresh" and "scripts_run_due" which app/api/sync.py and app/api/scripts.py don't write to audit_log. Fixed to the four actually-logged strings; comment points at the missing audit calls as a follow-up. Tests: tests/test_web_ui.py adds TestAdminRoleGuards::test_profile_sessions_page_no_admin_required and tightens test_admin_scheduler_runs_page_admin_only to assert the correct marketplace.sync_all string.	2026-05-05 08:57:35 +02:00
ZdenekSrotyr	f0d091f721	fix(store): scratch dir leak on ZIP validation failure (Devin Review) create_entity + update_entity created the `scratch` temp dir inside one try/finally but cleaned it up in a separate one. Validation HTTPExceptions raised by _safe_zip_extract (zip_unsafe_path, zip_too_large_uncompressed) or the BadZipFile→422 conversion exited the first scope, and the second finally was never entered → temp dir leaked on every failed upload. Devin flagged this on the F2 commit. The leak pre-existed (zip_unsafe_path was the original vector); F2 added zip_too_large_uncompressed to the same broken cleanup path. Fixed by collapsing scratch creation + cleanup into one outer try/finally that covers both extraction AND metadata/bake; the inner try/except/finally still handles BadZipFile→422 + tmp file cleanup. Same restructure in update_entity. Regression test `test_scratch_dir_cleaned_up_after_failed_extraction` triggers a zip_unsafe_path 422 and asserts tmp/agnes_store_* contains no leaked dirs.	2026-05-05 08:52:15 +02:00
ZdenekSrotyr	fd3c76d21b	fix(store): security + correctness blockers found in PR review (F1, F2, F4, F5) Three independent reviews of PR #180 surfaced four real defects in the new Store / my-ai-stack surface. CHANGELOG entries detail each; one-liners: - F1 video_url XSS: any authenticated user could upload a Store entity with `video_url=javascript:...` and pop XSS in any viewer's session via the `<a href=...>` "Watch video" link in store_detail.html. Jinja2 autoescape doesn't block URI schemes inside attribute values. Fixed by scheme-validating to http(s) only on create + update; 400 invalid_video_url. - F2 ZIP decompression bomb: _safe_zip_extract checked path-traversal but not declared file_size totals — a 50 MB compressed upload at 1:1000 ratio decompresses to 50 GB and DOS the host disk. Fixed by summing zinfo.file_size across infolist() and refusing > 200 MB before extractall touches disk. 413 zip_too_large_uncompressed. - F4 admin authz parity: PUT /api/store/entities/{id} was owner-only while DELETE allowed owner OR admin; the store-detail page hid Edit/Delete buttons from admin even though DELETE was permitted. Fixed by allowing admin on PUT and passing is_admin to the template; gate is now is_owner OR is_admin everywhere. - F5 cross-owner suffix collision: sanitize_username is many-to-one (alice.smith / alice_smith both → alice-smith). Two such users uploading entities with the same display name produced identical `<name>-by-<username>` suffixes, silently colliding in the served agnes-store-bundle on-disk paths AND the manifest catalog (Claude Code dedupes by plugin.json `name`). Fixed by enforcing global uniqueness on the suffixed value at create_entity; 409 conflict_global_suffix. F3 (ZIP symlink members) was investigated and confirmed to be a false-positive — Python's stdlib ZipFile.extractall does not honor symlink mode bits, so no exploit exists. 9 new regression tests in tests/test_store_api.py::TestStoreSecurityFixes covering all four. Test run locally: 60/60 store-related tests pass.	2026-05-05 08:18:02 +02:00
ZdenekSrotyr	e86dd5edc5	fix(anthropic): strict json_schema (additionalProperties=false) + add /admin/scheduler-runs UI E2E test on a real BQ deploy showed every verification-extraction call fails with HTTP 400 invalid_request_error: "output_config.format.schema: For 'object' type, 'additionalProperties' must be explicitly set to false". The Anthropic structured-output API now requires the field on every object node in the json_schema. Fix: connectors/llm/anthropic_provider.py wraps the caller-supplied schema through a recursive _strict_json_schema() walker that adds the field where missing (preserving any explicit override), then passes the strict variant to the API. Six unit tests in TestStrictJsonSchema pin the recursion across nested objects, array items, and the no-mutation invariant. Adds /admin/scheduler-runs — a read-only admin page that surfaces the last 200 audit-log entries from scheduler-driven actions. New AuditRepository.query_actions(actions, limit) helper, new admin nav entry. Failed scheduler ticks (HTTP 401, network errors) don't reach the audit_log; the page calls that out with a hint to set SCHEDULER_API_TOKEN if no rows show up.	2026-05-05 08:00:57 +02:00
ZdenekSrotyr	e68c2d3f0f	fix(session-collector): argv-free run() helper, drop SystemExit footgun (Devin Review on #179 ) run_session_collector called collector.main() which did argparse.parse_args() on uvicorn's sys.argv (['app.main:app', '--host', ...]) → sys.exit(2) → SystemExit(2), which inherits from BaseException, escapes FastAPI handlers, and propagates through the thread pool. Every scheduler tick that fired the endpoint either 500-ed or risked killing the uvicorn worker. services/session_collector/collector.py now exposes run(dry_run, verbose) that returns (rc, stats); main() is a thin CLI shim that parses argv and delegates. The admin endpoint calls run() directly and audit-logs the per-run stats (users_processed, files_copied, files_skipped) instead of just the rc. Three regression tests in TestRunHelper. Closes Devin Review finding on app/api/admin.py:2819 (#179).	2026-05-05 06:31:55 +02:00
ZdenekSrotyr	9f33e24bf9	fix(config): overlay-aware LLM consumers + env-ref resolution (#179 review) Devin BUG: /api/admin/configure seeds an ai: block to the writable overlay at DATA_DIR/state/instance.yaml, but the three LLM consumers imported from config.loader.load_instance_config — which reads the static config dir only. Even if they had read the overlay, the loader ran yaml.safe_load directly without passing through _resolve_env_refs, so '${ANTHROPIC_API_KEY}' would have stayed a literal placeholder. The pipeline appeared to work because the factory falls back to the env var directly, but the overlay path itself was dead code. Two fixes, both required: 1. Switched the three LLM consumers to app.instance_config.load_instance_config: - services/corporate_memory/collector.py:collect_all - services/verification_detector/__main__.py:main - app/api/admin.py:run_verification_detector 2. app/instance_config.py runs the loaded overlay through config.loader._resolve_env_refs before the deep-merge, so '${ANTHROPIC_API_KEY}' resolves at config-load time. New regression suite tests/test_instance_config_overlay.py pins: - env-ref resolution against the overlay (resolved when env set, empty when env missing — never the literal placeholder) - deep-merge still preserves static-only sections - the three consumers reach app.instance_config (inspected via inspect.getsource so a future refactor that reverts the import fails the test) - end-to-end: a seeded overlay + ANTHROPIC_API_KEY env reaches the factory with a resolved api_key	2026-05-05 05:57:22 +02:00
ZdenekSrotyr	98a8aba3be	fix(tests): align test_llm_connector with new factory + fail-fast (#179 review) The PR rewrote collect_all() to call the new create_extractor_from_env_or_config() helper, but the existing tests still mocked the old direct create_extractor() symbol and the old silent-skip-on-missing-config behavior. Five tests in TestCorporateMemoryCollector and one in TestCollectorExtractorIntegration were red on the PR branch. Changes: - Tests now mock connectors.llm.create_extractor_from_env_or_config (the symbol the collector imports lazily). - Renamed test_collect_all_no_ai_config_skips -> test_collect_all_no_ai_config_or_env_raises and test_collector_handles_invalid_config -> test_collector_raises_on_invalid_config. Both assert pytest.raises(ValueError) — the explicit fail-fast semantics defect 5 of #176 was supposed to enforce. - collect_all() no longer swallows the factory's ValueError into stats["errors"]; it propagates so the scheduler / admin endpoint surface the actionable misconfiguration message instead of pretending the run was a no-op. - /api/admin/run-corporate-memory translates the propagated ValueError into a 500 with the factory's message, matching /api/admin/run-verification-detector.	2026-05-05 05:55:01 +02:00
Minas Arustamyan	af72c5d259	fix(setup): walk TLS chain for trust-store match — Let's Encrypt cleanup `_read_agnes_ca_pem()` decides whether the served fullchain.pem needs trust-bootstrapping in the rendered setup prompt. Pre-fix it only checked the leaf's immediate issuer against `certifi`'s trust store. For Let's Encrypt that's the intermediate (R13), which `certifi` does not ship — only roots are in trust stores. So a publicly-trusted LE chain still tripped the "needs bootstrap" path and the setup prompt emitted a step-0 TLS trust block + clone-fallback marketplace block that no client actually needs (Bun-compiled `claude.exe`, system git, Python via certifi all validate the chain through the bundled ISRG Root X1). Now we walk every cert in the fullchain (leaf + intermediates) and return None the first time any cert's issuer is in the certifi trust store — that captures the standard "leaf signed by intermediate signed by publicly-trusted root" shape. Trusted subjects are read once into a set for O(1) lookup. Self-signed (leaf.issuer == leaf.subject) and private-CA chains (no chain link's issuer in certifi) keep their previous "return PEM" behavior, so deployments that genuinely need the bootstrap still get it. Validated end-to-end against the live VM at agnes-marustamyan.groupondev.com (LE R13 → ISRG Root X1): - Let's Encrypt fullchain → has_ca=False (was True) - Self-signed cert → has_ca=True - Corporate-CA chain (private root) → has_ca=True - Missing fullchain.pem → has_ca=False	2026-05-05 04:55:06 +02:00
Minas Arustamyan	d5a7c9ad79	feat(store): /store + /my-ai-stack — community marketplace + per-user composition Adds a community-driven Store where any authenticated user uploads skills/agents/plugins as ZIPs, plus /my-ai-stack as the per-user composition view. The served Claude Code marketplace is now: (admin_granted ∖ opt_outs) ∪ store_installs Skill + agent installs are merged into a single `agnes-store-bundle` plugin in the served marketplace; type=plugin uploads stay standalone. Names are suffixed with `-by-<owner-username>` at upload time so two owners can use the same display name without colliding in Claude Code's flat skill/agent namespace. Schema v23 → v24 adds three tables: - store_entities — community-uploaded skills/agents/plugins - user_store_installs — what each user has chosen to install - user_plugin_optouts — opt-out overlay on top of admin grants Admin grant-delete drops every user's opt-out for that plugin so re-grant resets cleanly to enabled (no sticky personal preference). UI: - /store — e-commerce-style listing with type/category/owner filters, search, pagination, owner-aware [Install] buttons, clickable cards - /store/new — 2-step upload wizard with drag & drop, preview validation (POST /api/store/entities/preview), docs multi-upload, photo + video URL - /store/{id} — detail page with hero, file list, docs, owner actions (Edit/Delete) for the uploader - /my-ai-stack — Granted plugins (toggle opt-out) + From the Store (uninstall) sections - Admin nav: Marketplaces moved into Admin dropdown, renamed to "Curated Marketplaces" Validation hardening: type-mismatch guards reject skill ZIP uploaded as agent (or vice versa), and plugin ZIPs masquerading as skills/agents. Human-readable error messages mapped client-side from machine codes. Cross-source naming: Store entity-id-prefixed dirs (`plugins/store-<id>/`) plus the bundle (`plugins/store-bundle/`) avoid collisions with admin marketplaces (whose `store` slug is reserved by `is_valid_slug`). Bundle composition is content-hashed at serve time — install/uninstall or owner re-upload bumps the bundle's plugin.json `version`, so Claude Code's auto-update toggle picks up changes. Tests: 50+ new tests across naming, repositories, filter (admin ∪ store ∪ bundle), API (upload/install/uninstall/delete/preview/docs), end-to-end marketplace.zip with bundle merging.	2026-05-05 02:53:49 +02:00
ZdenekSrotyr	a621a415cc	fix(health): session-pipeline staleness check (#176 ) GET /api/health/detailed now returns a session_pipeline service entry. Heuristic: max(mtime of /data/user_sessions/*/.jsonl) <= max(processed_at in session_extraction_state) + grace_seconds grace_seconds = 2 × verification-detector cadence (default 30 min; configurable via SCHEDULER_VERIFICATION_DETECTOR_INTERVAL). When the assert fails, status='warning' (never 'error') with an actionable detail pointing at the verification-detector scheduler job. A warning bubbles up to the existing overall='degraded' aggregation — operators querying /api/health/detailed (or /agnes diagnose system) get a clear breadcrumb instead of a silently-broken pipeline. Cold-start case (no session files, or files newer than the grace window with empty state table) is handled explicitly to avoid noise on a fresh deploy. Tests: tests/test_health_session_pipeline.py.	2026-05-05 00:04:28 +02:00
ZdenekSrotyr	c53c1e1572	fix(ui): admin pending-review banner on /corporate-memory (#176 ) The /corporate-memory page filters status IN ('approved','mandatory') and showed no hint that pending items exist. With approval_mode set to 'review_queue' (the default in instance.yaml.example), every collection run would silently funnel new items into the pending bucket where no operator ever saw them. For admins (is_km_admin), the page now renders a banner above the stats bar: N pending items awaiting review — review them at /corporate-memory/admin Non-admins see no change (the route zeroes the count server-side before passing to the template, so the hint is never leaked). Tests: tests/test_corporate_memory_page.py.	2026-05-05 00:01:22 +02:00
ZdenekSrotyr	45de71e8ab	fix(scheduler): wire LLM pipeline into scheduler-v2 (#176 ) The session-collector, verification-detector, and corporate-memory services now run on the same scheduler-v2 model that already drives data-refresh, health-check, script-runner, and marketplaces: - New admin endpoints in app/api/admin.py: POST /api/admin/run-session-collector POST /api/admin/run-verification-detector POST /api/admin/run-corporate-memory All admin-gated, sync-def (FastAPI thread pool), with one audit row per invocation. Same single-writer-of-system.duckdb pattern as the existing /api/marketplaces/sync-all job. - services/scheduler/__main__.py JOBS gains three entries with offset cadences (10m / 15m / 17m, all coprime modulo the 30s tick) so the three LLM-backed jobs don't fire on the same tick and stack their API + DB load. - The verification-detector endpoint surfaces the LLM factory's fail-fast ValueError as HTTP 500 with the actionable message, preserving the no-silent-skip contract from the previous commit. Tests: - tests/test_admin_run_endpoints.py covers admin gating + scheduler registration + endpoint contract. - tests/test_scheduler_sidecar.py existing tests continue to pass.	2026-05-04 23:57:43 +02:00
ZdenekSrotyr	bbb04ac041	fix(setup): seed default ai: block + env-var fallback (#176 ) POST /api/admin/configure now writes a default ai: block into the instance.yaml overlay when the request leaves it untouched and either ANTHROPIC_API_KEY or LLM_API_KEY is set in the environment. The block references the env var via ${VAR} syntax — secrets never land in YAML. connectors.llm.factory grows create_extractor_from_env_or_config which falls back to ANTHROPIC_API_KEY / LLM_API_KEY when ai_config is empty and raises a clear ValueError when neither is available. Both services/corporate_memory and services/verification_detector switch to the new helper, replacing the old 'silently skip when ai: missing' path that was the silent-failure root cause. Tests: - tests/test_setup_ai_block.py — overlay seeding contract. - tests/test_llm_provider_env_fallback.py — fallback + fail-fast.	2026-05-04 23:55:19 +02:00
ZdenekSrotyr	5915f92eaa	fix(query-guardrail): single-pass alternation regex (Devin Review on query.py:464) The iterative bare-name rewriter (one re.sub per name, longest-first) was vulnerable to cross-contamination when the GCP project ID contained a registered table name as a hyphen-delimited word. Concrete repro: project = 'my-ue-project' registered = ['orders', 'ue'] user SQL = 'SELECT * FROM orders JOIN ue ON ...' iter 1 (orders): produces 'FROM `my-ue-project.fin.orders` JOIN ue ...' iter 2 (ue): '\bue\b' matches 'ue' INSIDE 'my-ue-project' (hyphen creates word boundary on both sides) — corrupts the iter-1 path Fallback at query.py:576 caught the resulting BQ parse error and fell back to per-table SELECT * estimate, so impact was over-estimation, not fail-open — but the #171 partition-pruning fix silently degraded to pre-fix behavior whenever a project name shared a hyphen-segment with a registered table. Fix: single re.sub call with an alternation regex sorted longest-first. Single-pass means each source position is processed exactly once, so freshly-inserted backticked text from one match isn't re-scanned by later names in the alternation. Regression test test_rewrite_helper_does_not_corrupt_when_project_id_contains_registered_name covers the exact Devin repro.	2026-05-04 22:51:33 +02:00
ZdenekSrotyr	424ec9b0f4	refactor(install.html): single tile, single PAT-mint body shape Drops the `<nav class="role-tiles">` block (Analyst / Admin tiles), the `_show_admin_tile` flag, the `const ROLE = {{ role \| tojson }};` JS line, and the role-aware PAT-mint ternary. The setupNewClaude button now mints a uniform PAT for everyone: { name: defaultTokenName(), expires_in_days: 90 } …against the existing `POST /auth/tokens` endpoint. No new endpoint, no role-locked TTL clamp. The `bootstrap-analyst` 1-hour scope is no longer used from /setup (it broke the install flow anyway — saved PATs expired before the user opened Claude Code; tracked as a separate cleanup issue). Also removes the now-unused `.role-tiles` / `.role-tile` CSS rules so the stylesheet doesn't carry dead selectors. Plan: docs/superpowers/plans/2026-05-04-unified-setup-prompt.md task 6.	2026-05-04 22:18:00 +02:00
ZdenekSrotyr	2ee529533f	refactor(setup-page): drop role query param The `/setup` route no longer accepts `?role=analyst\|admin`. The route signature drops the `Literal[...] = Query(...)` parameter and the silent admin-downgrade block (`if role == "admin" and not is_admin: role = "analyst"`). The `role` ctx variable threaded into install.html also goes away — Task 6 cleans up the template's role-tile UI and the JS PAT-mint ternary. `?role=` is silently ignored by FastAPI for unknown query params, so existing bookmarks (none in production — the param was added in this PR and never shipped) just degrade to the unified layout. No RedirectResponse shim needed. Tests: drop the entire `tests/test_setup_page_roles.py` file (eight role-branching tests that no longer apply) and add `tests/test_setup_page_unified.py` with three tests: - `test_setup_page_renders_unified_layout` - `test_setup_page_ignores_role_query_param` - `test_setup_page_renders_marketplace_for_user_with_grants` - `test_install_legacy_path_redirects_to_setup` Also replace the role-aware `test_install_preview_*` tests in test_web_ui.py with unified-layout assertions. Plan: docs/superpowers/plans/2026-05-04-unified-setup-prompt.md task 5.	2026-05-04 22:16:59 +02:00
ZdenekSrotyr	291079b1d2	refactor(welcome-template): drop role param; resolve plugins per-user unconditionally Removes the `role: Literal["analyst", "admin"] = "admin"` parameter from `compute_default_agent_prompt`. The same RBAC pass (`marketplace_filter.resolve_allowed_plugins`) now runs for every user — admin or not. Users with no `resource_grants` rows get the no-marketplace layout; users with grants get the marketplace block inserted. Admin-vs-analyst is no longer a layout branch. `render_agent_prompt_banner` no longer derives a `role` from `user.is_admin`; it just delegates to `compute_default_agent_prompt`. Two `compute_default_agent_prompt(...role=role)` call sites in `app/web/router.py::setup_page` are updated to drop the keyword so the route keeps rendering — Task 5 will remove the `?role=` query parameter and the silent admin-downgrade block from the route signature itself. Tests: drop role-aware assertions from test_welcome_template_renderer and test_welcome_template_api. Both files now assert the unified default contains `agnes init` + `uv tool install` and bans the legacy `agnes auth import-token` / `agnes auth whoami` verbs. Plan: docs/superpowers/plans/2026-05-04-unified-setup-prompt.md task 4.	2026-05-04 22:13:46 +02:00
ZdenekSrotyr	74b7f6e254	feat(setup-instructions): preflight checks both git and claude Renames `_git_check_block` to `_preflight_block` and adds a `claude --version` check beside `git --version`. Both binaries are required by the marketplace step — git for the clone fallback, claude for `claude plugin marketplace add` / `claude plugin install` — so checking them together gives one clear failure instead of two confusing downstream errors. Install hints: `npm i -g @anthropic-ai/claude-code` for Linux / WSL plus a doc URL (https://docs.claude.com/claude-code) for the native macOS / Windows installers. We don't try to one-line a native installer; the canonical instructions live upstream. Plan: docs/superpowers/plans/2026-05-04-unified-setup-prompt.md task 3.	2026-05-04 22:11:38 +02:00
ZdenekSrotyr	e16698c3cc	refactor(setup-instructions): unified layout with mandatory agnes init Adds `_step_numbers(*, has_marketplace, has_skills)` so step numbering lives in one place instead of being split across three branches in `resolve_lines`. Pins the unified layout in the tests: No plugins: 1 install, 2 init, 3 catalog, 4 diagnose, 5 skills, 6 confirm With plugins: 1, 2, 3, 4 preflight, 5 marketplace, 6 diagnose, 7 skills, 8 confirm `agnes auth import-token` / `agnes auth whoami` are now banned from the rendered prompt — `agnes init` subsumes them. The renamed `test_resolve_lines_no_plugins_unified_six_step_layout` asserts those strings are absent and that the new step headers (`Bootstrap your Agnes workspace`, `Verify the data is queryable`) are present. Plan: docs/superpowers/plans/2026-05-04-unified-setup-prompt.md task 2.	2026-05-04 22:10:05 +02:00
ZdenekSrotyr	9334beed15	refactor(setup-instructions): drop role param; collapse analyst/admin into one layout Removes the `role: Literal["analyst", "admin"]` parameter from `resolve_lines` / `render_setup_instructions` and deletes the `_resolve_analyst_lines`, `_analyst_init_lines`, `_analyst_finale_lines` helpers. The unified flow now always emits `agnes init` (the workspace-rails delivery mechanism) in place of the legacy `agnes auth import-token` + `agnes auth whoami` pair, and uses `agnes catalog` as the smoke-verify step. `agnes init` already verifies the PAT internally, and `agnes catalog` doubles as a data-plane smoke check, so dropping `agnes auth whoami` costs no signal. Drops the now-redundant `tests/test_setup_instructions_analyst.py` and patches the one ordering test in `tests/test_setup_instructions.py` that referenced the old "Log in" / "Verify the login" headers. Also strips the `role=role` kwarg from `compute_default_agent_prompt`'s call into `resolve_lines` so the welcome-template render path keeps working; welcome_template.py's own role param is removed in a follow-up task. Plan: docs/superpowers/plans/2026-05-04-unified-setup-prompt.md task 1.	2026-05-04 22:08:48 +02:00
ZdenekSrotyr	103efb69f0	chore(cli-rename): replace stale `da` verbs in active code paths Bring admin UI, audit-log messages, code comments, and analyst-facing skill docs in line with the post-bootstrap CLI surface (`agnes pull`, `agnes push`, `agnes init`, `agnes snapshot create`). The legacy `_LEGACY_STRINGS` detection tuple in `app/api/claude_md.py` and the hook upgrade markers in `cli/lib/hooks.py` are intentionally left as-is — they exist precisely to flag pre-rewrite content for re-authoring. Strip "(folded from `da metrics list`)" / "(lifted from `da metrics show`)" / "Replaces the old `da analyst status`" docstring noise — the rename history is in CHANGELOG.md, not in module docstrings.	2026-05-04 21:10:43 +02:00
ZdenekSrotyr	500db8cd3c	fix(query-guardrail): dry-run user SQL not synthetic SELECT * (#171 ) Closes #171. The /api/query cost guardrail used to dry-run a synthetic `SELECT * FROM <table>` for each registered remote-BQ row referenced by the user SQL — which made BigQuery estimate a full table scan, with column projection, predicate pushdown, and partition pruning all disabled. Narrow queries on big partitioned/clustered tables (the documented happy path for `agnes query --remote`) hit ~30,000× over-estimates and got rejected with 400 `remote_scan_too_large` even when BQ's own dry-run reported single-digit MB. Pavel's report on #171 traced the root cause and proposed the fix: rewrite the user SQL to BQ-native syntax and dry-run it as a single job, exactly the way `bq query --dry_run` works. Implementation: - New helper _rewrite_user_sql_for_bq_dry_run rewrites bare registered names (word-boundary, case-insensitive, longest-first to avoid prefix collisions) + bq."<ds>"."<tbl>" forms to backticked `<project>.<ds>.<tbl>` paths. - _bq_quota_and_cap_guard runs ONE dry-run on the rewritten SQL. Cap check uses the real estimate. - Fallback path: if BQ rejects with bq_bad_request (e.g. DuckDB-only syntax like ::INT casts), the guard falls back to the pre-fix per-table SELECT * approach so non-portable queries still get a (loose) cap estimate instead of fail-opening. Non-parse BQ errors (forbidden, upstream) still propagate as 502. - _bq_guardrail_inputs now also returns name_lookups so the rewriter has the (registered_name, bucket, source_table) mapping it needs. - Per-table breakdown is unavailable from a composite dry-run; total bytes are pinned to dry_run_set[0] for the post-flight record_bytes(sum(...)) call to keep returning the right total. Tests (7 new, 3 existing still pass): - dry-run receives rewritten user SQL with WHERE clause intact (the load-bearing assertion for #171) - single dry-run per request even with multiple registered tables (JOIN, UNION) referenced - fallback to per-table SELECT * on bq_bad_request - non-parse BQ errors (forbidden) still 502 - rewriter unit tests: bare + bq.path in same SQL, longest-name-wins on prefix collision, case-insensitive bare-name match	2026-05-04 21:08:21 +02:00
ZdenekSrotyr	e438170ade	merge: pull #174 (BQ materialize view fix + concurrency, 0.33.0) into bootstrap branch Brings in zs/materialize-sync-fix (PR #174): - BigQuery view materialize works (wrap admin SQL in bigquery_query()) - Per-table mutex + fcntl.flock for concurrent COPY corruption - Cost guardrail dry-run engages on materialized rows - Schema v23 -> v24 migration: rewrite source_query to BQ-native - Server-generated trivial source_query from bucket+source_table - Validator backtick relaxation for materialized rows - 0.33.0 release cut Conflict resolution: - CHANGELOG.md: keep our [Unreleased] (bootstrap rewrite content) ABOVE the new [0.33.0] section from #174. The bootstrap rewrite remains unreleased; it'll cut 0.34.0 (or later) when this PR merges to main. - tests/conftest.py: union — keep our analyst-bootstrap fixture re-export AND #174's bq_instance / stub_bq_extractor fixtures. - pyproject.toml auto-merged to 0.33.0 (matches the cut), correct. - src/db.py auto-merged: SCHEMA_VERSION = 24, _v23_to_v24_finalize added — no overlap with our work which left schema at v23. - CLAUDE.md auto-merged: schema-history paragraph extended with v24. Verified: 79/79 across CLI bootstrap suite + materialize suite + schema v24 migration tests pass locally on Python 3.13/macOS.	2026-05-04 20:53:00 +02:00
ZdenekSrotyr	92d477e422	fix(setup): default /setup to analyst, hide admin tile from non-admins Three coupled UX fixes for the analyst-onboarding flow: 1. Dashboard "Setup a new Claude Code" CTA was rendering admin paste prompt for everyone (analysts couldn't actually execute the marketplace plugin install / skills setup steps). render_agent_prompt_banner now picks role based on user.is_admin — analysts get the analyst flow. 2. /setup default role changed from admin to analyst. Most visitors are analysts; admin layout is opt-in via the admin tile or ?role=admin. 3. Admin tile is admin-only on the role-tile nav. Non-admins see only the analyst tile. Server-side: non-admin requesting ?role=admin is silently downgraded to analyst (otherwise they'd see admin paste prompt despite no tile). Tests: - New: test_setup_page_admin_tile_hidden_for_non_admin (anonymous client can't see "Admin CLI" or role=admin link) - New: test_setup_page_admin_role_downgraded_for_non_admin (anonymous ?role=admin → analyst layout, no marketplace step in clipboard) - New: test_install_preview_default_role_is_analyst (admin signing in to bare /setup gets analyst clipboard by default) - Renamed: test_setup_page_default_role_is_admin → ..._is_analyst - Updated: test_setup_page_admin_clipboard_renders_admin_layout uses FastAPI dependency_overrides to inject admin user (admin layout is now admin-gated) - Updated: test_install_preview_visible_for_signed_in_user explicitly passes ?role=admin to exercise admin layout	2026-05-04 20:20:37 +02:00
ZdenekSrotyr	3d58768143	fix: address Devin Review findings — incomplete renames + estimate guard 13 Devin findings across 10 files: 🔴 Critical: - app/api/v2_catalog.py:42 — `_fetch_hint` returns `da fetch` in /api/v2/catalog responses (user-visible in every catalog list) - cli/skills/agnes-data-querying.md — 11 stale `da fetch`/`da sync` refs in the bundled skill markdown - config/claude_md_template.txt:38 — referenced `agnes pull --docs-only` flag that does NOT exist in agnes pull (removed; spec only ships --quiet/--json/ --dry-run) 🟡 Important: - app/api/admin.py:252 — `da fetch` in bq_max_scan_bytes hint - cli/commands/auth.py:119 — `da sync` in import-token docstring (--help text) - cli/commands/tokens.py:48 — "Export it so `da` can use it" prose - ARCHITECTURE.md — 4 stale rows in CLI commands table - README.md — stale paragraphs for analysts (da sync, da analyst setup) 🚩 Substantive observations addressed: - app/api/query.py:249,302,489 — server-side error/help strings still said `da sync`/`da fetch` (returned in API responses to clients) - cli/commands/snapshot.py:235-241 — DuckDB existence guard incorrectly blocked `--estimate` (server-side dry-run that never opens local DB). Added test ensuring estimate path skips the guard. Skipped (intentionally historical): - app/api/admin.py:2377,2429,2437 — historical comments describing past manifest-vs-sync_state bug; past tense, accurate to keep as `da sync`.	2026-05-04 20:05:06 +02:00
ZdenekSrotyr	5bffec641f	chore(lint): final ruff fixes	2026-05-04 19:32:52 +02:00
ZdenekSrotyr	6c0846fd17	feat(config): expose materialize.lock_ttl_seconds in server-config New top-level 'materialize' section, single field (lock_ttl_seconds). Default 86400 (24h). Backs the file-lock TTL reclaim added in the per-table-mutex change. Editable via PUT /api/admin/server-config and the /admin/server-config UI.	2026-05-04 18:52:54 +02:00
ZdenekSrotyr	3871d5320a	feat(admin): server-generate materialized source_query, allow BQ backticks When admin registers a materialized BQ row with bucket+source_table but no source_query, the server generates 'SELECT * FROM `<project>.<ds>.<tbl>`' from instance.yaml's configured BQ project. Same fallback fires on PUT when flipping to materialized. The backtick rejection guard, which was appropriate for DuckDB-flavor source_query, is relaxed for materialized rows since the new wrapping path (Task 2) runs admin SQL through BQ jobs API which uses BQ-native syntax (backticks for dashed identifiers).	2026-05-04 18:37:27 +02:00
ZdenekSrotyr	c7c42de0f0	feat(sync): treat MaterializeInFlightError as 'skipped, in_flight' _run_materialized_pass distinguishes due-check skips from in-flight skips and never calls state.set_error for either. summary['skipped'] becomes a list of {table, reason} dicts; the end-of-pass log line breaks out the in_flight subcount. Hoists is_table_due to module-level import so test monkeypatching of the symbol intercepts the call (the previous local import made patches a no-op).	2026-05-04 18:11:38 +02:00
ZdenekSrotyr	a92c624dba	feat(admin): yellow banner for legacy CLI verbs in workspace-prompt override	2026-05-04 17:46:50 +02:00
ZdenekSrotyr	8091620d33	fix(setup): role-aware clipboard render + JSON-escape ROLE injection Two Task 4 review fixes for app/web/templates/install.html: 1. JSON-escape `ROLE` JS const via `{{ role \| tojson }}` (defense in depth — removes the dependency on Jinja autoescape semantics for JS contexts; FastAPI's Literal validator already constrains role values). 2. Verify the analyst tile's clipboard payload is the analyst layout. The pre-existing role-aware plumbing (compute_default_agent_prompt threading role into setup_instructions_lines, picked up by the JS SETUP_INSTRUCTIONS_TEMPLATE array) was correct; adding regression tests that pin to the JS clipboard block specifically so a future inversion would fail loudly. Tests: analyst clipboard contains `agnes init` + `agnes catalog` and NOT `agnes auth import-token` / `agnes skills`; admin clipboard is the inverse. Plus an explicit assertion that ROLE is rendered via tojson.	2026-05-04 17:43:46 +02:00
ZdenekSrotyr	7965f8021d	fix(setup): role-aware PAT scope+TTL in setupNewClaude JS (Task 4 spec fix)	2026-05-04 17:34:30 +02:00
ZdenekSrotyr	f731ee7897	feat(setup): /setup?role=analyst\|admin branching with role tiles	2026-05-04 17:28:47 +02:00
ZdenekSrotyr	54f83c281c	test(setup): I1+I2 review fixes — AGNES_WORKSPACE.md alignment + step-number pin	2026-05-04 17:23:15 +02:00
ZdenekSrotyr	ae00945cbf	fix(setup): clean stale 'da' refs in setup_instructions.py (Task 0.5 missed sweep)	2026-05-04 17:19:55 +02:00
ZdenekSrotyr	29e28ccbd3	feat(setup): add analyst role to install-prompt renderer	2026-05-04 17:17:59 +02:00
ZdenekSrotyr	59324f9361	feat(admin): scan CLAUDE.md override for legacy strings	2026-05-04 17:10:58 +02:00
ZdenekSrotyr	4ee7323436	feat(tokens): add scope + ttl_seconds fields with bootstrap-analyst clamp	2026-05-04 17:00:54 +02:00
ZdenekSrotyr	1563b05f2e	refactor(cli): hard-cutover env vars + config dir to AGNES_* Task 0.5 of clean-analyst-bootstrap. Greenfield rewrite — no fallback, no aliases. Existing dev environments lose their cached PAT and must re-authenticate. Env var renames (hard cutover): - DA_CONFIG_DIR -> AGNES_CONFIG_DIR - DA_SERVER -> AGNES_SERVER - DA_SERVER_URL -> AGNES_SERVER_URL (test-only stale ref, not in spec) - DA_NO_UPDATE_CHECK -> AGNES_NO_UPDATE_CHECK - DA_LOCAL_DIR -> AGNES_LOCAL_DIR - DA_TOKEN -> AGNES_TOKEN - DA_STREAM_RETRIES -> AGNES_STREAM_RETRIES Config dir rename: ~/.config/da/ -> ~/.config/agnes/ (across code, comments, docstrings, error messages, install templates, dev scripts). Stale `da X` references in CLI source (and adjacent app/, tests/): swept docstrings, comments, help text, and error messages where the verb survives the rewrite (init, pull, push, catalog, status, diagnose, auth, admin, skills, query, schema, describe, explore, disk-info, snapshot, login, logout, whoami, server, setup) and replaced `da X` with `agnes X`. Intentionally kept `da sync`, `da fetch`, `da analyst`, `da metrics` — those verbs are removed in later tasks; the legacy strings will be detected by `_LEGACY_STRINGS` (added in Task 2). Test fixes: - TestCLIVersion now asserts output starts with `agnes ` (was `da `). Test results: 2675 passed, 25 skipped (full pytest run, excluding 9 pre-existing test_db.py / test_user_management.py / test_e2e_extract.py / test_cli_binary_rename.py failures unrelated to this rename).	2026-05-04 16:35:44 +02:00
ZdenekSrotyr	4bd1919f77	fix(query): #168 review iter 5 — forbidden-table check uses registry IDs Devin Review iter #5 flagged a pre-existing class of name/id mismatch in app/api/query.py:131-136 — the SAME root cause as the bq.* RBAC issue I fixed in iter #3 (line 332/362). Devin called it out as "NOT introduced by this PR" / "might merit follow-up", but it's exactly the same security-boundary pattern this PR is hardening, so fixing here keeps the RBAC story consistent across the handler. The `forbidden = all_views - set(allowed)` comparison mixed types: - `all_views` carries DuckDB master view names (= registry display `name` from the orchestrator's CREATE VIEW) - `set(allowed)` carries registry IDs (resource_grants.resource_id) When `id != name` (e.g. id="bq.finance.ue", name="ue"), authorized users got spurious 403s — the view name landed in `forbidden` even though the caller had a valid grant on the registry id. Build a name->id map from the registry, then the forbidden check compares apples to apples: allowed_view_names = {r["name"] for r in registry_rows if r.get("name") and r.get("id") in allowed_ids} forbidden = all_views - allowed_view_names 107 affected tests pass; 487 pass in wider RBAC/query/access/admin domain — no regressions.	2026-05-04 14:18:43 +02:00
ZdenekSrotyr	28aba4c1f9	fix(query): #168 review iter 3 — RBAC name-vs-id, placeholder dead code Devin Review iter #3 found 3 new real bugs after iter #2's fixes landed. 🔴 RBAC check at app/api/query.py:362 used `row["name"]` against `accessible_set`, but `accessible_set` is keyed by registry IDs (`get_accessible_tables` returns `resource_grants.resource_id` — table IDs, not display names). Confirmed by `_table_blocks` projection at `app/resource_types.py:157-158`. When `id != name` (e.g. `id="bq.finance.ue", name="ue"`), non-admin users with valid grants got 403 `bq_path_access_denied`. Switch to `row["id"]`. 🚩 Bare-name pass at app/api/query.py:332 had the same name-vs-id mismatch (different impact): legitimate accessible rows were skipped from `dry_run_set`, so the cost guardrail under-counted scan bytes for non-admin users. Could let an over-cap query through and under-bill quota. Switch to `row_id` comparison. 🟡 `placeholder_from` for billing_project was dead code. `_BQ_OPTIONAL_FIELD_DEFAULTS["billing_project"] = ""` seeded an empty string into every GET payload via `_ensure_bq_optional_fields`. JS `isUnset = (value === undefined)` evaluated False, so the `(defaults to <project>)` placeholder NEVER rendered. Drop the seed — field stays in `known_fields` (UI sees it) but routes through the unset rendering path on GET, where placeholder_from fires. Tests: test_get_surfaces_bq_fields_even_when_unset assertion flipped from "billing_project IS present" to "billing_project NOT auto-seeded" to lock in the new shape. 67 affected tests pass.	2026-05-04 13:51:36 +02:00
ZdenekSrotyr	5eaa449fcc	fix(query): #168 review iter 2 — quota user_id parity + concurrent-slot 429 Devin Review iter #2 found 2 new issues (after iter #1's 5 fixes landed). Both real, both addressed. 🔴 Quota user_id key mismatch defeated shared daily budget. /api/query computed `user.get("id") or user.get("email")` while /api/v2/scan uses `user.get("email") or "anon"` (app/api/v2_scan.py:327). Same user → two different keys in the singleton QuotaTracker. BQ bytes consumed via /api/query were tracked under UUID; via /api/v2/scan under email; the `check_daily_budget` pre-flight on either endpoint never saw the other's recorded bytes — per-user cap was effectively doubled. Match v2/scan's email-first ordering. 🟡 QuotaExceededError(KIND_CONCURRENT) → 400 instead of 429. `quota.acquire(user_id)` raises this from __enter__ when the per-user concurrent-scan slot is at cap. The exception propagated through the @contextlib.contextmanager generator, the caller's `with guard:` block, and was caught by execute_query's generic `except Exception` handler → mapped to 400 with a flattened "Query error: concurrent_scans: N/M" string, dropping the typed retry_after_seconds field. Wrap the `with quota.acquire(...)` in a try/except QuotaExceededError that maps to 429 with the same typed-detail shape used for the daily-budget rejection — consistent with /api/v2/scan:392-402. Tests: test_api_query_quota.py user_id strings updated to "admin@test.com" (the seeded_app admin's email) to match the new email-first ordering. 40 affected tests pass.	2026-05-04 13:38:31 +02:00
ZdenekSrotyr	1263b80726	fix(query): #168 review — concurrent-slot wraps execute, doc/JS fixes Devin Review on PR #168 found 5 issues — all real, all addressed. 🚩 ANALYSIS_001 (architectural): concurrent-slot guard didn't protect actual BQ query execution. Earlier `_enforce_remote_bq_quota_and_cap` ran dry-run + cap check inside `with quota.acquire(user_id):`, then returned — releasing the slot BEFORE `analytics.execute(...)` ran. Spec §4.3.3 explicitly designs the slot to wrap execute so the per-user concurrent cap limits BQ scans, not just dry-runs. Refactor to a context manager `_bq_quota_and_cap_guard`. Caller's `with` block now holds the slot through dry-run, cap check, the actual `analytics.execute(...)` (which is what triggers the BQ scan when DuckDB resolves the master view), AND the post-flight record_bytes. Slot released only when caller's `with` body exits. 🟡 BUG_001: placeholder JS walked `original` (full GET payload root) instead of `original.sections`. `placeholder_from: ["data_source", "bigquery", "project"]` is a section-relative path, so billing_project placeholder NEVER rendered. Fix: walk `original.sections` (with fallback to `original` for safety). 🟡 BUG_002 + BUG_003: admin_tables.html register and edit modals' operator help text referenced `max_bytes_per_remote_query` (the old name from the spec) but the actual config key is `bq_max_scan_bytes` after the fix-up commit `6423888d` moved it. Replace both occurrences. 🟡 BUG_004: CHANGELOG entry said `api.query.bq_max_scan_bytes` (the old path) but the read at app/api/query.py:53 is `get_value("data_source", "bigquery", "bq_max_scan_bytes", ...)`. An operator who set it under `api.query` in their yaml would have no effect. Correct path in CHANGELOG. All 95 #160-affected tests pass after the changes.	2026-05-04 13:28:03 +02:00
ZdenekSrotyr	6423888d02	fix(query): #160 move bq_max_scan_bytes to data_source.bigquery (UI editable) E2E test on dev VM revealed: spec said "configurable via /admin/server-config" for the cost guardrail cap, but the underlying read path was `api.query.bq_max_scan_bytes` and `api` is NOT in `_EDITABLE_SECTIONS`. POST to /admin/server-config rejected `{"sections":{"api":...}}` as "unknown section(s): api" — the cap was only adjustable via direct YAML edit. Move to `data_source.bigquery.bq_max_scan_bytes`: - `_default_remote_query_cap_bytes()` reads from the new path. - Add to `_OPTIONAL_FIELDS["data_source"]["bigquery"]["fields"]` with the same shape as `max_bytes_per_materialize` (kind=int, default 5 GiB, hint). - Add to `_BQ_OPTIONAL_FIELD_DEFAULTS` so it surfaces in the GET payload even when YAML omits it. Convention now mirrors `max_bytes_per_materialize` — both BQ cost guardrails live under `data_source.bigquery`, both editable in the UI.	2026-05-04 12:46:38 +02:00
ZdenekSrotyr	39bdc1ff45	feat(admin): #160 BQ test-connection endpoint + billing_project placeholder UI Closes the operator-side half of the reporter's loop. The CLI fix in the previous commit makes USER_PROJECT_DENIED errors readable to analysts; this commit lets admins verify reachability proactively from /admin/server-config without waiting for analyst reports. New endpoint POST /api/admin/bigquery/test-connection (app/api/admin_bigquery_test.py, ~110 LOC): - Depends(require_admin); registered in app/main.py. - Builds BqAccess via existing get_bq_access(), runs `SELECT 1 AS ok` with a 10s polling timeout. - 200 with {ok, billing_project, data_project, elapsed_ms} on success. - 400 for `BqAccessError(not_configured)` (operator config issue). - 502 for any other typed BqAccessError or unknown upstream exception. - 504 for concurrent.futures.TimeoutError; best-effort cancel_job invoked (BQ-side cancel may still run; documented caveat). Server-config placeholder (app/api/admin.py + admin_server_config.html): - `data_source.bigquery.billing_project` field-spec gains `placeholder_from: ["data_source", "bigquery", "project"]`. - renderLeafInput's text branch reads `opts.spec.placeholder_from`, walks the loaded `original` config dict, injects `placeholder="(defaults to <project>)"` into the input HTML at construction time. Admin sees the access.py:339-340 fallback rule visible directly in the UI without reading source. UI button: - "Test BigQuery connection" button next to data_source's Save button. - onTestBigQuery() POSTs to the endpoint, renders structured result inline (green check + elapsed_ms on success; red kind + hint on failure). Tests: 6 endpoint cases + 1 placeholder payload test = 7 GREEN. 62 total across the affected admin server-config test files.	2026-05-04 10:31:35 +02:00
ZdenekSrotyr	77cdb65f76	sec(query): #160 BQ_PATH catches quoted "bq" catalog token (Phase 3 review) Phase 3 review identified an RBAC + cost-cap bypass: `SELECT * FROM "bq"."ds"."tbl"` (catalog token quoted as a DuckDB identifier) was NOT matched by the BQ_PATH regex, so direct quoted-form references skipped both the registry check and the cost-cap dry-run. DuckDB resolves `"bq"` to the same ATTACHed BQ catalog, so the bypass is real. Widen the catalog-token alternation: `(?:"bq"\|bq)` matches both forms. Negative lookbehind `(?<![\w.])` still rejects look-alike prefixes (`other_bq`, `my_bq`); the new "my_bq".ds.tbl negative test locks that in alongside `other_bq.ds.tbl`. Tests: - 2 new positive cases in tests/test_query_bq_regex.py for the quoted form (`"bq"."finance"."ue"` and uppercase `"BQ"."ds"."tbl"`). - 1 new negative case rejecting `"my_bq".ds.tbl` so the quoted-form widening doesn't open a different evasion. - 1 new RBAC test in tests/test_api_query_rbac_bq_path.py: admin hitting an unregistered quoted path returns the same bq_path_not_registered 403 as the unquoted form. All 33 Phase 3 tests pass after the fix.	2026-05-04 10:31:35 +02:00
ZdenekSrotyr	896c43c7a2	feat(query): #160 cost guardrail + bq.* RBAC + quota integration on /api/query The headline implementation for issue #160. POST /api/query now gates direct `bq."<dataset>"."<source_table>"` references behind the registry and bounds the BQ scan cost behind a configurable cap. Wired through the same singleton QuotaTracker as /api/v2/scan so daily-byte budgets are shared across both BQ-touching paths. Changes in app/api/query.py: - Add module-level `BQ_PATH` regex matching the 16 syntax variants verified empirically (fully-quoted, unquoted, mixed quoting, case-insensitive, inside CTE bodies, multi-path, …). - Add `bigquery_query` to the SQL keyword blocklist. Closes the pre-existing function-call backdoor where a user could run an arbitrary BQ jobs API call against any reachable dataset, bypassing the registry and RBAC. Wrap views internal to the BQ extractor still use bigquery_query() — but those run via DuckDB view resolution at query time, not via user-submitted SQL, so the blocklist doesn't break them. - Add `_bq_guardrail_inputs` helper: walks user SQL twice — once for bare-name matches against accessible registered remote-BQ names (contributes to dry_run_set), once for direct `bq.X.Y` matches (gated against `find_by_bq_path` lookups, returns 403 with structured detail on miss or grant violation). - Add `_enforce_remote_bq_quota_and_cap` helper: pre-flight `check_daily_budget` (over-cap → 429), then `with quota.acquire(...)` wraps a per-path BQ dry-run, sums bytes, raises 400 `remote_scan_too_large` when total > cap. - Cap default 5 GiB; configurable via `api.query.bq_max_scan_bytes` in /admin/server-config (next phase wires the UI). - Post-flight `record_bytes` against the user's daily counter. - Module-level imports of `_bq_dry_run_bytes`, `_build_quota_tracker`, `get_bq_access` so tests can monkeypatch via `app.api.query.<name>`. Tests: - All 23 RED tests from the previous commit now pass (regex matrix, blocklist with detail-string assertion, RBAC unregistered/admin-bypass, guardrail dry-run-called/over-cap-rejected, quota pre-flight 429). - mock_dry_run fixture stubs both `_bq_dry_run_bytes` and `get_bq_access` so guardrail tests don't require a live BQ project. - Quota test uses `admin1` (the seeded_app fixture's actual user id, not `admin`). Smoke: 887 passed across query/bq/admin/extractor/registry/quota domains. No regressions.	2026-05-04 10:31:35 +02:00
ZdenekSrotyr	e44d2280e5	refactor(quota): #160 relocate _build_quota_tracker to v2_quota.py The /api/query cost guardrail (next phase) needs the same singleton QuotaTracker so its daily-byte and concurrent-slot caps accumulate across both /api/v2/scan and /api/query BQ-touching paths. Move `_build_quota_tracker`, `_quota_singleton`, and `_quota_init_lock` from app/api/v2_scan.py to app/api/v2_quota.py (the natural home; the factory uses QuotaTracker which already lives there). Re-export the function from v2_scan.py so the 7 test sites at tests/test_v2_scan.py (lines 77, 118, 143, 160, 186, 208, 250) keep working without edits. Crucially do NOT re-export `_quota_singleton` from v2_scan.py — Python `from X import var` copies the binding at import time, so a re-exported singleton would freeze at the initial None and never observe the in-place mutation done inside `_build_quota_tracker()`. Re-export only the function (which always reads the live module-global through `global`). Mechanical refactor; no behavior change. 30 quota-related tests pass.	2026-05-04 10:31:35 +02:00
ZdenekSrotyr	9d0e4e687d	refactor(bq): #160 remove legacy_wrap_views config knob (always-wrap) Now that VIEW/MATERIALIZED_VIEW always wrap via bigquery_query() (the prior `legacy_wrap_views=True` branch behavior, made unconditional in the previous commit), the toggle has no semantic meaning and is removed across the codebase. Production code: - app/api/admin.py: drop the field from _OPTIONAL_FIELDS["data_source"] ["bigquery"]["fields"] and from _BQ_OPTIONAL_FIELD_DEFAULTS, plus the comment block above the defaults dict. - config/instance.yaml.example: drop the example snippet. - src/orchestrator.py: update the inner-objects skip-branch comment to reflect the new BQ behavior (the skip itself stays — keboola use_extension=False still inserts _meta rows without inner views). - app/web/templates/admin_tables.html: rewrite operator copy in the register and edit forms to reflect always-wrap. Tests: - tests/test_admin_server_config.py (TestServerConfigBigQueryFields): flip assertions from "field IS present" to "field NOT present" on legacy_wrap_views. Drop the test_post_persists_legacy_wrap_views test since the field no longer exists. - tests/test_admin_server_config_known_fields.py: same flip on the known-fields registry assertion. - tests/test_bigquery_extractor.py: drop the obsolete test_view_entity_does_not_create_master_view_by_default (asserted the bug we fixed) and test_legacy_wrap_views_toggle_restores_old_behavior (toggle no longer meaningful). Update remaining test docstrings. Operators with `legacy_wrap_views: true` set in their overlay get the new (equivalent) behavior automatically — the unrecognized key is silently ignored by the YAML loader. Operators with `false` get the issue-#160 fix as a behavior change, not a regression. Spec gate updated: production code grep gate grep -rn 'legacy_wrap_views' connectors app src config cli must return zero. tests/ excluded — historical "removed in #160" breadcrumbs and `assert "X" not in fields` regression guards retained as anti-regression signals.	2026-05-04 10:31:35 +02:00
ZdenekSrotyr	955b56608d	feat(api,web,cli): /admin/workspace-prompt + /api/welcome restored + da analyst writes CLAUDE.md - app/api/claude_md.py: GET /api/welcome (analyst, auth required); GET/PUT/DELETE /api/admin/workspace-prompt-template; POST …/preview; two-pass Jinja2 validation on PUT; validation stub mirrors build_claude_md_context() shape - app/main.py: register claude_md_router - app/web/router.py: GET /admin/workspace-prompt → admin_workspace_prompt.html - app/web/templates/admin_workspace_prompt.html: CodeMirror editor + live preview + status chip + reset modal; mirrors admin_welcome.html for Agent Setup Prompt - app/web/templates/_app_header.html: add "Agent Workspace Prompt" nav item next to "Agent Setup Prompt"; extend _admin_active to cover /admin/workspace-prompt - cli/commands/analyst.py: _init_claude_workspace now accepts server_url + token; _write_claude_md fetches GET /api/welcome, writes CLAUDE.md, graceful 404/5xx; setup command adds --no-claude-md flag to opt out; default = write CLAUDE.md - tests: test_claude_md_api.py (16 tests); test_analyst_bootstrap.py updated with 4 new CLAUDE.md bootstrap tests; test_welcome_template_api.py: update stale assertion about /api/welcome being removed (endpoint restored) - tests/snapshots/openapi.json: regenerated	2026-05-03 22:44:14 +02:00
ZdenekSrotyr	9ad7856f72	fix(devin-review): dashboard CTA respects override; PUT validates anon path Finding #1: _build_context now routes through render_agent_prompt_banner when a DB connection is available, so both /setup and the /dashboard clipboard CTA always reflect the admin override (or the live default when no override is set). Previously _build_context unconditionally used resolve_lines(), ignoring the welcome_template override for the dashboard JS array. Finding #2: PUT /api/admin/welcome-template now performs a second render pass with user=None (anonymous stub) after the authenticated-user pass. Templates that reference user.* fields without an {% if user %} guard are rejected with a clear 400 error explaining the anon-visitor breakage.	2026-05-03 21:45:32 +02:00
ZdenekSrotyr	d18bc4c8f7	fix(api): align PUT validation autoescape with runtime (False); docs match	2026-05-03 21:30:24 +02:00
ZdenekSrotyr	61ef0d0eed	fix(devin-review): address 4 findings on PR #167 - Fix #1: _detect_existing_project now checks .claude/settings.json for "da sync" marker instead of deleted CLAUDE.md; update tests accordingly. - Fix #2: preview endpoint uses autoescape=False to match /setup rendering; align render_agent_prompt_banner in welcome_template.py to the same. - Fix #3: apply _sanitize_banner_html to override render path in setup_page so all render paths sanitize consistently. - Fix #4: move .setup-link-banner into the existing-user branch where account_details.last_sync_display is reachable; remove dead copy from new-user branch.	2026-05-03 21:15:01 +02:00
ZdenekSrotyr	bcb62ff4e2	fix(ui): tighten dashboard token row gap; lift editor/preview labels above panes	2026-05-03 19:51:34 +02:00
ZdenekSrotyr	05f12b416d	fix(ui): dashboard token row alignment + match editor/preview heights	2026-05-03 19:23:50 +02:00
ZdenekSrotyr	dc931a6556	feat(admin-prompt): default = live setup script; override replaces /setup content The /admin/agent-prompt editor now pre-fills with the full bash bootstrap script from setup_instructions.resolve_lines() instead of being empty. When an admin saves an override it replaces the default everywhere — the /setup page display and the dashboard clipboard CTA — rather than adding a banner above the auto-generated commands. GET /api/admin/welcome-template now returns a `default` field with the live computed script so the editor always shows meaningful starting content. {server_url} and {token} single-brace placeholders survive Jinja2 rendering and are substituted by JavaScript at clipboard-copy time as before. Preview pane switches to textContent (not innerHTML) since content is bash.	2026-05-03 16:31:35 +02:00
ZdenekSrotyr	c4d23cf235	feat(admin-prompt): update editor UX + docs for banner context - admin_welcome.html: update subtitle, description, placeholder cheatsheet (drop tables/metrics/marketplaces/sync_interval; add user-null note and security note). Textarea initial value is now empty (no default template to show). Preview pane uses innerHTML (HTML output). refreshStatus sets editor to empty when no override. Preview pane styled as light surface. Reset modal copy updated (no banner shown, not "OSS-shipped template"). - config/claude_md_template.txt: deleted (markdown template is gone; default is now no banner). - docs/agent-setup-prompt.md: rewritten for variant C — describes the /setup banner, smaller placeholder table, security/sanitization notes, anonymous-user guard, example HTML snippet.	2026-05-03 16:12:13 +02:00
ZdenekSrotyr	8db4c1645b	feat(admin-prompt): variant C — banner on /setup, drop CLAUDE.md generation - src/welcome_template.py: rewrite as HTML banner renderer (render_agent_prompt_banner); drop _list_tables, _metrics_summary, _marketplaces_for_user, render_welcome, _load_default_template. build_context now exposes only instance/server/user/now/today. _sanitize_banner_html strips script/iframe/on*/javascript: post-render. - app/api/welcome.py: drop get_welcome handler, WelcomeResponse, old _VALIDATION_STUB_CONTEXT. Admin endpoints stay at same URLs; validation stub updated to match new slim context. Preview now uses autoescape=True. - app/web/router.py: setup_page calls render_agent_prompt_banner and passes banner_html to install.html; admin_agent_prompt_page drops _load_default_template. - app/web/templates/install.html: add .setup-banner CSS + banner block above hero. - cli/commands/analyst.py: replace _generate_claude_md with _init_claude_workspace; no CLAUDE.md written, only .claude/CLAUDE.local.md placeholder + settings.json hooks. - tests: delete test_cli_analyst_welcome.py (tests deleted endpoint/function); rewrite TestGenerateClaudeMd → TestInitClaudeWorkspace; update api test to assert /api/welcome returns 404 and remove welcome-fetch tests.	2026-05-03 16:12:13 +02:00
ZdenekSrotyr	60386b9c3c	polish: drop dead CSS, fix docstring drift, add agent-prompt route test	2026-05-03 16:12:13 +02:00
ZdenekSrotyr	ecb6c35ad5	feat(admin): rename /admin/welcome to /admin/agent-prompt (Agent Setup Prompt) Rename the welcome prompt editor from /admin/welcome to /admin/agent-prompt and update all UI labels to "Agent Setup Prompt". API endpoint URLs are unchanged (PUT/GET/DELETE /api/admin/welcome-template, GET /api/welcome). - Nav menu: "Welcome prompt" → "Agent Setup Prompt", href updated - Page title and h2 updated in admin_welcome.html - Error message hint in app/api/welcome.py updated to /admin/agent-prompt - Dashboard: replace inline <details> preview of _claude_setup_instructions with a simple link to /setup (Task C) - docs/welcome-template.md renamed to docs/agent-setup-prompt.md; internal references to /admin/welcome updated - OpenAPI snapshot path updated - Tests updated to reflect new route and removed inline preview	2026-05-03 16:12:13 +02:00
ZdenekSrotyr	c7b14fb120	feat(admin): drop setup_banner feature; consolidate into single editor Remove the setup_banner feature (admin-editable /setup page banner) and all associated code: API router, repository, renderer, admin template, tests, and docs. The setup_page handler no longer calls render_setup_banner; the install.html template no longer renders banner_html. The setup_banner DuckDB table (v22) is kept intact for forward-compat with already-migrated instances — only the application code is removed. CHANGELOG updated: setup_banner bullets removed; Agent Setup Prompt (welcome-template feature) now stands alone as the single editable prompt.	2026-05-03 16:12:13 +02:00
ZdenekSrotyr	b0ec842804	feat(admin-ui): SRI + CDN fallback for CodeMirror, 301→302 on /install, error sanitization - Add integrity= + crossorigin= to all 4 cdnjs tags in admin_welcome.html and admin_setup_banner.html (I-1) - Add graceful CDN fallback: when CodeMirror is undefined (SRI mismatch or CDN down), degrade to styled plain textarea with polyfill editor interface so save/reset/preview still work (I-1) - Replace fixed 480px editor height with calc(100vh - 320px) for viewport-relative sizing; add min-height: 480px to .welcome-editor-col (M-8) - Change /install redirect from 301 to 302 to prevent indefinite browser caching (I-5) - Sanitize Jinja2 error detail in /api/welcome 500 response: log full error server-side, return generic detail pointing at /admin/welcome (M-7) - Hoist build_context import to module level in app/api/welcome.py (M-11)	2026-05-03 16:12:13 +02:00
ZdenekSrotyr	39146288e1	feat: admin-editable setup_banner on /setup page (schema v22) Adds an optional Jinja2/HTML banner displayed above the bootstrap commands on /setup. Empty by default; admin authors it at /admin/setup-banner. autoescape=True — safe for HTML context. Render failures return "" so a broken banner never breaks /setup. Schema v22: setup_banner singleton table, auto-migration v21→v22.	2026-05-03 16:12:13 +02:00
ZdenekSrotyr	40d221f20a	feat(admin-welcome): CodeMirror editor + live preview pane	2026-05-03 16:12:13 +02:00
ZdenekSrotyr	4bcdc4e7d7	feat(dashboard): link Claude Code setup CTA to /setup page	2026-05-03 16:12:13 +02:00
ZdenekSrotyr	85967e14ca	feat(web): rename /install → /setup; nav label 'Setup local agent' - Add GET /setup serving install.html (CLI + Claude Code setup page) - Add GET /install → 301 redirect to /setup for backwards compat - Move first-time setup wizard from /setup to /first-time-setup - Update nav link: href=/setup, label 'Setup local agent', active on both /setup and /install paths - Update page <title> to 'Setup local agent — …' - Update /dashboard and /setup comment in _claude_setup_instructions.jinja - Update tests and OpenAPI snapshot accordingly	2026-05-03 16:12:13 +02:00
ZdenekSrotyr	92fd78cfb4	fix(admin-welcome): redesign with peer chrome, toast, btn-copy	2026-05-03 16:12:13 +02:00
ZdenekSrotyr	ecaa113c68	fix(admin-welcome): credentials: include, real-content preview, refresh after mutate	2026-05-03 16:10:48 +02:00
ZdenekSrotyr	2b3048f77f	feat(web): /admin/welcome editor page	2026-05-03 16:10:48 +02:00
ZdenekSrotyr	93b713900b	fix(api): validate template render on PUT; broaden render-time catch	2026-05-03 16:10:48 +02:00
ZdenekSrotyr	0d1ecd235d	feat(api): /api/welcome + /api/admin/welcome-template endpoints	2026-05-03 16:10:48 +02:00
ZdenekSrotyr	d055417377	feat(config): default welcome template in jinja2 + sync_interval	2026-05-03 16:10:48 +02:00
ZdenekSrotyr	91caefaca9	security(auth): per-IP rate limit + last-admin guard (#165 ) * security(auth): per-IP rate limit on auth endpoints + generalize last-admin guard Closes #45 and #151. #45 — every auth endpoint was unthrottled (login, magic-link, token, bootstrap), leaving us open to password brute-force and SMTP email-bombing. Wires slowapi (new dep) into the middleware chain with per-route limits: 10/min on login + token, 5/min on send-link, 3/min on bootstrap. Returns 429 with Retry-After: 60 once exceeded. Per-IP key respects the leftmost X-Forwarded-For hop (Caddy in front of the app strips client-supplied XFF). Operator escape hatch: AGNES_AUTH_RATELIMIT_ENABLED=0. Test suite disables the limiter via autouse conftest fixture so existing auth tests that hammer endpoints in tight loops are unaffected. #151 — DELETE /api/admin/users/{id}/memberships/{group_id} and the mirror DELETE /api/admin/groups/{group_id}/members/{user_id} only guarded against self-removal as last admin. Generalizes to refuse removing anyone from the seeded Admin group when they are the only remaining active admin (mirrors the existing count_admins(active_only=True) <= 1 check on delete_user / update_user). Recovery from zero admins requires direct DB access, so this closes a path where a scheduler/bootstrap actor that bypasses normal admin checks could otherwise empty the group. * security(auth): throttle remaining email-bombing + token-confirm endpoints Address code-review gap on PR #165 — the first commit covered /send-link but missed two endpoints with the IDENTICAL email-bombing surface: - POST /auth/password/reset — sends reset mail, anti-enum response - POST /auth/password/setup/request — sends setup mail, anti-enum response Both now share the 5/min limit with /send-link. Also add 10/min to the token-confirm surfaces — high-entropy tokens but partial leaks via logs / referer have surfaced before, and unbounded guess rate would let an attacker exhaust the keyspace adjacent to a leaked prefix: - POST /auth/email/verify - GET /auth/email/verify — closes the click-through bypass - POST /auth/password/reset/confirm - POST /auth/password/setup/confirm Doc fix: rate_limit.py module docstring + CHANGELOG entry no longer claim "disable without a redeploy" (misleading). The Limiter constructor freezes `enabled` from env at import time, matching every other Agnes env knob — operators set the flag and bounce the container. Tests: 4 new cases in test_auth_rate_limit.py covering /reset, /setup/request, /reset/confirm, GET /verify. Full suite: 2583 passed, 32 skipped, 0 failed. * security(auth): throttle JSON /auth/password/setup — closes form-throttle bypass Second code-review pass on PR #165 caught a fifth gap: POST /auth/password/setup (JSON variant, kept for backward compat) consumes the same setup_token as the web form /setup/confirm but was unthrottled — an attacker brute-forcing the token just switches from the form path to the JSON path and resumes at unbounded RPS. Apply the same 10/min limit and signature shape used on /setup/confirm. Also extend CHANGELOG note about the JSON-variant bypass for future operators reading the security entry. Test: 1 new case (test_password_setup_json_rate_limited_after_10_requests), 9 rate-limit tests + 28 password-flow tests + 41 auth-provider tests pass, no regressions. * chore(release): cut 0.30.1 — auth security hardening (rate limit + last-admin guard)	2026-05-02 21:08:33 +02:00
ZdenekSrotyr	dc03837a7b	feat(query-api): better error message when --remote query references a materialized-but-not-rebuilt id E2E sub-agent finding: `da query --remote "SELECT * FROM <id>"` against a materialized table that hasn't yet been rebuilt in the server's analytics.duckdb returns a confusing DuckDB "Table does not exist" message even though the table is in the registry. Materialized rows produce parquets at `${DATA_DIR}/extracts/<source>/data/<id>.parquet`, but the orchestrator's master-view creation is `_meta`-driven — fresh instances or pre-tick states have the registry row without a corresponding view, so analysts hit the bare "does not exist" with no path forward. Improve the error rendering in `app/api/query.py:execute_query`. When DuckDB raises a "table does not exist" error, scan the registry for any `query_mode='materialized'` row whose id or name appears in the failed SQL. On a hit, return a 400 whose detail names the table, explains the materialize state, and offers two concrete next steps: 1. Run `da sync` (or wait for the scheduler tick / hit POST /api/sync/trigger) to materialize the parquet, OR 2. Query the source directly via the catalog alias when the registry row carries bucket+source_table (e.g. `bq."dataset"."table"` for BigQuery, `kbc."bucket"."table"` for Keboola). Detection is bounded — the registry round-trip only fires when DuckDB's error mentions a missing table, so happy-path queries pay no cost. Non-materialized unknowns fall through to DuckDB's raw error. 2 new tests: materialized id surfaces the hint with the bucket+source_table payload; unknown table falls back to the generic error path with no false positive on the new hint.	2026-05-01 23:09:52 +02:00
ZdenekSrotyr	8030a867ec	fix(admin-api): keep source_type validator permissive when primary is 'local' (bootstrap) The strict source_type-availability validator from the prior commit broke ~12 existing tests that register tables on the default test instance (where `data_source.type` resolves to 'local' since no instance.yaml is loaded). The intent of the validator is to catch explicit misconfig: `type=bigquery` instance + `source_type=keboola` payload with no `data_source.keboola.*` block. The bootstrap workflow — admin sets up a fresh instance and registers a few tables before pointing at a real source — should not be gated here. Loosen the check: when `get_data_source_type()` returns 'local' (the fallback when no `data_source.type` is set), skip the rejection. The explicit mismatch case still 422s because that path resolves `configured_primary` to a real source type. Also adds an autouse keboola_instance fixture to test_journey_sync_query.py which exercises Keboola registrations through the full sync→query flow — the fixture documents the test's data-source assumption rather than relying on the bootstrap escape hatch.	2026-05-01 23:09:15 +02:00
ZdenekSrotyr	bc3ba0d43d	feat(admin-api): reject register-table for source_type not configured on instance E2E sub-agent finding: instance configured with `data_source.type='bigquery'` and no `data_source.keboola.` block. Admin POSTs `{source_type: 'keboola'}` to /api/admin/register-table → returns 201, row lands in the registry, but never syncs because the scheduler has no Keboola URL/token to ATTACH against. Operator only notices the gap when `da catalog` keeps showing nothing. The new `_validate_source_type_configured` helper runs immediately after the id/view-name collision checks in `register_table`. A source_type is considered configured when: - it matches `get_data_source_type()` (the instance's primary), OR - a non-empty `data_source.<source_type>` block exists in the effective `instance.yaml` (multi-source instance), OR - it's in `_SOURCE_TYPES_INDEPENDENT_OF_DATA_SOURCE` (Jira / local — both get data through paths that don't involve `data_source.`). Returns 422 with a message that names the configured primary source and points at `/admin/server-config` for enabling a secondary one. None / empty source_type is still tolerated for backward compat with legacy CLI scripts that don't set the field — the route resolves it later. 5 new tests cover: keboola-on-bq rejected, bq-on-keboola rejected, matching source_type still works, jira allowed regardless, omitted source_type passes through. Existing tests that registered Keboola rows on the unconfigured default test instance now opt into a `keboola_instance` fixture to satisfy the new validator (tests/test_admin_bq_register.py + .keboola_materialized + .unregister_cleanup; the multi-source PUT test in test_admin_bq_register adds a `keboola` block to its synthetic config). Pre-existing test_missing_project_returns_error failure in TestRebuildFromRegistry is unrelated (config-cache leakage from a previous test in the same class) — confirmed pre-existing on the prior commit via `git stash` reproduction.	2026-05-01 23:04:51 +02:00
ZdenekSrotyr	dd46461c6c	fix(admin+orchestrator): DELETE registry drops parquet + sync_state; rebuild skips orphan parquets E2E sub-agent finding: register a materialized BQ row → sync to materialize the parquet at `/data/extracts/bigquery/data/<id>.parquet` → DELETE the registry row. The DB row goes away but: - the parquet file stays on disk forever, AND - the sync_state row stays, so `/api/sync/manifest` keeps advertising the dropped table to `da sync`, AND - the orchestrator's next rebuild can resurrect a master view by picking up the leftover parquet. Two-part fix in `unregister_table`: 1. For materialized rows on bigquery/keboola, remove `${DATA_DIR}/extracts/<source_type>/data/<name>.parquet` (and any stale `<name>.parquet.tmp` from a crashed prior materialize). Filename is keyed on `table_registry.name` to match sync_state bookkeeping. File-removal errors are logged but don't fail the DELETE — the registry row is already gone, and an orphan parquet won't get a master view at next rebuild because the orchestrator's _meta-driven scan never picks up bare parquet files. 2. Always clear `sync_state` + `sync_history` rows for the dropped table_id so the manifest stops advertising the table — applies to all source types and modes, not just materialized, since any synced row had a sync_state entry. Orchestrator-side defensive guard (Finding 2b) is a no-op in the current implementation: `_attach_and_create_views` only creates master views from `_meta` rows in each connector's `extract.duckdb`, so a parquet without a matching `_meta` entry is already invisible to the rebuild. The new test `test_orchestrator_skips_orphan_parquet_in_extracts` is kept as a regression guard for that contract. 5 tests cover: BQ + Keboola materialized DELETE removes parquet, remote DELETE doesn't error trying to remove a non-existent file, sync_state cleared on DELETE, orchestrator orphan-skip invariant.	2026-05-01 22:54:11 +02:00
ZdenekSrotyr	f0979f997a	fix(admin-api): reject backtick BQ-native source_query at register; surface materialize errors per-row E2E testing showed admin POSTs of materialized BQ rows whose source_query uses BigQuery-native backtick identifiers (`prj.ds.t`) silently no-op'd at the next sync tick — the materialize path runs the SQL through the DuckDB BQ extension's COPY which uses DuckDB's parser; backticks aren't recognized and the query either parse-errors or matches zero rows. No parquet lands at the canonical path and no error reaches an operator-visible surface. Two-part fix: 1. RegisterTableRequest's _check_mode_query_coherence model_validator now rejects any source_query containing a backtick with a 422 + actionable message pointing at the DuckDB equivalent (bq."dataset"."table"). Same check is applied in update_table on the merged record so PATCHes that flip a stored source_query to backtick form are also caught. Covers BQ AND Keboola materialized rows since both connectors funnel source_query through DuckDB's COPY. 2. _run_materialized_pass now persists per-row failures via the new SyncStateRepository.set_error / clear_error methods (existing sync_state.error / status columns — no schema migration). GET /api/admin/registry enriches each row with `last_sync_error` from a single batched SELECT against sync_state, so the admin UI / da admin status can show "this table failed last sync because: X" instead of operators having to trawl scheduler logs. Recovered rows have the error cleared automatically — update_sync's success path resets status='ok' / error=NULL on the upsert. The materialized-path test fixture's _materialized_payload helper is updated to use DuckDB-flavor SQL (the prior backtick example pre-dated the fix). 6 new tests cover register/update rejection on BQ + Keboola, the sync_state error persistence, and the registry response surface.	2026-05-01 22:51:02 +02:00
ZdenekSrotyr	a4339ce679	fix(admin+diagnose): address 2 additional Devin Review findings on PR #152 Devin's second review pass on commit `16938ae7` surfaced 2 more issues: BUG_pr-review-job-58ae3148_0001 — non-BQ materialized via PUT bypasses source_query check app/api/admin.py update_table only enforces 'query_mode=materialized requires source_query' for source_type='bigquery' rows (via the synthetic RegisterTableRequest at line 2129+). Non-BQ source types (Keboola) skip the check — admin could PUT {query_mode: materialized} on a Keboola local row without source_query, persist successfully, then crash at the next sync tick when kb_materialize_query received sql=None and DuckDB rejected COPY (None) TO '...'. Fix: generic coherence guard before the BQ-specific block — for ALL source types, query_mode='materialized' requires non-empty source_query in the merged record. Returns 422 with a hint about reverting via query_mode='local'/'remote'. ANALYSIS_pr-review-job-642ff90f_0007 — diagnose returns 'ok' on BQ resolution failure app/api/health.py:_check_bq_billing_project caught get_bq_access() exceptions and returned status='ok' with a 'could not resolve' detail. Automated alerting keyed on status != 'ok' would silently miss missing google-cloud-bigquery, auth failures, or malformed config. Fix: return status='unknown' on resolution failure — surfaces it on operator dashboards without promoting the overall health to 'degraded' (which 'warning' does, intentionally for the billing==project case). Tests: - test_update_keboola_to_materialized_without_source_query_rejected: PUT {query_mode: materialized} on a Keboola local row returns 422 with 'source_query' in the detail - test_diagnose_returns_unknown_status_when_bq_resolution_fails: when get_bq_access raises, the bq_config service entry surfaces status='unknown' (not 'ok') Full sweep: 2507 passed, 25 skipped, 0 failed (+2 from previous sweep because of the 2 new regression tests; 8 pre-existing internal_roles schema-migration failures still ignored per task brief).	2026-05-01 21:21:23 +02:00
ZdenekSrotyr	16938ae7cb	fix(materialized): address 4 Devin Review findings on PR #152 Devin Review on commit `7052a235` flagged 4 real bugs in the Keboola materialized path. All four are fixed; 3 new regression tests pin the behavior so future refactors can't quietly regress. BUG_pr-review-job-3fbd31c9_0001 — _run_materialized_pass gated behind 'if bq_project:' app/api/sync.py:444-466 wrapped the entire materialized pass (which dispatches BOTH BigQuery AND Keboola rows by source_type) in a check for data_source.bigquery.project being non-empty. On Keboola-only instances this short-circuited and Keboola materialized rows sat in table_registry forever without their SQL being evaluated — the feature CHANGELOG advertised was dead code on the most common deployment shape. Fix: always run the materialized pass; the BQ branch's per-row try/except catches the typed BqAccessError(not_configured) the sentinel raises when no BQ project is set, so non-BQ instances incur a per-row error for any (hypothetical) BQ-tagged row but the Keboola path runs cleanly. Log line renamed 'Materialized BQ' → 'Materialized SQL' to match. BUG_pr-review-job-3fbd31c9_0004 — wrong config key 'url' instead of 'stack_url' app/api/sync.py:149 read get_value('data_source', 'keboola', 'url'), but the canonical config key documented in instance.yaml.example:111 and used by app/api/admin.py:1503 + 2359 is 'stack_url'. Production Keboola instances would always see an empty URL and fail with the 'not configured' error. The pre-existing test patched the wrong key too, so it passed without catching the mismatch. Fix: use stack_url in both sync.py and the test fixture. BUG_pr-review-job-3fbd31c9_0003 — no atomic write in Keboola materialize_query connectors/keboola/extractor.py wrote COPY directly to the final '<id>.parquet' path. A mid-COPY failure (network, disk full, extension crash) left a partial parquet that the orchestrator rebuild would later pick up and serve to analysts. BQ's materialize_query already uses a '<id>.parquet.tmp' staging path + os.replace() atomic swap (connectors/bigquery/extractor.py:370-445); Keboola now mirrors that pattern with the same try/except cleanup on COPY failure. BUG_pr-review-job-3fbd31c9_0002 — full file read into memory for MD5 Same file:60-62 used parquet_path.read_bytes() for the MD5 hash. Multi-GB Keboola materialized results would OOM on memory-constrained containers. BQ's version uses streaming 8 KiB-chunk hashing (connectors/bigquery/extractor.py:438-442); Keboola now mirrors it. Tests: - test_run_sync_runs_materialized_pass_on_keboola_only_instance — pins BUG_0001's fix; setting bigquery.project='' must NOT skip Keboola materialized dispatch - test_keboola_materialize_atomic_write_on_failure — pins BUG_0003; a mid-COPY RuntimeError leaves no .parquet AND no .parquet.tmp at the canonical path - test_keboola_materialize_uses_tmp_path_during_copy — documents the atomic-write contract: COPY targets .parquet.tmp, final swap to .parquet (no .tmp suffix on the result['path']) - existing test_run_materialized_pass_dispatches_keboola_to_keboola_extractor fixture updated: stack_url instead of url Full sweep: 2505 passed, 25 skipped, 0 failed (modulo 8 pre-existing internal_roles schema-migration failures called out in the task brief).	2026-05-01 20:58:17 +02:00
ZdenekSrotyr	b627de8344	feat(diagnose) + docs: warn on USER_PROJECT_DENIED footgun + document all newly-exposed knobs Diagnostic + operator-facing documentation that closes the loop on the work in this PR. `da diagnose` (via /api/health/detailed): - New _check_bq_billing_project() helper. When data_source.type='bigquery' and BqProjects.billing == .data, surface a yellow warning: 'BigQuery billing project equals data project'. Hint includes the YAML field path + the /admin/server-config UI shortcut. Diagnose's overall status promotes warning → degraded so the CLI echoes it. - Non-BQ instances (Keboola-only, etc.) skip the check. - Implementation hooks into the existing /api/health/detailed surface — no new endpoint, no CLI changes. config/instance.yaml.example documentation: - data_source.bigquery.billing_project: USER_PROJECT_DENIED hint, /admin/server-config UI reference - data_source.bigquery.legacy_wrap_views: analyst-side discipline note (use `da fetch` / `da query --remote`), issue #101 history, view-heavy deployment guidance - data_source.bigquery.max_bytes_per_materialize: cost guardrail block (NEW — wasn't documented in .example before) - ai.base_url: provider list + UI hint - openmetadata + desktop: 'configurable via /admin/server-config UI' headers - corporate_memory: leading note that the schema is editable via UI Other docs: - CHANGELOG.md: comprehensive Unreleased section - CLAUDE.md: schema chain → v20 + Materialized SQL connector mode + per-connector tab UI mention - README.md: mode-first source table summary - docs/architecture.md: per-connector tab UI mention - cli/skills/connectors.md: bootstrap rails (parallel to #154) - docs/superpowers/plans/2026-05-01-admin-tables-form-cleanup.md: implementation plan archive (2515 lines) - scripts/seed_dummy_tables.py: drop is_public after #150 RBAC migration (column gone) Tests: - test_diagnose_billing.py — 3 cases (BQ with billing==data warns, BQ with billing!=data clean, non-BQ skips)	2026-05-01 20:27:24 +02:00
ZdenekSrotyr	df7f5b1d9a	feat(admin-ui): /admin/server-config known-fields registry + structured nested editor Today /admin/server-config renders fields by iterating Object.keys(payload) on the YAML value — if a key isn't in instance.yaml, the operator can't see it. They have to know to type it via the JSON-patch textarea (which only renders for empty sections) or SSH and edit YAML. Adds a known-fields registry (`_KNOWN_FIELDS` in app/api/admin.py) the UI consumes alongside the YAML payload. Renderer shows BOTH: - existing fields (from YAML) with current value - known-but-unset fields with dashed-border placeholder + hint, ready to fill in Renderer (`renderField`, `renderSection`, `collectSection`): - kind="string"\|"secret"\|"bool"\|"int"\|"select"\|"object"\|"array"\|"map" — picks input type - kind="object" with `fields` — recursive structured form, arbitrary depth (corporate_memory needs 3-4 levels) - kind="array" with `item_kind` — vertical stack of typed inputs + add/remove buttons - kind="map" with `key_kind` + `value_kind` — key:value rows + add/remove (used for confidence.base, domain_owners, entity_resolution.entities) - data-path encoded as JSON segment array so map keys with embedded dots (e.g. 'user_verification.correction') survive collect → patch round-trip - .cfg-field.is-unset CSS — dashed border, muted label, italic hint Sections newly exposed (added to _EDITABLE_SECTIONS): - openmetadata: url, token (secret), cache_ttl_seconds, verify_ssl - desktop: jwt_issuer, jwt_secret (secret), url_scheme Known fields populated for existing sections: - data_source.bigquery: billing_project (the cause of the 403 USER_PROJECT_DENIED footgun when SA can read but not bill the data project), legacy_wrap_views (bigquery_query() wrap for VIEWs — issue #101 default off, ON for view-heavy deployments), max_bytes_per_materialize (cost guardrail) - data_source.keboola: stack_url, project_id (hints; values already populated) - ai: base_url (required for openai_compat), structured_output (select) - corporate_memory: full schema from instance.yaml.example — distribution_mode, approval_mode, review_period_months, notify_on_new_items, sources.{claude_local_md,session_transcripts}, extraction.{model,sensitivity_check,contradiction_check}, confidence.{base,modifiers,decay.{mode,half_life_months,decay_rate_monthly,floor}}, contradiction_detection.{enabled,max_candidates}, entity_resolution.{enabled,entities}, domain_owners, domains - Known partial: confidence.modifiers is map<string, map<string, float>> — falls through to JSON-textarea with TODO; structured editor for that one shape needs more renderer work Tests: - test_admin_server_config_known_fields — registry envelope shape, smoke fixture - test_admin_server_config_renderer_depth — 4-level nested objects, arrays of strings, maps of floats, dotted-key safety - test_admin_server_config_corp_memory — full corporate_memory schema, 12 fields incl. nested - test_admin_server_config — existing tests adjusted for new shape	2026-05-01 20:27:01 +02:00
ZdenekSrotyr	c63f54d643	feat(admin-ui): /admin/tables per-connector tabs + Keboola materialized parity + form cleanup + Manage access deep link Replaces the single mixed Jinja-branched form at /admin/tables with a per-connector tab interface and brings Keboola to capability parity with BigQuery. Tab structure: - BigQuery tab: Register modal with two-question radio model (Q1 Live \| Synced × Q2 Whole \| Custom SQL), Discover datasets / List tables / Use-table-as-base autocomplete buttons, table-vs-view auto-detection hint, per-tab listing filter - Keboola tab: same two-question radio (Q2 only — no Live mode for Keboola), Custom SQL textarea against kbc."bucket"."table" for materialized rows - Jira tab: read-only listing (Jira is webhook-driven; no Register form) - Active tab persists in window.location.hash so refresh keeps the operator in place Form cleanup (within tabs): - Drops the misleading 'Sync Strategy' dropdown — runtime never read it (only profiler.is_partitioned() consumes the value for parquet-layout detection); kept in DB for back-compat (Pydantic deprecated) - Adds Sync Schedule input to Keboola Register/Edit (was missing — scheduler honored per-table cron via is_table_due() for every source but the Keboola UI had no surface) - Hides Primary Key under <details>Advanced with clarifying hint that it's catalog-metadata only (Agnes does not perform upsert/dedup; every sync is a full overwrite) - Drops the Strategy column from the registry listing (every Keboola row defaulted to full_refresh after Strategy was hidden — column was noise) - Removes the legacy out-of-tab #registerModal + the legacy global Discovery panel; each tab now owns its own header + Register button + listing div Edit modal: - BigQuery Edit modal physically relocated into <section id="tab-content-bigquery"> (mirrors Phase E Register placement) - Keboola Edit modal mirrors Register (same Q2 radio, Discover/List buttons via parameterized helpers) - openEditModal(table) dispatches by source_type to the right modal — fixes a quiet bug where Phase F's openEditKeboolaModal was never wired up and Keboola edits silently used the legacy modal Per-row Manage access deep link: - Each row in the per-tab listing has a lock-icon button between Edit and Delete that navigates to /admin/access#table:<table_id> - admin_access.html bootstrap reads window.location.hash and pre-fills the resource filter, mirroring the existing ?group=<id> deep-link pattern Tests: - test_admin_tables_tab_ui.py — tab nav, hash persistence, register-button-per-tab, listing partition by source_type, Manage access deep link - test_admin_tables_ui_materialized.py — two-question radio (BQ + Keboola), Discover/List/Use-as-base buttons, Edit modal parity, Jira read-only	2026-05-01 20:26:29 +02:00
ZdenekSrotyr	85d3810535	feat(materialized): query_mode='materialized' for BigQuery + Keboola — admin SELECT → parquet → analyst Closes the 'admin pre-stages a curated table/view for analysts' use case end-to-end across both supported source connectors. Backend (BigQuery + Keboola, schema v20): - schema v20 adds source_query TEXT to table_registry (renumbered from v19 after main's #150 RBAC migration also bumped to v19) - connectors/bigquery/extractor.py adds materialize_query(table_id, sql, , bq, output_dir, max_bytes=...) — BqAccess session, dry-run cost guardrail (default 10 GiB, configurable via data_source.bigquery.max_bytes_per_materialize), idempotent ATTACH, rows/bytes/md5 metadata for sync_state - connectors/keboola/access.py — new KeboolaAccess facade (parallel of BqAccess) wrapping ATTACH 'keboola://...' AS kbc - connectors/keboola/extractor.py adds materialize_query — same shape, no dry-run analog (Keboola Storage API has different cost model); legacy bucket-download path skips query_mode='materialized' rows - app/api/sync.py:_run_materialized_pass dispatches by source_type to the right materialize_query - app/api/admin.py: RegisterTableRequest accepts source_query; model_validator coheres mode↔source_query↔bucket; PUT preserves omitted fields; deprecation marks (Field(deprecated=True)) on sync_strategy + profile_after_sync (no extractor reads them; profile_after_sync becomes inert — bug from earlier work where /api/sync/trigger never honored the flag); _BQ_OPTIONAL_FIELD_DEFAULTS injects defaults into GET /server-config payload Operator + CLI surface: - da admin register-table --query / --query-mode materialized - scripts/smoke-test-materialized-bq.sh — end-to-end smoke for operators Tests (incl. spike + integration + regression): - test_db_migration_v20, test_table_registry_source_query - test_bq_materialize, test_bq_cost_guardrail, test_bq_init_extract_skips - test_keboola_access, test_keboola_extension_query_passthrough (lock-in for the DuckDB extension capability), test_keboola_materialize, test_keboola_init_extract_skips, test_keboola_materialized_e2e (skipped without KBC_TEST_ creds) - test_sync_trigger_materialized, test_sync_trigger_keboola_materialized - test_api_admin_materialized, test_cli_admin_materialized - test_admin_bq_register, test_admin_discover_bigquery, test_admin_keboola_materialized, test_admin_phase_c_deprecation, test_admin_put_preservation, test_materialized_e2e Cost: BQ uses bigquery_query() (jobs API, view-aware) — works on tables, views, materialized views uniformly. Keboola uses ATTACH+COPY parquet through the DuckDB extension.	2026-05-01 20:25:56 +02:00
minasarustamyan	d4ac84dd46	feat(rbac): drop dataset_permissions + users.role + is_public; v19 migration (#150 ) * feat(rbac): drop dataset_permissions + access_requests + users.role + is_public; v19 migration BREAKING. Sjednocení datové RBAC vrstvy do per-group resource_grants modelu. Před PR byla legacy data RBAC vrstva (dataset_permissions + is_public bypass) de-facto neaktivní — is_public neměl API/UI/CLI surface, default true znamenal že can_access_table vždycky bypassl. Dnes každý non-admin přístup vyžaduje explicitní resource_grants(group, "table", id) řádek. Schema v18 → v19 (src/db.py:_v18_to_v19_finalize): - DROP TABLE dataset_permissions, access_requests - DROP COLUMN users.role (NULL artifact since v13) - DROP COLUMN table_registry.is_public - Drops přes table-rebuild idiom (rename → create new → INSERT … SELECT → drop old) kvůli DuckDB ALTER DROP COLUMN limitacím na tabulkách s historic FK constraints. INSERT picks intersection sloupců, takže test fixtures s minimal pre-v19 schemou migrate cleanly. Runtime: - src/rbac.py:can_access_table → deleguje na app.auth.access.can_access - DatasetPermissionRepository, AccessRequestRepository smazány - AGNES_ENABLE_TABLE_GRANTS env-gate v app/resource_types.py odstraněn (TABLE je unconditionally enabled) API drop: - app/api/permissions.py, app/api/access_requests.py celé soubory - /admin/permissions web route + admin_permissions.html - "Request Access" modal v catalog.html + locked-row UI - ~10 if user.get("role") != "admin" checků nahrazeno (admin shortcut je uvnitř can_access_table) - /api/settings: drop permissions field z GET; PUT /api/settings/dataset gate přepnut na can_access(user_id, "table", dataset, conn) Auth: - app/auth/jwt.py:create_access_token: drop role parametr (claim zmizí z nově vydávaných JWT; staré tokeny zůstávají valid, claim ignored) - app/api/users.py: drop role z CreateUserRequest / UpdateUserRequest (admin promotion = explicit add to Admin group via memberships API) - src/repositories/users.py: drop role z create() / update() CLI: - da admin set-role smazán → hard-fail s replacement command - da admin add-user --role flag pryč - da auth import-token --role flag pryč - da auth whoami: drop "Role:" výpis - cli/config.py:save_token: role parametr now optional, no longer written (back-compat se starými token.json soubory zachována — pole se ignoruje) Tests: - DELETE: test_permissions.py, test_permissions_api.py, test_access_requests_api.py - REWRITE: test_access_control.py (resource_grants flow), test_rbac.py (can_access_table over resource_grants), test_journey_rbac.py (drop access-request flow), test_resource_types.py (drop env-gate tests, drop is_public from helpers), test_v2_.py (drop role-based user dicts in favor of id-based + Admin group membership), test_settings_api.py (no permissions field, can_access gate) - TRIVIAL: ~30 souborů — drop role="admin" arg z UserRepository.create a 3rd positional role z create_access_token - NEW: test_v18_to_v19 migration test (test_db.py), test_can_access_table_no_implicit_public (test_rbac.py), test_admin_set_role_returns_hardfail (test_cli_admin.py) - OpenAPI snapshot regenerated Docs: - CHANGELOG: BREAKING entry pod [Unreleased] - CLAUDE.md: schema v18 → v19 - docs/architecture.md: schema table + RBAC sekce přepsána - docs/auth-google-oauth.md: admin promotion přes da admin break-glass - cli/skills/security.md: kompletně přepsáno na group-based model - docs/TODO-rbac-data-enforcement.md: smazáno (TODO splněn) Test results: 2363 passed, 19 failed. Zbývající failures jsou pre-existing Windows-specific issues (fcntl, charset) nesouvisející s tímto PR — ověřeno git stash pop. Plan: ~/.claude/plans/floofy-coalescing-parnas.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> chore(release): cut 0.27.0 --------- Co-authored-by: Minas Arustamyan <arustamyan.minas@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>	2026-04-30 22:02:16 +02:00
minasarustamyan	fb1573766a	feat(admin): users/groups UI polish + SSO lock + v18 migration (#142 ) Cuts release 0.24.0. ## Highlights - SSO-managed accounts read-only for password / delete operations (UI + API). New `is_sso_user` flag derived from group memberships. - Admin/Everyone system rows show `google_sync` chip + Workspace email subtitle when env-mapped. - Origin pill vocabulary unified across `/admin/groups`, `/admin/access`, `/admin/users`, `/admin/users/{id}`, `/profile` (Admin yellow, Everyone gray, google_sync green, custom purple). - Effective-access readout no longer short-circuits for admin users — always renders per-resource breakdown. - Schema migration v18 drops stranded non-google memberships in env-mapped Admin/Everyone groups (cleans up v13's blanket Everyone backfill). ## Devin findings addressed - _is_sso_user requires source='google_sync' on system-group branches (so v13 system_seed memberships in env-mapped Everyone don't lock out the admin). - POST add-to-group returns correct origin via _derive_origin (matching GET). - 8 customer-specific token instances (groupon.com / foundryai) replaced with vendor-neutral placeholders across templates, tests, and CHANGELOG. - deriveDisplayName name-skip for canonical "Admin"/"Everyone" so an overlapping AGNES_GOOGLE_GROUP_PREFIX doesn't mangle the chip text. See CHANGELOG [0.24.0] for full notes.	2026-04-30 15:16:04 +02:00
ZdenekSrotyr	70672204fe	feat(memory): admin Edit + MEMORY_DOMAIN RBAC + ai-section UI (#141 ) Cuts release 0.23.0. ## Highlights - Single-item Edit button on every memory item card (modal hits PATCH /api/memory/admin/{id}). - MEMORY_DOMAIN RBAC resource type — admins grant user_groups access to specific domains via /admin/access. Composes with existing audience filter (OR semantics, no-op when no grants). - ai: section editable in /admin/server-config — admins can set ANTHROPIC_API_KEY / model / provider / base_url for the corporate-memory extractor without editing instance.yaml directly. api_key auto-masked. ## Devin findings addressed - Modal NULL→empty fix (audience visibility wouldn't break). - Stats endpoint granted_domains parity with list endpoint. - Documented intentional MEMORY_DOMAIN→audience bypass. - Documented conscious ai.base_url SSRF exclusion (legit internal LiteLLM/vLLM proxies). See CHANGELOG [0.23.0] for full notes.	2026-04-30 11:04:41 +02:00
ZdenekSrotyr	83adf01bde	fix(v2): #134 BigQuery cross-project errors return structured 502/400 + BqAccess facade (#138 ) * docs(spec): #134 unify BigQuery access behind BqAccess facade Brainstorm output for issue #134. Captures: - root cause (incl. correction of the issue's hypothesis about commit 33a9964) - BqAccess facade API + project resolution rules - error contract — typed BqAccessError mapped to HTTP 502 for upstream BQ failures, 500 for deployment/config bugs - migration plan for v2_scan, v2_sample, RemoteQueryEngine - test rewrite eliminating _bq_client_factory injection point - E2E verification protocol on agnes-development as success criterion * docs(spec): #134 revise after first review Incorporates code-reviewer findings: Must-fix: - Add v2_schema (2 copies of INSTALL/LOAD/SECRET dance) to migration scope. - Reframe v2_scan headline: missing try/except around BQ calls is the actual cause of bare 500s, not project resolution (which 33a9964 fixed). - List two more deferred call sites (extractor.py, register_bq_table) with explicit rationale. Important: - Drop billing != data clause from cross_project_forbidden heuristic; rely only on 'serviceusage' substring. billing != data is normal for cross-project setup, was over-classifying. - Split bq_bad_request into _user (400) and _server (502) variants; add sql_origin parameter to translate_bq_error so call sites declare whether SQL contains user input. - Add @functools.cache to BqAccess.from_config; document tests bypass via dependency_overrides. - Replace monkey-patched-classmethod test pattern with BqAccess(client_factory=...) injection at construction time. Cleaner than today's _bq_client_factory and 1:1 migration shape. - Keep BqProjects.data (reviewer assumed registry has source_project; it doesn't). Multi-project explicitly listed as non-goal with note. Nice-to-have: - Add 'Implementation strategy' section: 2 staged commits (bug fix alone is revertable; refactor follows). - Extend E2E protocol to cover all three endpoints, not just /sample. - Note removal of stale docstring at src/remote_query.py:204. * docs(spec): #134 revision 3 — incorporates second-round review Must-fix from second review: - v2_schema split into two migration cases: _fetch_bq_schema translates errors via translate_bq_error; _fetch_bq_table_options preserves its swallow-all 'except Exception → return {}' so /schema doesn't 502 on partition-info failures. - RemoteQueryEngine.__init__ now resolves BqAccess lazily (in _get_bq_client, not in __init__). Without this, ~7 DuckDB-only tests in test_remote_query.py would suddenly fail with not_configured. - translate_bq_error pass-through for BqAccessError is now load-bearing (clause 1, before any Google-API branch). bq.client() raises BqAccessError for bq_lib_missing/auth_failed; without explicit pass-through those fall to 'unknown' and re-raise as bare 500. - Commit 1 now emits the SAME structured response shape as commit 2 to avoid contract churn between commits. - BIGQUERY_PROJECT env-var precedence is BREAKING for env-only deployments — flagged in CHANGELOG ### Changed. Editorial: - sql_origin renamed to bad_request_status with values 'client_error' / 'upstream_error' (clearer about what the parameter actually decides). bq_bad_request_user/_server kinds collapsed to bq_bad_request (400) and bq_upstream_error (502). - CLI (cli/commands/query.py) noted as external RemoteQueryEngine caller; unaffected because new bq_access kwarg has default None. - Added unit/integration tests for the new contracts: test_translate_passes_through_BqAccessError, test_v2_scan_returns_500_on_bq_lib_missing, test_v2_schema_returns_200_with_empty_partition_on_bq_failure, test_resolve_succeeds_after_config_set. - E2E protocol now covers /schema as the fourth endpoint. - Documented functools.cache-doesn't-cache-exceptions semantics and fixture nullcontext-doesn't-close caveat for nested sessions. * docs(spec): #134 revision 4 — incorporates third-round review Third reviewer verdict: 'implementation-ready with two trivial edits'; explicitly noted prior rounds did the heavy lifting. Edits: 1. get_bq_access() module-level function instead of @classmethod @functools.cache from_config. Removes the classmethod-cache stacking footgun (different Python versions wrap differently) and gives FastAPI's dependency introspection a clean function signature. Drops the 'Do not subclass BqAccess' caveat that no longer applies. 2. Commit 1 strategy explicitly: wrap _fetch_bq_sample (v2_sample), _bq_dry_run_bytes + _run_bq_scan (v2_scan), and _fetch_bq_schema (v2_schema strict block). Do NOT touch _fetch_bq_table_options swallow-all in commit 1 — preserved as-is, then migrated (still preserved) in commit 2. All three endpoints emit the same structured body shape so client parsers see one consistent contract throughout the staged rollout. No more half-rolled-out window where /sample is bare 500 while /scan is structured 502. * docs(plan): #134 implementation plan — Phase 1 (atomic bug fix) + Phase 2 (BqAccess refactor) + Phase 3 (verification) Bite-sized TDD tasks. 3 phases, 16 tasks total: Phase 1 (Commit 1) — atomic bug fix across all four v2 endpoints: Tasks 1.1-1.5 wrap _fetch_bq_sample, _bq_dry_run_bytes, _run_bq_scan, _fetch_bq_schema with structured 502/400 try/except. _fetch_bq_table_options preserved untouched. CHANGELOG Fixed entries. Phase 2 (Commit 2) — BqAccess facade extraction + migration: Tasks 2.1-2.5 build connectors/bigquery/access.py bottom-up (BqProjects, BqAccessError, translate_bq_error, default factories, BqAccess class, get_bq_access module-level cached). Task 2.6 adds conftest.py fixture. Tasks 2.7-2.9 migrate v2_scan, v2_sample, v2_schema to BqAccess. Tasks 2.10-2.11 migrate RemoteQueryEngine + tests (lazy bq_access, drop _bq_client_factory). Task 2.12 CHANGELOG Changed BREAKING + Internal. Phase 3 — Verification: 3.1 full pytest. 3.2 squash into two PR-shape commits. 3.3 manual E2E on agnes-development per spec protocol → close #134. Self-review table maps spec sections to implementing tasks; no gaps. * fix(v2): #134 structured 502/400 on BQ errors across /scan, /scan/estimate, /sample, /schema Wraps the BigQuery call sites in v2_scan, v2_sample, and v2_schema (strict block only) with try/except for google.api_core exceptions, translating to HTTPException with a structured body shape: {error, message, details}. Fixes Pavel's report (#134) where these endpoints returned bare HTTP 500 with no body when the SA on agnes-development hit cross-project Forbidden on serviceusage.services.use. Also fixes /sample's missing billing_project fallback (the bug 33a9964 fixed for /scan never landed here). Status code split: - /scan, /scan/estimate: BadRequest -> 400 (bq_bad_request) since SQL is user-derived from req.select/where/order_by. - /sample, /schema: BadRequest -> 502 (bq_upstream_error) since SQL is server-constructed from validated identifiers. - All Forbidden -> 502 with cross_project_forbidden if 'serviceusage' in error message (with hint pointing at data_source.bigquery.billing_project), else bq_forbidden. Body shape matches what the upcoming BqAccess refactor (next commit) will produce, so client-side parsers see one consistent contract throughout the staged rollout. _fetch_bq_table_options preserved exactly as-is — its swallow-all-and-return-empty contract is intentional and survives into the refactor; /schema continues to return 200 with empty partition info when partition queries fail. Outer wraps in scan_endpoint, scan_estimate_endpoint, sample, and schema endpoints exist only to make the test pattern (monkeypatching whole _fetch_* functions) work, and are tagged TODO(#134 Phase 2) for removal once BqAccess centralizes translation. * refactor(bq): #134 BqAccess facade — unify v2_scan, v2_sample, v2_schema, RemoteQueryEngine Extracts the duplicated BigQuery-access pattern (project resolution + client construction + DuckDB-extension session + Google-API error translation) into connectors/bigquery/access.py. Migrates four call sites to use it: - app/api/v2_scan.py — _bq_dry_run_bytes, _run_bq_scan - app/api/v2_sample.py — _fetch_bq_sample - app/api/v2_schema.py — _fetch_bq_schema (strict translation), _fetch_bq_table_options (preserves swallow-all best-effort contract) - src/remote_query.py — RemoteQueryEngine, lazy bq_access kwarg The new module exposes: - BqProjects (frozen dataclass: billing + data project IDs) - BqAccessError (typed exception with HTTP_STATUS class mapping) - BqAccess (facade with injectable client_factory/duckdb_session_factory for tests; defaults call the real google-cloud-bigquery + DuckDB extension) - get_bq_access (module-level @functools.cache; FastAPI Depends target) - translate_bq_error (Google API exception → BqAccessError mapper, with BqAccessError pass-through, 'serviceusage'-substring heuristic for cross_project_forbidden, and bad_request_status param distinguishing user-derived (400) from server-constructed (502) SQL) - _default_client_factory, _default_duckdb_session_factory RemoteQueryEngine.__init__ no longer accepts _bq_client_factory; tests migrate to bq_access=BqAccess(projects, client_factory=...). DuckDB-only RemoteQueryEngine tests need no changes — bq_access defaults to None and get_bq_access() is only invoked on first BQ call (lazy resolution). BqAccessError raised internally is translated to RemoteQueryError( error_type="bq_error") in _get_bq_client to preserve the engine's existing public contract — CLI and /api/query/hybrid callers see no change. Endpoint tests (test_v2_scan, test_v2_scan_estimate, test_v2_sample, test_v2_schema) migrate from monkey-patching whole _fetch_* functions to using the new bq_access fixture in tests/conftest.py — which exercises the REAL translation path through BqAccess + translate_bq_error, closing the test gap flagged in Task 1.1's review. Side-effect behavior change: v2_sample's FROM clause now uses the data project (instance.yaml data_source.bigquery.project), not the conflated billing_project from Phase 1. Documented in CHANGELOG ### Internal. BREAKING for deployments combining BIGQUERY_PROJECT env var with data_source.bigquery.project in instance.yaml — env var now overrides data project too. See CHANGELOG ### Changed. Two known-duplicate BQ-access sites (connectors/bigquery/extractor.py, scripts/duckdb_manager.register_bq_table) explicitly out of scope; tracked as follow-up. Removed stale docstring at the previous src/remote_query.py:204 that referenced scripts.duckdb_manager._create_bq_client as the default BQ client factory (RemoteQueryEngine never actually used that function). Test counts: tests/test_bq_access.py +27 (new), tests/test_v2_.py + tests/test_remote_query.py migrated to bq_access fixture (counts unchanged or +1-2 per file). Full suite: 2086 passed, 8 pre-existing failures (DB migration tests with unrelated internal_roles DependencyException — not introduced by this PR). fix(bq_access): translate DefaultCredentialsError to BqAccessError(auth_failed) CI on PR #138 caught: bigquery.Client(...) resolves Application Default Credentials at construction time; without ADC (CI without SA key, dev laptop without 'gcloud auth application-default login') it raises google.auth.exceptions.DefaultCredentialsError synchronously. Pre-fix _default_client_factory only caught ImportError, so DefaultCredentialsError propagated as raw exception — and from production endpoints would surface as bare 500 (the exact failure mode #134 sets out to fix). Now translates to BqAccessError(kind='auth_failed', details.hint='Run gcloud auth application-default login...'). Endpoint catch chain returns HTTP 502 with structured body. Adds unit test test_raises_auth_failed_on_default_credentials_error. Third-round spec review flagged this case in passing; the fix didn't land. CI's auth-less environment surfaced it. * fix(bq_access): get_bq_access() returns sentinel instead of raising when not configured Devin BUG_0001 on PR #138 review: 'get_bq_access() as FastAPI Depends breaks all v2 endpoints for non-BigQuery instances'. Pre-fix: get_bq_access() raised BqAccessError(not_configured) when neither BIGQUERY_PROJECT env nor data_source.bigquery.project was set. Because FastAPI resolves Depends() BEFORE the endpoint body runs, this exception fires during dep-injection — the endpoint's try/except BqAccessError clause never gets a chance to catch it. Result: every v2 request on Keboola-only or CSV-only instances returned bare HTTP 500, even for local-source tables that never touch BigQuery. Fix: get_bq_access() now returns a sentinel BqAccess with empty BqProjects and factories that raise BqAccessError(not_configured) on actual use. Construction succeeds, FastAPI's dep-injection cleanly yields the sentinel, the endpoint runs. The local-source code path in build_sample / build_schema / etc. never calls bq.client() or bq.duckdb_session() (it reads parquet directly), so non-BQ tables return 200 as before. Only when an endpoint actually tries to query BQ (source_type == 'bigquery') does the sentinel raise — and the endpoint's existing except BqAccessError catches it normally, returning structured 502 with hint. Test get_bq_access::test_raises_not_configured_when_neither_set renamed and rewritten to test_returns_sentinel_when_neither_set: asserts BqAccess is returned, then asserts client() and duckdb_session() each raise BqAccessError(not_configured) on call. Test test_does_not_cache_exceptions removed (no longer applicable) and replaced with test_sentinel_is_cached_per_process documenting the operator-restart-on-config-change contract. * docs(spec+plan): #134 genericize customer-specific tokens (CLAUDE.md OSS rule) Devin BUG_0001/0002 round 3 on PR #138: spec and plan docs contained customer-specific deployment hostnames, deployment names, and a GCP project ID that violated CLAUDE.md's vendor-agnostic OSS rule ('Nothing customer-specific belongs in code, configuration defaults, comments, docs, commit messages, PR titles, or PR bodies'). Replacements: agnes-development.groupondev.com -> <your-agnes-host> agnes-development -> <your-dev-instance> prj-grp-dataview-prod-1ff9 -> <your-data-project> s1_session_landings -> <bq_table_id> E2E verification semantics unchanged — operators still run the same four curls + config flip + retry, just substituting their own host / deployment name / project / table. * fix(bq_access): hook get_bq_access.cache_clear into instance_config.reset_cache Devin ANALYSIS_0004 on PR #138: get_bq_access is @functools.cache'd at process level, so it captures BigQuery project IDs at first call and ignores subsequent instance.yaml changes. Pre-Phase-2 the v2 endpoints re-read get_value() on every request, so admin /api/admin/server-config saves (which call instance_config.reset_cache()) hot-reloaded the BQ project. Without this fix, my refactor silently regresses that contract — operators editing instance.yaml via the admin UI would see no effect on v2 endpoints until container restart. instance_config.reset_cache() now also calls connectors.bigquery.access.get_bq_access.cache_clear() (lazy import, swallowed if connectors module isn't loaded — keeps instance_config usable in isolated unit tests). Adds test_instance_config_reset_cache_invalidates_get_bq_access as regression guard. Updates CHANGELOG Internal entry to mention the hot-reload contract + the not-configured sentinel behavior (round-3 fix from Devin BUG_0001 was previously only in commit message). * fix(bq_access): surface not_configured before identifier validation + plan path genericize Devin BUG_0001 + BUG_0002 round 5 on PR #138. BUG_0001 (plan doc): personal filesystem path violated CLAUDE.md vendor-agnostic rule. Replaced with '<worktree-root>' placeholder. BUG_0002 (sentinel error path): when get_bq_access() returns the sentinel BqAccess (BQ not configured), the empty bq.projects.data was reaching validate_quoted_identifier first and raising ValueError -> endpoint mapped to HTTP 400 'unsafe_identifier' instead of structured 500 'not_configured' with hint. Each fetch helper now checks 'if not bq.projects.data: bq.client()' as the first step, which triggers the sentinel's BqAccessError(not_configured). Endpoint catches the typed error and returns HTTP 500 with hint pointing at data_source.bigquery.project. Best-effort _fetch_bq_table_options returns {} silently in this case (preserves the swallow-all contract). * fix(bq_access): classify DuckDB-native exceptions from bigquery_query() via string match Devin ANALYSIS on PR #138 review (latest round). The DuckDB bigquery extension is a C++ plugin making its own HTTP calls — when BQ returns 403, it throws duckdb.IOException with the BQ error embedded as text, not gax.Forbidden. translate_bq_error's isinstance checks would miss these, falling to case 7 → bare 500 in production for v2_scan, v2_sample, and v2_schema (the bigquery_query() paths). Fix: last-resort string-match heuristic before the re-raise. 'Forbidden' / '403' / 'Bad Request' / '400' in the lowercased message classifies via the same kind hierarchy. The 'serviceusage' substring still distinguishes cross_project_forbidden from bq_forbidden. Specific enough that random exceptions without HTTP-error keywords still re-raise. Adds 4 unit tests covering the new heuristic + the 'don't swallow random exceptions' invariant. * chore(release): cut 0.22.0 PR #138 contains issue #134 user-visible behavior changes: - BREAKING: BIGQUERY_PROJECT env var now overrides instance.yaml data_source.bigquery.project for v2 endpoints (previously RemoteQueryEngine billing only). - Fixed: structured 502/400 on /api/v2/sample, /scan, /scan/estimate, /schema when BigQuery raises Forbidden/BadRequest (was bare 500). - Internal: BqAccess facade refactor unifying four duplicate BQ-access call sites; instance_config.reset_cache() now invalidates BqAccess cache too so admin server-config saves hot-reload BQ project IDs. Bumps to 0.22.0 because PR #137 merged first and took 0.21.0.	2026-04-30 10:11:20 +02:00
minasarustamyan	4ec5ff44dd	feat(setup): cross-platform TLS bootstrap + marketplace plugin install (#137 ) Bootstraps the Agnes Claude Code marketplace + RBAC-allowed plugins from the dashboard CTA, and inlines the server's TLS cert when the chain isn't publicly trusted (self-signed / private CA). Cross-platform setup prompt covers Windows Git Bash, macOS, Linux. Includes Bun-compiled `claude` fix (macOS goes via git-clone fallback, same as Windows), PAT stripping after clone, explicit error handling, and four rounds of Devin Review fixes (phantom step references, $PLATFORM re-detection, heredoc/awk line-count sync). Cuts 0.21.0. See CHANGELOG.md [0.21.0] section for details.	2026-04-30 08:56:45 +02:00
Vojtech	38f6b639d2	feat(observability): request_id end-to-end + dev debug toolbar + centralized logging (#136 ) Cuts release 0.20.0. ## Highlights - X-Request-ID header on every response + sanitized to [A-Za-z0-9_-] (CRLF log-forging mitigation) - Error pages (HTML + JSON 500) surface request_id for support tickets - Dev debug toolbar gated by DEBUG=1 — fastapi-debug-toolbar with custom DuckDBPanel - Centralized app.logging_config.setup_logging() replaces 23 scattered basicConfig calls - Telegram bot drops bot.log file — stdout only (BREAKING) ## Devin findings addressed - BUG_0001: .env.template no longer claims FastAPI debug=True - BUG_0002: subprocess extractor logs INFO to stderr again - ANALYSIS_0003: _wants_html no longer matches Accept: / (curl gets JSON as before) - BUG on b1c6ee9: HTML 500 page no longer leaks str(exc) in production - BUG on b13d2fe: 2 CLAUDE.md compliance flags (transform.py + ws_gateway) accepted as scope-limited logging refactor — follow-up to update CLAUDE.md if needed See CHANGELOG [0.20.0] for full notes.	2026-04-29 22:54:21 +02:00
ZdenekSrotyr	b7a1795834	feat(scheduler): re-wire sync_schedule + script.schedule; tune via env; OpenMetadata TLS (#135 ) Bundles 4 issues: - #79 — table_registry.sync_schedule honored at runtime (API-side filter + Pydantic validators) - #78 — script_registry.schedule honored via new POST /api/scripts/run-due (atomic claim, BackgroundTask exec, deploy-time safety validation) - #77 — sidecar JOBS env-driven (SCHEDULER_DATA_REFRESH_INTERVAL/HEALTH_CHECK_INTERVAL/SCRIPT_RUN_INTERVAL/TICK_SECONDS) - #89 — OpenMetadataClient verify=True default (BREAKING for self-signed) Cuts release 0.19.0. See CHANGELOG for full notes incl. Known Limitations.	2026-04-29 22:06:30 +02:00
minasarustamyan	953bd9d250	fix(marketplace): use plugin.json name in synth marketplace.json (#133 ) Closes the /plugin UI 'Plugin <X> not found in marketplace' bug. Synth marketplace.json catalog 'name' now reads from <plugin_dir>/.claude-plugin/plugin.json (with fallback to upstream marketplace.json name). On-disk plugins/<slug>-<plugin>/ layout preserved so cross-marketplace files don't collide. /marketplace/info exposes both name and prefixed_name (BREAKING — downstream tooling parsing 'name' for the slug-prefixed form must switch to prefixed_name).	2026-04-29 19:25:57 +02:00
minasarustamyan	c940593a90	feat(auth): Google Workspace group prefix filter + system mapping (#131 ) Three new env vars wire the Google OAuth callback to a configurable Workspace prefix and route admin/everyone Workspace groups onto the seeded system rows: AGNES_GOOGLE_GROUP_PREFIX, AGNES_GROUP_ADMIN_EMAIL, AGNES_GROUP_EVERYONE_EMAIL. Login gate redirects users with no prefix-matching group to /login?error=not_in_allowed_group. BREAKING: auto-Everyone membership for new users removed. Admin UI/API are read-only on Google-managed groups. See docs/auth-groups.md.	2026-04-29 14:08:04 +02:00
ZdenekSrotyr	82c5d71d63	feat(memory): #62 — duplicate hints + tree-view + bulk-edit (#126 ) Issue #62. Tree view with cross-axis filtering, duplicate-candidate hints (Jaccard score on entity overlap), bulk-edit endpoints (PATCH /api/memory/admin/{id} + POST /api/memory/admin/bulk-update), schema v17 (knowledge_item_relations), full CLI parity (da admin memory tree/edit/bulk-edit/duplicates list/resolve).	2026-04-29 13:55:15 +02:00
ZdenekSrotyr	1824b9dd9c	feat(admin): #108 M1 — BigQuery table registration in UI + CLI (#119 ) Issue #108 Milestone 1. Adds BigQuery table registration via /admin/tables UI and `da admin register-table` CLI without hand-editing table_registry. POST /api/admin/register-table/precheck for round-trip validation. --dry-run flag on CLI. Audit-log entries on register/update/unregister. PUT /api/admin/registry/{id} now preserves registered_at (closes #130).	2026-04-29 13:18:31 +02:00
ZdenekSrotyr	995e4cd366	fix(scheduler): HTTP marketplaces job + SCHEDULER_API_TOKEN shared secret (#127 ) * fix(scheduler): HTTP marketplaces job + SCHEDULER_API_TOKEN shared secret Two scheduler-reliability bugs surfaced after the v0.12.1 USER-agnes flip: 1. The marketplaces job called src.marketplace.sync_marketplaces() in-process from the scheduler container, racing the app's long-lived system.duckdb handle. DuckDB rejects cross-process writers — every cron tick 500-ed on "Could not set lock on file ... PID 0". 2. The data-refresh + new marketplaces jobs both 401-ed on the API because SCHEDULER_API_TOKEN was never propagated by the Terraform startup script. The scheduler had no credential to authenticate with. Fix: - New POST /api/marketplaces/sync-all (admin-only) drives the nightly refresh through the app process so it inherits the existing DB connection. - Scheduler swaps fn->http for marketplaces; all jobs are now plain HTTP and the scheduler is reduced to a cron clock. - New app/auth/scheduler_token.py adds a shared-secret auth path. The startup script generates a 256-bit secret on first boot, persists it across reboots, and writes it to /opt/agnes/.env. Both containers source the same .env. The app validates incoming Bearer tokens against the env var (constant-time, length-floored) and resolves matches to a synthetic scheduler@system.local user that's a member of the Admin system group. Audit-log entries from the scheduler are attributed to this user. - app/main.py seeds the synthetic user at startup so the first cron tick has a valid actor; lazy seed in get_scheduler_user covers token rotation before the next app restart. Tests: 5 new in tests/test_auth_scheduler_token.py covering empty/short secret rejection, exact-match comparison, idempotent user seeding, and lazy provisioning. 142 marketplace + scheduler tests + 96 auth tests remain green. Existing VMs with .env from before this change need a one-time re-provisioning (re-run startup-script or rotate via openssl rand); documented in CHANGELOG. * fix(audit): use '_all' sentinel for bulk marketplace sync — Devin review #127 Avoids the literal string 'marketplace:None' in the audit_log resource column when the bulk sync endpoint writes its summary row. * fix(scheduler): unblock event loop + per-job timeouts — Devin review #127 Two findings from Devin re-review on commit 5fbad15: 1. BUG: trigger_sync_all was async def, so FastAPI ran it on the asyncio event loop. sync_marketplaces() does blocking I/O (subprocess git clones up to GIT_TIMEOUT_SEC=300 each, threading.Lock, DuckDB writes) and would freeze every concurrent request for the duration of a bulk sync. Switched to plain def so FastAPI auto-routes to the thread pool. 2. ANALYSIS: scheduler used a fixed 120s httpx timeout for every POST. Bulk marketplace sync iterates the registry under a single lock with up to 300s per repo — easily exceeds 120s on 2-3 slow repos. The scheduler then sees a timeout, doesn't update last_run, and re-fires on the next 30s tick, queueing redundant work. Per-job timeout override added to the JOBS tuple; marketplaces gets 900s (15 min), data-refresh keeps 120s, health-check 30s. * fix(auth): require_session_token rejects scheduler shared secret — Devin review #127 require_session_token gates /auth/tokens (PAT minting). Pre-fix it only rejected JWTs with typ=pat — but the scheduler shared secret is an opaque string, so verify_token() returns None, payload becomes {}, and the PAT-claim check silently passed. A caller bearing SCHEDULER_API_TOKEN could mint persistent PATs that survive a secret rotation. Added explicit is_scheduler_token() check before the PAT-claim check; new regression test in tests/test_auth_scheduler_token.py. Devin's other note (pre-existing async def trigger_sync at marketplaces.py:392 also calls blocking sync_one) — Devin flagged it as out-of-scope for this PR and I agree; tracking separately. * release(0.17.0): cut + clean up CHANGELOG duplicates Cuts 0.17.0 (minor: scheduler shared-secret auth + sync-all endpoint plus the deploy-shape fixes that landed since the last release tag). Bumps pyproject from 0.15.0 — also corrects the missed bump from PR #120 (v0.16.0 was tagged on GitHub and shipped as :stable, but pyproject stayed at 0.15.0, so /api/version, /cli/latest, and `da --version` had been under-reporting the running release). Removes the long-form duplicate entries for 0.13.0 / 0.14.0 / 0.15.0 above [0.16.0] — the canonical short summaries (with GitHub-release links) already exist below 0.16.0, the long forms were leftover state from before those versions were cut and have been silently shadowed ever since.	2026-04-29 11:44:00 +02:00
ZdenekSrotyr	61f6b8d2d5	feat(ci+tests): deploy safety audit — linting, rollback, smoke tests, 50+ new tests (#120 ) Comprehensive deploy safety audit implementing 19 improvements across CI/CD pipeline, test coverage, and source code. ### CI/CD Pipeline - ruff + mypy added to both release.yml and keboola-deploy.yml (continue-on-error) - Smoke test added to keboola-deploy.yml (was missing) - Automatic rollback on smoke test failure in release.yml - Expanded smoke-test.sh with catalog, admin/tables, marketplace.zip, metrics - Required status checks via .github/settings.yml - Dependabot + CODEOWNERS + pre-commit hooks + ruff config ### Source Code - DB schema version check in /api/health (db_schema: ok/mismatch/unhealthy) - Config versioning (config_version: 1 in instance.yaml, non-blocking validation) - BigQuery extractor ATTACH error handling (try/except around INSTALL+ATTACH) - Post-deploy smoke test script for prod VM validation ### Test Coverage (~50 new tests) - v13->v14 migration, Email magic link TTL, PAT, Marketplace ZIP/Git, Jira webhooks, Hybrid Query BQ, Keboola/BQ extractor failure modes, Orchestrator failure modes Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-04-29 09:18:55 +02:00
ZdenekSrotyr	6752c4a53e	fix(web): restore admin nav menu items (#122 ) v13 RBAC migration nulled users.role and moved admin authority onto user_group_members. Header still gated on session.user.role == 'admin', so admin menu was hidden for everyone. Inject user['is_admin'] via is_user_admin in get_current_user; header reads session.user.is_admin.	2026-04-29 09:09:23 +02:00
ZdenekSrotyr	1baadd172e	fix(ui): render shared header full-width on corporate memory pages (#117 ) Move {% include '_app_header.html' %} out of .container-memory (max-width: 1000px) in corporate_memory.html and corporate_memory_admin.html so the header spans the viewport, matching dashboard.html. Page content stays constrained by the container.	2026-04-29 07:45:56 +02:00
PavelDo	e1108b6112	feat(memory): corporate memory v1+v1.5 + 0.15.0 (#72 ) Adds corporate memory v1 (verification flywheel + contradiction detection + confidence scoring) and v1.5 (audience-based distribution + per-item privacy + admin curation). Server: GET /api/memory/bundle returns mandatory + ranked-approved items within a token budget; POST /api/memory/admin/mandate accepts an audience field gated against user_group_members; /api/memory/stats uses SQL aggregation. CLI: da sync writes received items to .claude/rules/km_*.md. Verification detector extracts knowledge candidates from session JSONL files. Auto-tagging via Haiku when ai: is configured. Adapted from the v9-era branch onto v13/v14 RBAC: _is_privileged_viewer + _effective_groups now query user_group_members JOIN user_groups; require_role(Role.KM_ADMIN) replaced with require_admin (km_admin collapsed into admin). Schema v15: knowledge_items context-engineering columns + knowledge_contradictions + session_extraction_state. Schema v16: verification_evidence. Cuts release v0.15.0 (also bundles #116 /me/debug page).	2026-04-29 07:16:22 +02:00
minasarustamyan	7a06f1a585	feat(auth): /me/debug self-only auth diagnostic page (#116 ) Adds /me/debug HTML page rendering the logged-in user's own session state — decoded JWT claims (no raw token, sha256[:12] fingerprint for log correlation), group memberships with sources and bound external_id when present, resource grants effective via those memberships, and a Refetch from Google (dry-run) button that diffs a fresh fetch_user_groups call against the cached user_group_members snapshot. Gated by AGNES_DEBUG_AUTH env var (default off → 404, route existence undetectable in production). Self-only by construction: user_id is read from the validated session, never echoes raw JWT / password hash / full PAT. Tolerates v13 + v14 schemas via information_schema check on users.external_id.	2026-04-29 06:36:28 +02:00
ZdenekSrotyr	2e1dfb7553	feat(v2): claude-driven fetch primitives + 0.14.0 (#102 ) Replaces the BigQuery wrap-view pattern with a discovery + scoped-fetch toolkit driven by the analyst's Claude session. Adds /api/v2/{catalog,schema,sample,scan,scan/estimate}, da catalog/schema/describe/fetch/snapshot/disk-info CLI commands, sqlglot-backed WHERE validator, process-local quota tracker, agent rails skill (cli/skills/agnes-data-querying.md). BREAKING: BQ wrap views off by default — set data_source.bigquery.legacy_wrap_views=true for one cycle. Backward-compat field_validator on primary_key. Catalog cache now matches documented 300s TTL with RBAC fresh per request. Cuts release v0.14.0.	2026-04-29 01:07:19 +02:00
ZdenekSrotyr	a222f92e70	feat(admin): server configuration editor + 0.13.0 (#107 ) Adds /admin/server-config UI for editing instance.yaml from the web. Hardening: SSRF gate on data_source URLs, narrow-overlay write strategy, atomic writes, audit log with secret masking on shape changes, threading lock on read-modify-write, corrupt-overlay refusal on write side + louder log on read side, modal Promise resolution on backdrop dismiss, sentinel scrub on save (defense-in-depth client+server). Bundles Windows PowerShell wrapper from #80. Cuts release v0.13.0.	2026-04-29 00:47:23 +02:00
ZdenekSrotyr	5f6bb7a4b2	fix(security+ops) + release(0.12.1): #82 #85 #87 hardening + cut 0.12.1 (#104 ) * fix(security+ops): #82 #85 #87 — auth hardening, API validation, deploy posture Security and operational hardening across three issue groups: - M23: docker-compose.override.yml → docker-compose.dev.yml (BREAKING, prod foot-gun) - C13: Container runs as non-root user 'agnes' (USER directive in Dockerfile) - M21: Docker resource limits (mem_limit, cpus) on app + scheduler - M22: Caddyfile security headers (X-Frame-Options, X-Content-Type-Options, Referrer-Policy, -Server) - M17: /api/health split into minimal (unauth) + /api/health/detailed (auth) (BREAKING) - M26: release.yml restricts build-and-push to main + workflow_dispatch; paths-ignore for docs - C2: table_id traversal validation on /api/data/{table_id}/download - M4: Upload streaming (chunk-read + temp file) instead of full-buffer; /local-md hashed filename - C5: reset_token removed from POST /api/users/{id}/reset-password response - C8: Startup WARNING when no user has password_hash (bootstrap window visible) - M9: Audit log on failed web form login (mirrors /auth/token endpoint) - M10: Atomic magic-link consume via compare-and-swap (CONSUMED: marker + DuckDB conflict catch) Also: SSRF protection on /api/admin/configure (#46), memory stats SQL aggregation (#90) Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com> * fix(review): SSRF 169.254.x.x + IPv6 multicast; M10 marker cleanup safety Review fixes: - Add 169.254.0.0/16 (link-local, cloud metadata) to SSRF regex — was missing, allowing requests to AWS/GCP/Azure metadata endpoints - Add ff[0-9a-f]{2}: (IPv6 multicast) to SSRF regex - M10: wrap Step 3 (CONSUMED marker cleanup) in try-except with warning log — prevents unhandled exception if DB write fails after successful token consumption - Add test for 169.254.169.254 SSRF rejection Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com> * fix(review): SSRF IPv6 bypass, CLI health endpoint, upload FD leak Address Devin Review findings on PR #104: 1. SSRF IPv6 bypass: Replace hostname regex with DNS resolution + ipaddress module checks. The old regex patterns like `fe80:` only matched up to the first colon, missing real IPv6 addresses like `fe80::1`, `fc00::1`, `ff02::1`. The new approach resolves the hostname via getaddrinfo and checks each resulting IP against ipaddress.is_private/is_loopback/is_link_local/is_reserved/is_multicast. 2. CLI commands broken: `da setup test-connection`, `da setup verify`, `da diagnose`, `da status` all called /api/health expecting the old format (status=="healthy", services dict). Now they call /api/health/detailed for service-level checks (with graceful fallback to the minimal endpoint when auth is not configured). 3. Temp file handle leak: _stream_to_temp returns an open NamedTemporaryFile; callers now close it before shutil.move() to prevent FD leaks until GC. Also adds IPv6 SSRF test cases (loopback, link-local, unique-local, multicast) with mocked DNS resolution for test environment independence. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com> * fix(review): download regex blocks hyphenated IDs; document health split Address Devin Review round-3 findings on PR #104: 1. _SAFE_IDENTIFIER regex blocked hyphenated table IDs: The download endpoint used the strict SQL-identifier regex which does not allow dots or hyphens, but Keboola table IDs like in.c-crm.orders contain both. Switched to _SAFE_QUOTED_IDENTIFIER which allows dots and hyphens while still blocking path-traversal chars (/, .., \) and quote/control characters. Added test for hyphenated/dotted IDs. 2. Documented health endpoint split in DEPLOYMENT.md: Added Health checks & external monitoring section explaining both endpoints (minimal unauth /api/health vs authenticated /api/health/detailed) and how to wire external monitoring tools to the detailed endpoint with a PAT. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com> * release(0.12.1): cut hotfix for snapshot integrity + #82/#85/#87 hardening * fix(security): apply CAS pattern to password reset confirm (#82/M10 follow-up) Devin review on the rebased PR flagged the asymmetry: magic-link verify got the atomic compare-and-swap pattern in the original M10 fix, but password reset confirm at /auth/password/reset/confirm was still using read-validate-clear. Two concurrent POSTs with the same valid reset token could both succeed in setting different new passwords (last-write- wins). Lower severity than the magic-link race because the attacker would need the reset token AND to race the legitimate user, but the asymmetry was a polish gap. Mirrors app/auth/providers/email.py::_consume_token CAS exactly: write unique CONSUMED:<random> marker via UPDATE...WHERE token=old_token, then SELECT to verify our marker won, then proceed. Only the winner clears the marker and applies the password change. New regression test_concurrent_reset_only_one_wins in tests/test_password_flows.py::TestResetConfirm pins the contract: two ThreadPoolExecutor workers + Barrier hit /reset/confirm with the same token; exactly one gets 302 (password applied), the other gets 200 with 'Invalid or expired'. Sanity-checked against the pre-CAS code — both POSTs got 302 (race confirmed). --------- Co-authored-by: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-04-28 19:57:30 +02:00
ZdenekSrotyr	0c963b55ef	fix(rbac+auth): system-group PATCH accepts description; Google sync preserves memberships on empty Two Devin-flagged regressions on the squashed PR #106 head: 1) PATCH /api/admin/groups/{id} blanket-rejected on system groups. The repository guard at src/repositories/user_groups.py was already narrowed to "rename only" by `7147bac` (PR #110 follow-up), but the endpoint at app/api/access.py:331-343 still short-circuited with 409 "System groups are immutable" for any mutation. A description-only payload like {"description": "..."} returned 409 instead of 200 even though the repo would have accepted it. CHANGELOG entry promised the fix but the code didn't match. Endpoint now mirrors the repo contract: 409 only when payload.name is set AND differs from existing name. Same-name no-op renames are dropped before the repo call. Description-only updates flow through. 2) Google OAuth callback wiped google_sync memberships on transient API failure. fetch_user_groups is fail-soft and returns [] for both "user has no groups" and "Cloud Identity API error". The callback fed that empty list into replace_google_sync_groups, which DELETEs all rows with source='google_sync' for the user then INSERTs zero — silently wiping every Workspace-synced membership on a hiccup. Callback now skips replace_google_sync_groups when group_names is empty and logs "preserving existing memberships". Trade-off: a user whose Workspace groups were genuinely cleared keeps stale memberships until the next non-empty sync. Admin-added rows (source='admin') were already protected by source-scope and are unaffected. The previous guard against this exact regression was test_callback_empty_groups_ does_not_overwrite_existing in tests/test_auth_providers.py — that test class has been skipped since v12 (asserts users.groups JSON, needs rewrite for user_group_members).	2026-04-28 14:40:27 +02:00
Vojtech Rysanek	7147bac079	feat(rbac+marketplace): schema v14 FK + AGNES_ENABLE_TABLE_GRANTS + break-glass CLI Follow-up to the RBAC v13 + marketplace work in the parent commit. Addresses deferred Devin findings, gemini-flagged blockers, and adds three guard rails. == Schema v14 — FK constraints on user_group_members + resource_grants == Adds DuckDB foreign-key constraints so cascade deletes can no longer leave orphaned member / grant rows pointing at a deleted group_id (which were relying on application-level cascades up to v13). Migration is RENAME → CREATE-with-FK → INSERT → DROP, wrapped in BEGIN TRANSACTION so a partial failure rolls back without leaving the DB at a half-applied schema. == AGNES_ENABLE_TABLE_GRANTS feature flag (default off) == ResourceType.TABLE was shipped in the parent commit as listing-only — admins can record grants but runtime enforcement still flows through legacy dataset_permissions. To avoid the misleading-UX surface area, the chip is hidden from /admin/access and POST /api/admin/grants returns 422 with the env-var name in detail until the operator opts in. Existing TABLE rows in resource_grants stay listable + deletable so cleanup is never blocked. Helpers: is_resource_type_enabled(rt), enabled_resource_types(). == Break-glass admin CLI == `da admin break-glass <user>` adds the user to the Admin user_group with source='system_seed' regardless of RBAC state. Bypasses authentication — relies on filesystem access to ${DATA_DIR}/state/system.duckdb implying host-level trust. Recovery path when the operator has locked themselves out of /admin/access. == Devin round-2 fixes (deferred on b4ec4c4) == - src/repositories/user_groups.py — narrow update() guard from blocking any mutation on system groups to blocking name change only. Description edits now pass through. Endpoint pre-check stays as defense-in-depth. Prior behavior surfaced as a misleading 409 'Cannot rename a system group' on description-only PATCH. - app/api/access.py:delete_group — wrap cascade DELETEs + repo.delete in BEGIN TRANSACTION / COMMIT / ROLLBACK. Prevents orphan rows if any DELETE fails after the user_groups row is gone. - app/marketplace_server/{packager,router}.py — split compute_etag_for_user() from build_zip(); router resolves etag first and 304-shorts before any file read or ZIP_DEFLATED. In-process cachetools.TTLCache (default 120s, env-tunable via AGNES_MARKETPLACE_ETAG_TTL, set 0 to disable). invalidate_etag_cache() called by sync to force re-hash on content drift. == Tests == - TestTableGrantsFeatureFlag (4 cases) — endpoint exclude/include, grant rejection/acceptance under the flag. - test_v12_to_v13_finalize_rollback_on_failure — destructive: monkeypatches _seed_system_groups to raise mid-transaction, asserts schema_version stays at 12, legacy tables intact, new tables empty (rollback fired). Then restores the real function and asserts the retry succeeds. - test_update_system_group_description_allowed, test_update_system_group_same_name_no_op — repo-level coverage of the narrowed guard.	2026-04-28 14:25:13 +02:00
ZdenekSrotyr	e9d7af3cce	feat(rbac+marketplace): RBAC v13 + Claude Code marketplace + #81/#83/#44 hardening This squashes 13 commits from ma/staging plus a small docstring translation into a single coherent unit. Three workstreams. == RBAC v13 redesign == - Drops core.viewer/analyst/km_admin/admin hierarchy and the internal_roles / group_mappings / user_role_grants / plugin_access tables. - Replaced by user_group_members + resource_grants. Atomic v12→v13 backfill wrapped in BEGIN/COMMIT; ROLLBACK leaves schema_version at 12 for retry. - Two authorization primitives in app.auth.access: require_admin — Admin-group god-mode require_resource_access(rt, "{path}") — entity-scoped grants Single DB lookup per request; no session cache; no implies BFS. - /admin/access UI (single page) replaces /admin/role-mapping + /admin/plugin-access. CLI `da admin group/grant ` replaces `da admin role/mapping/grant-role/revoke-role/effective-roles`. - ResourceType.TABLE listing-only — admins can record table grants, runtime enforcement still flows through legacy dataset_permissions (migration plan in docs/TODO-rbac-data-enforcement.md). == Claude Code marketplace == - Aggregated /marketplace.zip + /marketplace.git/ (PAT-gated, RBAC-filtered, content-addressed cache via dulwich). - Admin god-mode dropped on the marketplace surface — admins curate their own view via grants like everyone else. - Bare-repo cache materializes per RBAC-filtered ETag; stale entries not pruned in this iteration (disclaimed in git_backend.py docstring). == #81 #83 #44 security/ops hardening == - #81 Group A — orchestrator ATTACH allow-listing (extension/url/alias). - #81 Group B — Keboola extractor 3-state exit codes: 0 success / 1 total fail / 2 PARTIAL fail Sync API logs PARTIAL FAILURE alert on exit 2. Operators with binary alerting must teach it the new partial signal. - #81 Group C — schema v10 view_ownership; rejects silent overwrite of a prior connector's view name on collision. - #81 Group D — extractor-side identifier validation. - #83 — Jira webhook fail-closed when JIRA_WEBHOOK_SECRET unset + path-traversal fix. - #44 — entire /api/scripts/* surface is admin-only (planted-script + sandbox-bypass risk closed). == Web UI polish + deploy fix == - /admin/access: live grant-count badges (no stale snapshot revert), shared-header CSS link added to /catalog and /admin/{tables,permissions}, per-resource-type colored stripes. - docker-compose.host-mount.yml: bind,rbind so dual-disk hosts don't silently shadow sub-mounts and write state to the wrong disk. == OSS vendor-neutralization (waves 1+2) == - scripts/grpn/ → scripts/ops/. Customer-specific identifiers (project IDs, internal hostnames, dev/prod VM IPs, brand names) replaced with placeholders across code, docs, Terraform, Caddyfile, OAuth probe, and planning docs. Downstream infra repos that copied scripts/grpn/agnes-tls-rotate.sh or agnes-auto-upgrade.sh must update the path. == Translation == - src/repositories/user_groups.py::ensure_system docstring translated from Czech to English for codebase consistency. Co-authored-by: Mina Rustamyan <mina@keboola.com>	2026-04-28 14:25:04 +02:00
ZdenekSrotyr	ef74ec010c	fix(ops): #81 Group B — Keboola partial-failure exit code 2 (squashed) (#99 ) Closes M14 from issue #81. Keboola extractor exits 0/1/2 (success/full-fail/partial). sync.py interprets exit 2 as PARTIAL FAILURE (data-quality alert, distinct from exit 1). Tests: tests/test_keboola_extractor_exit_codes.py — 14 cases including runtime mock subprocess (rc=0/1/2/124). Refs #81 Group B.	2026-04-27 21:52:46 +02:00
ZdenekSrotyr	23be8ad46f	fix(security): #81 Group A — orchestrator attach hardening (squashed) (#95 ) Closes the C1 findings from issue #81 plus the round-3/4 follow-ups on the read-only query path. Both _attach_remote_extensions (rebuild path) and _reattach_remote_extensions (query path) now apply the same hard allowlists for extensions and token-env names, single-quote-escape the URL, and split built-in vs community install. The CHANGELOG bullet documents the full scope including the table_schema → table_catalog fix that made the rebuild path a silent no-op for every connector. New module src/orchestrator_security.py centralises the policy. Tests in tests/test_orchestrator_remote_attach_security.py — 28/28 pass. Refs #81.	2026-04-27 21:34:04 +02:00
ZdenekSrotyr	24e81fb671	fix(security): gate Script-API /run on admin role (#44 ) (#92 ) * fix(security): gate Script-API /run on admin role (#44) The AST + string-blocklist sandbox in `_execute_script` is defense-in-depth, not a primary trust boundary. It does not block `vars()`, `type()`, or `__class__.__bases__` introspection chains, and the string blocklist is trivially evadable via concatenation/dunder encoding. Treat the role gate as the actual barrier: only admin can run scripts. - `POST /api/scripts/run` and `POST /api/scripts/{id}/run` now require admin. - `POST /api/scripts/deploy` stays analyst-accessible (storing != executing). - Existing /run tests retargeted to admin_token; added regression tests asserting analyst → 403 on both endpoints. - CHANGELOG: BREAKING (security) bullet under Unreleased/Changed. Closes #44. * fix(security): admin-gate /deploy + harden sandbox blocklist (review #92) Reviewer of PR #92 flagged three MUST-FIXes that #44 wasn't fully closed: 1. /api/scripts/deploy still accepted analyst → planted-script attack path (analyst plants malicious source, waits for admin to /run). Now: /deploy also requires admin; the entire Script API is admin-only. 2. The "Minimum (same-day)" blocklist mitigations from issue #44 weren't applied. Added the introspection-chain dunders that the issue PoC pivots through: __subclasses__, __globals__, __class__, __base__, __bases__, __mro__, __dict__, __code__, __builtins__. Plus `vars` in BLOCKED_FUNCTIONS. Deliberately NOT adding __init__ / __getattribute__ (substring match would flag every legit `def __init__`) nor `type`/`dir` (frequent in legitimate admin scripts). Documented the trade-off inline. 3. Tests didn't cover the actual PoC payload nor non-analyst non-admin roles. Added test_run_pwn_payload_blocked parametrized over the issue's own PoC + two equivalent variants (lambda+__globals__, __mro__ traversal); these stay green only as long as the dunder list does. test__requires_admin tests now parametrize over (analyst, viewer, km_admin) so all three non-admin core roles are pinned at 403. Conftest extension: seeded_app now exposes viewer_token and km_admin_token as siblings to admin_token / analyst_token. CHANGELOG bullet updated to reflect /deploy gate change and new internal regression tests. 35/35 scripts tests pass locally. Refs review of #92. fix(tests): test_security TestScriptSandbox needs admin token after #44 hardening CI failure on PR #92 caught a missed test file. tests/test_security.py seeded only an analyst user and used the analyst token to drive sandbox tests. After the #44 admin-gate (deploy + run both admin-only), every sandbox test got 403 from the role gate before the AST/string check could run, so 'blocks os.system' / 'blocks eval' / etc. all failed. Fix: extend the fixture to also seed an admin user and return the admin token. Sandbox tests now reach the sandbox layer; access-control tests further down in the module continue to use the analyst that was kept around. 41/41 test_security.py tests pass locally. * fix(security): #92 round-3 — gate GET /api/scripts on admin role Devin Review caught: GET /api/scripts (app/api/scripts.py:44-51) was left on Depends(get_current_user) when the rest of the API moved to admin-only. ScriptRepository.list_all() does SELECT * FROM script_registry which returns ALL columns including 'source' (the full script body). So any authenticated user (viewer / analyst / km_admin) could read admin-deployed scripts — leak of code that may contain credentials, business logic, or admin-only operational details. CHANGELOG already says 'The entire Script API is now admin-only', which was true for /deploy, /run, /{id}/run, DELETE — just not for GET. Now consistent: every Script endpoint requires admin. Tests: - New parametrized test_list_scripts_requires_admin over (analyst, viewer, km_admin) tokens — all assert 403. - Updated test_list_scripts_empty in both test_scripts_api.py and test_api_scripts.py to use admin_token. 79 tests pass. Refs Devin Review of #92. * fix: cleanup unused imports, stale docstrings, and incomplete CHANGELOG - Remove unused imports: Path, List, get_current_user (ruff F401) - Trim docstrings to describe current behavior, not change history - CHANGELOG now lists GET /api/scripts among admin-gated endpoints - Remove diff-commenting inline comments from tests Co-Authored-By: zdenek.srotyr <zdenek.srotyr@keboola.com> * fix: merge duplicate Changed sections into one per CLAUDE.md convention Co-Authored-By: zdenek.srotyr <zdenek.srotyr@keboola.com> --------- Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-04-27 21:13:56 +02:00
ZdenekSrotyr	2f783c5c0a	fix(security): close Jira webhook fail-open + path traversal (#83 ) (#93 ) * fix(security): close Jira webhook fail-open + path traversal (#83) Two related vulnerabilities: 1. Fail-open signature check: when JIRA_WEBHOOK_SECRET was unset, _verify_signature returned True and any unauthenticated POST to /webhooks/jira would run the full ingest pipeline. Now fail-closed — the handler short-circuits with 503 (operator-misconfiguration signal, distinct from 401 wrong-signature) when the secret is missing. 2. Path traversal via attacker-controlled issue_key: webhook payloads carry issue.key, which flowed unsanitized into save_issue (issues_dir / "{issue_key}.json"), download_attachment (attachments_dir / issue_key), and incremental_transform (raw_dir / "issues" / "{issue_key}.json"). A crafted webhook with issue.key="../../etc/passwd" could write outside the Jira data dir. Defense-in-depth: new connectors/jira/validation.py exposes is_valid_issue_key (whitelist regex ^[A-Z][A-Z0-9_]{0,31}-\d{1,12}$) and safe_join_under (Path.resolve() containment check). Both are enforced at the webhook entry point AND at every filesystem boundary in the connector. Tests: - New tests/test_jira_validation.py — unit tests for both helpers (parametrized invalid keys, traversal/symlink/absolute-path cases). - Webhook tests: test_unconfigured_secret_returns_503, test_path_traversal_in_issue_key_rejected (parametrized over 10 bad keys), test_valid_issue_key_accepted. CHANGELOG: two CRITICAL Fixed bullets under Unreleased. Closes #83. * fix(security): close remaining #83 review findings — webhookEvent traversal, _handle_deletion guard, regex tightening Reviewer of PR #93 flagged four MUST-FIXes: 1. _log_webhook_event used the attacker-controlled `webhookEvent` field as a filename component without sanitization. Payload with `webhookEvent: "../../tmp/pwn"` could escape WEBHOOK_LOG_DIR. Now: - non-`[A-Za-z0-9_-]` runs are replaced with `_` (dot excluded so `..` cannot survive sanitization as a directory component) - length capped at 64 chars - final path routed through safe_join_under New regression test `test_webhook_event_path_traversal_sanitized`. 2. _handle_deletion (connectors/jira/service.py:530) and process_webhook_event (line 487) still used raw issue_key in path builds. Even though the webhook handler validates upstream, the "defense-in-depth at every filesystem boundary" claim required these too. Both now run is_valid_issue_key and safe_join_under guards. 3. Regex `^[A-Z][A-Z0-9_]{0,31}-\d{1,12}$` permitted underscores in project keys. Atlassian's project-key validator does not — `A_B-1` is rejected by Jira itself. Tightened to `[A-Z0-9]` and updated tests: `ABC_DEF-1` is now invalid, added Cyrillic А-1 (lookalike), CRLF, and oversize cases to the bad-key parametrization. 4. Existing test test_deletion_of_nonexistent_issue_returns_true used `PROJ-NOEXIST` which is not a real Jira key shape. Updated to `PROJ-99999`. The test still exercises the same intent (deletion of issue with no local file is idempotent). 73/73 jira tests pass locally (test_jira_webhooks + test_jira_validation + test_jira_service + test_jira_service_full + test_jira_incremental). CHANGELOG updated to document the regex tightening and the new webhookEvent sanitization. Refs review of #93. * fix(tests): test_journey_jira tests assumed fail-open before #83 fix CI failure on PR #93 caught two journey tests that pinned the OLD fail-open contract: - test_webhook_with_no_secret_configured_accepted asserted 200 when JIRA_WEBHOOK_SECRET was unset. After the #83 fix that's a 503 (operator misconfig). Renamed to _refused and flipped the assertion. - test_webhook_empty_payload_rejected didn't set the secret, so the 503 short-circuit fired before the empty-payload 400 could. Set JIRA_WEBHOOK_SECRET in the patched Config so the test exercises the intended path. 56/56 jira journey + webhook + validation tests now pass. * fix(security): #93 round-3 — webhook fallback format + save_issue early validation Devin Review caught two real findings: 1. Webhook handler regression: the round-2 fix extracted issue_key only from event_data['issue']['key'], but process_webhook_event has long supported a fallback 'issue_key' top-level field for certain Jira event formats (e.g. delete events historically). The handler now blocks those events with 400 before they reach the service layer. Fix: mirror process_webhook_event's fallback in the handler — try issue.key first, fall through to event_data.get('issue_key') when empty. is_valid_issue_key still validates whichever source provided the key. 2. save_issue defense-in-depth was incomplete: is_valid_issue_key ran AFTER fetch_remote_links and fetch_sla_fields had already used the unvalidated issue_key in HTTP URL construction ({base_url}/issue/{issue_key}/remotelink etc.). A future internal caller invoking save_issue directly with attacker-controlled input could trigger outbound requests with a malicious path component (limited SSRF / URL-path manipulation against the Jira API server). Fix: move the is_valid_issue_key check to immediately after the null guard, before any HTTP request or filesystem op. Webhook layer still validates upstream, this is the second layer. 66 jira tests pass. Refs Devin Review of #93. * fix(changelog): #93 round-4 — add BREAKING marker to fail-closed bullet Devin Review caught: the JIRA_WEBHOOK_SECRET fail-closed change is a behavior change for operators (response code 503 vs old 200) that existing alerting may treat differently. Per CLAUDE.md changelog discipline rule, operators grep for BREAKING before bumping the pin. Added the marker + a short note on what action operators need to take (set the env var if they haven't). Refs Devin Review of #93. * fix: #93 round-5 — null-issue crash + comment drift Devin Review caught two findings on the round-4 commit: 1. Pre-existing crash on null issue field: a webhook payload with {"issue": null} (rather than omitting the key) caused event_data.get("issue", {}) to return None, then issue.get("key") raised AttributeError → unhandled 500. Pre-existing but reachable. Fix: 'event_data.get("issue") or {}' normalises None to {}, then the existing fallback / validation path returns 400 cleanly. New regression test test_null_issue_field_does_not_crash. 2. Inline comment drift: the comment at line 77 documented the allowed character class as [A-Za-z0-9._-] (with dot) but the regex at line 27 excludes dot deliberately (so '..' cannot survive sanitization). Fixed the comment to match. 52 jira tests pass. Refs Devin Review of #93 round 5. * fix: #93 round-6 — process_webhook_event also normalises null issue field Devin Review caught: the webhook handler at app/api/jira_webhooks.py correctly handles {"issue": null} via 'event_data.get("issue") or {}', but process_webhook_event at connectors/jira/service.py:509 still used the bare 'event_data.get("issue", {})' which returns None on explicit null. Internal callers (anything that invokes process_webhook_event without going through the HTTP handler) would hit the same AttributeError the round-5 fix closed at the handler layer. Same one-line fix. 32 jira tests pass. Refs Devin Review of #93 round 5. * fix: #93 round-7 — issue-key regex uses [0-9] not \d Devin Review caught: Python 3's \d matches any Unicode decimal digit (Arabic-Indic ٣, Bengali ৩, Devanagari ३, …). A key like TEST-٣ would pass the regex even though it's not a valid Jira input. Tightened to [0-9] (ASCII only). Added three Unicode-digit cases to the bad-key parametrization in test_jira_validation.py to lock in the contract. Refs Devin Review of #93 round 6. * fix: #93 round-8 — use \\Z anchor not $ in issue-key regex Devin Review caught: Python's $ anchor matches before a trailing \\n, so re.match('…$', 'TEST-1\\n') returns a match. is_valid_issue_key returned True for CRLF-injected keys. \\Z is hard end-of-string and closes that bypass. Manual verification: is_valid_issue_key('TEST-1\\n') → False (was True before fix) is_valid_issue_key('TEST-1\\r\\n') → False is_valid_issue_key('TEST-1') → True Refs Devin Review of #93 round 7. * docs: #93 round-9 — CHANGELOG regex matches implementation	2026-04-27 19:53:55 +02:00
Petr Simecek	2dfb246996	release(0.11.5): post-merge follow-up — Devin review fixes + authlib warning silenced (#74 ) Cuts 0.11.5 with all the [Unreleased] bullets that landed on top of PR #73 between commit a899877 (the original "v0.11.4" tag in the chain) and the final merge commit on main. No new public-API surface; the user-visible payoff is that v8→v9-migrated installations work end-to-end (login flows, GET /api/users, admin nav, the new role-management REST API and its last-admin protection) and `make local-dev` startup is finally quiet. Bullets covered (full text in CHANGELOG.md [0.11.5]): - _hydrate_legacy_role re-resolves from grants on every request — fixes privilege-retention after grant revoke via the role-management API. - Dev-bypass + OAuth callback now pass user_id to resolve_internal_roles so direct grants land in the session cache (not the DB-fallback path). - GET /api/users hydrates user dicts before Pydantic validation (HTTP 500 on every migrated install) + same fix for update/delete paths so last-admin protection triggers on migrated admins. - Scheduler stopped spamming POST /auth/token 401 — the auto-fetch fallback was always broken; SCHEDULER_API_TOKEN is now the only path. - POST /auth/token / Google OAuth / password / email-magic-link all hydrate user["role"] before issuing the JWT (Pydantic 500 + wrong token payload). New TestAuthLoginFlowsPostMigration regression class. - docs/RBAC.md no longer documents the non-existent implies= keyword on register_internal_role. - _seed_core_roles now actually runs on every connect (the docstring was lying — only ran during fresh install + v8→v9). New TestSeedCoreRolesSafetyNet regression class. This commit also adds: - AuthlibDeprecationWarning suppression at app/main.py top — upstream- internal forward-compat note from authlib._joserfc_helpers, not actionable on our side. Filter is targeted by class (with a message-based fallback) so other DeprecationWarnings remain visible. - pyproject.toml version: 0.11.4 → 0.11.5. - CHANGELOG.md: [Unreleased] → [0.11.5] — 2026-04-27, new empty [Unreleased] skeleton appended for the next PR to land on. Tag v0.11.5 follows; keboola-deploy-v0.11.5 tag triggers the keboola-deploy.yml workflow for agnes-dev.keboola.com.	2026-04-27 02:32:18 +02:00
Petr Simecek	83ced81966	feat(auth): unified role management — UI + REST API + CLI + schema v9 (v0.11.4) (#73 ) * feat(auth): v9 schema — unified role management foundation (WIP) Tasks 1-5, 10 of the role-management-complete plan. Foundation only, follow-up commits add REST API, CLI, UI, and tests. Schema v9: - user_role_grants table: direct user → internal_role mapping (complementary to group_mappings). Drives PAT/headless auth and persists across sessions. Source field tracks 'direct' vs auto-seed. - internal_roles.implies (JSON): transitive role hierarchy. core.admin implies core.km_admin → core.analyst → core.viewer. Resolver does BFS expand at lookup time. - internal_roles.is_core (BOOL): distinguishes seeded core.* hierarchy from module-registered roles. UI renders them differently. - v8→v9 migration: ADD COLUMN, CREATE TABLE, _seed_core_roles + _backfill_users_role_to_grants, then NULL legacy users.role values. DuckDB FK constraint blocks DROP COLUMN — sloupec zůstává jako deprecated artifact (UserRepository ignoruje), fyzický drop deferred. Resolver: - Regex extended to allow dotted namespace (core.admin, context_engineering.admin), max 64 chars total. - expand_implies(role_keys, conn): BFS over implies JSON column. - resolve_internal_roles signature gains optional user_id parameter; unions group-mapping resolution with user_role_grants direct grants before implies expansion. require_internal_role: - Two-path resolution: session cache (OAuth) → DB grants (PAT/headless fallback). PAT clients now legitimately satisfy gates without the OAuth round-trip, fixing the v8 limitation where every PAT-callable admin endpoint needed require_role(Role.ADMIN) instead of require_internal_role(...). Backward-compat: - require_role(Role.X) and require_admin become thin wrappers over require_internal_role(f"core.{role}"). Implies hierarchy preserves the legacy "at least this level" semantics automatically — no per-level comparison code needed. - src/rbac.py helpers (is_admin, has_role, get_user_role, set_user_role, can_access_table, get_accessible_tables) all read from the resolver via _get_internal_role_keys. - UserRepository.create() and update() now mirror role changes into user_role_grants via _grant_core_role helper. Preserves API while making the new table the source of truth. - UserRepository.delete() pre-deletes user_role_grants rows (FK cascade — DuckDB doesn't auto-cascade). - count_admins() reads user_role_grants ⨝ internal_roles instead of the now-NULL users.role column. First consumer: - app/api/admin.py module-level docstring documents the v9 pattern for future module authors. Existing require_role(Role.ADMIN) callsites flow through the wrapper; no behavior change for OAuth callers, and PAT callers gain access via direct grants. Tests: full suite green (1396 passed, 6 skipped). Existing tests exercise the new pathway transparently because UserRepository.create auto-grants. New test_pat_caller_with_direct_grant_passes pins the PAT-aware contract. Schema: v9 (was v8). pyproject.toml + CHANGELOG bump deferred to the final PR-prep commit. * feat(auth): role management complete — REST API + CLI + UI + docs (v0.11.4) Sjednocuje legacy users.role enum s v8 internal-roles foundation pod jeden model s implies hierarchií, dodává admin UI + REST API + CLI pro správu group mappings i přímých user grants, a dělá require_internal_role PAT-aware tak, aby admin endpointy fungovaly uniformly napříč OAuth i headless callery. REST API (app/api/role_management.py, +496 LOC): - 8 endpointů pod /api/admin: internal-roles list, group-mappings CRUD, users/{id}/role-grants CRUD, users/{id}/effective-roles debug. - Všechny gated require_internal_role("core.admin"). Audit-log na každé mutaci (role_mapping.created/deleted, role_grant.created/deleted). - Last-admin protection: refuse to delete the final core.admin grant (mirrors users.py:count_admins protection). - Nový UserRoleGrantsRepository v src/repositories/user_role_grants.py. CLI (cli/commands/admin.py extension, +258 LOC): - da admin role list / show <key> - da admin mapping list / create <group-id> <role-key> / delete <id> - da admin grant-role <email> <role-key> - da admin revoke-role <email> <role-key> - da admin effective-roles <email> - Všechno přes typer + PAT auth, --json flag, response-shape tolerantní. UI (admin_role_mapping.html + admin_user_detail.html + nav + user list): - Nová stránka /admin/role-mapping: internal_roles read-only table + group_mappings table with create/delete forms. - Nová stránka /admin/users/{id}: core role single-select + capabilities multi-checkbox + effective-roles debug (direct + group + expanded). - Existing user list dostává "Detail" link na novou stránku. - Nav link na /admin/role-mapping. Tests: +85 nových testů přes 4 nové soubory: - test_schema_v9_migration.py (8) — fresh install + v8→v9 backfill + legacy column NULL semantics + unknown-role fallback + invariants. - test_api_role_management.py (33) — všech 8 endpointů, happy + error paths, audit-log assertions, last-admin protection. - test_cli_admin_role.py (25 + 1 conditional) — typer subcommands, text + json output, PAT integration smoke. - test_admin_role_mapping_ui.py (9) + test_admin_user_capabilities_ui.py (10) — page rendering, auth gating, form contracts, JS hooks. Full suite: 1482 passed, 6 skipped (was 1396 → +86, žádné regrese). Docs: - docs/internal-roles.md kompletní rewrite — odstranil "no UI yet", přidal hierarchy diagram, dual-path resolution, dotted-namespace convention, admin workflow přes UI/CLI/REST, refresh semantics for group mappings vs direct grants, migration notes. - CLAUDE.md schema v8 → v9. - CHANGELOG.md [0.11.4] s BREAKING marker pro users.role NULL semantics + complete Added/Changed/Removed/Internal sekce. - pyproject.toml: 0.11.3 → 0.11.4. Sequencing: po mergi tohoto PR Pabu rebasuje pabu/local-dev (PR #72) na main, jeho schema migrations se posouvají z v9/v10/v11 na v10/v11/v12. Implementation breakdown: - Sequential (já): foundation tasks — schema v9, resolver, PAT-aware require_internal_role, backward-compat wrappers, rbac refactor, UserRepository auto-grant. - Parallel sub-agents (3 worktrees, ~10 min): REST API, CLI, UI. - Sequential (já): integrace, docs/CHANGELOG/version, schema tests, fullsuite verification. * fix(auth): address Devin review on PR #73 — three regressions Three concrete bugs caught in Devin's PR review, all fixed in this commit. 1. users.role hydration on read (the big one): v8→v9 migration NULLs users.role for every existing user, but a long tail of read sites still inspect user["role"] directly: - app/web/templates/_app_header.html:15 — admin nav gate - app/web/templates/_app_header.html:36-37 — role badge in dropdown - app/web/router.py:319-321 — UserInfo.is_admin/is_analyst/is_privileged - app/web/router.py:489 — corporate memory is_km_admin - app/api/catalog.py:54 — admin "see all tables" bypass - app/api/sync.py:215 — admin "see all sync states" bypass Without a fix, every existing admin loses the entire admin nav (and API admin bypasses) immediately after upgrade — a serious regression. Fix: new helper _hydrate_legacy_role() in app/auth/dependencies.py maps the highest-level core.* grant back into user["role"] as the legacy enum string. Called from get_current_user() on both auth paths (LOCAL_DEV_MODE + JWT/PAT). Idempotent — skips when role is already populated. Net effect: every pre-v9 callsite keeps working transparently for both OAuth and PAT callers, with one extra DB round-trip per authenticated request (same cost as the existing PAT-aware require_internal_role fallback). 3 regression tests in tests/test_schema_v9_migration.py: - test_hydration_recovers_role_from_user_role_grants - test_hydration_returns_highest_grant (multi-grant → highest wins) - test_hydration_falls_back_to_viewer_when_no_grants (safe fallback) 2. CLI effective-roles TypeError: API returns direct/group as List[Dict] (RoleGrantResponse-shaped), but the CLI did ', '.join(direct) which raises TypeError on dicts. Tests masked it because mocks used bare string lists. Replaced raw .join() with a _names() helper that extracts role_key from each item, falling back to str() for legacy mock shapes. 3. UI template field-name mismatch: admin_user_detail.html JS reads data.groups but the API serializes the field as group (singular, per EffectiveRolesResponse pydantic). Currently benign because the API always returns group:[], but the field would silently disappear once the group-derived view is wired up. Added data.group as the primary lookup, kept the legacy aliases for shape-drift tolerance. Full suite: 1485 passed (was 1482, +3 hydration tests), 6 skipped, no regressions. * fix(auth): Devin review #2 + UX self-service + RBAC docs rename Three threads landed in one commit because they share the same auth/role surface and CHANGELOG entry. Devin review #73 second round (2 actionable findings): - _hydrate_legacy_role no longer short-circuits on truthy users.role. The role-management endpoints (POST/DELETE /api/admin/users/{id}/ role-grants + the changeCoreRole UI flow) only mutate user_role_grants — they don't update the legacy column. The early return trusted that stale value, so a user downgraded via the new REST/UI kept role="admin" in their dict on subsequent requests, which fooled _is_admin_user_dict (src/rbac.py) and the catalog/sync admin-bypass short-circuits into retaining elevated table access even though require_internal_role correctly denied the API gates. Always re-resolves now, making user_role_grants the single source of truth on every authenticated request. Cost: one DB round-trip per request — same as the existing PAT-aware fallback. Pinned by test_hydration_ignores_stale_legacy_role_after_grant_revoke. - Dev-bypass (app/auth/dependencies.py) and OAuth callback (app/auth/providers/google.py) now pass user_id to resolve_internal_roles so direct grants land in session["internal_roles"] alongside group-mapped roles. Pre-fix, every admin-gated request fell through to the per-request DB fallback inside require_internal_role and the dev-bypass log line read "resolved 0 internal role(s)" for an obviously-admin user. test_session_internal_roles_populated updated to assert union. User-visible UX (also addresses local-test feedback): - HTTP 500 on /admin/users post-v8→v9 migration — UserResponse.role is required str, but legacy users.role was NULL-ed by the migration. _to_response in app/api/users.py now routes every dict through _hydrate_legacy_role; same fix lifts the silent no-op of last-admin protection in update_user/delete_user (the role-equality short-circuits would skip the count_admins guard for migrated admins). Three regression tests under TestAPIUsersPostMigration. - /profile is now a real self-service detail page for every signed-in user (not just admins). Three new server-side sections: Effective roles (resolver output as chip cloud), Direct grants (rows in user_role_grants with source label), Roles via groups (which Cloud Identity / dev group grants which role for the current user). Non-admins finally see why a feature is or isn't accessible. Admins additionally see a deep-link to /admin/users/{id} for editing their own grants. - /admin/role-mapping group-id picker. New "Known groups" panel above the create form: clickable chips for the calling admin's own session.google_groups (tagged "your group") merged with external_group_ids already used in existing mappings (tagged "already mapped"). Click a chip → fills the form. Empty-state copy points operators at LOCAL_DEV_GROUPS / Google sign-in instead of leaving them to guess Cloud Identity opaque IDs from memory. Operational fixes: - Scheduler log-noise: every cron tick produced a POST /auth/token 401 because the auto-fetch fallback called the endpoint with just an email (no password) and silently fell through. Removed the broken path entirely. Operators set SCHEDULER_API_TOKEN (long-lived PAT) in production; in LOCAL_DEV_MODE the dev-bypass auto-authenticates the un-tokenized request, so jobs continue to work. Docs: - docs/internal-roles.md → docs/RBAC.md (git mv preserves history). Standard industry term, more discoverable for engineers grepping for RBAC in a new repo. Restructured: Quickstart-by-role (operator / end-user / module author), step-by-step Module-author workflow with code examples (register key, gate endpoint, declare implies, write contract test), naming pitfalls, refresh semantics. CLAUDE.md gets a new "Extensibility → RBAC" section pointing contributors at the doc before they add gated endpoints. Cross-refs in app/api/admin.py + tests/test_role_resolver.py updated. Tests: 293 in the auth/role/scheduler/UI test set passed, 0 regressions. * fix(auth): Devin review #3 — login flows + RBAC docs Two new findings on commit 7d1c048, both real and addressed. Finding 1 (BUG, HTTP 500): every auth login flow loaded users via UserRepository.get_by_email and passed user["role"] straight to create_access_token, Pydantic response models, and _set_login_cookie without going through _hydrate_legacy_role. Post-v9 the legacy column is NULL for migrated users, and TokenResponse.role is a required str — so POST /auth/token raised ValidationError → HTTP 500 for any v8-admin trying to log in via password. Same root cause produced non-crashing but semantically wrong JWTs (role: null) from Google OAuth, password web flows, and email magic-link verification. Fix: hydrate inline in every login flow before reading user["role"]: - app/auth/router.py — POST /auth/token (the crash site) - app/auth/providers/google.py — OAuth callback (was just stale JWT) - app/auth/providers/password.py — 5 flows: JSON login, web login, JSON setup, web reset confirm, web setup confirm - app/auth/providers/email.py — centralized in _consume_token, covers both /verify endpoints New regression class TestAuthLoginFlowsPostMigration pins both the no-crash and the correct-role contracts for all four legacy levels (viewer/analyst/km_admin/admin) on POST /auth/token. Finding 2 (DOCS): docs/RBAC.md showed register_internal_role() being called with implies=[...], but the function signature is (key, , display_name, description, owner_module). A module author copying the example would TypeError at import time. The implies field on internal_roles IS honored at runtime by expand_implies, but the registry-side write path (register_internal_role + InternalRoleSpec + sync_registered_roles_to_db) doesn't exist yet — implies is currently seeded only for the core. hierarchy via _seed_core_roles in src/db.py. Rewrote the Implies hierarchy and Module-author workflow sections to document what's actually supported in 0.11.4 and what a future change would need to add. The "for cross-module hierarchies, register each level + grant both" pattern works today. Tests: 322 in the auth/role/scheduler/UI/password test set passed, 0 regressions. * fix(db): _seed_core_roles actually runs on every connect (Devin review #4) Devin flagged that the docstring on `_seed_core_roles` promised per-connect execution as a safety net for accidental DELETEs and in-code seed changes, but the only call sites lived inside `if current < SCHEMA_VERSION:` — so once a DB was on v9 the function never ran again, and the docstring lied. Picked option (b) from the review (actually call it on every startup) over option (a) (fix the docstring) because the safety net is genuinely useful: - recovery from accidental admin DELETE on internal_roles, - in-code _CORE_ROLES_SEED tweaks (display_name/description/implies) ship without a manual SQL deploy, - fresh installs and migrations stop needing their own seed call sites. Tail call gated by `get_schema_version(conn) <= SCHEMA_VERSION` so the future-version-is-noop rollback contract still holds — a v9 binary won't touch a DB that's been upgraded past v9. Test coverage: new TestSeedCoreRolesSafetyNet class (3 tests) pins the three contracts — deleted row re-seeds, mutated display_name re-syncs from in-code seed, applied_at on schema_version doesn't churn on already-current DBs. Existing TestMigrationSafety::test_future_version_is_noop still passes (verified against the gating logic).	2026-04-27 02:23:01 +02:00
Petr Simecek	6c36b26979	release(0.11.3): internal roles + external→internal group mapping (foundation) (#71 ) * feat(auth): internal roles + external→internal group mapping (foundation) Two-layer authorization model: external Cloud Identity groups (org-managed) get mapped onto internal Agnes-defined capabilities (app-managed) via an admin-curated many-to-many table. Per-request permission checks read off the session — no DB hit. Refresh requires re-login. Schema v8 — new tables: - internal_roles (id, key UNIQUE, display_name, description, owner_module, …) — app-defined capabilities like 'context_admin'. Modules self-register at import; the startup hook syncs the registry into this table (idempotent). - group_mappings (id, external_group_id, internal_role_id FK, …) — admin-managed bindings, UNIQUE(external_group_id, internal_role_id). app/auth/role_resolver.py — new module: - register_internal_role(key, display_name, description, owner_module) Module-author entry point. lower_snake_case key, immutable, validated. Same key + same fields = no-op (re-import safe); same key + different fields = ValueError so two modules can't silently overwrite each other. - sync_registered_roles_to_db(conn) — startup reconciliation. Inserts new keys, updates drifted metadata, never deletes (preserves mappings). - resolve_internal_roles(external_groups, conn) — joins group_mappings. Sorted, deduplicated role-key list. Plugged into google_callback + dev-bypass branch in get_current_user. - require_internal_role('key') — FastAPI dependency factory; reads session.internal_roles; 403 with explicit message when missing. Resolution runs at sign-in only (Google callback + LOCAL_DEV_GROUPS change in dev-bypass) — same semantics as session.google_groups. No admin UI yet; mappings created via repository directly until follow-up PR ships UI. 21 new tests in tests/test_role_resolver.py: register/list, idempotency, collision detection, key-format validation; sync insert/update/no-delete; resolve empty/single/many-to-many/malformed-input; e2e via LOCAL_DEV_GROUPS — gated endpoint allowed/denied + direct session-cookie inspection. Full sweep: 178/178 passed across auth + db + repo tests. (Two pre-existing test_catalog_export.py failures verified unrelated.) * fix(auth): polish review feedback — first-request dev populate + PAT doc Two follow-ups from a code-reviewer pass on the foundation commit before opening the PR: - Dev-bypass populates session["internal_roles"] on the first request after sign-in, not just when external groups change. The previous guard only resolved when groups_changed=True, which left a hole for the LOCAL_DEV_GROUPS=`""` (explicit empty) flow: target=[], current=None, neither write branch fires, internal_roles stays unset, and require_internal_role then 403s with no roles to check against. The OAuth callback writes session["internal_roles"] unconditionally on sign-in (even []); dev-bypass now matches that semantics. Adds a single-pass populate gated on the key being absent from the session, so subsequent same-state requests still no-op (cheap session lookup, no resolver call). - Document that internal roles are session-scoped and PAT/headless clients will get 403 from any require_internal_role(...) endpoint. Same constraint already applies to session.google_groups (PAT JWTs deliberately don't snapshot group memberships — they could change after issuance with no way to re-sign), but the doc didn't surface this — an operator pointing a CLI at a role-gated endpoint would see 403 with no clue why. New "PAT and headless requests" section spells out the constraint, the rationale, and the three escape valves (use users.role for the gate; route through OAuth; wait for the planned `da admin grant-role` CLI helper). 54 auth tests still pass locally (21 role-resolver + 33 existing auth-provider). * release(0.11.3): cut release for the internal-roles foundation Bumps pyproject.toml 0.11.2 → 0.11.3 and renames CHANGELOG's [Unreleased] section to [0.11.3] — 2026-04-26 (with a fresh empty [Unreleased] skeleton appended). Adds the matching [0.11.3] link reference at the bottom of CHANGELOG so the section heading renders as a hyperlink to the GitHub release page once the tag lands. The bullet itself is unchanged content; the rephrasing of "dev-bypass when external groups change" → "dev-bypass — populates on first request and whenever external groups change, mirroring the OAuth callback's always-write semantics" reflects the polish committed in d590579, plus the appended PAT/headless caveat pointing at the doc section that landed in the same polish pass. * fix(auth): address review feedback from Pavel — PAT-specific 403, audit logs, hardening Round-2 polish over the internal-roles foundation, addressing Pavel's review on PR #71. No behavior change for the happy path; tightens the safety rails and makes the failure modes self-explanatory. User-visible: - require_internal_role now distinguishes "no session" (Bearer/PAT caller) from "signed in but missing role" and surfaces a PAT-specific 403 detail in the first case ("This endpoint needs an interactive (OAuth) session — Bearer/PAT tokens do not carry session-resolved roles by design"). - docs/internal-roles.md documents deactivate+reactivate as the supported "force re-resolve now" lever for users that can't be made to log out. Internal hardening: - INFO-level audit log on every successful resolve (OAuth callback + dev-bypass) so a wrong-role complaint is debuggable from the log alone. - Startup warning when SESSION_SECRET is shorter than 32 chars, matching the existing JWT_SECRET_KEY gate — both HMAC surfaces sign trust-laden state (session.internal_roles, session.google_groups, JWTs). - _clear_registry_for_tests() now refuses to run unless TESTING=1 so a stray import path in production can't drop the registered capabilities. Tests: - 4 new tests in tests/test_role_resolver.py covering: stale-session contract after a mid-session mapping revoke (pin the documented limitation), PAT 403 detail wording, OAuth pipeline data flow from external groups to internal_roles, and the dev-bypass empty-list fallback when the resolver raises. CHANGELOG.md updated under [0.11.3] (### Changed + ### Internal). CLAUDE.md schema doc bumped from v7 to v8. --------- Co-authored-by: Claude <noreply@anthropic.com>	2026-04-26 23:49:10 +02:00
Petr Simecek	1c18cdf15f	release(0.11.2): LOCAL_DEV_GROUPS dev mock + Makefile defaults + docs/local-development.md (#70 ) * feat(auth): mock session.google_groups in LOCAL_DEV_MODE via LOCAL_DEV_GROUPS LOCAL_DEV_MODE auto-logged-in the dev user but left session.google_groups empty, so group-aware UI/code paths can't be exercised on localhost without a real Google OAuth round-trip. New LOCAL_DEV_GROUPS env var (JSON array matching the production {id, name} shape) populates the session on every dev-bypass request — same structure the OAuth callback writes, so mock and prod stay in lockstep. Compare-then-write avoids spurious Set-Cookie noise on PAT/CLI requests; malformed input falls back to [] with a WARNING so the dev mock never breaks the dev flow. * refactor(auth): fail-fast LOCAL_DEV_GROUPS at startup + cache + no-mutate Three small follow-ups on the same dev-mock vector before merge: - Validate LOCAL_DEV_GROUPS at app startup and report the parsed group IDs in the LOCAL_DEV_MODE banner. A malformed value now warns loudly at boot instead of silently logging on the first authenticated request, where it's easy to miss. - Cache the parsed result single-slot, keyed by the raw env-string. Avoids re-parsing JSON on every authenticated request without test-isolation surprises — when the env value changes, the key changes and the cache transparently rebuilds. - Stop mutating the parsed-input dicts (item.setdefault → spread-merge) so the cached list stays a fresh value on every rebuild. - Replace the try/except guard around request.session with hasattr — SessionMiddleware is always registered, the silent except was paranoid. Tests grow by a direct session-cookie inspection (decoupled from the profile template) and three startup-banner log assertions. * fix(auth): drop fragile session-decoder test + actually skip empty-target write Two follow-ups on the LOCAL_DEV_GROUPS feature before merge: - Drop test_session_holds_mocked_groups_directly. It manually decoded the signed session cookie via TimestampSigner + base64, hardcoding both the Starlette session-cookie format and the 14-day max_age. Starlette has changed its session encoding before (URLSafeTimedSerializer pre-0.20) and would do so again silently — the test would fail with a cryptic BadSignature, not a clear "mock is broken" signal. The remaining test_dev_user_sees_mocked_groups_on_profile already covers the same observable signal (mocked groups in /profile body) without coupling to Starlette internals. - Actually skip the session write when target_groups is empty. The previous comment claimed compare-then-write avoided spurious Set-Cookie noise on PAT/CLI requests, but on those requests session.get("google_groups") is None and target is [], so None != [] always evaluates True and the write fired anyway, marking the session dirty and re-issuing Set-Cookie on every request. Adding `target_groups and ...` to the guard makes the comment honest: empty mock now genuinely no-ops, stable browser sessions still skip via value-equality, and the only remaining write is the one that actually changes state. 33 auth tests still pass locally. * fix(auth): match production's always-write semantics for stale dev groups Devin code-review finding on PR #70: my earlier `target_groups and ...` short-circuit silently diverged from the production OAuth callback. In app/auth/providers/google.py:189-194 the callback always writes session.google_groups on each login — including [] on failure or empty token — so the session always reflects authoritative current state. The mock should match. Failure mode the previous guard left open: a developer sets LOCAL_DEV_GROUPS=[{...}] for a session, the groups land in the signed cookie, then the developer unsets the env var and reloads. target → [], session.get → [{...}], `if target_groups and ...` is False, no write, stale groups stay in the browser session indefinitely. Mock now lies about state until logout. Fix splits the guard: - target_groups truthy + value-changed → write the new mock (existing path) - target_groups falsy + non-empty stored → write [] to clear stale state - otherwise no-op (target [] + stored None/[]: no transition to record) PAT/CLI requests with no prior session still take the no-op path (target=[], session.get → None which is falsy), so the original goal of suppressing spurious Set-Cookie noise on token traffic is preserved. Tests already cover the populated and unset paths; the new clear-stale branch is correct by construction (production has the same shape) and the rare manual reset workflow. * release(0.11.2): default mocked groups in make local-dev + docs/local-development.md Cuts 0.11.2 around the LOCAL_DEV_GROUPS work plus a small dev-experience follow-up: every `make local-dev` now boots with two sensible default mocked groups (Local Dev Engineers + Local Dev Admins on example.com), so /profile and group-aware code paths render something realistic without the operator having to discover and set LOCAL_DEV_GROUPS. Layered so the default lives in the workflow, not the contract: - scripts/run-local-dev.sh seeds LOCAL_DEV_GROUPS via shell ":=" syntax — only sets the var when the operator hasn't already. Override: LOCAL_DEV_GROUPS='[...]' make local-dev. Disable: LOCAL_DEV_GROUPS= make local-dev. - docker-compose.local-dev.yml swaps the commented JSON example for a bare `- LOCAL_DEV_GROUPS` passthrough — the value comes from the shell, the compose file just propagates it. Operators running `docker compose up` directly without the wrapper script get an empty mock (correct: they didn't opt into the make-driven defaults). - Makefile help line mentions the mocked groups so the behavior is visible without grepping. New docs/local-development.md consolidates dev-onboarding instructions that were previously scattered across docker-compose.local-dev.yml inline comments, docs/auth-groups.md "Local-dev mock" section, the Makefile help text, and CLAUDE.md "First-Time Setup". Single page now covers TL;DR, what LOCAL_DEV_MODE actually bypasses, group mocking controls + verification, what is not mocked (Cloud Identity, real OAuth, admin Workspace permissions), and the safety rails that keep the dev shortcuts off production. Version bump 0.11.1 → 0.11.2 in pyproject.toml, CHANGELOG cuts [Unreleased] → [0.11.2] — 2026-04-26 with a fresh empty [Unreleased] skeleton. * fix(local-dev): default LOCAL_DEV_GROUPS truncated by shell parameter expansion Reported by an operator running `make local-dev` against the freshly released 0.11.2 — the LOCAL_DEV_MODE banner showed: LOCAL_DEV_GROUPS is not valid JSON, ignoring: Expecting ',' delimiter: line 1 column 70 (char 69) LOCAL_DEV_GROUPS is set but produced no valid groups — check the WARNING above for the parse error. Cause: the default value lived inside `${LOCAL_DEV_GROUPS:=…}` parameter expansion. Bash matches `}` to close the expansion at the first `}` encountered in the body, regardless of context — even one inside a nested JSON object literal. The two-element JSON array was therefore truncated to the first group's closing brace, leaving an unparseable fragment: [{"id":"local-dev-engineers@example.com","name":"Local Dev Engineers" There is no escaping syntax for `}` inside parameter expansion (the backslash escapes I had only escaped the quotes — `}` reaches bash literally). Fix: hold the default in a single-quoted variable and reference it through `${LOCAL_DEV_GROUPS:-$DEFAULT_LOCAL_DEV_GROUPS}`. The variable's value is opaque to the expansion — no `}` matching inside it — so the JSON survives intact. Verified with `python -m json`: parsed OK: 2 groups: ['local-dev-engineers@example.com', 'local-dev-admins@example.com'] Operators on a running 0.11.2 stack: `make local-dev-down && make local-dev` to pick up the corrected default. * fix(local-dev): respect LOCAL_DEV_GROUPS= disable path + add 0.11.2 changelog link Two follow-ups from a Devin code-review pass on PR #70: - run-local-dev.sh: switch ${LOCAL_DEV_GROUPS:-$DEFAULT} to ${LOCAL_DEV_GROUPS-$DEFAULT} (no leading colon). The :- form substitutes the default when the variable is unset OR set-but-empty, silently overwriting the documented disable knob. Three places promise this works — docs/local-development.md, the CHANGELOG entry, and the script's own comment — so the bug was an operator-facing lie, not just an implementation detail. The bare - form only substitutes on unset, so `LOCAL_DEV_GROUPS= make local-dev` now reaches the Python parser as "" and short-circuits to []. Verified with both empty and unset shells. - CHANGELOG.md: add the [0.11.2] link reference at the bottom. Keep-a-Changelog convention is to mirror every version heading with a release-tag link in the footer; the 0.11.2 heading was missing its counterpart, breaking the Markdown link rendering on GitHub. --------- Co-authored-by: Claude <noreply@anthropic.com>	2026-04-26 16:48:55 +02:00
Petr Simecek	c25fd41bf7	feat(auth): Google Workspace groups on /profile + tag-triggered Keboola deploy workflow (#56 ) * feat(auth): display Google Workspace groups on /profile - Request cloud-identity.groups.readonly scope in Google OAuth - Fetch groups via Cloud Identity API after callback; tolerate 4xx (non-Workspace tenants) and network errors — never break login - Store result in Starlette session as google_groups - Replace /profile redirect with a real profile page rendering account details (email, name, role) and the group list; show a friendly empty state when no groups are available - Tests: helper parsing + 403 + exception paths; profile page smoke test; updated the old redirect test * test: remove stale /profile redirect tests Cherry-pick of Zdeněk's 4f7e4cd ("display Google Workspace groups on /profile") replaces the /profile redirect with a real profile page — but only updated one of three tests that expected the old behaviour. These two tests in test_admin_tokens_ui.py and test_pat.py were left asserting `/profile → 302 /tokens`, which now returns `/profile → 302 /login?next=%2Fprofile` for unauth users (the standard auth guard) or `/profile → 200 HTML` for authenticated users. Removed both rather than patched — coverage for the new behaviour already exists in tests/test_auth_providers.py (added by the same commit). The /tokens render assertions in the deleted test_pat.py case are redundant with test_admin_tokens_ui.py's own /tokens UI tests. * fix(auth): Google groups search query needs parent + labels predicates Cloud Identity Groups Search API returns 400 INVALID_ARGUMENT when the CEL query lacks the required `parent == 'customers/<id>'` predicate AND a `'<label>' in labels` membership predicate. Zdeněk's original 4f7e4cd query had only `member_key_id == '<email>'` — every fetch silently returned [] and the /profile groups list was always empty. Fix: build the query with all three required pieces: parent == 'customers/my_customer' (alias = caller's own Workspace org; no need to look up customer ID) member_key_id == '<email>' (filter to this user's memberships) 'cloudidentity.googleapis.com/groups.discussion_forum' in labels (Workspace mailing-list groups — the common case; security-group coverage is a follow-up) Also: log the full error body (not truncated to 200 chars) and the query string so the next time Google rejects something we can diagnose in one log line instead of a re-deploy. Caught when first agnes-dev login completed normally (HTTP 302) but app log showed `Google groups fetch returned 400 for petr@keboola.com: {"error":{"code":400,"message":"Request contains an invalid argument."}}` on the same VM (kids-ai-data-analysis / agnes-dev.keboola.com). Reference: https://cloud.google.com/identity/docs/reference/rest/v1/groups/search * feat(web): add Profile link to user dropdown menu The /profile page (Zdeněk's 4f7e4cd cherry-pick) renders a real profile view including Google Workspace groups, but had no entry point in the UI — users could only reach it by typing the URL manually. Add a "Profile" menu item between the user header (email + role) and "My tokens" so the page is discoverable. Side effect: cleaned up the leftover `or _path.startswith('/profile')` condition on the "My tokens" active class, which dated from the old /profile → /tokens redirect (removed in c789617). Now each menu item owns its own active state. * fix: profile-link tests + .env quoting for CADDY_TLS Two issues caught by Keboola's first agnes-dev deploy + agnes-auto-upgrade cron run: 1. tests/test_web_ui.py — two negative assertions ("href=/profile" NOT in body) date from when /profile was a redirect-only stub. Now /profile is a real page (groups display) AND has a dropdown menu link, so the negative assertions flip to positive. Same for ">Profile<" text in the non-admin nav test. 2. startup-script.sh.tpl — CADDY_TLS line must be QUOTED in .env, because agnes-auto-upgrade.sh sources .env via `set -a; . .env; set +a` and bash treats `KEY=value with spaces` as `KEY=value` followed by `with` and `spaces` exec attempts. Symptom: cron log spam `/opt/agnes/.env: line 14: petr@keboola.com: command not found`, the cron exits non-zero, and no auto-upgrade ever happens. Caddy itself reads the value fine because docker-compose env_file=.env parses key=value properly without shell-evaluating the rest. Fix: emit `CADDY_TLS="tls <email>"` instead of `CADDY_TLS=tls <email>`. Both the cron source and docker-compose env_file accept the quoted form; cron stops failing. * fix(auth): use searchTransitiveGroups + security label for non-admin user Three bugs in the original cherry-pick + my prior fix attempt, all caught by a stdlib probe script (scripts/debug/probe_google_groups.py) run locally with a Playground-issued OAuth token: 1. Wrong endpoint. `groups:search` is the admin "find groups in org" endpoint and 400s for non-admin users regardless of query. Switched to `groups/-/memberships:searchTransitiveGroups` which is the user-perspective "what groups am I in" endpoint. 2. Wrong label. Querying with `cloudidentity.googleapis.com/groups.discussion_forum` returns 403 "Insufficient permissions to retrieve memberships" even on the new endpoint — Workspace policy denies non-admin reads of discussion-forum groups. Switching to `groups.security` returns 200 with the actual membership list. Empirically every Workspace group at Keboola carries BOTH labels, so the security filter sees the full set anyway. Confirmed with the probe script. 3. Wrong response shape. `searchTransitiveGroups` returns {"memberships": [...]}, not {"groups": [...]}. Parser updated accordingly. Also adds scripts/debug/probe_google_groups.py — stdlib-only standalone probe that hits 6 candidate endpoints with a user OAuth token. Saved a deploy cycle (~10 min) per query iteration; future API-syntax debugging should start there. Verified end-to-end: petr@keboola.com login on agnes-dev returns 5 groups (LIC-1PASSWORD, ROLE_ATLASSIAN_, etc.) via the probe; once deployed, the same will populate session["google_groups"] and render on /profile. test(auth): update Google groups parser fixture to match searchTransitiveGroups shape Mock payload was `{"groups": [...]}` (the shape `groups:search` returns). After switching to `groups/-/memberships:searchTransitiveGroups` in the prior commit, the actual response is `{"memberships": [...]}` and the parser iterates that key. Test now mirrors the real shape. The per-item structure (groupKey.id + displayName) is unchanged, so the expected output dict stays the same: [{"id": "...", "name": "..."}]. * docs(auth): add docs/auth-groups.md — Google Workspace groups runbook Captures the non-obvious bits: the GCP-side setup checklist (Cloud Identity API + scope on consent screen + Internal user type), the `security` vs `discussion_forum` label trap (the latter 403s for non-admins, the former 200s — one of those is a 4-iteration debug session and shouldn't have to be repeated), where groups are stored (session, not DB) and how to refresh (re-login), plus how to use the probe script for future API-syntax issues. Deliberately stops short of explaining "what is Cloud Identity" or "what is OAuth scope" — those belong in Google's own docs, not ours. * docs(claude): document release workflows + module versioning + recreate trick New "Release & deploy workflows" section in CLAUDE.md covers what didn't exist anywhere in the repo before: - Distinction between release.yml (auto-build per push) vs the new keboola-deploy.yml (tag-triggered, explicit deploy only) — plus when to use which (per-developer convenience vs shared dev VM safety) - Module versioning (infra-vX.Y.Z) and the bump-after-merge dance - The lifecycle.ignore_changes [metadata_startup_script] gotcha and how to force a recreate via workflow_dispatch's recreate_targets input All generic — no customer hostnames, project IDs, IPs. Customer-specific deploy steps belong in the consuming infra repo's README. Also: cross-reference docs/auth-groups.md from the Authentication section so future Claude sessions find the Workspace-groups runbook without grepping. --------- Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>	2026-04-26 00:56:44 +02:00

1 2 3 4 5 ...

331 commits