agnes-the-ai-analyst

Author	SHA1	Message	Date
ZdenekSrotyr	226ec9e189	Merge pull request #197 from keboola/fix/bigquery-extension-timeout fix(bigquery): apply bq_query_timeout_ms on every BQ attach + surface silent failures	2026-05-06 10:00:52 +02:00
ZdenekSrotyr	d68c3c5fa2	release: 0.38.2 — bq_query_timeout_ms applied on every BQ attach + surfaced silent failures	2026-05-06 09:48:12 +02:00
ZdenekSrotyr	cd90d9dfa3	Merge remote-tracking branch 'origin/main' into pr197-review	2026-05-06 09:47:39 +02:00
ZdenekSrotyr	f33e78a85a	Merge pull request #196 from keboola/docs/marketplace-setup-fallback docs(marketplace): document two-step fallback for marketplace registration	2026-05-06 09:42:34 +02:00
ZdenekSrotyr	a7d19206d7	release: 0.38.1 — docs(marketplace) two-step fallback	2026-05-06 09:27:42 +02:00
Vojtech Rysanek	32c8ea601a	fix(bigquery): apply bq_query_timeout_ms on every BQ-extension attach + surface silent failures The DuckDB BigQuery extension defaults bq_query_timeout_ms to 90 s, which is too tight for analyst-scale queries against view-backed BQ datasets. Agnes already has apply_bq_session_settings() that bumps it to 600 s (configurable via data_source.bigquery.query_timeout_ms), but two regressions let the 90 s default leak through to live queries: 1. apply_bq_session_settings() swallowed every Exception silently. If the BigQuery extension wasn't loaded on the connection yet, or the installed extension version didn't recognise the setting, the SET would fail and the function would return without surfacing the problem. Operators saw 90 s timeouts on 'agnes query --remote' with no log line explaining why. 2. The call sites in src/db.py:_reattach_remote_extensions and src/orchestrator.py:_remote_attach only invoked apply_bq_session_settings on the metadata-token branch (token_env empty, the BqAccess contract). The token-based and no-auth branches ran ATTACH against the BigQuery extension without ever applying the timeout setting — so any BQ source registered with an explicit token_env, or with no auth env at all, fell back to the 90 s default. Fix: - apply_bq_session_settings now logs WARNING on each failure path (instance_config import error, non-numeric value, SET execution failure, readback error). It also verifies the setting actually landed via SELECT current_setting('bq_query_timeout_ms') and logs WARNING when the readback disagrees with the requested value, which catches the silent-ignore case some extension versions exhibit. - Both _reattach_remote_extensions (src/db.py) and _remote_attach (src/orchestrator.py) now call apply_bq_session_settings on every branch that ATTACHes a BigQuery alias, not only the metadata-token branch. Idempotent: calling it twice on the metadata-token path is a no-op SET. Tests: - Extended the _RecordingConn fixture to support .fetchone() so the readback assertion path works. Updated existing call-shape assertions to expect the SELECT current_setting readback alongside the SET. Added two new tests covering the WARNING surfaces for SET failure and readback mismatch — regression guards for the silent- fallback bug this PR addresses. - Full BQ-touching suite (398 tests) passes.	2026-05-06 11:24:14 +04:00
Vojtech Rysanek	abc2335ea2	docs(marketplace): document two-step fallback for marketplace registration The 'Git channel' block previously showed only the direct '/plugin marketplace add https://x:$AGNES_PAT@…' path. That path fails on macOS/Windows against a private-CA Agnes instance because Bun-compiled 'claude' ignores the OS trust store and CA env vars on the marketplace HTTPS path (see the existing rationale in app/web/setup_instructions.py). Document the two-step fallback explicitly: git clone https://x:$AGNES_PAT@agnes.example.com/marketplace.git/ \ ~/agnes-marketplace claude plugin marketplace add ~/agnes-marketplace System 'git' honors GIT_SSL_CAINFO + the OS trust store, so the clone succeeds where direct add fails; pointing Claude Code at the local clone then sidesteps the Bun TLS path entirely. The dashboard-served setup payload already branches between the two automatically based on platform; the docs now match that behavior for manual flows. Also note the optional 'remote set-url' hardening to strip the PAT from the cloned repo's origin (mirrors what the dashboard payload does).	2026-05-06 11:00:59 +04:00
ZdenekSrotyr	f598b7e2f6	Merge pull request #180 from keboola/ma/my-ai-stack feat(store): /store + /my-ai-stack — per-user marketplace composition	2026-05-06 07:41:05 +02:00
ZdenekSrotyr	6c94d2cbce	Merge remote-tracking branch 'origin/main' into pr180-review # Conflicts: # CHANGELOG.md # pyproject.toml	2026-05-06 07:27:25 +02:00
ZdenekSrotyr	f2aae1427c	Merge pull request #194 from keboola/zs/host-mount-state-dir-combined release: 0.37.0 — STATE_DIR + flat-mount overlay; host-mount direct-bind fix Folds in #191 (host-mount.yml direct binds, eliminating Docker named-volume immutability footgun) + #192 (STATE_DIR env var + flat-mount.yml overlay for parallel-disk topology) with attribution preserved (cherry-picks of @cvrysanek's commits). Three rounds of Devin Review caught: - caddy missing from host-mount.yml override (file_server bypass dead) - caddy !override in flat-mount.yml dropped data:/srv:ro + caddy_config - .env_overlay write paths hadn't been migrated to STATE_DIR - instance.yaml overlay path had same asymmetry across 3 sites - v24 migration error message hardcoded old snapshot path - customer-specific tokens in OSS docs/comments All BUG-class findings resolved. ANALYSIS-class deferred (auto-upgrade + tls-rotate hardcode host-mount.yml — operator-side limitation documented in docs/state-dir.md).	2026-05-06 07:03:25 +02:00
ZdenekSrotyr	fdc6cd7fb4	release: 0.37.0 — STATE_DIR + flat-mount overlay; host-mount direct-bind fix	2026-05-06 06:53:48 +02:00
ZdenekSrotyr	4a1916a4b0	fix: v24 migration error message points to actual snapshot path The pre-migration snapshot was correctly migrated to STATE_DIR-aware path in src/db.py:1832 (`_get_state_dir() / 'system.duckdb.pre-migrate'`), but the error message in _migrate_v24_bq_source_queries still hardcoded the old `{DATA_DIR}/state/...` shape. Under flat-mount layout (STATE_DIR=/data-state), an operator hitting the v24 migration error would look in /data/state/ for a rollback snapshot that lives in /data-state/. Devin Review on PR #194 round 3.	2026-05-05 20:13:08 +02:00
ZdenekSrotyr	df2c33147c	fix: Devin Review on #194 round 2 — 3 BUG-class findings 1. instance.yaml overlay path now matches read site under STATE_DIR. Three sites updated: - app/api/admin.py:1005 (server-config endpoint writer) - app/api/admin.py:2610 (configure endpoint writer) - app/instance_config.py:106 (overlay reader) All three now go through _state_dir() so under flat-mount layout (STATE_DIR=/data-state) the irreplaceable instance.yaml overlay lands on the state disk (sdc) instead of the regenerable data disk (sdb). Without this fix, .env_overlay correctly went to the state disk while instance.yaml went to the data disk — config would be lost if an operator wiped sdb. 2. Strip customer-specific tokens from OSS repo per CLAUDE.md vendor-agnostic rule: - docker-compose.host-mount.yml: 'a deployer (Groupon FoundryAI)' → 'a deployer in production' - docker-compose.flat-mount.yml: 'caused 2026-05-05 in the Groupon FoundryAI deployment' → generic 'production failure mode' - docs/state-dir.md: rewrote the incident reference to describe the failure mode abstractly without naming the deployment; updated the recommendation table to say 'shadow-mount class' instead of dating the specific incident. 3. Updated docs/state-dir.md 'What reads STATE_DIR' to list all read/write sites including the three migrated in this round (admin.py, instance_config.py, marketplaces.py). ANALYSIS finding (tls-rotate.sh hardcoded host-mount.yml) deferred — same operator-side class as auto-upgrade.sh hardcoded host-mount, documented limitation per the PR body.	2026-05-05 20:02:50 +02:00
ZdenekSrotyr	b6543c9c55	fix: Devin Review on #194 — 2 BUG-class findings 1. .env_overlay write paths now match read path under STATE_DIR. app/main.py:343 reads via _state_dir() (post-PR #194), but two write sites still hardcoded ${DATA_DIR}/state/.env_overlay: - app/api/admin.py:2687 — configure endpoint secrets persistence - app/api/marketplaces.py:152 — marketplace PAT persistence Under flat-mount layout (STATE_DIR=/data-state) the admin UI wrote secrets to /data/state/.env_overlay while the app read from /data-state/.env_overlay, silently dropping the value on next restart. Both write sites now go through _state_dir(). 2. host-mount.yml: caddy inherits data:/srv:ro from base, but with no service populating the data: named volume (other services switched to direct /data binds), the inherited mount points at an empty Docker volume — try_files finds nothing, every parquet download falls through to uvicorn, defeating the v0.36.0 file_server bypass under the host-mount layout. Added a caddy override that restates all mounts including a direct /data:/srv:ro bind. Mirrors the comment + treatment already in flat-mount.yml.	2026-05-05 19:47:12 +02:00
ZdenekSrotyr	a9ae5f9c35	fix(flat-mount): preserve data:/srv:ro and caddy_config:/config in caddy override; CHANGELOG The flat-mount overlay's caddy `volumes: !override` block listed only three mounts, but the base docker-compose.yml caddy service has five. `!override` (compose-spec semantics) replaces the entire list, so two mounts were silently dropped under the flat layout: - `data:/srv:ro` — Caddy's read-only view of the agnes data dir, used by the `@download` file_server handler in Caddyfile (added in v0.36.0 as the perf bypass for multi-GB parquet downloads). Without this mount, `try_files /bigquery/data/<id>.parquet …` finds no file and every parquet download falls through to the app's uvicorn worker — defeating the bypass entirely. - `caddy_config:/config` — Caddy's autosave/ACME state. Less critical (we feed certs in via /certs) but loses the autosaved adapter config across container recreates. Restated both mounts with a comment block explaining the !override caveat for any future overlay author. Plus: CHANGELOG entries for the host-mount.yml direct-bind fix and the STATE_DIR + flat-mount overlay under [Unreleased].	2026-05-05 19:29:38 +02:00
Vojtech Rysanek	a303de0372	feat: STATE_DIR env var + flat-mount overlay (parallel disks) Introduces STATE_DIR as the single source of truth for the writable state directory path, with backward-compatible default of ${DATA_DIR}/state. Pairs with a new docker-compose.flat-mount.yml overlay that mounts the state disk in PARALLEL to the data disk (rather than nested under it). Why --- The default deployment topology nests state under data: sdb at /data, sdc at /data/state. That layout has known fragility documented in docs/state-dir.md — bind-propagation gotchas, two-writer collisions on the same prefix, mount-order coupling. The 2026-05-05 incident in the Groupon FoundryAI deployment was a manifestation of the propagation gotcha. The flat layout (sdb at /data, sdc at /data-state — parallel, not nested) eliminates the nested-mount class entirely. Each disk is its own bind mount, recursive by default in modern Docker. No volume options to forget. No two-writer collision (host scripts and container app share /data-state at the same path, single namespace). What changes ------------ App code (Python): - src/db.py: new _get_state_dir() helper. get_system_db() and schema migration snapshot use it. - app/secrets.py: new _state_dir() helper. _load_or_generate() uses it for .session_secret and .jwt_secret. - app/main.py: .env_overlay loaded from _state_dir(). Host scripts: - scripts/ops/agnes-auto-upgrade.sh: STATE_DIR drives mount-sanity check and cert detection. Defaults preserve existing behavior. - scripts/ops/agnes-tls-rotate.sh: STATE_DIR drives CERT_DIR. New compose overlay: - docker-compose.flat-mount.yml: parallel /data and /data-state binds per service. Mutually exclusive with docker-compose.host-mount.yml; pick one based on disk topology. Documentation: - docs/state-dir.md: layout choice (A nested vs B flat), pros/cons, migration steps, and which code paths read STATE_DIR. Backward compatibility ---------------------- STATE_DIR defaults to ${DATA_DIR}/state — current behavior. Existing deployers that don't set the var see no behavior change. Migration to flat layout is opt-in per the runbook in docs/state-dir.md. Validation ---------- - bash -n on both host scripts: pass - docker compose config -f docker-compose.flat-mount.yml: resolves cleanly with all 6 services binding /data and /data-state directly - python3 import + helper exercise: STATE_DIR override works, default falls back to ${DATA_DIR}/state Companion to PR #191 (drop named-volume driver_opts in host-mount.yml). That PR fixes the immutability footgun for Layout A; this PR offers Layout B as the architectural alternative.	2026-05-05 19:28:07 +02:00
Vojtech Rysanek	655822b953	host-mount: replace named-volume driver_opts with direct service binds The previous version of docker-compose.host-mount.yml modified the 'data' named volume's driver_opts to point at /data with 'o: bind,rbind'. Docker named volumes have an immutability footgun: once a volume is created, its driver options are fixed for the life of the volume. Editing this file and re-running 'docker compose up -d' does NOT propagate the new options to existing volumes — they keep whatever options were in effect at create time. This bit a deployer (Groupon FoundryAI) on 2026-05-05: the volume was created before this overlay had bind,rbind, kept the old bind (non-recursive) propagation, and containers wrote to a shadowed subdirectory of the parent disk instead of the nested child mount. DuckDB went FATAL on a root-owned WAL during a routine container recreate; sign-in broke. Recovery required docker volume rm + manual data migration on every affected VM. Direct service-level bind mounts ('/host/path:/container/path') don't go through Docker's volume layer at all. They re-evaluate mount options every container start, and modern Docker Engine (20.10+) defaults to recursive bind for these. No options to forget, no immutable state to migrate, no shadow-mount class. Validated via 'docker compose config' merge — overlay correctly replaces 'data:/data' with bind type:none on app, extract, scheduler, telegram-bot, ws-gateway. Compose-spec version note: !override merge tag is part of the Compose Specification supported by Docker Compose v2.20+. Tested against Compose v5.1.3 used by Groupon's deployment.	2026-05-05 19:27:14 +02:00
ZdenekSrotyr	1315f9f93c	Merge pull request #188 from keboola/zs/combined-perf-and-clarity release: 0.36.0 — perf + analyst-clarity bundle BQ query timeout knob, Caddy file_server parquet bypass, parallel parquet pulls, auto-upgrade self-update, Tier 1 event-loop unblocking, clean CLI errors + init progress + skip-materialize, workspace prompt decision tree + size hint.	2026-05-05 19:22:53 +02:00
ZdenekSrotyr	e2f740d7ab	fix(changelog): consolidate duplicate Added/Changed sections in 0.36.0 Devin Review on PR #188 (15:53Z): the renamed [0.36.0] section had two separate ### Added blocks and two separate ### Changed blocks, which violates Keep-a-Changelog grouping (and CLAUDE.md's explicit 'group by section' rule). Merged each set into a single ordered block: Added, Changed, Fixed. No content removed; only reflowed.	2026-05-05 19:04:51 +02:00
ZdenekSrotyr	f33475cec3	release: 0.36.0 — perf + analyst-clarity bundle Renames the [Unreleased] section to [0.36.0] in CHANGELOG, adds the top-level summary, drops a fresh empty [Unreleased] above, and bumps pyproject from 0.35.1. Also fixes the third Devin Review finding on this PR: the CLI ReadTimeout message hardcoded QUERY_TIMEOUT_S (300s) so a 30s-default call (agnes catalog, agnes auth, …) reported a wait window that didn't match reality. _translate_transport_error now takes the actual httpx timeout from the calling helper; the BQ-job advisory only appears for calls where the timeout was set ≥ 60s.	2026-05-05 18:57:04 +02:00
ZdenekSrotyr	f2ce915458	fix: Devin Review on #188 commit `28423907` — 2 bugs 🚩 /api/v2/catalog still async def while now calling sync stat() `/api/v2/catalog` was left as `async def` when the rest of Tier 1 was converted, on the assumption it was lightweight. The new `_materialized_size_hint` populator added in this PR calls `Path.stat()` / `Path.exists()` for every visible row to bucket the parquet size — on a local FS that's microseconds, but on a network-mounted DATA_DIR (NFS / CIFS / GCS-FUSE) those syscalls can block the event loop. Convert to plain `def` so FastAPI auto-offloads to the thread pool, mirroring /api/query etc. 🔴 stream_download translates HTTPStatusError as generic transport error `response.raise_for_status()` inside the retry loop raises `httpx.HTTPStatusError` on 4xx/5xx. After retries exhaust, the new `isinstance(last_exc, httpx.HTTPError)` check at line 219 was eating the status code: HTTPStatusError is a subclass of HTTPError, so the generic transport translation produced "Unexpected error: HTTPStatusError" instead of the informative "Client error '401 Unauthorized' for url …" that callers expect. Fix: short-circuit HTTPStatusError before the HTTPError branch — it re-raises verbatim so the caller's status-code handling + the rich server error body (e.g. 401 expired token, 403 cross_project_forbidden) reach the analyst. api_get / api_post / api_delete / api_patch don't have the same bug: httpx Client.get/etc. don't raise HTTPStatusError unless the caller explicitly calls .raise_for_status(), and our wrappers don't. Only stream_download does, hence the targeted fix there.	2026-05-05 18:29:44 +02:00
ZdenekSrotyr	28423907fd	feat: clean CLI errors + init progress + skip-materialize + claude.md catalog pointer Three first-try-failure-surface fixes from Pavel's #185 trace + the template guidance question, all under PR #188's umbrella so they land together with the file_server / parallel pull / Tier 1 work. 1. CLI clean-error wrapper — new AgnesTransportError raised by the api_*/stream_download helpers when httpx times out / drops / refuses, plus a top-level Typer wrapper (cli/main.py) that prints one-line "Error: …" + actionable hint and exits non-zero. Full traceback goes to ~/.config/agnes/last-error.log for support forwarding. Unhandled Exceptions are caught at the same boundary so no Python traceback ever leaks to the analyst's terminal. Pavel's #185 Phase 3B: a 30-frame httpx traceback from a slow BQ --remote query made it look like a CLI bug. Now: clean message + hint pointing at `agnes snapshot create` / partition-column guidance. Entry point in pyproject.toml flipped from `cli.main:app` → `cli.main:_run_with_clean_errors` so the wrapper actually runs under the installed `agnes` binary. 2. agnes init / agnes pull --skip-materialize + progress bar. --skip-materialize omits query_mode='materialized' rows from the download set so a first init doesn't spend 44 minutes silently pulling a single 6 GB parquet (Pavel's #185 Phase 1). Rich-driven per-file progress bar with label/bytes/rate/ETA renders to stderr when not --quiet and not --json. Aggregates across the parallel ThreadPoolExecutor workers added earlier in this PR. 3. config/claude_md_template.txt: explicit one-line snippet pointing at `agnes catalog --json \| jq '.tables[] \| select(.id=="<id>")'` for per-table descriptions + restated invariant: "the description field on each catalog row is the authoritative business-rules text — re-read live, never copy into this file." Resolves the regression-or-feature debate between Pavel (wants annotations) and the user feedback that landed in the prior commit (don't embed table-specific content; tables change). Catalog command stays the source of truth.	2026-05-05 18:11:59 +02:00
ZdenekSrotyr	e5fb913cec	perf: Tier 1 event-loop unblocking — async def → def on BQ-bound handlers Five hottest BQ-touching endpoints were `async def` but invoked synchronous DuckDB / BQ-extension calls inside the body. Under uvicorn's single event loop that meant a single heavy `agnes query --remote` (waiting up to ~200 s for BQ's jobs.query) froze EVERY other request — /api/health, dashboard, auth, even another query — for the full BQ wait. Operators saw "VM idle, app frozen" during PR #188's testing. Convert to plain `def` so FastAPI auto-offloads the body to the anyio thread pool. Event loop stays free for non-BQ requests. - app/api/query.py:execute_query - app/api/v2_scan.py:scan_estimate_endpoint, scan_endpoint - app/api/v2_sample.py:sample - app/api/v2_schema.py:schema Audit: 0 `await` statements in any converted handler (verified file-by- file), so the rename is safe. Tests in tests/test_v2_*.py called the handlers via `asyncio.run(...)` which now fails on a non-coroutine return; swapped for direct calls (asyncio.run( -> ( ) — keeps paren balance). Plus AGNES_THREADPOOL_SIZE env var (default 200, was anyio's stock 40) in app/main.py:lifespan. Set via anyio.to_thread.current_default_thread_limiter().total_tokens. 200 is comfortable headroom for <50 concurrent analysts; bump for more. 480/480 impacted tests pass (the 2 remaining errors are a pre-existing fixture setup issue in test_reader_smoke_matrix.py unrelated to this change).	2026-05-05 17:44:08 +02:00
ZdenekSrotyr	7a72ea9c37	fix: Devin Review on #188 — try_files fallback + auto-upgrade ordering Two bugs Devin caught: 1. Caddy `try_files A B C` rewrites the URI to its LAST entry when no file matches (per Caddy docs). Without an explicit "back to original URI" fallback, a parquet missing from all three known static paths would get rewritten to `/jira/data/<id>.parquet`, and the reverse_proxy below would forward THAT rewritten URI to app:8000 → 404. The PR's documented "missed → falls through to app handler" promise didn't actually hold for legacy / future connectors. Append `/api/data/<id>/download` as the final try_files entry so the reverse_proxy receives the analyst-facing URI. 2. agnes-auto-upgrade.sh's TLS-overlay decision (which checks Caddyfile existence) ran BEFORE the config re-fetch loop. If a tick's fetch added a previously-missing Caddyfile, this tick's docker compose would still omit `--profile tls` until the next 5-min tick — a window where the recreate uses the wrong overlay set. Move the COMPOSE_FILES tls extension AFTER the fetch. Also strip the workspace prompt of table-list / metric-count enumerations (per user feedback): those are dynamic snapshots that go stale; replace with explicit "use `agnes catalog` / `agnes schema` / `agnes describe` to discover" guidance plus a note about `rough_size_hint` semantics. The Available Datasets `{% for t in tables %}` loop is gone — analysts use the live CLI instead.	2026-05-05 17:24:42 +02:00
ZdenekSrotyr	30e81a15b9	feat(workspace-prompt): decision tree + size-hint so analyst Claude gets it right first try Three concrete changes addressing the "analyst Claude misuses the CLI" class of bugs (image.png table — issues #3, #5, plus the recurrent "how big is this table" guesswork): 1. config/claude_md_template.txt — the template agnes init writes to <workspace>/CLAUDE.md. Surfaces every catalog-row field with a why, adds a query_mode-based decision tree, explicit --estimate scoping (snapshot create ONLY — was the #1 first-try error), an agnes fetch → agnes snapshot create rename note, and a 6-row failure-mode table that maps each common error wording to its right next step. 2. app/api/v2_catalog.py — populate rough_size_hint for local + materialized rows from the on-disk parquet size, bucketed small/medium/large/very_large. Was hardcoded null with a TODO; AI couldn't tell "is this 6.8 GB" without a failed --remote round-trip. 3. cli/update_check.py — the [update] banner survived the da→agnes rename and printed "[update] da X is out of date" on every command, training analysts to associate the binary with the old name. Verified by rendering the template against representative contexts (33/33 tests pass) and running every use case from the original screenshot through the real CLI against a dev VM.	2026-05-05 16:44:24 +02:00
ZdenekSrotyr	2ae486bc5d	feat(pull): parallel parquet downloads (AGNES_PULL_PARALLELISM=4 default) The download loop in cli/lib/pull.py was strictly serial — N tables took Σ stream_download(t_i). With the Caddy file_server change in this PR, the server can now sustain many parallel sendfile transfers without blocking app workers, so the client-side serialization became the new bottleneck. Switch to ThreadPoolExecutor capped by AGNES_PULL_PARALLELISM (default 4, set 1 to restore pre-PR serial). 4 matches typical home-broadband saturation without over-subscribing the analyst's NIC. Drops to serial when len(to_download) <= 1 to avoid executor overhead in the common single-table case. Per-table error semantics preserved via (tid, entry, err) tuple — a failure on one parquet doesn't abort the rest of the batch. Verified end-to-end against a dev VM with the new Caddy file_server deployed: 2-table pull through agnes CLI works under the new concurrency.	2026-05-05 16:42:55 +02:00
ZdenekSrotyr	ab61e30c91	chore(auto-upgrade): re-fetch compose + Caddyfile, self-update Sibling change to the Caddy file_server PR (#182). Without this, existing long-uptime VMs would pull the new agnes image on auto-upgrade but keep their stale Caddyfile + docker-compose.yml — leaving the file_server route + the data:/srv:ro mount inert. Confirmed live 2026-05-05 when the file_server change merged in main but stayed unreachable on a running dev VM until /opt/agnes/* was scp'd by hand. agnes-auto-upgrade.sh now hashes the bind-mounted config files (Caddyfile + every docker-compose overlay) on every 5 min tick and triggers a `docker compose up -d` recreation when the hash drifts — same trigger path as an image-digest change. Fail-soft via the .new-then-mv pattern: a curl 404 / network blip leaves the existing file untouched. Self-update at the bottom of the script: re-fetch /usr/local/bin/agnes-auto-upgrade.sh itself so the very fix that watches config files lands on running VMs without a manual ssh-and- curl cycle. Otherwise we'd have a self-perpetuating "old script problem" — the watch-config logic never propagating to the VMs that need it. Operators no longer need to ssh + scp Caddyfile/compose changes.	2026-05-05 16:42:13 +02:00
ZdenekSrotyr	1be997f6d4	feat(caddy): file_server for parquet downloads — bypass uvicorn A single analyst's multi-GB `agnes pull` held the only uvicorn worker for the duration of the stream, starving UI / /api/health / every other API endpoint. Container flipped to `unhealthy`. Triggered while a 6.8 GB `order_economics` pull was in-flight on prod 2026-05-05. Caddy now intercepts `GET /api/data/{table_id}/download` and serves the parquet directly via sendfile from the data volume (mounted r-o at /srv inside the caddy container). RBAC enforced by `forward_auth` to a new lightweight `GET /api/data/{table_id}/check-access` endpoint (returns 204 / 403) — the bulk transfer never reaches uvicorn. Path discovery via `try_files` over the known extract.duckdb v2 source subdirs. Anything not at a static path falls through to the existing app handler so legacy `src_data/parquet` and future connectors still work without a Caddyfile change. Non-Caddy deployments are unchanged. Stage 1 (multi-worker uvicorn) was considered but blocked by the single-writer DuckDB lock on system.duckdb — workers > 1 would crash at startup on "Could not set lock on file", the same race that pushed the scheduler from in-process writes to HTTP-via-app. Multi-reader workers + single-writer coordination is out of scope for this PR.	2026-05-05 16:41:33 +02:00
ZdenekSrotyr	025a2b5c0e	fix(db): apply bq_query_timeout_ms to read-only reattach path Devin Review on PR #181: caught that the original PR plumbed the new SET into the orchestrator's _remote_attach (rebuild path), the BqAccess factory (materialize path), and the standalone extractor — but missed the actual primary `agnes query --remote` request path: every read-only analytics-DB connection runs `_reattach_remote_extensions` in `src/db.py` on open, and that LOAD bigquery + ATTACH cycle was unconfigured. Without this commit, the very flow the PR was meant to fix — analyst queries hitting BQ views > 90s — would still 400 with the same Binder Error / Job ID wording, because the runtime LOAD bigquery happens here not in the orchestrator's rebuild path. Apply apply_bq_session_settings(conn) right after the BQ secret is created and before ATTACH, mirroring what every other PR site does.	2026-05-05 16:40:40 +02:00
ZdenekSrotyr	4f04235502	feat(bigquery): bq_query_timeout_ms knob; default 600s (was 90s) DuckDB BigQuery extension defaults `bq_query_timeout_ms` to 90 s, which is too tight for analyst-scale queries against view-backed BQ datasets. `agnes query --remote` HTTP 400'd with `Binder Error: Query execution exceeded the timeout. Job ID: ...` whenever the underlying BQ job ran longer than 90 s, even though the job itself was healthy. Add `data_source.bigquery.query_timeout_ms` (default 600 000 ms = 10 min, sentinel 0 falls through to the extension default). Applied via `SET bq_query_timeout_ms` after every `LOAD bigquery` on every BQ-touching DuckDB session: orchestrator's `_remote_attach` ATTACH path, BqAccess session factory, and the standalone extractor. Configurable via `/admin/server-config` UI. Fail-soft: extension versions that don't recognise the setting silently keep the default rather than poisoning the session.	2026-05-05 16:40:40 +02:00
ZdenekSrotyr	4751094e1c	fix(keboola): per-table fallback to legacy Storage-API client (#183 ) * fix(keboola): per-table fallback to legacy Storage-API client The DuckDB Keboola extension's per-table COPY fails with `Schema '..."in.c-..."' does not exist or not authorized` on projects whose Snowflake backend doesn't expose bucket schemas to the storage-token-derived QueryService role (keboola/duckdb-extension#17). ATTACH itself succeeds, so the existing extension-level fallback in `_try_attach_extension` never triggers — the table is just marked failed. - Promote `kbcstorage>=0.9.0` from optional to core dep so the legacy client import in `_extract_via_legacy` doesn't crash default installs with `ModuleNotFoundError`. - Wrap `_extract_via_extension` in a per-table try/except so a scan failure retries via `_extract_via_legacy` instead of recording `tables_failed` and moving on. Slower than the extension path, but produces correct parquets on affected projects while the upstream extension fix lands. * test(keboola): cover per-table extension→legacy fallback Two existing tests mocked _extract_via_extension to throw and asserted the original message survived in result["errors"]. With per-table fallback, the new flow retries via _extract_via_legacy — which on the mock URLs would throw a different (404 / DNS-fail) error, replacing the asserted message. - Mock _extract_via_legacy alongside _extract_via_extension in test_network_timeout_during_extraction + test_partial_failure_continues + test_all_tables_fail_returns_full_failure_stats so the assertion observes the final propagated error from the fallback chain. - Add test_extension_per_table_failure_falls_back_to_legacy that exercises the new behavior directly: extension scan fails with the QueryService schema-not-authorized message (keboola/duckdb-extension#17), legacy succeeds, parquet ends up queryable.	2026-05-05 15:47:44 +02:00
ZdenekSrotyr	4908a0d7a2	Merge remote-tracking branch 'origin/main' into pr180-review # Conflicts: # CHANGELOG.md # pyproject.toml	2026-05-05 15:22:10 +02:00
ZdenekSrotyr	91f2605865	Merge pull request #184 from keboola/vr/remote-query-timeout fix(cli): bump --remote query timeout to 300s, add AGNES_QUERY_TIMEOUT	2026-05-05 15:15:02 +02:00
ZdenekSrotyr	a220955640	release: 0.35.1 — CLI --remote query timeout fix Patch release bundling the only Unreleased change: bump httpx client timeout for agnes query --remote from 30s to 300s (configurable via AGNES_QUERY_TIMEOUT). Renames CHANGELOG [Unreleased] section to [0.35.1] and bumps pyproject version to match.	2026-05-05 15:01:37 +02:00
Vojtech Rysanek	0843c2bd1b	fix(cli): bump --remote query timeout to 300s, add AGNES_QUERY_TIMEOUT The httpx client behind 'agnes query --remote' used the default 30s timeout, killing every BigQuery SELECT that took longer than half a minute — i.e. most non-trivial remote queries. cli/client.py now exposes QUERY_TIMEOUT_S (default 300s, override via AGNES_QUERY_TIMEOUT) and propagates a kw-only 'timeout' through api_get/post/delete/patch. _query_remote passes QUERY_TIMEOUT_S so only the long-running /api/query path gets the bump; every other CLI call keeps the 30s default. Server-side has no read deadline on /api/query, so the client cap was the sole bottleneck.	2026-05-05 16:40:54 +04:00
ZdenekSrotyr	8d8d2c219e	refactor(cli-store): pull/info → agnes admin store; add agnes store mine Backup-orchestration commands were split across two namespaces (pull in agnes store, push in agnes admin store), which broke the operator mental model — pull/push are a paired operation and should sit together. Move pull + info into agnes admin store so all bulk operations share one help screen. Add agnes store mine as the user-facing equivalent — calls the same /api/store/bundle.zip endpoint with ?owner=me, which the server resolves to the caller's user_id. Authors can archive their own uploads without admin role; whole-Store bulk reads stay admin-flavored as a discoverability hint. Server: 3-line addition to export_bundle handles owner='me' as a magic alias for the caller. No new endpoint. Tests updated: pull/info expectations move from agnes store to agnes admin store; new tests cover agnes store mine and the ?owner=me server resolution. 69/69 store tests green locally.	2026-05-05 13:49:18 +02:00
ZdenekSrotyr	3d63965a67	Merge remote-tracking branch 'origin/main' into pr180-review # Conflicts: # CHANGELOG.md # app/web/templates/_app_header.html	2026-05-05 12:05:50 +02:00
ZdenekSrotyr	5686a170fb	Merge pull request #179 from keboola/zs/fix-176-session-pipeline fix(#176): wire LLM pipeline into scheduler + close 4 silent-failure defects	2026-05-05 12:01:06 +02:00
ZdenekSrotyr	a8f9d065c8	feat(store): bundle export/import + agnes store update + agnes admin store push Adds whole-Store backup/restore primitives so an external CI/CD job can mirror the Store to a git repo (and restore back from one). REST: - GET /api/store/bundle.zip — deterministic ZIP of all (filtered) Store entities. Layout: manifest.json + entities/<id>/{plugin,assets}/. Manifest carries owner_email for cross-instance restore. Auth: any authenticated user (Store is community-open). - POST /api/store/import-bundle — admin-only restore. Modes merge\|replace\|skip; owner resolution by email with stub-disabled-user fallback when the email is unknown on the target instance. CLI: - agnes store update <id> [--description X] [--zip PATH] ... — in-place edit (server PUT permits owner OR admin per F4). Closes the missing edit affordance for analysts who want to fix a typo or push a new ZIP without losing install_count. - agnes store pull [-o store.zip] [--unpack DIR] — download the bundle. --unpack streams + extracts so an external git-backup workflow can drop the tree straight into a repo and `git add .`. - agnes store info [--json] — counts + size summary. - agnes admin store push <zip-or-dir> [--mode ...] — admin-only restore. Auto-zips a directory client-side so a working-tree → server round-trip is one command. cli/v2_client.py gains api_get_stream helper for binary downloads. Tests: 5 new server tests (bundle shape + filters + round-trip + stub user creation + skip mode + admin-only gate) + 11 new CLI tests (update, pull/unpack, info, admin push). 66/66 store-related tests green locally.	2026-05-05 11:51:31 +02:00
ZdenekSrotyr	952dc9e74d	fix(profile-sessions): tolerate stat() failures on individual jsonl (Devin Review on #179 ) The previous gather used `sorted(glob, key=lambda p: p.stat().st_mtime)`. A transient OSError (race with delete, permission flicker, EBADF on a weird filesystem) on any single file raised through the lambda and 500-ed the whole page. Reworked: stat each path under try/except into a (path, stat) list, sort the already-statted entries. Bad files drop silently from the listing. Regression test test_profile_sessions_page_tolerates_stat_failures patches Path.stat to raise on one of two files, asserts the page returns 200 with the good row rendered and the bad row dropped.	2026-05-05 09:53:06 +02:00
ZdenekSrotyr	d878764ac1	fix(session-collector-api): mirror sibling endpoints' audit-on-exception (Devin Review on #179 ) Devin flagged that run_session_collector still had the same audit-skip gap I fixed in run_verification_detector and run_corporate_memory in the previous two rounds — a PermissionError walking /home, an OSError on /data/user_sessions mkdir, or any other unhandled exception from collector.run() would skip the audit_log row and only show in docker logs. Same try/except + unhandled_error pattern as the sibling endpoints. All three LLM-pipeline run-* endpoints now record their failures the same way; /admin/scheduler-runs sees them. Regression test in tests/test_admin_run_endpoints.py::TestRunSessionCollector::test_unhandled_exception_still_audits.	2026-05-05 09:31:33 +02:00
ZdenekSrotyr	9ebe991b55	feat(profile): per-session jsonl download from /profile/sessions User feedback during e2e of #179: the listing page is nice but I want to grab the raw jsonl and look at what's inside. Adds GET /profile/sessions/<filename>: - Auth via get_current_user (owner-only). - Path safety: rejects "/", "\", "..", leading ".", and any non-".jsonl" filename. The served path resolves under ${DATA_DIR}/user_sessions/<caller.id>/; if resolution escapes that base directory, returns 404 (never 403, so existence of other users' files isn't leaked). - FileResponse with Content-Disposition: attachment. UI: Download button per row in profile_sessions.html. Tests in test_web_ui.py: path-traversal / nested / dotfile / non-jsonl all 404 for owner; unauthenticated 302/401/403; authenticated owner gets 200 + correct Content-Disposition.	2026-05-05 09:15:12 +02:00
ZdenekSrotyr	e86da72997	fix(corporate-memory-api): mirror verification-detector audit-on-exception (Devin Review on #179 ) Devin flagged that run_corporate_memory still had the same audit-skip gap I just fixed in run_verification_detector — if collect_all() throws anything other than the already-translated ValueError (DuckDB lock, network blip, unexpected SDK error), the audit_log row was never written and /admin/scheduler-runs missed the failure. Same try/except + unhandled_error pattern as the verification_detector fix from `4c4dfee8`. Regression test in tests/test_admin_run_endpoints.py::TestRunCorporateMemory::test_unhandled_exception_still_audits.	2026-05-05 09:11:13 +02:00
ZdenekSrotyr	4c4dfee8e6	feat(profile): /profile/sessions page + audit on detector exception + correct SCHEDULER_AUDIT_ACTIONS Three changes addressing user feedback during e2e test of #179 + Devin Review on `e86dd5ed`. 1) /profile/sessions — new self-service user page in the user menu. Lists all session jsonls the caller uploaded via `agnes push` joined against session_extraction_state. Each row shows uploaded_at, file size, status badge (pending/processed/extracted), processed_at, and items_extracted. The page docstring + help text explicitly call out that items_extracted=0 means the verification detector ran fine but the LLM found no claims to track — that's the documented "no items" outcome, not a broken pipeline. Closes the gap surfaced during the e2e test of #176 where a user could see their sessions on disk and process them through the LLM but had no UI to inspect what happened. 2) run_verification_detector audits unhandled exceptions (Devin #1). If detector.run() threw anything other than the already-translated ValueError, the audit_log row was never written. The endpoint now wraps detector.run in try/except, records the exception in audit_params["unhandled_error"], then re-raises as 500 after audit. The /admin/scheduler-runs page surfaces the failure row with the error type + message. 3) SCHEDULER_AUDIT_ACTIONS list corrected (Devin #2). Previous list had "marketplaces_sync_all" (wrong — actual is "marketplace.sync_all") plus "data_refresh" and "scripts_run_due" which app/api/sync.py and app/api/scripts.py don't write to audit_log. Fixed to the four actually-logged strings; comment points at the missing audit calls as a follow-up. Tests: tests/test_web_ui.py adds TestAdminRoleGuards::test_profile_sessions_page_no_admin_required and tightens test_admin_scheduler_runs_page_admin_only to assert the correct marketplace.sync_all string.	2026-05-05 08:57:35 +02:00
ZdenekSrotyr	f0d091f721	fix(store): scratch dir leak on ZIP validation failure (Devin Review) create_entity + update_entity created the `scratch` temp dir inside one try/finally but cleaned it up in a separate one. Validation HTTPExceptions raised by _safe_zip_extract (zip_unsafe_path, zip_too_large_uncompressed) or the BadZipFile→422 conversion exited the first scope, and the second finally was never entered → temp dir leaked on every failed upload. Devin flagged this on the F2 commit. The leak pre-existed (zip_unsafe_path was the original vector); F2 added zip_too_large_uncompressed to the same broken cleanup path. Fixed by collapsing scratch creation + cleanup into one outer try/finally that covers both extraction AND metadata/bake; the inner try/except/finally still handles BadZipFile→422 + tmp file cleanup. Same restructure in update_entity. Regression test `test_scratch_dir_cleaned_up_after_failed_extraction` triggers a zip_unsafe_path 422 and asserts tmp/agnes_store_* contains no leaked dirs.	2026-05-05 08:52:15 +02:00
ZdenekSrotyr	78cad8b235	release: 0.35.0 — /store + /my-ai-stack + security fixes + CLI	2026-05-05 08:18:16 +02:00
ZdenekSrotyr	16373d6b0b	feat(cli): agnes store + agnes my-stack commands Adds CLI coverage for the new REST surface introduced in this PR: agnes store list / show / install / uninstall / upload / delete agnes my-stack show / toggle Covers 11 of the 15 new endpoints — listing, detail, install/uninstall, upload (multipart), delete, my-stack get + curated toggle. Photo / docs download endpoints intentionally skipped; analyst-side automation rarely needs raw bytes back, and the web UI already covers them. cli/v2_client.py: api_post_multipart + api_put_multipart helpers (httpx files= passthrough). api_delete + api_put_json fillers were already needed for non-multipart writes; added together. Tests: tests/test_cli_store.py — help-text smoke tests + happy-path mocked tests for list, install, upload, my-stack show, my-stack toggle. 12 new tests, all green.	2026-05-05 08:18:12 +02:00
ZdenekSrotyr	fd3c76d21b	fix(store): security + correctness blockers found in PR review (F1, F2, F4, F5) Three independent reviews of PR #180 surfaced four real defects in the new Store / my-ai-stack surface. CHANGELOG entries detail each; one-liners: - F1 video_url XSS: any authenticated user could upload a Store entity with `video_url=javascript:...` and pop XSS in any viewer's session via the `<a href=...>` "Watch video" link in store_detail.html. Jinja2 autoescape doesn't block URI schemes inside attribute values. Fixed by scheme-validating to http(s) only on create + update; 400 invalid_video_url. - F2 ZIP decompression bomb: _safe_zip_extract checked path-traversal but not declared file_size totals — a 50 MB compressed upload at 1:1000 ratio decompresses to 50 GB and DOS the host disk. Fixed by summing zinfo.file_size across infolist() and refusing > 200 MB before extractall touches disk. 413 zip_too_large_uncompressed. - F4 admin authz parity: PUT /api/store/entities/{id} was owner-only while DELETE allowed owner OR admin; the store-detail page hid Edit/Delete buttons from admin even though DELETE was permitted. Fixed by allowing admin on PUT and passing is_admin to the template; gate is now is_owner OR is_admin everywhere. - F5 cross-owner suffix collision: sanitize_username is many-to-one (alice.smith / alice_smith both → alice-smith). Two such users uploading entities with the same display name produced identical `<name>-by-<username>` suffixes, silently colliding in the served agnes-store-bundle on-disk paths AND the manifest catalog (Claude Code dedupes by plugin.json `name`). Fixed by enforcing global uniqueness on the suffixed value at create_entity; 409 conflict_global_suffix. F3 (ZIP symlink members) was investigated and confirmed to be a false-positive — Python's stdlib ZipFile.extractall does not honor symlink mode bits, so no exploit exists. 9 new regression tests in tests/test_store_api.py::TestStoreSecurityFixes covering all four. Test run locally: 60/60 store-related tests pass.	2026-05-05 08:18:02 +02:00
ZdenekSrotyr	e86dd5edc5	fix(anthropic): strict json_schema (additionalProperties=false) + add /admin/scheduler-runs UI E2E test on a real BQ deploy showed every verification-extraction call fails with HTTP 400 invalid_request_error: "output_config.format.schema: For 'object' type, 'additionalProperties' must be explicitly set to false". The Anthropic structured-output API now requires the field on every object node in the json_schema. Fix: connectors/llm/anthropic_provider.py wraps the caller-supplied schema through a recursive _strict_json_schema() walker that adds the field where missing (preserving any explicit override), then passes the strict variant to the API. Six unit tests in TestStrictJsonSchema pin the recursion across nested objects, array items, and the no-mutation invariant. Adds /admin/scheduler-runs — a read-only admin page that surfaces the last 200 audit-log entries from scheduler-driven actions. New AuditRepository.query_actions(actions, limit) helper, new admin nav entry. Failed scheduler ticks (HTTP 401, network errors) don't reach the audit_log; the page calls that out with a hint to set SCHEDULER_API_TOKEN if no rows show up.	2026-05-05 08:00:57 +02:00
ZdenekSrotyr	9f9aabd72b	fix(corporate-memory): CLI catches fail-fast ValueError, exits 1 with clean message (Devin Review on #179 ) The PR's #176 fail-fast change made collect_all() raise ValueError when neither an ai: block nor ANTHROPIC_API_KEY/LLM_API_KEY was available. verification_detector's CLI was updated to handle it; corporate_memory's CLI was missed and crashed with an unhandled traceback. services/corporate_memory/collector.py:main() now wraps the collect_all call in try/except ValueError, prints a one-line actionable message to stderr, and returns rc=1. Regression test: test_llm_connector.py::TestCorporateMemoryCollector::test_main_returns_1_on_no_ai_config_instead_of_traceback.	2026-05-05 06:45:10 +02:00

... 3 4 5 6 7 ...

917 commits