agnes-the-ai-analyst

Author	SHA1	Message	Date
ZdenekSrotyr	9f9aabd72b	fix(corporate-memory): CLI catches fail-fast ValueError, exits 1 with clean message (Devin Review on #179 ) The PR's #176 fail-fast change made collect_all() raise ValueError when neither an ai: block nor ANTHROPIC_API_KEY/LLM_API_KEY was available. verification_detector's CLI was updated to handle it; corporate_memory's CLI was missed and crashed with an unhandled traceback. services/corporate_memory/collector.py:main() now wraps the collect_all call in try/except ValueError, prints a one-line actionable message to stderr, and returns rc=1. Regression test: test_llm_connector.py::TestCorporateMemoryCollector::test_main_returns_1_on_no_ai_config_instead_of_traceback.	2026-05-05 06:45:10 +02:00
ZdenekSrotyr	e68c2d3f0f	fix(session-collector): argv-free run() helper, drop SystemExit footgun (Devin Review on #179 ) run_session_collector called collector.main() which did argparse.parse_args() on uvicorn's sys.argv (['app.main:app', '--host', ...]) → sys.exit(2) → SystemExit(2), which inherits from BaseException, escapes FastAPI handlers, and propagates through the thread pool. Every scheduler tick that fired the endpoint either 500-ed or risked killing the uvicorn worker. services/session_collector/collector.py now exposes run(dry_run, verbose) that returns (rc, stats); main() is a thin CLI shim that parses argv and delegates. The admin endpoint calls run() directly and audit-logs the per-run stats (users_processed, files_copied, files_skipped) instead of just the rc. Three regression tests in TestRunHelper. Closes Devin Review finding on app/api/admin.py:2819 (#179).	2026-05-05 06:31:55 +02:00
ZdenekSrotyr	046d8705ee	docs(changelog): correct "two paths" claim + document new env vars The 0.35.0 entry's 'two paths to a working LLM pipeline' wording was defensible only after the #179 review fixes — on the initial cut, the seeded-overlay path was dead code (consumers imported the static-only loader; even when they didn't, env refs in the overlay weren't resolved). Updated Defect 5's bullet to spell out what was broken and what shipped, and added a new bullet for the scheduler-cadence env-var fix. Added the two new test modules under Internal.	2026-05-05 06:05:27 +02:00
ZdenekSrotyr	567385d046	release: 0.35.0 — session pipeline fix (BREAKING) (#176 ) Five compounding defects on default `docker compose up` deploys made the session pipeline silently broken: sessions uploaded by analysts via `agnes push` landed on /data/user_sessions/<user>/.jsonl but nothing ever processed them. Fix is one PR: promote anthropic + openai to core deps, wire all three LLM-pipeline jobs into scheduler-v2 with offset cadences (10m/15m/17m), drop the side-car services from compose, seed a default ai: block on first-time setup with an env-var fallback in code, surface the pending review queue to admins, and expose a health check that warns when uploaded jsonls aren't being processed. BREAKING* for operators on COMPOSE_PROFILES=full or with custom Compose overrides referencing the corporate-memory or session-collector service stanzas — drop them. The scheduler is now the sole driver.	2026-05-05 00:46:27 +02:00
ZdenekSrotyr	0430c0de00	release: 0.34.0 — clean analyst bootstrap (BREAKING) + bundled fixes Headlines: - Clean analyst bootstrap rewrite: web /setup → paste prompt → Claude Code in empty folder = working analyst workspace. CLI binary renamed da → agnes. See CHANGELOG ## [0.34.0] for the full breaking-change matrix. - Unified /setup flow: collapsed the admin/analyst tile split (the ?role= query parameter introduced mid-cycle is gone). Every signed-in user sees the same flow; marketplace + plugins block emitted iff caller has plugin grants. PAT scope uniform (general 90 d). - Bundled fixes: supersedes #172 (Windows console encoding), merges #174 (BigQuery materialize view fix + concurrency, schema v24 migration), closes #171 (--remote query pre-check no longer over-rejects narrow queries on partitioned tables, ~30,000x over-estimate fix). - Devin Review findings addressed throughout the cycle: query.py:464 (rewriter cross-contamination), extractor.py:166 (TTL reclaim dead code), db.py:1757 (v24 migration retry path), init.py:99 (stale on-disk token override), and more. - Operator UX: register-table now requires --bucket for materialized rows + emits first-sync and grant hints on success. agnes status sessions counter reads from ~/.claude/projects/<encoded-cwd>/. agnes init --token now wins over stale ~/.config/agnes/token.json. Open follow-ups (separate issues): - #175 sync architecture redesign (full-extract Keboola, full-file downloads, user-global sync_state) - #177 admin CLI: missing unregister-table / update-table commands - #178 agnes diagnose: introduce "info" severity tier	2026-05-04 23:13:23 +02:00
ZdenekSrotyr	0612c1e1a1	fix(schema-v24): raise on deferred migration so retry path actually runs (Devin Review on db.py:1757) Pre-fix: when v24 migration found rows to migrate but data_source.bigquery.project was empty, it logged a warning per row and returned normally. Schema_version then bumped to 24 unconditionally → next start's 'if current < 24:' gate skipped _v23_to_v24_finalize forever, leaving rows in DuckDB-flavor SQL that the new _wrap_admin_sql_for_jobs_api wrapping path rejects. Devin escalated this from advisory ("idempotent retry") to critical on rescan after my reply. The reply was wrong — the LIKE filter inside the function gives idempotency IF the function is called again, but the schema-version gate prevents that call from happening. Fix (Devin's recommended Approach 1): raise RuntimeError BEFORE the schema-version bump when rows need migration but project_id is empty. The schema_version stays at 23, so on next start the 'if current < 24:' gate fires and the migration runs again — this time with project_id configured. Side effect: a BQ-using deployment that hasn't set the project blocks startup until they do. That's the right call for a config error that would otherwise silently break all materialized tables. The error message points at the right knob (data_source.bigquery.project + restart). No-rows-no-block invariant preserved: the early 'if not rows: return' at the top of _v23_to_v24_finalize means non-BQ deployments are unaffected. Tests: - test_v24_raises_when_project_not_configured_and_rows_need_migration: asserts raise + schema_version stays at 23 (the load-bearing invariant for retry-on-next-start to work) - test_v24_skips_clean_when_no_rows_match_even_without_project: asserts non-BQ deployments don't block startup - Existing 3 tests still pass	2026-05-04 23:11:34 +02:00
ZdenekSrotyr	36012e0833	fix(admin): register-table real-world UX gaps for materialized BQ Three items from operator feedback after running the actual flow: (1) Help docstring lied: "--bucket / --source-table ignored" for materialized rows. Reality: --bucket is load-bearing because `agnes schema <name>` builds the BQ identifier as `bq.<bucket>.<source_table>`. An empty bucket registered the row but broke schema/describe with HTTP 400 "unsafe BQ identifier in registry". Fix: docstring rewritten to reflect reality, plus client-side validation rejects materialized + empty bucket with a clear error pointing at the right knob. (2) Post-register UX cliff: `agnes pull` after register-table reports "Updated 0 tables (1 total)" because registration adds a registry row but does NOT trigger a parquet build. Operators routinely assume something's broken when they need to run `agnes setup first-sync` to kick off the materialization. Hint emitted on success now points at first-sync. (3) RBAC gotcha: `agnes catalog` is RBAC-filtered via `resource_grants`, so non-admin users don't see freshly-registered rows until a grant is created. Hint emitted on success now points at `agnes admin grant create <group> table <name>`. Tests: 8/8 in test_cli_admin_materialized.py, including two new regression tests for the validation + the hint output.	2026-05-04 23:06:17 +02:00
ZdenekSrotyr	5915f92eaa	fix(query-guardrail): single-pass alternation regex (Devin Review on query.py:464) The iterative bare-name rewriter (one re.sub per name, longest-first) was vulnerable to cross-contamination when the GCP project ID contained a registered table name as a hyphen-delimited word. Concrete repro: project = 'my-ue-project' registered = ['orders', 'ue'] user SQL = 'SELECT * FROM orders JOIN ue ON ...' iter 1 (orders): produces 'FROM `my-ue-project.fin.orders` JOIN ue ...' iter 2 (ue): '\bue\b' matches 'ue' INSIDE 'my-ue-project' (hyphen creates word boundary on both sides) — corrupts the iter-1 path Fallback at query.py:576 caught the resulting BQ parse error and fell back to per-table SELECT * estimate, so impact was over-estimation, not fail-open — but the #171 partition-pruning fix silently degraded to pre-fix behavior whenever a project name shared a hyphen-segment with a registered table. Fix: single re.sub call with an alternation regex sorted longest-first. Single-pass means each source position is processed exactly once, so freshly-inserted backticked text from one match isn't re-scanned by later names in the alternation. Regression test test_rewrite_helper_does_not_corrupt_when_project_id_contains_registered_name covers the exact Devin repro.	2026-05-04 22:51:33 +02:00
ZdenekSrotyr	c432e90f62	fix(bq-materialize): TTL reclaim was dead code (Devin Review on extractor.py:166) `_try_acquire_file_lock` opened the lock file with `open(mode='w')` BEFORE the mtime check, which truncated the file and refreshed mtime to now. The subsequent age check always saw ~0, so the TTL reclaim branch was never reachable and `materialize.lock_ttl_seconds` was a silently no-op config knob. Repro: before open(w): mtime age = 100000s after open(w): mtime age = 0s Fix: stat the lock path BEFORE any open(). If pre-probe mtime is older than TTL, unlink (forcing a fresh inode for the open + flock that follows). Order is now stat-then-decide-then-probe, not probe-then-stat-then-decide. Two regression tests added in tests/test_bq_materialize_concurrency.py: - test_stale_held_lock_is_reclaimed_despite_live_holder — exercises the full reclaim path with a still-living fcntl holder. Pre-fix this returned None (in_flight forever); post-fix returns a holder fd on a new inode. - test_failed_probe_does_not_self_refresh_lock_mtime — sister test pins that a failed acquisition's mode='w' truncate doesn't pathologically loop. Residual cross-process risk (genuinely overrunning materialize past TTL races a fresh attempt — both write to the same parquet.tmp, inode-level flock independence means new acquisition succeeds while old holder is still alive) stays documented in the helper docstring. In-process threading.Lock keyed on table_id blocks the single-process race; cross-process protection relies on TTL being well above longest plausible COPY (24h default).	2026-05-04 22:36:56 +02:00
ZdenekSrotyr	ed969f5e37	docs(changelog): unified /setup flow under Unreleased Replace the analyst-vs-admin `?role=` design summary with the unified flow we're shipping: single tile, single PAT-mint shape (general / 90 d), `agnes init` mandatory for everyone, marketplace block gated by `resource_grants`, pre-flight check now validates both git and claude. The intro paragraph references the 10-task unification follow-up and the `?role=` introduction-and-removal cycle so a future operator reading the diff doesn't think they missed a release. Plan: docs/superpowers/plans/2026-05-04-unified-setup-prompt.md task 9.	2026-05-04 22:19:57 +02:00
ZdenekSrotyr	8784f10a6b	fix(devin-review): stale-token override + status sessions counter + lock comment Three Devin Review findings on PR #173 addressed in one commit since they're in adjacent code paths: 1. cli/commands/init.py:99 (\u{1F534}): `agnes init --token NEW` ran step 2 verify against the OLD on-disk token because `get_token()` read `~/.config/agnes/token.json` before the env var, and `_override_server_env` only set the env var. So `agnes init --force` on a machine with a stale token.json failed 401 with a confusing 'token expired' even though the --token arg was valid. Fix: ContextVar-based override in `cli.config._token_override` checked by `get_token()` BEFORE the on-disk read. `_with_token_override` context manager scopes the override. `_override_server_env` now also sets the contextvar via `_with_token_override(token)`, so both env var and contextvar carry the override (env for back-compat with anything bypassing get_token; contextvar is the authoritative source). Async-safe (each task sees its own override) and leak-proof (resets on context exit). 2 new tests: regression on stale-disk-token + scope leak guard. 2. cli/commands/status.py:43 (\u{1F7E1}): sessions_pending_upload only checked legacy `<workspace>/user/sessions/` and always reported 0 in workspaces bootstrapped with `agnes init` (Claude Code writes to `~/.claude/projects/`, not the legacy path). Same bug we fixed for `agnes push` in `08e49591`. Fix: route through `cli.lib.claude_sessions.list_session_files()` so status and push agree on what counts as a pending session. 3. connectors/bigquery/extractor.py:111 (\u{1F7E1}): docstring claimed "a live holder still wins the second flock attempt" — incorrect on Linux. After `unlink()` + `open()`, the new file is a new inode; fcntl.flock keys per-inode, so the old holder's lock does NOT block the new acquisition. In a genuine TTL-overrun scenario two writers CAN race the parquet.tmp. Fix: documentation only. Comment now honestly describes the inode-recreation behavior, names the threading.Lock as the actual in-process guard, and flags pid-gating as the next-iteration fix if real corruption surfaces. The 24h default TTL is well above typical COPY durations so the practical risk is low. Tests: 17/17 across test_cli_init.py + test_lib_pull.py + the broader regression set.	2026-05-04 21:26:30 +02:00
ZdenekSrotyr	8233c3e3f9	chore(docs): replace stale `da` verbs and vendor-specific install paths Sweep operator runbooks (docs/QUICKSTART, docs/HEADLESS_USAGE, docs/architecture, docs/sample-data, docs/agent-workspace-prompt, docs/metrics/metrics.yml, dev_docs/server, dev_docs/disaster-recovery), the corporate-memory service README, the jira connector README + backfill scripts, the deploy skill, and test docstrings. Replaces `da sync` → `agnes pull`, `da analyst setup` → `agnes init`, `da metrics ...` → `agnes catalog --metrics` / `agnes admin metrics ...`, `da fetch` → `agnes snapshot create`, plus the matching docker-compose admin invocations. Vendor-specific `/opt/data-analyst/` install paths in jira backfill / consistency scripts and operator docs are replaced with the placeholder `<install-dir>` and a new `AGNES_ENV_FILE` env-var override that lets a deployment inject its actual install path without a code change. Aligns with the OSS vendor-agnostic policy in CLAUDE.md. CHANGELOG `### Internal` entry summarizes the audit and reaffirms the intentional stale-marker tuples (`_LEGACY_STRINGS`, `_OUR_COMMAND_MARKERS`) that must keep referencing `da sync` / `da fetch` / etc. for hook upgrade and override-detection logic.	2026-05-04 21:22:19 +02:00
ZdenekSrotyr	976d0c7160	fix(pull): re-download parquet when file missing despite matching hash Pre-fix `agnes pull` decided what to download from sync_state hash equality alone: if server_hash != local_hash or tid not in local_tables or not server_hash: to_download.append(tid) If the recorded local hash matched server but the actual parquet had been deleted from disk, the download was skipped. The next DuckDB view rebuild then fails on a missing file. Repro: `rm server/parquet/X.parquet && agnes pull` → 'Updated 0 tables', X still missing. Failure modes that produce hash-equal-but-file-missing: - manual `rm` of a single parquet - operator-side cleanup of `server/parquet/` - two workspaces sharing one user's `~/.config/agnes/sync_state.json` (TODO(workspace-scoped-sync-state) in pull.py): one workspace writes its parquets, the other reads sync_state and concludes 'I already have these' - disk corruption / partial restore from backup Fix: existence check runs alongside the hash compare. Missing file forces a re-download regardless of hash equality. `parquet_dir` is hoisted above the loop so the existence check is in scope when the download set is built. Tests: regression test for the hash-equal-but-missing-file case + counterpart for the fast-path (hash-equal-and-file-present must still skip).	2026-05-04 21:12:06 +02:00
ZdenekSrotyr	500db8cd3c	fix(query-guardrail): dry-run user SQL not synthetic SELECT * (#171 ) Closes #171. The /api/query cost guardrail used to dry-run a synthetic `SELECT * FROM <table>` for each registered remote-BQ row referenced by the user SQL — which made BigQuery estimate a full table scan, with column projection, predicate pushdown, and partition pruning all disabled. Narrow queries on big partitioned/clustered tables (the documented happy path for `agnes query --remote`) hit ~30,000× over-estimates and got rejected with 400 `remote_scan_too_large` even when BQ's own dry-run reported single-digit MB. Pavel's report on #171 traced the root cause and proposed the fix: rewrite the user SQL to BQ-native syntax and dry-run it as a single job, exactly the way `bq query --dry_run` works. Implementation: - New helper _rewrite_user_sql_for_bq_dry_run rewrites bare registered names (word-boundary, case-insensitive, longest-first to avoid prefix collisions) + bq."<ds>"."<tbl>" forms to backticked `<project>.<ds>.<tbl>` paths. - _bq_quota_and_cap_guard runs ONE dry-run on the rewritten SQL. Cap check uses the real estimate. - Fallback path: if BQ rejects with bq_bad_request (e.g. DuckDB-only syntax like ::INT casts), the guard falls back to the pre-fix per-table SELECT * approach so non-portable queries still get a (loose) cap estimate instead of fail-opening. Non-parse BQ errors (forbidden, upstream) still propagate as 502. - _bq_guardrail_inputs now also returns name_lookups so the rewriter has the (registered_name, bucket, source_table) mapping it needs. - Per-table breakdown is unavailable from a composite dry-run; total bytes are pinned to dry_run_set[0] for the post-flight record_bytes(sum(...)) call to keep returning the right total. Tests (7 new, 3 existing still pass): - dry-run receives rewritten user SQL with WHERE clause intact (the load-bearing assertion for #171) - single dry-run per request even with multiple registered tables (JOIN, UNION) referenced - fallback to per-table SELECT * on bq_bad_request - non-parse BQ errors (forbidden) still 502 - rewriter unit tests: bare + bq.path in same SQL, longest-name-wins on prefix collision, case-insensitive bare-name match	2026-05-04 21:08:21 +02:00
ZdenekSrotyr	e438170ade	merge: pull #174 (BQ materialize view fix + concurrency, 0.33.0) into bootstrap branch Brings in zs/materialize-sync-fix (PR #174): - BigQuery view materialize works (wrap admin SQL in bigquery_query()) - Per-table mutex + fcntl.flock for concurrent COPY corruption - Cost guardrail dry-run engages on materialized rows - Schema v23 -> v24 migration: rewrite source_query to BQ-native - Server-generated trivial source_query from bucket+source_table - Validator backtick relaxation for materialized rows - 0.33.0 release cut Conflict resolution: - CHANGELOG.md: keep our [Unreleased] (bootstrap rewrite content) ABOVE the new [0.33.0] section from #174. The bootstrap rewrite remains unreleased; it'll cut 0.34.0 (or later) when this PR merges to main. - tests/conftest.py: union — keep our analyst-bootstrap fixture re-export AND #174's bq_instance / stub_bq_extractor fixtures. - pyproject.toml auto-merged to 0.33.0 (matches the cut), correct. - src/db.py auto-merged: SCHEMA_VERSION = 24, _v23_to_v24_finalize added — no overlap with our work which left schema at v23. - CLAUDE.md auto-merged: schema-history paragraph extended with v24. Verified: 79/79 across CLI bootstrap suite + materialize suite + schema v24 migration tests pass locally on Python 3.13/macOS.	2026-05-04 20:53:00 +02:00
ZdenekSrotyr	ee83cebbda	fix(cli): Windows console crash on cs-CZ codepage (port + broaden #172 ) Ports Minas's PR #172 (against pre-rename `da` CLI on main) and applies the principle to the post-rename `agnes` CLI. Two distinct failure modes on Windows consoles whose default codepage is cp1250 (cs-CZ) / cp1252 (en-US): 1. `agnes pull` and other Rich-progress codepaths UnicodeEncodeError on Braille spinner glyphs. Fix: `cli/main.py` reconfigures stdout/stderr to UTF-8 with errors='replace' at import time on `sys.platform == 'win32'` so Rich's legacy-Windows render path emits decodable bytes. Wrapped in try/except so pytest's captured streams (which aren't TextIOWrapper) don't break. 2. `agnes skills list` and `agnes skills show` UnicodeDecodeError when reading skill markdown containing em-dashes / accented chars. Default `Path.read_text()` uses locale.getpreferredencoding(False), which is the broken codepage on Windows. Fix: every call site passes encoding='utf-8' explicitly. Broader scope than #172 because: - The bootstrap rewrite renamed/removed several files Minas's PR patched (`cli/commands/analyst.py` -> rolled into init.py; `cli/commands/sync.py` -> split into pull/push). Those targets no longer exist; the equivalent code lives in init.py. - Other call sites Minas didn't touch (still bare in his branch) are patched here too — config.py / update_check.py / snapshot_meta.py / setup.py / skills.py — so the codebase has zero locale-default text I/O in cli/. Side cleanup: stale `Run `da`` reference in snapshot_meta.py:88 fixed to `agnes` while touching the file.	2026-05-04 20:45:29 +02:00
ZdenekSrotyr	e323ab76cc	fix(snapshot): catch httpx transport errors in --estimate path CI failure: test_readers_in_pre_init_dir asserted no Traceback in stderr when running `agnes snapshot create x --as y --estimate` in a folder that never saw `agnes init`. The estimate-guard fix in `3d587681` let `--estimate` skip the local_db check and reach `api_post_json`, but the existing `except V2ClientError` doesn't cover transport-layer failures. With no server configured the URL defaults to http://localhost:8000; httpx raises ConnectError → ConnectError isn't a V2ClientError → the exception bubbles up through Typer/rich as a full traceback. Add `except httpx.HTTPError` next to V2ClientError so connection / DNS / TLS / timeout failures all render the friendly hint `Run `agnes init …` first` instead of leaking transport noise.	2026-05-04 20:36:30 +02:00
ZdenekSrotyr	cd3293b994	release: 0.33.0 — BQ materialize view fix + concurrency control	2026-05-04 20:30:50 +02:00
ZdenekSrotyr	08e4959185	fix(push): read sessions from ~/.claude/projects/<encoded-cwd>/ Real bug: `agnes push` was reading `<workspace>/user/sessions/`, but Claude Code writes session jsonls to `~/.claude/projects/<encoded-cwd>/` and nothing on the analyst side ever copies them across. The SessionEnd hook ran `agnes push` happily and uploaded zero sessions every time. `cli/lib/claude_sessions.py` probes both Claude Code encoding variants (older `/`→`-` keeping spaces+tildes; newer all-non-alphanumeric→`-` with collapsed runs) and unions whichever exist. Users who upgraded Claude Code mid-project end up with both encoded dirs side-by-side on disk; the union ensures no session is left behind. Same-named jsonl in both dirs → newest mtime wins. `<workspace>/user/sessions/` survives as a fallback for any setup that explicitly mirrors sessions there. Verified on real disk: helper returns 2 dirs + 8 unioned session files for the Agnes-test workspace where the previous code returned 0.	2026-05-04 20:29:59 +02:00
ZdenekSrotyr	d44cace17c	docs(changelog): clean-analyst-bootstrap rewrite (BREAKING)	2026-05-04 19:25:38 +02:00
ZdenekSrotyr	f731ee7897	feat(setup): /setup?role=analyst\|admin branching with role tiles	2026-05-04 17:28:47 +02:00
ZdenekSrotyr	cf8930b593	chore(release): cut 0.32.0 — #160 da query --remote on VIEW + 4 reinforcing fixes CHANGELOG: rename [Unreleased] → [0.32.0] — 2026-05-04, prepend a new empty [Unreleased] for next-PR landing zone. pyproject.toml: version 0.31.0 → 0.32.0. Per repo discipline (memory: feedback_release_cut_with_pr.md), the release-cut commit lands as the FINAL commit of the PR that contained the user-visible behavior change — it does not get a separate PR. After merge: tag v0.32.0 on the merge commit + create a GitHub Release (memory: feedback_github_release_per_tag.md — the tag alone isn't enough; the Release prose is the operator-visible artifact). Headline: closes #160. da query --remote now resolves query_mode='remote' BQ rows whose entity is VIEW or MATERIALIZED_VIEW (the bug Pavel hit). Plus 4 reinforcing fixes — server-side cost guardrail (bq_max_scan_bytes, default 5 GiB), registry-gating of direct bq.* paths, bigquery_query() function-call backdoor closed, structured CLI render of typed BQ errors — and one operator-side admin convenience (BQ test-connection endpoint + billing_project placeholder UI). 14 issues caught and addressed across 6 iterations of Devin Review. E2E verified on agnes-zsrotyr.groupondev.com (commit `7f743d03`): - VIEW path resolves (count=23 from active_inventory_view) - VIEW aggregate parity vs filtered BASE TABLE - cost guardrail rejects with structured 400 detail - bq_path_not_registered 403 (incl. quoted "bq" variant) - bigquery_query() blocklist 400 - test-connection endpoint 200 with elapsed_ms	2026-05-04 14:37:52 +02:00
ZdenekSrotyr	7f743d0392	fix(cli): #168 review iter 6 — render empty-string diagnostics Devin Review iter #6 found 2 issues. 🟡 BUG: cli/error_render.py filtered out empty-string values via `detail[key] not in (None, "")` and `value not in (None, "")` before they could reach `_kv_line`. But `_kv_line` was specifically designed to render empty strings as `(empty)` — the filter shadowed that branch. The hidden field happens to be the most operator-actionable one in `cross_project_forbidden`: `billing_project: ""` is the exact diagnostic confirming WHY USER_PROJECT_DENIED fires. Change filter to `is not None`. Empty strings now flow through `_kv_line` and render as `billing_project: (empty)`. 📝 ANALYSIS: CHANGELOG wording for the test-connection endpoint said "the saved data_source.bigquery config", which Devin flagged as slightly misleading because `get_bq_access` is `@functools.cache`d — "Test connection" tests the config in the running process, not the just-saved YAML overlay. The save flow already returns `restart_required: True` and the UI shows a banner, so the behavior is documented; only the CHANGELOG wording was loose. Tightened to "the process-cached BqAccess... Tests the config active in the running process — after a save the response includes restart_required; click Test AFTER restart to validate the freshly-saved values." New test: test_renders_empty_string_as_empty_marker locks in the empty-string-as-(empty) rendering for the cross_project_forbidden case so a future filter change won't silently drop the diagnostic again. 9 affected render tests pass.	2026-05-04 14:30:43 +02:00
ZdenekSrotyr	1263b80726	fix(query): #168 review — concurrent-slot wraps execute, doc/JS fixes Devin Review on PR #168 found 5 issues — all real, all addressed. 🚩 ANALYSIS_001 (architectural): concurrent-slot guard didn't protect actual BQ query execution. Earlier `_enforce_remote_bq_quota_and_cap` ran dry-run + cap check inside `with quota.acquire(user_id):`, then returned — releasing the slot BEFORE `analytics.execute(...)` ran. Spec §4.3.3 explicitly designs the slot to wrap execute so the per-user concurrent cap limits BQ scans, not just dry-runs. Refactor to a context manager `_bq_quota_and_cap_guard`. Caller's `with` block now holds the slot through dry-run, cap check, the actual `analytics.execute(...)` (which is what triggers the BQ scan when DuckDB resolves the master view), AND the post-flight record_bytes. Slot released only when caller's `with` body exits. 🟡 BUG_001: placeholder JS walked `original` (full GET payload root) instead of `original.sections`. `placeholder_from: ["data_source", "bigquery", "project"]` is a section-relative path, so billing_project placeholder NEVER rendered. Fix: walk `original.sections` (with fallback to `original` for safety). 🟡 BUG_002 + BUG_003: admin_tables.html register and edit modals' operator help text referenced `max_bytes_per_remote_query` (the old name from the spec) but the actual config key is `bq_max_scan_bytes` after the fix-up commit `6423888d` moved it. Replace both occurrences. 🟡 BUG_004: CHANGELOG entry said `api.query.bq_max_scan_bytes` (the old path) but the read at app/api/query.py:53 is `get_value("data_source", "bigquery", "bq_max_scan_bytes", ...)`. An operator who set it under `api.query` in their yaml would have no effect. Correct path in CHANGELOG. All 95 #160-affected tests pass after the changes.	2026-05-04 13:28:03 +02:00
ZdenekSrotyr	b2d54126dc	docs(query): #160 align CLAUDE.md/skills/CHANGELOG with new --remote behavior + cost guardrail Fixes the rails docs that PR #154 over-promised. The reporter (#160) tried `da query --remote` against a VIEW row and saw a catalog error; the previous version of the docs said this would work as a one-shot server-side execution. Now it actually does (see prior commits), but the docs also need to acknowledge the new cost guardrail and the registry-gated direct-bq path. Touched files: - CLAUDE.md (root, "Querying Agnes data — agent rails"): the `da query --remote` bullet under "Choose the right tool" now spells out the BASE TABLE vs VIEW/MATERIALIZED_VIEW pushdown asymmetry + the 5 GiB scan cap + the registry-gating of direct bq.* paths. "When NOT to use `da fetch`" decision matrix updated with a separate row for VIEW aggregates so analysts see why the cap might trip. - config/claude_md_template.txt (PR #154's analyst CLAUDE.md): three-patterns table caveat for the cost guardrail. - cli/skills/agnes-data-querying.md: `When NOT to use da fetch` matrix updated with the same VIEW caveat + registry-gating note. - cli/skills/agnes-table-registration.md:121: replaced the example that suggested raw `bq."<project>.<dataset>.<table>"` syntax (now blocked by the RBAC patch) with the registered-name form. - CHANGELOG.md: full Unreleased entry with Added (Test Connection endpoint + cost-cap server-config knob + placeholder UI), Fixed (the five #160-class fixes: VIEW resolution, RBAC patch, blocklist, bigquery_query() blocking, CLI render, hybrid endpoint detail flattening), Changed (BREAKING legacy_wrap_views removal + quota relocation). 140 tests pass across the issue-affected files.	2026-05-04 10:33:06 +02:00
ZdenekSrotyr	68b9fca3cc	docs(changelog): note CLAUDE.md write-outcome summary fix in [0.31.0]	2026-05-04 07:20:56 +02:00
ZdenekSrotyr	95bf420af3	release(0.31.0): cut Agent Setup Prompt + Agent Workspace Prompt	2026-05-04 07:04:43 +02:00
ZdenekSrotyr	65e39a087d	docs+tests: Agent Workspace Prompt + drop stale BREAKING markers - CHANGELOG.md: add Agent Workspace Prompt bullets under [Unreleased]; remove stale BREAKING (CLI) and BREAKING (API) bullets about CLAUDE.md removal and GET /api/welcome deletion — both behaviors are restored in this PR; replace with a neutral Changed bullet describing da analyst setup writing CLAUDE.md - docs/agent-workspace-prompt.md: operator reference for the feature (when written, editing via UI/API, template language, full placeholder table, Jinja2 examples, reset to default)	2026-05-03 22:44:22 +02:00
ZdenekSrotyr	53f841f244	release(undo): un-cut 0.31.0 — re-cut once Workspace Prompt feature also lands	2026-05-03 22:27:22 +02:00
ZdenekSrotyr	26dc367037	release(0.31.0): cut Agent Setup Prompt + BREAKING CLI/API removals	2026-05-03 21:03:57 +02:00
ZdenekSrotyr	dc931a6556	feat(admin-prompt): default = live setup script; override replaces /setup content The /admin/agent-prompt editor now pre-fills with the full bash bootstrap script from setup_instructions.resolve_lines() instead of being empty. When an admin saves an override it replaces the default everywhere — the /setup page display and the dashboard clipboard CTA — rather than adding a banner above the auto-generated commands. GET /api/admin/welcome-template now returns a `default` field with the live computed script so the editor always shows meaningful starting content. {server_url} and {token} single-brace placeholders survive Jinja2 rendering and are substituted by JavaScript at clipboard-copy time as before. Preview pane switches to textContent (not innerHTML) since content is bash.	2026-05-03 16:31:35 +02:00
ZdenekSrotyr	8f71af6c22	docs(changelog): variant C — banner-on-setup model Update [Unreleased] to reflect the actual shipped behavior: - banner on /setup replaces CLAUDE.md generation - BREAKING: da analyst setup no longer writes CLAUDE.md - BREAKING: GET /api/welcome removed - schema v21/v22 notes corrected - drop sync_interval bullet (not part of this feature set)	2026-05-03 16:12:13 +02:00
ZdenekSrotyr	60386b9c3c	polish: drop dead CSS, fix docstring drift, add agent-prompt route test	2026-05-03 16:12:13 +02:00
ZdenekSrotyr	c7b14fb120	feat(admin): drop setup_banner feature; consolidate into single editor Remove the setup_banner feature (admin-editable /setup page banner) and all associated code: API router, repository, renderer, admin template, tests, and docs. The setup_page handler no longer calls render_setup_banner; the install.html template no longer renders banner_html. The setup_banner DuckDB table (v22) is kept intact for forward-compat with already-migrated instances — only the application code is removed. CHANGELOG updated: setup_banner bullets removed; Agent Setup Prompt (welcome-template feature) now stands alone as the single editable prompt.	2026-05-03 16:12:13 +02:00
ZdenekSrotyr	0ee22f8fb0	docs: add setup-banner.md + rename migration test to test_db_schema_version.py - Add docs/setup-banner.md: placeholder table, autoescape semantics, security note on post-render stripping, diff table vs welcome-template (M-9). - Update CHANGELOG.md to reference docs/setup-banner.md. - Rename tests/test_db_migration_v20.py → tests/test_db_schema_version.py (file tested SCHEMA_VERSION==22, not just the v20 step; clearer name) (M-10).	2026-05-03 16:12:13 +02:00
ZdenekSrotyr	39146288e1	feat: admin-editable setup_banner on /setup page (schema v22) Adds an optional Jinja2/HTML banner displayed above the bootstrap commands on /setup. Empty by default; admin authors it at /admin/setup-banner. autoescape=True — safe for HTML context. Render failures return "" so a broken banner never breaks /setup. Schema v22: setup_banner singleton table, auto-migration v21→v22.	2026-05-03 16:12:13 +02:00
ZdenekSrotyr	b579f119b5	docs(changelog): customizable welcome prompt	2026-05-03 16:12:13 +02:00
ZdenekSrotyr	13ab464ac5	Merge branch 'main' into zs/fix-health-e2e-tests	2026-05-03 15:55:02 +02:00
ZdenekSrotyr	f348296685	fix(tests): align docker-e2e health asserts with current /api/health shape `/api/health` is the auth-free LB probe — returns `status` + `db_schema` only. `version` lives in `/api/version` and the richer `services.duckdb_state` lives in `/api/health/detailed` (auth-gated). The two e2e asserts had drifted and broke nightly on main.	2026-05-03 11:21:19 +02:00
ZdenekSrotyr	91caefaca9	security(auth): per-IP rate limit + last-admin guard (#165 ) * security(auth): per-IP rate limit on auth endpoints + generalize last-admin guard Closes #45 and #151. #45 — every auth endpoint was unthrottled (login, magic-link, token, bootstrap), leaving us open to password brute-force and SMTP email-bombing. Wires slowapi (new dep) into the middleware chain with per-route limits: 10/min on login + token, 5/min on send-link, 3/min on bootstrap. Returns 429 with Retry-After: 60 once exceeded. Per-IP key respects the leftmost X-Forwarded-For hop (Caddy in front of the app strips client-supplied XFF). Operator escape hatch: AGNES_AUTH_RATELIMIT_ENABLED=0. Test suite disables the limiter via autouse conftest fixture so existing auth tests that hammer endpoints in tight loops are unaffected. #151 — DELETE /api/admin/users/{id}/memberships/{group_id} and the mirror DELETE /api/admin/groups/{group_id}/members/{user_id} only guarded against self-removal as last admin. Generalizes to refuse removing anyone from the seeded Admin group when they are the only remaining active admin (mirrors the existing count_admins(active_only=True) <= 1 check on delete_user / update_user). Recovery from zero admins requires direct DB access, so this closes a path where a scheduler/bootstrap actor that bypasses normal admin checks could otherwise empty the group. * security(auth): throttle remaining email-bombing + token-confirm endpoints Address code-review gap on PR #165 — the first commit covered /send-link but missed two endpoints with the IDENTICAL email-bombing surface: - POST /auth/password/reset — sends reset mail, anti-enum response - POST /auth/password/setup/request — sends setup mail, anti-enum response Both now share the 5/min limit with /send-link. Also add 10/min to the token-confirm surfaces — high-entropy tokens but partial leaks via logs / referer have surfaced before, and unbounded guess rate would let an attacker exhaust the keyspace adjacent to a leaked prefix: - POST /auth/email/verify - GET /auth/email/verify — closes the click-through bypass - POST /auth/password/reset/confirm - POST /auth/password/setup/confirm Doc fix: rate_limit.py module docstring + CHANGELOG entry no longer claim "disable without a redeploy" (misleading). The Limiter constructor freezes `enabled` from env at import time, matching every other Agnes env knob — operators set the flag and bounce the container. Tests: 4 new cases in test_auth_rate_limit.py covering /reset, /setup/request, /reset/confirm, GET /verify. Full suite: 2583 passed, 32 skipped, 0 failed. * security(auth): throttle JSON /auth/password/setup — closes form-throttle bypass Second code-review pass on PR #165 caught a fifth gap: POST /auth/password/setup (JSON variant, kept for backward compat) consumes the same setup_token as the web form /setup/confirm but was unthrottled — an attacker brute-forcing the token just switches from the form path to the JSON path and resumes at unbounded RPS. Apply the same 10/min limit and signature shape used on /setup/confirm. Also extend CHANGELOG note about the JSON-variant bypass for future operators reading the security entry. Test: 1 new case (test_password_setup_json_rate_limited_after_10_requests), 9 rate-limit tests + 28 password-flow tests + 41 auth-provider tests pass, no regressions. * chore(release): cut 0.30.1 — auth security hardening (rate limit + last-admin guard)	2026-05-02 21:08:33 +02:00
ZdenekSrotyr	dc03837a7b	feat(query-api): better error message when --remote query references a materialized-but-not-rebuilt id E2E sub-agent finding: `da query --remote "SELECT * FROM <id>"` against a materialized table that hasn't yet been rebuilt in the server's analytics.duckdb returns a confusing DuckDB "Table does not exist" message even though the table is in the registry. Materialized rows produce parquets at `${DATA_DIR}/extracts/<source>/data/<id>.parquet`, but the orchestrator's master-view creation is `_meta`-driven — fresh instances or pre-tick states have the registry row without a corresponding view, so analysts hit the bare "does not exist" with no path forward. Improve the error rendering in `app/api/query.py:execute_query`. When DuckDB raises a "table does not exist" error, scan the registry for any `query_mode='materialized'` row whose id or name appears in the failed SQL. On a hit, return a 400 whose detail names the table, explains the materialize state, and offers two concrete next steps: 1. Run `da sync` (or wait for the scheduler tick / hit POST /api/sync/trigger) to materialize the parquet, OR 2. Query the source directly via the catalog alias when the registry row carries bucket+source_table (e.g. `bq."dataset"."table"` for BigQuery, `kbc."bucket"."table"` for Keboola). Detection is bounded — the registry round-trip only fires when DuckDB's error mentions a missing table, so happy-path queries pay no cost. Non-materialized unknowns fall through to DuckDB's raw error. 2 new tests: materialized id surfaces the hint with the bucket+source_table payload; unknown table falls back to the generic error path with no false positive on the new hint.	2026-05-01 23:09:52 +02:00
ZdenekSrotyr	bc3ba0d43d	feat(admin-api): reject register-table for source_type not configured on instance E2E sub-agent finding: instance configured with `data_source.type='bigquery'` and no `data_source.keboola.` block. Admin POSTs `{source_type: 'keboola'}` to /api/admin/register-table → returns 201, row lands in the registry, but never syncs because the scheduler has no Keboola URL/token to ATTACH against. Operator only notices the gap when `da catalog` keeps showing nothing. The new `_validate_source_type_configured` helper runs immediately after the id/view-name collision checks in `register_table`. A source_type is considered configured when: - it matches `get_data_source_type()` (the instance's primary), OR - a non-empty `data_source.<source_type>` block exists in the effective `instance.yaml` (multi-source instance), OR - it's in `_SOURCE_TYPES_INDEPENDENT_OF_DATA_SOURCE` (Jira / local — both get data through paths that don't involve `data_source.`). Returns 422 with a message that names the configured primary source and points at `/admin/server-config` for enabling a secondary one. None / empty source_type is still tolerated for backward compat with legacy CLI scripts that don't set the field — the route resolves it later. 5 new tests cover: keboola-on-bq rejected, bq-on-keboola rejected, matching source_type still works, jira allowed regardless, omitted source_type passes through. Existing tests that registered Keboola rows on the unconfigured default test instance now opt into a `keboola_instance` fixture to satisfy the new validator (tests/test_admin_bq_register.py + .keboola_materialized + .unregister_cleanup; the multi-source PUT test in test_admin_bq_register adds a `keboola` block to its synthetic config). Pre-existing test_missing_project_returns_error failure in TestRebuildFromRegistry is unrelated (config-cache leakage from a previous test in the same class) — confirmed pre-existing on the prior commit via `git stash` reproduction.	2026-05-01 23:04:51 +02:00
ZdenekSrotyr	dd46461c6c	fix(admin+orchestrator): DELETE registry drops parquet + sync_state; rebuild skips orphan parquets E2E sub-agent finding: register a materialized BQ row → sync to materialize the parquet at `/data/extracts/bigquery/data/<id>.parquet` → DELETE the registry row. The DB row goes away but: - the parquet file stays on disk forever, AND - the sync_state row stays, so `/api/sync/manifest` keeps advertising the dropped table to `da sync`, AND - the orchestrator's next rebuild can resurrect a master view by picking up the leftover parquet. Two-part fix in `unregister_table`: 1. For materialized rows on bigquery/keboola, remove `${DATA_DIR}/extracts/<source_type>/data/<name>.parquet` (and any stale `<name>.parquet.tmp` from a crashed prior materialize). Filename is keyed on `table_registry.name` to match sync_state bookkeeping. File-removal errors are logged but don't fail the DELETE — the registry row is already gone, and an orphan parquet won't get a master view at next rebuild because the orchestrator's _meta-driven scan never picks up bare parquet files. 2. Always clear `sync_state` + `sync_history` rows for the dropped table_id so the manifest stops advertising the table — applies to all source types and modes, not just materialized, since any synced row had a sync_state entry. Orchestrator-side defensive guard (Finding 2b) is a no-op in the current implementation: `_attach_and_create_views` only creates master views from `_meta` rows in each connector's `extract.duckdb`, so a parquet without a matching `_meta` entry is already invisible to the rebuild. The new test `test_orchestrator_skips_orphan_parquet_in_extracts` is kept as a regression guard for that contract. 5 tests cover: BQ + Keboola materialized DELETE removes parquet, remote DELETE doesn't error trying to remove a non-existent file, sync_state cleared on DELETE, orchestrator orphan-skip invariant.	2026-05-01 22:54:11 +02:00
ZdenekSrotyr	f0979f997a	fix(admin-api): reject backtick BQ-native source_query at register; surface materialize errors per-row E2E testing showed admin POSTs of materialized BQ rows whose source_query uses BigQuery-native backtick identifiers (`prj.ds.t`) silently no-op'd at the next sync tick — the materialize path runs the SQL through the DuckDB BQ extension's COPY which uses DuckDB's parser; backticks aren't recognized and the query either parse-errors or matches zero rows. No parquet lands at the canonical path and no error reaches an operator-visible surface. Two-part fix: 1. RegisterTableRequest's _check_mode_query_coherence model_validator now rejects any source_query containing a backtick with a 422 + actionable message pointing at the DuckDB equivalent (bq."dataset"."table"). Same check is applied in update_table on the merged record so PATCHes that flip a stored source_query to backtick form are also caught. Covers BQ AND Keboola materialized rows since both connectors funnel source_query through DuckDB's COPY. 2. _run_materialized_pass now persists per-row failures via the new SyncStateRepository.set_error / clear_error methods (existing sync_state.error / status columns — no schema migration). GET /api/admin/registry enriches each row with `last_sync_error` from a single batched SELECT against sync_state, so the admin UI / da admin status can show "this table failed last sync because: X" instead of operators having to trawl scheduler logs. Recovered rows have the error cleared automatically — update_sync's success path resets status='ok' / error=NULL on the upsert. The materialized-path test fixture's _materialized_payload helper is updated to use DuckDB-flavor SQL (the prior backtick example pre-dated the fix). 6 new tests cover register/update rejection on BQ + Keboola, the sync_state error persistence, and the registry response surface.	2026-05-01 22:51:02 +02:00
ZdenekSrotyr	7052a23552	release(0.30.0): per-connector tab UI + Keboola materialized parity + /admin/server-config full exposure Highlights (full prose in CHANGELOG.md [0.30.0]): - Smart local sync — Claude Code SessionStart/SessionEnd hooks via 'da analyst setup' + 'da sync --quiet' for hook-friendly output - query_mode='materialized' end-to-end for BigQuery + Keboola — admin SELECT (against bq.dataset.x or kbc.bucket.table) → scheduler runs through DuckDB extension → parquet → da sync distribution - /admin/tables per-connector tabs (BigQuery / Keboola / Jira), full Keboola Custom-SQL parity, form cleanup, per-row Manage access deep link - /admin/server-config known-fields registry + structured nested editor: surfaces BQ optional knobs (billing_project, legacy_wrap_views, max_bytes_per_materialize), ai.base_url, new openmetadata + desktop sections, full corporate_memory governance schema - da diagnose warns on USER_PROJECT_DENIED-prone billing_project=project config - Schema v20 — adds source_query TEXT to table_registry	2026-05-01 20:38:34 +02:00
ZdenekSrotyr	b627de8344	feat(diagnose) + docs: warn on USER_PROJECT_DENIED footgun + document all newly-exposed knobs Diagnostic + operator-facing documentation that closes the loop on the work in this PR. `da diagnose` (via /api/health/detailed): - New _check_bq_billing_project() helper. When data_source.type='bigquery' and BqProjects.billing == .data, surface a yellow warning: 'BigQuery billing project equals data project'. Hint includes the YAML field path + the /admin/server-config UI shortcut. Diagnose's overall status promotes warning → degraded so the CLI echoes it. - Non-BQ instances (Keboola-only, etc.) skip the check. - Implementation hooks into the existing /api/health/detailed surface — no new endpoint, no CLI changes. config/instance.yaml.example documentation: - data_source.bigquery.billing_project: USER_PROJECT_DENIED hint, /admin/server-config UI reference - data_source.bigquery.legacy_wrap_views: analyst-side discipline note (use `da fetch` / `da query --remote`), issue #101 history, view-heavy deployment guidance - data_source.bigquery.max_bytes_per_materialize: cost guardrail block (NEW — wasn't documented in .example before) - ai.base_url: provider list + UI hint - openmetadata + desktop: 'configurable via /admin/server-config UI' headers - corporate_memory: leading note that the schema is editable via UI Other docs: - CHANGELOG.md: comprehensive Unreleased section - CLAUDE.md: schema chain → v20 + Materialized SQL connector mode + per-connector tab UI mention - README.md: mode-first source table summary - docs/architecture.md: per-connector tab UI mention - cli/skills/connectors.md: bootstrap rails (parallel to #154) - docs/superpowers/plans/2026-05-01-admin-tables-form-cleanup.md: implementation plan archive (2515 lines) - scripts/seed_dummy_tables.py: drop is_public after #150 RBAC migration (column gone) Tests: - test_diagnose_billing.py — 3 cases (BQ with billing==data warns, BQ with billing!=data clean, non-BQ skips)	2026-05-01 20:27:24 +02:00
Vojtech	c364f65127	fix(tls-rotate): self-signed fallback sets basicConstraints=critical,CA:FALSE (#159 ) * fix(tls-rotate): self-signed fallback sets basicConstraints=critical,CA:FALSE OpenSSL's default '[v3_ca]' config marks CA:TRUE on 'req -x509', which causes strict TLS stacks (rustls / webpki, used by uv, cargo, and future versions of pip) to reject the cert with 'invalid peer certificate: CaUsedAsEndEntity' per RFC 5280 §4.2.1.9. Browsers, curl, and OpenSSL-based clients tolerated the violation, hiding the bug until a uv user hit it. Affects every VM running on the self-signed fallback while the corp PKI hasn't published the real chain yet. Fix lands on the next agnes-tls-rotate.timer tick (or 'systemctl start agnes-tls-rotate.service' for an immediate refresh). Existing CSR / real-cert paths unaffected; only the bring-up fallback regenerates. * chore(release): cut 0.29.0 --------- Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>	2026-05-01 12:23:14 +02:00
Vojtech	bd7b8c3233	fix(analyst): document BigQuery remote-query capability in bootstrap CLAUDE.md template (#154 ) * fix(analyst): document BigQuery remote-query capability in bootstrap CLAUDE.md template Closes #153. The CLAUDE.md template generated by `da analyst bootstrap` (config/claude_md_template.txt) covered metrics, sync, corporate memory, and directory layout — but had ZERO mention of query_mode: "remote", da fetch, da query --remote, or --register-bq. Result: the AI analyst running in a freshly-bootstrapped workspace had no idea BigQuery-backed tables existed, no path to fetch unsynced data, and no fallback for tables not in the catalog. Validated against /Users/<user>/foundry-ai/foundryai-data-analyst/CLAUDE.md on 2026-05-01: section confirmed missing. Workspace-level (parent-dir) CLAUDE.md carried legacy SSH-heredoc instructions but the analyst-level file (which Claude reads as primary project context) had nothing. ## Changes ### config/claude_md_template.txt (+83) Added a `## Remote Queries (BigQuery)` section covering: - Discovery first — `da catalog --json \| jq '...'` to see all tables with their query_mode, then `da schema` and `da describe` for shape. - Three query patterns: - `da fetch` (preferred) — materialize a filtered subset locally, query the snapshot, drop when done. - `da query --remote` — one-shot server-side execution (cheap probes). - `da query --register-bq` — hybrid joins between local + ad-hoc BQ. - `da fetch` estimate-first discipline — rules of thumb on --select / --where / --estimate / snapshot reuse. - BigQuery SQL flavor cheat sheet for `--where` (DATE literal, DATE_SUB, REGEXP_CONTAINS, CAST AS INT64). - Unknown-table fallback: when a table isn't in `da catalog` at all, use ad-hoc `--register-bq` if the agnes server SA has BQ access, or ask admin to register with `query_mode: "remote"` for ongoing use. - Pointer to `da skills show agnes-data-querying` for deeper guidance. ### docs/setup/claude_md_template.txt (deleted) Stale 359-line template that documented the deprecated SSH-heredoc remote_query.sh protocol. No code references it (verified via grep across .py / .sh / .yml / .md). Removing eliminates two failure modes: 1. A future refactor accidentally pulling it into a workspace and shipping deprecated guidance to analyst Claude sessions. 2. Reviewer confusion over which template is canonical. ### CHANGELOG.md `### Fixed` and `### Removed` entries under [Unreleased]. ## Tested - Manually walked the diff against `da skills show agnes-data-querying` output on a live VM (foundryai-development) — patterns + flags match the modern CLI exactly. - Re-bootstrap test deferred: requires network round-trip; pattern is identical to existing template substitution path so render is not at risk. ## Out of scope - The companion gap that data_description.md often only enumerates query_mode: "local" tables (no signal that other modes exist) — separate concern, fix likely belongs in the metadata generator on the server side, not in the analyst template. - Encouraging admins to register frequently-queried BQ tables as `query_mode: "remote"` in the registry — workflow improvement, not a code bug. * chore(release): cut 0.28.0 --------- Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>	2026-05-01 12:06:41 +02:00
minasarustamyan	d4ac84dd46	feat(rbac): drop dataset_permissions + users.role + is_public; v19 migration (#150 ) * feat(rbac): drop dataset_permissions + access_requests + users.role + is_public; v19 migration BREAKING. Sjednocení datové RBAC vrstvy do per-group resource_grants modelu. Před PR byla legacy data RBAC vrstva (dataset_permissions + is_public bypass) de-facto neaktivní — is_public neměl API/UI/CLI surface, default true znamenal že can_access_table vždycky bypassl. Dnes každý non-admin přístup vyžaduje explicitní resource_grants(group, "table", id) řádek. Schema v18 → v19 (src/db.py:_v18_to_v19_finalize): - DROP TABLE dataset_permissions, access_requests - DROP COLUMN users.role (NULL artifact since v13) - DROP COLUMN table_registry.is_public - Drops přes table-rebuild idiom (rename → create new → INSERT … SELECT → drop old) kvůli DuckDB ALTER DROP COLUMN limitacím na tabulkách s historic FK constraints. INSERT picks intersection sloupců, takže test fixtures s minimal pre-v19 schemou migrate cleanly. Runtime: - src/rbac.py:can_access_table → deleguje na app.auth.access.can_access - DatasetPermissionRepository, AccessRequestRepository smazány - AGNES_ENABLE_TABLE_GRANTS env-gate v app/resource_types.py odstraněn (TABLE je unconditionally enabled) API drop: - app/api/permissions.py, app/api/access_requests.py celé soubory - /admin/permissions web route + admin_permissions.html - "Request Access" modal v catalog.html + locked-row UI - ~10 if user.get("role") != "admin" checků nahrazeno (admin shortcut je uvnitř can_access_table) - /api/settings: drop permissions field z GET; PUT /api/settings/dataset gate přepnut na can_access(user_id, "table", dataset, conn) Auth: - app/auth/jwt.py:create_access_token: drop role parametr (claim zmizí z nově vydávaných JWT; staré tokeny zůstávají valid, claim ignored) - app/api/users.py: drop role z CreateUserRequest / UpdateUserRequest (admin promotion = explicit add to Admin group via memberships API) - src/repositories/users.py: drop role z create() / update() CLI: - da admin set-role smazán → hard-fail s replacement command - da admin add-user --role flag pryč - da auth import-token --role flag pryč - da auth whoami: drop "Role:" výpis - cli/config.py:save_token: role parametr now optional, no longer written (back-compat se starými token.json soubory zachována — pole se ignoruje) Tests: - DELETE: test_permissions.py, test_permissions_api.py, test_access_requests_api.py - REWRITE: test_access_control.py (resource_grants flow), test_rbac.py (can_access_table over resource_grants), test_journey_rbac.py (drop access-request flow), test_resource_types.py (drop env-gate tests, drop is_public from helpers), test_v2_.py (drop role-based user dicts in favor of id-based + Admin group membership), test_settings_api.py (no permissions field, can_access gate) - TRIVIAL: ~30 souborů — drop role="admin" arg z UserRepository.create a 3rd positional role z create_access_token - NEW: test_v18_to_v19 migration test (test_db.py), test_can_access_table_no_implicit_public (test_rbac.py), test_admin_set_role_returns_hardfail (test_cli_admin.py) - OpenAPI snapshot regenerated Docs: - CHANGELOG: BREAKING entry pod [Unreleased] - CLAUDE.md: schema v18 → v19 - docs/architecture.md: schema table + RBAC sekce přepsána - docs/auth-google-oauth.md: admin promotion přes da admin break-glass - cli/skills/security.md: kompletně přepsáno na group-based model - docs/TODO-rbac-data-enforcement.md: smazáno (TODO splněn) Test results: 2363 passed, 19 failed. Zbývající failures jsou pre-existing Windows-specific issues (fcntl, charset) nesouvisející s tímto PR — ověřeno git stash pop. Plan: ~/.claude/plans/floofy-coalescing-parnas.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> chore(release): cut 0.27.0 --------- Co-authored-by: Minas Arustamyan <arustamyan.minas@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>	2026-04-30 22:02:16 +02:00
Vojtech	2447da7bb1	refactor(ops): bake all host artifacts into image, drop every curl-from-main (#149 ) * refactor(ops): bake all host artifacts into image, drop every curl-from-main Replaces the curl-from-main pattern (originally introduced in 0.25.0 for agnes-auto-upgrade.sh; older for the compose files + Caddyfile) with image- bundled host artifacts. Same-tag delivery for everything the host runs, version-pinned by AGNES_TAG, atomically rolled back by reverting the image. ## Motivation The customer-instance startup template was curling 6 files from raw.githubusercontent.com on every VM boot: docker-compose.yml docker-compose.prod.yml docker-compose.host-mount.yml docker-compose.tls.yml Caddyfile scripts/ops/agnes-auto-upgrade.sh (added in 0.25.0) Every one of them already lives inside the image (`COPY . .` copies the whole repo to /app/). Curling them from the public internet duplicates content the image already carries and introduces three problems: 1. Split-brain version pinning. image_tag pins the docker image to an immutable digest. The compose files + script bypassed that pinning by tracking `main` (or the rarely-set compose_ref). A customer pinned to stable-2026.04.516 could wake up tomorrow with their host artifacts floating on whatever shipped to main overnight — even though they're explicitly pinned for stability. 2. No rollback knob. Reverting a bad host artifact meant reverting the upstream PR globally — affects every customer that reboots after the bad commit. No "rollback for me only" path; tag-pinning gave no protection. 3. Public-internet dependency on every boot. The image is already pulled from a private registry on the same boot. Reusing that channel is strictly cheaper than adding a second one. Customers with restricted egress (no raw.githubusercontent.com reachability) silently broke on every boot. ## Changes ### Dockerfile (+19 -8) After `COPY . .` and before the wheel build, an explicit `cp` lifts every host-side artifact into a stable contract path /opt/agnes-host/: agnes-auto-upgrade.sh (mode 0755 — host cron driver) docker-compose.{yml,prod,host-mount,tls}.yml Caddyfile (mode 0644) Why a copy instead of pointing at /app directly: /app is owned by uid 999 (USER agnes); /opt/agnes-host is root-owned, mode 0755 across the board, stable path that won't shift if /app structure refactors. ### infra/modules/customer-instance/startup-script.sh.tpl (+22 -36) Replaced six curls and the standalone agnes-auto-upgrade.sh extract block (introduced earlier in this PR) with one extract sequence in section 3: docker pull "$${IMAGE_REPO}:$${IMAGE_TAG}" EXTRACT_CONTAINER=$(docker create "$${IMAGE_REPO}:$${IMAGE_TAG}") trap "docker rm '$EXTRACT_CONTAINER' >/dev/null 2>&1 \|\| true" EXIT docker cp "$EXTRACT_CONTAINER:/opt/agnes-host/." "$APP_DIR/" docker cp "$EXTRACT_CONTAINER:/opt/agnes-host/agnes-auto-upgrade.sh" /usr/local/bin/agnes-auto-upgrade.sh chmod +x /usr/local/bin/agnes-auto-upgrade.sh The auto-upgrade section (#6) is now a no-op — script is already in place. ### infra/modules/customer-instance/variables.tf (+1 -1) `compose_ref` marked DEPRECATED in description. Default unchanged for one release cycle to avoid breaking existing terraform plans. Will be removed in a future major bump. ### CHANGELOG.md `### Changed` entry under [Unreleased] — supersedes the narrower entry this PR previously had (which only covered the script). ## Out of scope (filed as follow-ups) 1. agnes-the-ai-analyst-infra/startup.sh (operator deploy) still curls the same artifacts from main. Symmetric fix needed there. Will file as a separate PR against the infra repo. 2. Self-update inside agnes-auto-upgrade.sh after a successful `docker compose pull` of a new digest. Otherwise the running cron keeps using the OLD baked-in script for one tick after image upgrade. ~10 LOC. Deferred to keep this PR scoped. 3. scripts/ops/agnes-tls-rotate.sh has the same shape — host-side bash currently sourced via the infra repo. Should follow the same bake-into-image pattern. ## Tested - Local: `docker build .` succeeds with the new RUN block. - `docker create` + `docker cp /opt/agnes-host/.` round-trips all 6 artifacts; sha matches each source file. - Not yet tested on a live VM bring-up — that requires a CI image with this Dockerfile change. Recommend reviewer trigger CI build, then do a single VM-recreate against a dev VM (e.g. foundryai-development) to confirm the extract path works end-to-end before merge. ## Compatibility - Existing VMs running 0.25.0 are unaffected — they have host artifacts in place from `curl from main` already; this PR doesn't touch them. They pick up the new pattern only on next VM recreate. - VMs pinned to an image_tag older than this PR (no /opt/agnes-host in the image) would FAIL the docker cp. Current diff fails-loud (no fallback). Recommend operators upgrade to a fresh-enough image_tag alongside the template upgrade — same coupling as any compose-flag bump. * docs(infra): document image_tag >= v0.26.0 minimum on prod/dev_instances The new startup script extracts host artifacts from /opt/agnes-host/ inside the image — a directory added in this PR (will ship as v0.26.0). Pinning image_tag to an older tag would fail-loud at first boot with 'docker cp: No such file or directory'. Existing VMs are unaffected because the module ignores metadata_startup_script changes. Devin ANALYSIS_0004 on PR #149. * fix(changelog): mark BREAKING + drop private-repo reference Per CLAUDE.md, breaking changes start with BREAKING so operators can grep before bumping the pin. The image_tag minimum constraint introduced here qualifies — older tags fail-loud at first boot. Also drop the explicit 'agnes-the-ai-analyst-infra' name from the entry; the OSS distribution shouldn't reference operator-side deploy templates by their private-repo names. Generic 'consumer- side deploy templates' wording instead. Devin BUG_0001 + WARN_0001 on PR #149. * chore(release): cut 0.26.0 --------- Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>	2026-04-30 21:40:25 +02:00

1 2

95 commits