agnes-the-ai-analyst

Author	SHA1	Message	Date
ZdenekSrotyr	7052a23552	release(0.30.0): per-connector tab UI + Keboola materialized parity + /admin/server-config full exposure Highlights (full prose in CHANGELOG.md [0.30.0]): - Smart local sync — Claude Code SessionStart/SessionEnd hooks via 'da analyst setup' + 'da sync --quiet' for hook-friendly output - query_mode='materialized' end-to-end for BigQuery + Keboola — admin SELECT (against bq.dataset.x or kbc.bucket.table) → scheduler runs through DuckDB extension → parquet → da sync distribution - /admin/tables per-connector tabs (BigQuery / Keboola / Jira), full Keboola Custom-SQL parity, form cleanup, per-row Manage access deep link - /admin/server-config known-fields registry + structured nested editor: surfaces BQ optional knobs (billing_project, legacy_wrap_views, max_bytes_per_materialize), ai.base_url, new openmetadata + desktop sections, full corporate_memory governance schema - da diagnose warns on USER_PROJECT_DENIED-prone billing_project=project config - Schema v20 — adds source_query TEXT to table_registry	2026-05-01 20:38:34 +02:00
ZdenekSrotyr	b627de8344	feat(diagnose) + docs: warn on USER_PROJECT_DENIED footgun + document all newly-exposed knobs Diagnostic + operator-facing documentation that closes the loop on the work in this PR. `da diagnose` (via /api/health/detailed): - New _check_bq_billing_project() helper. When data_source.type='bigquery' and BqProjects.billing == .data, surface a yellow warning: 'BigQuery billing project equals data project'. Hint includes the YAML field path + the /admin/server-config UI shortcut. Diagnose's overall status promotes warning → degraded so the CLI echoes it. - Non-BQ instances (Keboola-only, etc.) skip the check. - Implementation hooks into the existing /api/health/detailed surface — no new endpoint, no CLI changes. config/instance.yaml.example documentation: - data_source.bigquery.billing_project: USER_PROJECT_DENIED hint, /admin/server-config UI reference - data_source.bigquery.legacy_wrap_views: analyst-side discipline note (use `da fetch` / `da query --remote`), issue #101 history, view-heavy deployment guidance - data_source.bigquery.max_bytes_per_materialize: cost guardrail block (NEW — wasn't documented in .example before) - ai.base_url: provider list + UI hint - openmetadata + desktop: 'configurable via /admin/server-config UI' headers - corporate_memory: leading note that the schema is editable via UI Other docs: - CHANGELOG.md: comprehensive Unreleased section - CLAUDE.md: schema chain → v20 + Materialized SQL connector mode + per-connector tab UI mention - README.md: mode-first source table summary - docs/architecture.md: per-connector tab UI mention - cli/skills/connectors.md: bootstrap rails (parallel to #154) - docs/superpowers/plans/2026-05-01-admin-tables-form-cleanup.md: implementation plan archive (2515 lines) - scripts/seed_dummy_tables.py: drop is_public after #150 RBAC migration (column gone) Tests: - test_diagnose_billing.py — 3 cases (BQ with billing==data warns, BQ with billing!=data clean, non-BQ skips)	2026-05-01 20:27:24 +02:00
ZdenekSrotyr	df7f5b1d9a	feat(admin-ui): /admin/server-config known-fields registry + structured nested editor Today /admin/server-config renders fields by iterating Object.keys(payload) on the YAML value — if a key isn't in instance.yaml, the operator can't see it. They have to know to type it via the JSON-patch textarea (which only renders for empty sections) or SSH and edit YAML. Adds a known-fields registry (`_KNOWN_FIELDS` in app/api/admin.py) the UI consumes alongside the YAML payload. Renderer shows BOTH: - existing fields (from YAML) with current value - known-but-unset fields with dashed-border placeholder + hint, ready to fill in Renderer (`renderField`, `renderSection`, `collectSection`): - kind="string"\|"secret"\|"bool"\|"int"\|"select"\|"object"\|"array"\|"map" — picks input type - kind="object" with `fields` — recursive structured form, arbitrary depth (corporate_memory needs 3-4 levels) - kind="array" with `item_kind` — vertical stack of typed inputs + add/remove buttons - kind="map" with `key_kind` + `value_kind` — key:value rows + add/remove (used for confidence.base, domain_owners, entity_resolution.entities) - data-path encoded as JSON segment array so map keys with embedded dots (e.g. 'user_verification.correction') survive collect → patch round-trip - .cfg-field.is-unset CSS — dashed border, muted label, italic hint Sections newly exposed (added to _EDITABLE_SECTIONS): - openmetadata: url, token (secret), cache_ttl_seconds, verify_ssl - desktop: jwt_issuer, jwt_secret (secret), url_scheme Known fields populated for existing sections: - data_source.bigquery: billing_project (the cause of the 403 USER_PROJECT_DENIED footgun when SA can read but not bill the data project), legacy_wrap_views (bigquery_query() wrap for VIEWs — issue #101 default off, ON for view-heavy deployments), max_bytes_per_materialize (cost guardrail) - data_source.keboola: stack_url, project_id (hints; values already populated) - ai: base_url (required for openai_compat), structured_output (select) - corporate_memory: full schema from instance.yaml.example — distribution_mode, approval_mode, review_period_months, notify_on_new_items, sources.{claude_local_md,session_transcripts}, extraction.{model,sensitivity_check,contradiction_check}, confidence.{base,modifiers,decay.{mode,half_life_months,decay_rate_monthly,floor}}, contradiction_detection.{enabled,max_candidates}, entity_resolution.{enabled,entities}, domain_owners, domains - Known partial: confidence.modifiers is map<string, map<string, float>> — falls through to JSON-textarea with TODO; structured editor for that one shape needs more renderer work Tests: - test_admin_server_config_known_fields — registry envelope shape, smoke fixture - test_admin_server_config_renderer_depth — 4-level nested objects, arrays of strings, maps of floats, dotted-key safety - test_admin_server_config_corp_memory — full corporate_memory schema, 12 fields incl. nested - test_admin_server_config — existing tests adjusted for new shape	2026-05-01 20:27:01 +02:00
ZdenekSrotyr	c63f54d643	feat(admin-ui): /admin/tables per-connector tabs + Keboola materialized parity + form cleanup + Manage access deep link Replaces the single mixed Jinja-branched form at /admin/tables with a per-connector tab interface and brings Keboola to capability parity with BigQuery. Tab structure: - BigQuery tab: Register modal with two-question radio model (Q1 Live \| Synced × Q2 Whole \| Custom SQL), Discover datasets / List tables / Use-table-as-base autocomplete buttons, table-vs-view auto-detection hint, per-tab listing filter - Keboola tab: same two-question radio (Q2 only — no Live mode for Keboola), Custom SQL textarea against kbc."bucket"."table" for materialized rows - Jira tab: read-only listing (Jira is webhook-driven; no Register form) - Active tab persists in window.location.hash so refresh keeps the operator in place Form cleanup (within tabs): - Drops the misleading 'Sync Strategy' dropdown — runtime never read it (only profiler.is_partitioned() consumes the value for parquet-layout detection); kept in DB for back-compat (Pydantic deprecated) - Adds Sync Schedule input to Keboola Register/Edit (was missing — scheduler honored per-table cron via is_table_due() for every source but the Keboola UI had no surface) - Hides Primary Key under <details>Advanced with clarifying hint that it's catalog-metadata only (Agnes does not perform upsert/dedup; every sync is a full overwrite) - Drops the Strategy column from the registry listing (every Keboola row defaulted to full_refresh after Strategy was hidden — column was noise) - Removes the legacy out-of-tab #registerModal + the legacy global Discovery panel; each tab now owns its own header + Register button + listing div Edit modal: - BigQuery Edit modal physically relocated into <section id="tab-content-bigquery"> (mirrors Phase E Register placement) - Keboola Edit modal mirrors Register (same Q2 radio, Discover/List buttons via parameterized helpers) - openEditModal(table) dispatches by source_type to the right modal — fixes a quiet bug where Phase F's openEditKeboolaModal was never wired up and Keboola edits silently used the legacy modal Per-row Manage access deep link: - Each row in the per-tab listing has a lock-icon button between Edit and Delete that navigates to /admin/access#table:<table_id> - admin_access.html bootstrap reads window.location.hash and pre-fills the resource filter, mirroring the existing ?group=<id> deep-link pattern Tests: - test_admin_tables_tab_ui.py — tab nav, hash persistence, register-button-per-tab, listing partition by source_type, Manage access deep link - test_admin_tables_ui_materialized.py — two-question radio (BQ + Keboola), Discover/List/Use-as-base buttons, Edit modal parity, Jira read-only	2026-05-01 20:26:29 +02:00
ZdenekSrotyr	85d3810535	feat(materialized): query_mode='materialized' for BigQuery + Keboola — admin SELECT → parquet → analyst Closes the 'admin pre-stages a curated table/view for analysts' use case end-to-end across both supported source connectors. Backend (BigQuery + Keboola, schema v20): - schema v20 adds source_query TEXT to table_registry (renumbered from v19 after main's #150 RBAC migration also bumped to v19) - connectors/bigquery/extractor.py adds materialize_query(table_id, sql, , bq, output_dir, max_bytes=...) — BqAccess session, dry-run cost guardrail (default 10 GiB, configurable via data_source.bigquery.max_bytes_per_materialize), idempotent ATTACH, rows/bytes/md5 metadata for sync_state - connectors/keboola/access.py — new KeboolaAccess facade (parallel of BqAccess) wrapping ATTACH 'keboola://...' AS kbc - connectors/keboola/extractor.py adds materialize_query — same shape, no dry-run analog (Keboola Storage API has different cost model); legacy bucket-download path skips query_mode='materialized' rows - app/api/sync.py:_run_materialized_pass dispatches by source_type to the right materialize_query - app/api/admin.py: RegisterTableRequest accepts source_query; model_validator coheres mode↔source_query↔bucket; PUT preserves omitted fields; deprecation marks (Field(deprecated=True)) on sync_strategy + profile_after_sync (no extractor reads them; profile_after_sync becomes inert — bug from earlier work where /api/sync/trigger never honored the flag); _BQ_OPTIONAL_FIELD_DEFAULTS injects defaults into GET /server-config payload Operator + CLI surface: - da admin register-table --query / --query-mode materialized - scripts/smoke-test-materialized-bq.sh — end-to-end smoke for operators Tests (incl. spike + integration + regression): - test_db_migration_v20, test_table_registry_source_query - test_bq_materialize, test_bq_cost_guardrail, test_bq_init_extract_skips - test_keboola_access, test_keboola_extension_query_passthrough (lock-in for the DuckDB extension capability), test_keboola_materialize, test_keboola_init_extract_skips, test_keboola_materialized_e2e (skipped without KBC_TEST_ creds) - test_sync_trigger_materialized, test_sync_trigger_keboola_materialized - test_api_admin_materialized, test_cli_admin_materialized - test_admin_bq_register, test_admin_discover_bigquery, test_admin_keboola_materialized, test_admin_phase_c_deprecation, test_admin_put_preservation, test_materialized_e2e Cost: BQ uses bigquery_query() (jobs API, view-aware) — works on tables, views, materialized views uniformly. Keboola uses ATTACH+COPY parquet through the DuckDB extension.	2026-05-01 20:25:56 +02:00
ZdenekSrotyr	d0b7e122d6	feat(cli): smart local sync — Claude Code SessionStart/SessionEnd hooks + da sync --quiet The analyst flow becomes a closed loop with the server-curated table catalog: - `da analyst setup` writes `<workspace>/.claude/settings.json` with two hooks: SessionStart → `da sync --quiet \|\| true` — pulls fresh RBAC-filtered parquets at session start SessionEnd → `da sync --upload-only --quiet \|\| true` — uploads session jsonl + CLAUDE.local.md - `\|\| true` keeps Claude Code unblocked when the server is down. - Workspace-level (not user-home) so the hooks fire only when Claude Code opens this analyst workspace. - `da sync --quiet` rewrites the CLI output for hook consumption — 0 stdout on success, single-line error on failure. - Existing settings.json is patched (deep-merged), not overwritten; malformed JSON is reported, not silently overwritten. Tests cover: workspace bootstrap, hook insertion, malformed-json safety, quiet-mode output shape.	2026-05-01 20:25:27 +02:00
Vojtech	c364f65127	fix(tls-rotate): self-signed fallback sets basicConstraints=critical,CA:FALSE (#159 ) * fix(tls-rotate): self-signed fallback sets basicConstraints=critical,CA:FALSE OpenSSL's default '[v3_ca]' config marks CA:TRUE on 'req -x509', which causes strict TLS stacks (rustls / webpki, used by uv, cargo, and future versions of pip) to reject the cert with 'invalid peer certificate: CaUsedAsEndEntity' per RFC 5280 §4.2.1.9. Browsers, curl, and OpenSSL-based clients tolerated the violation, hiding the bug until a uv user hit it. Affects every VM running on the self-signed fallback while the corp PKI hasn't published the real chain yet. Fix lands on the next agnes-tls-rotate.timer tick (or 'systemctl start agnes-tls-rotate.service' for an immediate refresh). Existing CSR / real-cert paths unaffected; only the bring-up fallback regenerates. * chore(release): cut 0.29.0 --------- Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>	2026-05-01 12:23:14 +02:00
Vojtech	bd7b8c3233	fix(analyst): document BigQuery remote-query capability in bootstrap CLAUDE.md template (#154 ) * fix(analyst): document BigQuery remote-query capability in bootstrap CLAUDE.md template Closes #153. The CLAUDE.md template generated by `da analyst bootstrap` (config/claude_md_template.txt) covered metrics, sync, corporate memory, and directory layout — but had ZERO mention of query_mode: "remote", da fetch, da query --remote, or --register-bq. Result: the AI analyst running in a freshly-bootstrapped workspace had no idea BigQuery-backed tables existed, no path to fetch unsynced data, and no fallback for tables not in the catalog. Validated against /Users/<user>/foundry-ai/foundryai-data-analyst/CLAUDE.md on 2026-05-01: section confirmed missing. Workspace-level (parent-dir) CLAUDE.md carried legacy SSH-heredoc instructions but the analyst-level file (which Claude reads as primary project context) had nothing. ## Changes ### config/claude_md_template.txt (+83) Added a `## Remote Queries (BigQuery)` section covering: - Discovery first — `da catalog --json \| jq '...'` to see all tables with their query_mode, then `da schema` and `da describe` for shape. - Three query patterns: - `da fetch` (preferred) — materialize a filtered subset locally, query the snapshot, drop when done. - `da query --remote` — one-shot server-side execution (cheap probes). - `da query --register-bq` — hybrid joins between local + ad-hoc BQ. - `da fetch` estimate-first discipline — rules of thumb on --select / --where / --estimate / snapshot reuse. - BigQuery SQL flavor cheat sheet for `--where` (DATE literal, DATE_SUB, REGEXP_CONTAINS, CAST AS INT64). - Unknown-table fallback: when a table isn't in `da catalog` at all, use ad-hoc `--register-bq` if the agnes server SA has BQ access, or ask admin to register with `query_mode: "remote"` for ongoing use. - Pointer to `da skills show agnes-data-querying` for deeper guidance. ### docs/setup/claude_md_template.txt (deleted) Stale 359-line template that documented the deprecated SSH-heredoc remote_query.sh protocol. No code references it (verified via grep across .py / .sh / .yml / .md). Removing eliminates two failure modes: 1. A future refactor accidentally pulling it into a workspace and shipping deprecated guidance to analyst Claude sessions. 2. Reviewer confusion over which template is canonical. ### CHANGELOG.md `### Fixed` and `### Removed` entries under [Unreleased]. ## Tested - Manually walked the diff against `da skills show agnes-data-querying` output on a live VM (foundryai-development) — patterns + flags match the modern CLI exactly. - Re-bootstrap test deferred: requires network round-trip; pattern is identical to existing template substitution path so render is not at risk. ## Out of scope - The companion gap that data_description.md often only enumerates query_mode: "local" tables (no signal that other modes exist) — separate concern, fix likely belongs in the metadata generator on the server side, not in the analyst template. - Encouraging admins to register frequently-queried BQ tables as `query_mode: "remote"` in the registry — workflow improvement, not a code bug. * chore(release): cut 0.28.0 --------- Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>	2026-05-01 12:06:41 +02:00
minasarustamyan	d4ac84dd46	feat(rbac): drop dataset_permissions + users.role + is_public; v19 migration (#150 ) * feat(rbac): drop dataset_permissions + access_requests + users.role + is_public; v19 migration BREAKING. Sjednocení datové RBAC vrstvy do per-group resource_grants modelu. Před PR byla legacy data RBAC vrstva (dataset_permissions + is_public bypass) de-facto neaktivní — is_public neměl API/UI/CLI surface, default true znamenal že can_access_table vždycky bypassl. Dnes každý non-admin přístup vyžaduje explicitní resource_grants(group, "table", id) řádek. Schema v18 → v19 (src/db.py:_v18_to_v19_finalize): - DROP TABLE dataset_permissions, access_requests - DROP COLUMN users.role (NULL artifact since v13) - DROP COLUMN table_registry.is_public - Drops přes table-rebuild idiom (rename → create new → INSERT … SELECT → drop old) kvůli DuckDB ALTER DROP COLUMN limitacím na tabulkách s historic FK constraints. INSERT picks intersection sloupců, takže test fixtures s minimal pre-v19 schemou migrate cleanly. Runtime: - src/rbac.py:can_access_table → deleguje na app.auth.access.can_access - DatasetPermissionRepository, AccessRequestRepository smazány - AGNES_ENABLE_TABLE_GRANTS env-gate v app/resource_types.py odstraněn (TABLE je unconditionally enabled) API drop: - app/api/permissions.py, app/api/access_requests.py celé soubory - /admin/permissions web route + admin_permissions.html - "Request Access" modal v catalog.html + locked-row UI - ~10 if user.get("role") != "admin" checků nahrazeno (admin shortcut je uvnitř can_access_table) - /api/settings: drop permissions field z GET; PUT /api/settings/dataset gate přepnut na can_access(user_id, "table", dataset, conn) Auth: - app/auth/jwt.py:create_access_token: drop role parametr (claim zmizí z nově vydávaných JWT; staré tokeny zůstávají valid, claim ignored) - app/api/users.py: drop role z CreateUserRequest / UpdateUserRequest (admin promotion = explicit add to Admin group via memberships API) - src/repositories/users.py: drop role z create() / update() CLI: - da admin set-role smazán → hard-fail s replacement command - da admin add-user --role flag pryč - da auth import-token --role flag pryč - da auth whoami: drop "Role:" výpis - cli/config.py:save_token: role parametr now optional, no longer written (back-compat se starými token.json soubory zachována — pole se ignoruje) Tests: - DELETE: test_permissions.py, test_permissions_api.py, test_access_requests_api.py - REWRITE: test_access_control.py (resource_grants flow), test_rbac.py (can_access_table over resource_grants), test_journey_rbac.py (drop access-request flow), test_resource_types.py (drop env-gate tests, drop is_public from helpers), test_v2_.py (drop role-based user dicts in favor of id-based + Admin group membership), test_settings_api.py (no permissions field, can_access gate) - TRIVIAL: ~30 souborů — drop role="admin" arg z UserRepository.create a 3rd positional role z create_access_token - NEW: test_v18_to_v19 migration test (test_db.py), test_can_access_table_no_implicit_public (test_rbac.py), test_admin_set_role_returns_hardfail (test_cli_admin.py) - OpenAPI snapshot regenerated Docs: - CHANGELOG: BREAKING entry pod [Unreleased] - CLAUDE.md: schema v18 → v19 - docs/architecture.md: schema table + RBAC sekce přepsána - docs/auth-google-oauth.md: admin promotion přes da admin break-glass - cli/skills/security.md: kompletně přepsáno na group-based model - docs/TODO-rbac-data-enforcement.md: smazáno (TODO splněn) Test results: 2363 passed, 19 failed. Zbývající failures jsou pre-existing Windows-specific issues (fcntl, charset) nesouvisející s tímto PR — ověřeno git stash pop. Plan: ~/.claude/plans/floofy-coalescing-parnas.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> chore(release): cut 0.27.0 --------- Co-authored-by: Minas Arustamyan <arustamyan.minas@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>	2026-04-30 22:02:16 +02:00
Vojtech	2447da7bb1	refactor(ops): bake all host artifacts into image, drop every curl-from-main (#149 ) * refactor(ops): bake all host artifacts into image, drop every curl-from-main Replaces the curl-from-main pattern (originally introduced in 0.25.0 for agnes-auto-upgrade.sh; older for the compose files + Caddyfile) with image- bundled host artifacts. Same-tag delivery for everything the host runs, version-pinned by AGNES_TAG, atomically rolled back by reverting the image. ## Motivation The customer-instance startup template was curling 6 files from raw.githubusercontent.com on every VM boot: docker-compose.yml docker-compose.prod.yml docker-compose.host-mount.yml docker-compose.tls.yml Caddyfile scripts/ops/agnes-auto-upgrade.sh (added in 0.25.0) Every one of them already lives inside the image (`COPY . .` copies the whole repo to /app/). Curling them from the public internet duplicates content the image already carries and introduces three problems: 1. Split-brain version pinning. image_tag pins the docker image to an immutable digest. The compose files + script bypassed that pinning by tracking `main` (or the rarely-set compose_ref). A customer pinned to stable-2026.04.516 could wake up tomorrow with their host artifacts floating on whatever shipped to main overnight — even though they're explicitly pinned for stability. 2. No rollback knob. Reverting a bad host artifact meant reverting the upstream PR globally — affects every customer that reboots after the bad commit. No "rollback for me only" path; tag-pinning gave no protection. 3. Public-internet dependency on every boot. The image is already pulled from a private registry on the same boot. Reusing that channel is strictly cheaper than adding a second one. Customers with restricted egress (no raw.githubusercontent.com reachability) silently broke on every boot. ## Changes ### Dockerfile (+19 -8) After `COPY . .` and before the wheel build, an explicit `cp` lifts every host-side artifact into a stable contract path /opt/agnes-host/: agnes-auto-upgrade.sh (mode 0755 — host cron driver) docker-compose.{yml,prod,host-mount,tls}.yml Caddyfile (mode 0644) Why a copy instead of pointing at /app directly: /app is owned by uid 999 (USER agnes); /opt/agnes-host is root-owned, mode 0755 across the board, stable path that won't shift if /app structure refactors. ### infra/modules/customer-instance/startup-script.sh.tpl (+22 -36) Replaced six curls and the standalone agnes-auto-upgrade.sh extract block (introduced earlier in this PR) with one extract sequence in section 3: docker pull "$${IMAGE_REPO}:$${IMAGE_TAG}" EXTRACT_CONTAINER=$(docker create "$${IMAGE_REPO}:$${IMAGE_TAG}") trap "docker rm '$EXTRACT_CONTAINER' >/dev/null 2>&1 \|\| true" EXIT docker cp "$EXTRACT_CONTAINER:/opt/agnes-host/." "$APP_DIR/" docker cp "$EXTRACT_CONTAINER:/opt/agnes-host/agnes-auto-upgrade.sh" /usr/local/bin/agnes-auto-upgrade.sh chmod +x /usr/local/bin/agnes-auto-upgrade.sh The auto-upgrade section (#6) is now a no-op — script is already in place. ### infra/modules/customer-instance/variables.tf (+1 -1) `compose_ref` marked DEPRECATED in description. Default unchanged for one release cycle to avoid breaking existing terraform plans. Will be removed in a future major bump. ### CHANGELOG.md `### Changed` entry under [Unreleased] — supersedes the narrower entry this PR previously had (which only covered the script). ## Out of scope (filed as follow-ups) 1. agnes-the-ai-analyst-infra/startup.sh (operator deploy) still curls the same artifacts from main. Symmetric fix needed there. Will file as a separate PR against the infra repo. 2. Self-update inside agnes-auto-upgrade.sh after a successful `docker compose pull` of a new digest. Otherwise the running cron keeps using the OLD baked-in script for one tick after image upgrade. ~10 LOC. Deferred to keep this PR scoped. 3. scripts/ops/agnes-tls-rotate.sh has the same shape — host-side bash currently sourced via the infra repo. Should follow the same bake-into-image pattern. ## Tested - Local: `docker build .` succeeds with the new RUN block. - `docker create` + `docker cp /opt/agnes-host/.` round-trips all 6 artifacts; sha matches each source file. - Not yet tested on a live VM bring-up — that requires a CI image with this Dockerfile change. Recommend reviewer trigger CI build, then do a single VM-recreate against a dev VM (e.g. foundryai-development) to confirm the extract path works end-to-end before merge. ## Compatibility - Existing VMs running 0.25.0 are unaffected — they have host artifacts in place from `curl from main` already; this PR doesn't touch them. They pick up the new pattern only on next VM recreate. - VMs pinned to an image_tag older than this PR (no /opt/agnes-host in the image) would FAIL the docker cp. Current diff fails-loud (no fallback). Recommend operators upgrade to a fresh-enough image_tag alongside the template upgrade — same coupling as any compose-flag bump. * docs(infra): document image_tag >= v0.26.0 minimum on prod/dev_instances The new startup script extracts host artifacts from /opt/agnes-host/ inside the image — a directory added in this PR (will ship as v0.26.0). Pinning image_tag to an older tag would fail-loud at first boot with 'docker cp: No such file or directory'. Existing VMs are unaffected because the module ignores metadata_startup_script changes. Devin ANALYSIS_0004 on PR #149. * fix(changelog): mark BREAKING + drop private-repo reference Per CLAUDE.md, breaking changes start with BREAKING so operators can grep before bumping the pin. The image_tag minimum constraint introduced here qualifies — older tags fail-loud at first boot. Also drop the explicit 'agnes-the-ai-analyst-infra' name from the entry; the OSS distribution shouldn't reference operator-side deploy templates by their private-repo names. Generic 'consumer- side deploy templates' wording instead. Devin BUG_0001 + WARN_0001 on PR #149. * chore(release): cut 0.26.0 --------- Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>	2026-04-30 21:40:25 +02:00
Vojtech	ddffdfeafd	fix(ops): fail-fast guard in agnes-auto-upgrade — refuse start if config disk not mounted (#146 ) * fix(ops): fail-fast guard in agnes-auto-upgrade — refuse to start containers if config disk not mounted Companion to keboola/agnes-the-ai-analyst-infra#62. Same incident: foundryai-development 2026-04-30, marketplaces / DuckDB / session secret written to /data (sdb) instead of the config disk (sdc), wiped on next container recreate. ## Why an app-side guard agnes-auto-upgrade.sh fires every 5 min on every VM. If `/data/state` is not on the config disk (because of the propagation regression fixed by the infra PR, or the boot-time udev race fixed by infra #58, or any future mount-loss path), this script previously ran `docker compose up -d` anyway — and the app silently wrote state onto the wrong disk. Next recreate, that state was gone. The boot-time fixes in infra are preventive. This is the runtime backstop. ## Behavior Before the existing pull/up logic, when /dev/disk/by-id/google-config-disk exists on the VM: 1. Up to 3 mount-and-verify attempts with backoff (2s, 4s, 6s). - Mount the config disk if /data/state is not a mountpoint. - Detect mismatch: if /data/state is mounted from the wrong source, umount and retry. 2. After the loop, assert findmnt source matches the config disk. - On mismatch: `logger -t agnes-auto-upgrade FATAL` + exit 1. systemd marks the service failed; no docker compose action runs; existing containers (if any) keep running on stale state, but no new write lands on the wrong disk. 3. Once verified mounted: re-apply `mount --make-rprivate /data /data/state` on every run. Idempotent. Guards against propagation regressions sneaking back in via future docker / kernel changes. VMs without a config disk (foundryai-poc, single-disk legacy) skip the whole block — the `if [ -e $CONFIG_DEVICE ]` guard. ## Tested Patched script installed on foundryai-development as a hotfix; manual run post-migration was a no-op (digest unchanged); /data/state stayed on sdc across a full `docker compose down + up -d` cycle. ## Rollout - This file is fetched by infra startup.sh from raw.githubusercontent.com/keboola/agnes-the-ai-analyst/main on every boot. Once merged to main, all VMs pick up the new script on their next boot — no infra recreate needed. - For immediate rollout to running VMs without waiting for next boot: `scp scripts/ops/agnes-auto-upgrade.sh <vm>:/tmp/ && ssh <vm> sudo install -m755 -o root -g root /tmp/agnes-auto-upgrade.sh /usr/local/bin/agnes-auto-upgrade.sh` (already done on foundryai-development). * chore: vendor-agnostic comment + changelog text Drop customer-specific VM names from the script comment and CHANGELOG entry. The OSS distribution should not name a particular operator's hosts; the technical description already conveys why the guard exists. * fix(ops): suppress mount stderr in retry loop Match the rest of the script's error-tolerant idiom (2>/dev/null). Mount failures in the cold-boot udev race the loop is designed to handle gracefully should not flow to stdout — cron would mail on every transient retry. Devin BUG_0001 on PR #146. * fix(changelog): move auto-upgrade entry to [Unreleased] Entry landed under v0.20.0 because that section was [Unreleased] when this branch first opened — releases v0.21–v0.24 cut in the meantime stranded it inside an already-released section. Move it back where new entries belong. Devin BUG_0001 on PR #146. * fix(infra): single-source agnes-auto-upgrade.sh via curl from main Replace the inline heredoc copy of the auto-upgrade script in the customer-instance Terraform startup template with a curl fetch from raw.githubusercontent.com on every boot. The inline copy had drifted several iterations behind canonical scripts/ops/agnes-auto-upgrade.sh (missing TLS overlay detection, array-form COMPOSE_FILES, and now the config-disk fail-fast guard from this PR). Devin ANALYSIS_0001 on PR #146. * fix(infra): fetch docker-compose.tls.yml unconditionally + document coupling The canonical agnes-auto-upgrade.sh from main detects TLS at runtime via cert files on disk, regardless of the TLS_MODE Terraform variable. Certs can appear after boot via agnes-tls-rotate.sh or manual provisioning, and the cron job would then fail every 5 min under 'set -euo pipefail' because docker-compose.tls.yml was never fetched. Also document the main-vs-COMPOSE_REF coupling: when the canonical script references a new compose file, the fetch list above must be updated to match — pinned-ref VMs would otherwise break. Devin BUG_0001 + ANALYSIS_0001 on PR #146. * fix(ops,infra): unconditional Caddyfile + skip tls overlay if missing Caddyfile fetch now matches docker-compose.tls.yml: unconditional in startup-script.sh.tpl. Without it, Docker would auto-create an empty directory at the bind-mount target and Caddy would crash-loop while the tls overlay has already closed :8000 — making the app unreachable on any non-caddy VM where certs land via rotate or manual provisioning. Defensive layer: agnes-auto-upgrade.sh now also requires Caddyfile to exist (size > 0) before activating the tls profile, with a WARN log if it's missing. Belt-and-suspenders so the failure mode is contained even when the script is deployed by some other path (not just the customer-instance TF module). Devin BUG_0001 on PR #146. * chore(release): cut 0.25.0 --------- Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>	2026-04-30 20:07:22 +02:00
minasarustamyan	fb1573766a	feat(admin): users/groups UI polish + SSO lock + v18 migration (#142 ) Cuts release 0.24.0. ## Highlights - SSO-managed accounts read-only for password / delete operations (UI + API). New `is_sso_user` flag derived from group memberships. - Admin/Everyone system rows show `google_sync` chip + Workspace email subtitle when env-mapped. - Origin pill vocabulary unified across `/admin/groups`, `/admin/access`, `/admin/users`, `/admin/users/{id}`, `/profile` (Admin yellow, Everyone gray, google_sync green, custom purple). - Effective-access readout no longer short-circuits for admin users — always renders per-resource breakdown. - Schema migration v18 drops stranded non-google memberships in env-mapped Admin/Everyone groups (cleans up v13's blanket Everyone backfill). ## Devin findings addressed - _is_sso_user requires source='google_sync' on system-group branches (so v13 system_seed memberships in env-mapped Everyone don't lock out the admin). - POST add-to-group returns correct origin via _derive_origin (matching GET). - 8 customer-specific token instances (groupon.com / foundryai) replaced with vendor-neutral placeholders across templates, tests, and CHANGELOG. - deriveDisplayName name-skip for canonical "Admin"/"Everyone" so an overlapping AGNES_GOOGLE_GROUP_PREFIX doesn't mangle the chip text. See CHANGELOG [0.24.0] for full notes.	2026-04-30 15:16:04 +02:00
ZdenekSrotyr	f3d252f17d	fix(tls-rotate): chown CERT_DIR to UID 999 so the app container can read its own certs (#143 ) The script's `mkdir -p` left ownership of `/data/state/certs/` to whichever process won the create race — root when systemd's timer fired before the app container's first volume init, UID 999 when the container ran first. With mode 700, a root-owned dir blocks the UID-999 agnes container from reading its own fullchain.pem; `_read_agnes_ca_pem()` returns None, and the cross-platform TLS trust block (Step 0 from PR #137) silently disappears from the /install setup prompt. Operators on the unlucky-race VMs got a setup prompt that couldn't bootstrap client trust against the self-signed host. Existing VMs self-heal on next timer tick.	2026-04-30 13:21:59 +02:00
ZdenekSrotyr	70672204fe	feat(memory): admin Edit + MEMORY_DOMAIN RBAC + ai-section UI (#141 ) Cuts release 0.23.0. ## Highlights - Single-item Edit button on every memory item card (modal hits PATCH /api/memory/admin/{id}). - MEMORY_DOMAIN RBAC resource type — admins grant user_groups access to specific domains via /admin/access. Composes with existing audience filter (OR semantics, no-op when no grants). - ai: section editable in /admin/server-config — admins can set ANTHROPIC_API_KEY / model / provider / base_url for the corporate-memory extractor without editing instance.yaml directly. api_key auto-masked. ## Devin findings addressed - Modal NULL→empty fix (audience visibility wouldn't break). - Stats endpoint granted_domains parity with list endpoint. - Documented intentional MEMORY_DOMAIN→audience bypass. - Documented conscious ai.base_url SSRF exclusion (legit internal LiteLLM/vLLM proxies). See CHANGELOG [0.23.0] for full notes.	2026-04-30 11:04:41 +02:00
ZdenekSrotyr	83adf01bde	fix(v2): #134 BigQuery cross-project errors return structured 502/400 + BqAccess facade (#138 ) * docs(spec): #134 unify BigQuery access behind BqAccess facade Brainstorm output for issue #134. Captures: - root cause (incl. correction of the issue's hypothesis about commit 33a9964) - BqAccess facade API + project resolution rules - error contract — typed BqAccessError mapped to HTTP 502 for upstream BQ failures, 500 for deployment/config bugs - migration plan for v2_scan, v2_sample, RemoteQueryEngine - test rewrite eliminating _bq_client_factory injection point - E2E verification protocol on agnes-development as success criterion * docs(spec): #134 revise after first review Incorporates code-reviewer findings: Must-fix: - Add v2_schema (2 copies of INSTALL/LOAD/SECRET dance) to migration scope. - Reframe v2_scan headline: missing try/except around BQ calls is the actual cause of bare 500s, not project resolution (which 33a9964 fixed). - List two more deferred call sites (extractor.py, register_bq_table) with explicit rationale. Important: - Drop billing != data clause from cross_project_forbidden heuristic; rely only on 'serviceusage' substring. billing != data is normal for cross-project setup, was over-classifying. - Split bq_bad_request into _user (400) and _server (502) variants; add sql_origin parameter to translate_bq_error so call sites declare whether SQL contains user input. - Add @functools.cache to BqAccess.from_config; document tests bypass via dependency_overrides. - Replace monkey-patched-classmethod test pattern with BqAccess(client_factory=...) injection at construction time. Cleaner than today's _bq_client_factory and 1:1 migration shape. - Keep BqProjects.data (reviewer assumed registry has source_project; it doesn't). Multi-project explicitly listed as non-goal with note. Nice-to-have: - Add 'Implementation strategy' section: 2 staged commits (bug fix alone is revertable; refactor follows). - Extend E2E protocol to cover all three endpoints, not just /sample. - Note removal of stale docstring at src/remote_query.py:204. * docs(spec): #134 revision 3 — incorporates second-round review Must-fix from second review: - v2_schema split into two migration cases: _fetch_bq_schema translates errors via translate_bq_error; _fetch_bq_table_options preserves its swallow-all 'except Exception → return {}' so /schema doesn't 502 on partition-info failures. - RemoteQueryEngine.__init__ now resolves BqAccess lazily (in _get_bq_client, not in __init__). Without this, ~7 DuckDB-only tests in test_remote_query.py would suddenly fail with not_configured. - translate_bq_error pass-through for BqAccessError is now load-bearing (clause 1, before any Google-API branch). bq.client() raises BqAccessError for bq_lib_missing/auth_failed; without explicit pass-through those fall to 'unknown' and re-raise as bare 500. - Commit 1 now emits the SAME structured response shape as commit 2 to avoid contract churn between commits. - BIGQUERY_PROJECT env-var precedence is BREAKING for env-only deployments — flagged in CHANGELOG ### Changed. Editorial: - sql_origin renamed to bad_request_status with values 'client_error' / 'upstream_error' (clearer about what the parameter actually decides). bq_bad_request_user/_server kinds collapsed to bq_bad_request (400) and bq_upstream_error (502). - CLI (cli/commands/query.py) noted as external RemoteQueryEngine caller; unaffected because new bq_access kwarg has default None. - Added unit/integration tests for the new contracts: test_translate_passes_through_BqAccessError, test_v2_scan_returns_500_on_bq_lib_missing, test_v2_schema_returns_200_with_empty_partition_on_bq_failure, test_resolve_succeeds_after_config_set. - E2E protocol now covers /schema as the fourth endpoint. - Documented functools.cache-doesn't-cache-exceptions semantics and fixture nullcontext-doesn't-close caveat for nested sessions. * docs(spec): #134 revision 4 — incorporates third-round review Third reviewer verdict: 'implementation-ready with two trivial edits'; explicitly noted prior rounds did the heavy lifting. Edits: 1. get_bq_access() module-level function instead of @classmethod @functools.cache from_config. Removes the classmethod-cache stacking footgun (different Python versions wrap differently) and gives FastAPI's dependency introspection a clean function signature. Drops the 'Do not subclass BqAccess' caveat that no longer applies. 2. Commit 1 strategy explicitly: wrap _fetch_bq_sample (v2_sample), _bq_dry_run_bytes + _run_bq_scan (v2_scan), and _fetch_bq_schema (v2_schema strict block). Do NOT touch _fetch_bq_table_options swallow-all in commit 1 — preserved as-is, then migrated (still preserved) in commit 2. All three endpoints emit the same structured body shape so client parsers see one consistent contract throughout the staged rollout. No more half-rolled-out window where /sample is bare 500 while /scan is structured 502. * docs(plan): #134 implementation plan — Phase 1 (atomic bug fix) + Phase 2 (BqAccess refactor) + Phase 3 (verification) Bite-sized TDD tasks. 3 phases, 16 tasks total: Phase 1 (Commit 1) — atomic bug fix across all four v2 endpoints: Tasks 1.1-1.5 wrap _fetch_bq_sample, _bq_dry_run_bytes, _run_bq_scan, _fetch_bq_schema with structured 502/400 try/except. _fetch_bq_table_options preserved untouched. CHANGELOG Fixed entries. Phase 2 (Commit 2) — BqAccess facade extraction + migration: Tasks 2.1-2.5 build connectors/bigquery/access.py bottom-up (BqProjects, BqAccessError, translate_bq_error, default factories, BqAccess class, get_bq_access module-level cached). Task 2.6 adds conftest.py fixture. Tasks 2.7-2.9 migrate v2_scan, v2_sample, v2_schema to BqAccess. Tasks 2.10-2.11 migrate RemoteQueryEngine + tests (lazy bq_access, drop _bq_client_factory). Task 2.12 CHANGELOG Changed BREAKING + Internal. Phase 3 — Verification: 3.1 full pytest. 3.2 squash into two PR-shape commits. 3.3 manual E2E on agnes-development per spec protocol → close #134. Self-review table maps spec sections to implementing tasks; no gaps. * fix(v2): #134 structured 502/400 on BQ errors across /scan, /scan/estimate, /sample, /schema Wraps the BigQuery call sites in v2_scan, v2_sample, and v2_schema (strict block only) with try/except for google.api_core exceptions, translating to HTTPException with a structured body shape: {error, message, details}. Fixes Pavel's report (#134) where these endpoints returned bare HTTP 500 with no body when the SA on agnes-development hit cross-project Forbidden on serviceusage.services.use. Also fixes /sample's missing billing_project fallback (the bug 33a9964 fixed for /scan never landed here). Status code split: - /scan, /scan/estimate: BadRequest -> 400 (bq_bad_request) since SQL is user-derived from req.select/where/order_by. - /sample, /schema: BadRequest -> 502 (bq_upstream_error) since SQL is server-constructed from validated identifiers. - All Forbidden -> 502 with cross_project_forbidden if 'serviceusage' in error message (with hint pointing at data_source.bigquery.billing_project), else bq_forbidden. Body shape matches what the upcoming BqAccess refactor (next commit) will produce, so client-side parsers see one consistent contract throughout the staged rollout. _fetch_bq_table_options preserved exactly as-is — its swallow-all-and-return-empty contract is intentional and survives into the refactor; /schema continues to return 200 with empty partition info when partition queries fail. Outer wraps in scan_endpoint, scan_estimate_endpoint, sample, and schema endpoints exist only to make the test pattern (monkeypatching whole _fetch_* functions) work, and are tagged TODO(#134 Phase 2) for removal once BqAccess centralizes translation. * refactor(bq): #134 BqAccess facade — unify v2_scan, v2_sample, v2_schema, RemoteQueryEngine Extracts the duplicated BigQuery-access pattern (project resolution + client construction + DuckDB-extension session + Google-API error translation) into connectors/bigquery/access.py. Migrates four call sites to use it: - app/api/v2_scan.py — _bq_dry_run_bytes, _run_bq_scan - app/api/v2_sample.py — _fetch_bq_sample - app/api/v2_schema.py — _fetch_bq_schema (strict translation), _fetch_bq_table_options (preserves swallow-all best-effort contract) - src/remote_query.py — RemoteQueryEngine, lazy bq_access kwarg The new module exposes: - BqProjects (frozen dataclass: billing + data project IDs) - BqAccessError (typed exception with HTTP_STATUS class mapping) - BqAccess (facade with injectable client_factory/duckdb_session_factory for tests; defaults call the real google-cloud-bigquery + DuckDB extension) - get_bq_access (module-level @functools.cache; FastAPI Depends target) - translate_bq_error (Google API exception → BqAccessError mapper, with BqAccessError pass-through, 'serviceusage'-substring heuristic for cross_project_forbidden, and bad_request_status param distinguishing user-derived (400) from server-constructed (502) SQL) - _default_client_factory, _default_duckdb_session_factory RemoteQueryEngine.__init__ no longer accepts _bq_client_factory; tests migrate to bq_access=BqAccess(projects, client_factory=...). DuckDB-only RemoteQueryEngine tests need no changes — bq_access defaults to None and get_bq_access() is only invoked on first BQ call (lazy resolution). BqAccessError raised internally is translated to RemoteQueryError( error_type="bq_error") in _get_bq_client to preserve the engine's existing public contract — CLI and /api/query/hybrid callers see no change. Endpoint tests (test_v2_scan, test_v2_scan_estimate, test_v2_sample, test_v2_schema) migrate from monkey-patching whole _fetch_* functions to using the new bq_access fixture in tests/conftest.py — which exercises the REAL translation path through BqAccess + translate_bq_error, closing the test gap flagged in Task 1.1's review. Side-effect behavior change: v2_sample's FROM clause now uses the data project (instance.yaml data_source.bigquery.project), not the conflated billing_project from Phase 1. Documented in CHANGELOG ### Internal. BREAKING for deployments combining BIGQUERY_PROJECT env var with data_source.bigquery.project in instance.yaml — env var now overrides data project too. See CHANGELOG ### Changed. Two known-duplicate BQ-access sites (connectors/bigquery/extractor.py, scripts/duckdb_manager.register_bq_table) explicitly out of scope; tracked as follow-up. Removed stale docstring at the previous src/remote_query.py:204 that referenced scripts.duckdb_manager._create_bq_client as the default BQ client factory (RemoteQueryEngine never actually used that function). Test counts: tests/test_bq_access.py +27 (new), tests/test_v2_.py + tests/test_remote_query.py migrated to bq_access fixture (counts unchanged or +1-2 per file). Full suite: 2086 passed, 8 pre-existing failures (DB migration tests with unrelated internal_roles DependencyException — not introduced by this PR). fix(bq_access): translate DefaultCredentialsError to BqAccessError(auth_failed) CI on PR #138 caught: bigquery.Client(...) resolves Application Default Credentials at construction time; without ADC (CI without SA key, dev laptop without 'gcloud auth application-default login') it raises google.auth.exceptions.DefaultCredentialsError synchronously. Pre-fix _default_client_factory only caught ImportError, so DefaultCredentialsError propagated as raw exception — and from production endpoints would surface as bare 500 (the exact failure mode #134 sets out to fix). Now translates to BqAccessError(kind='auth_failed', details.hint='Run gcloud auth application-default login...'). Endpoint catch chain returns HTTP 502 with structured body. Adds unit test test_raises_auth_failed_on_default_credentials_error. Third-round spec review flagged this case in passing; the fix didn't land. CI's auth-less environment surfaced it. * fix(bq_access): get_bq_access() returns sentinel instead of raising when not configured Devin BUG_0001 on PR #138 review: 'get_bq_access() as FastAPI Depends breaks all v2 endpoints for non-BigQuery instances'. Pre-fix: get_bq_access() raised BqAccessError(not_configured) when neither BIGQUERY_PROJECT env nor data_source.bigquery.project was set. Because FastAPI resolves Depends() BEFORE the endpoint body runs, this exception fires during dep-injection — the endpoint's try/except BqAccessError clause never gets a chance to catch it. Result: every v2 request on Keboola-only or CSV-only instances returned bare HTTP 500, even for local-source tables that never touch BigQuery. Fix: get_bq_access() now returns a sentinel BqAccess with empty BqProjects and factories that raise BqAccessError(not_configured) on actual use. Construction succeeds, FastAPI's dep-injection cleanly yields the sentinel, the endpoint runs. The local-source code path in build_sample / build_schema / etc. never calls bq.client() or bq.duckdb_session() (it reads parquet directly), so non-BQ tables return 200 as before. Only when an endpoint actually tries to query BQ (source_type == 'bigquery') does the sentinel raise — and the endpoint's existing except BqAccessError catches it normally, returning structured 502 with hint. Test get_bq_access::test_raises_not_configured_when_neither_set renamed and rewritten to test_returns_sentinel_when_neither_set: asserts BqAccess is returned, then asserts client() and duckdb_session() each raise BqAccessError(not_configured) on call. Test test_does_not_cache_exceptions removed (no longer applicable) and replaced with test_sentinel_is_cached_per_process documenting the operator-restart-on-config-change contract. * docs(spec+plan): #134 genericize customer-specific tokens (CLAUDE.md OSS rule) Devin BUG_0001/0002 round 3 on PR #138: spec and plan docs contained customer-specific deployment hostnames, deployment names, and a GCP project ID that violated CLAUDE.md's vendor-agnostic OSS rule ('Nothing customer-specific belongs in code, configuration defaults, comments, docs, commit messages, PR titles, or PR bodies'). Replacements: agnes-development.groupondev.com -> <your-agnes-host> agnes-development -> <your-dev-instance> prj-grp-dataview-prod-1ff9 -> <your-data-project> s1_session_landings -> <bq_table_id> E2E verification semantics unchanged — operators still run the same four curls + config flip + retry, just substituting their own host / deployment name / project / table. * fix(bq_access): hook get_bq_access.cache_clear into instance_config.reset_cache Devin ANALYSIS_0004 on PR #138: get_bq_access is @functools.cache'd at process level, so it captures BigQuery project IDs at first call and ignores subsequent instance.yaml changes. Pre-Phase-2 the v2 endpoints re-read get_value() on every request, so admin /api/admin/server-config saves (which call instance_config.reset_cache()) hot-reloaded the BQ project. Without this fix, my refactor silently regresses that contract — operators editing instance.yaml via the admin UI would see no effect on v2 endpoints until container restart. instance_config.reset_cache() now also calls connectors.bigquery.access.get_bq_access.cache_clear() (lazy import, swallowed if connectors module isn't loaded — keeps instance_config usable in isolated unit tests). Adds test_instance_config_reset_cache_invalidates_get_bq_access as regression guard. Updates CHANGELOG Internal entry to mention the hot-reload contract + the not-configured sentinel behavior (round-3 fix from Devin BUG_0001 was previously only in commit message). * fix(bq_access): surface not_configured before identifier validation + plan path genericize Devin BUG_0001 + BUG_0002 round 5 on PR #138. BUG_0001 (plan doc): personal filesystem path violated CLAUDE.md vendor-agnostic rule. Replaced with '<worktree-root>' placeholder. BUG_0002 (sentinel error path): when get_bq_access() returns the sentinel BqAccess (BQ not configured), the empty bq.projects.data was reaching validate_quoted_identifier first and raising ValueError -> endpoint mapped to HTTP 400 'unsafe_identifier' instead of structured 500 'not_configured' with hint. Each fetch helper now checks 'if not bq.projects.data: bq.client()' as the first step, which triggers the sentinel's BqAccessError(not_configured). Endpoint catches the typed error and returns HTTP 500 with hint pointing at data_source.bigquery.project. Best-effort _fetch_bq_table_options returns {} silently in this case (preserves the swallow-all contract). * fix(bq_access): classify DuckDB-native exceptions from bigquery_query() via string match Devin ANALYSIS on PR #138 review (latest round). The DuckDB bigquery extension is a C++ plugin making its own HTTP calls — when BQ returns 403, it throws duckdb.IOException with the BQ error embedded as text, not gax.Forbidden. translate_bq_error's isinstance checks would miss these, falling to case 7 → bare 500 in production for v2_scan, v2_sample, and v2_schema (the bigquery_query() paths). Fix: last-resort string-match heuristic before the re-raise. 'Forbidden' / '403' / 'Bad Request' / '400' in the lowercased message classifies via the same kind hierarchy. The 'serviceusage' substring still distinguishes cross_project_forbidden from bq_forbidden. Specific enough that random exceptions without HTTP-error keywords still re-raise. Adds 4 unit tests covering the new heuristic + the 'don't swallow random exceptions' invariant. * chore(release): cut 0.22.0 PR #138 contains issue #134 user-visible behavior changes: - BREAKING: BIGQUERY_PROJECT env var now overrides instance.yaml data_source.bigquery.project for v2 endpoints (previously RemoteQueryEngine billing only). - Fixed: structured 502/400 on /api/v2/sample, /scan, /scan/estimate, /schema when BigQuery raises Forbidden/BadRequest (was bare 500). - Internal: BqAccess facade refactor unifying four duplicate BQ-access call sites; instance_config.reset_cache() now invalidates BqAccess cache too so admin server-config saves hot-reload BQ project IDs. Bumps to 0.22.0 because PR #137 merged first and took 0.21.0.	2026-04-30 10:11:20 +02:00
ZdenekSrotyr	b5178fe942	fix(ci): smoke-test stale route + rollback ghcr auth + issues:write (#140 ) Three CI fixes triggered by the failed PR #137 deploy: 1. scripts/smoke-test.sh: assertion 8 was hitting /api/admin/tables (renamed to /api/admin/registry long ago). The 404 was treated as deployment regression and triggered the auto-rollback. Same stale URL also fixed in CLAUDE.md, README.md, dev_docs/server.md. 2. .github/workflows/release.yml smoke-test job: added Log in to GHCR step. The auto-rollback's docker push :stable was failing with 'unauthenticated' because the smoke-test job had no GHCR login of its own — leaving :stable pointing at the broken image. 3. Rollback step gained GH_TOKEN env, AND the workflow's permissions block gained issues:write. Both were needed for gh issue create to actually create the alert issue (was silently swallowed by the \|\| echo fallback). Manual cleanup outside this PR: :stable currently points at the broken PR #137 image — needs manual retag back to stable-2026.04.505.	2026-04-30 09:42:27 +02:00
minasarustamyan	4ec5ff44dd	feat(setup): cross-platform TLS bootstrap + marketplace plugin install (#137 ) Bootstraps the Agnes Claude Code marketplace + RBAC-allowed plugins from the dashboard CTA, and inlines the server's TLS cert when the chain isn't publicly trusted (self-signed / private CA). Cross-platform setup prompt covers Windows Git Bash, macOS, Linux. Includes Bun-compiled `claude` fix (macOS goes via git-clone fallback, same as Windows), PAT stripping after clone, explicit error handling, and four rounds of Devin Review fixes (phantom step references, $PLATFORM re-detection, heredoc/awk line-count sync). Cuts 0.21.0. See CHANGELOG.md [0.21.0] section for details.	2026-04-30 08:56:45 +02:00
Vojtech	38f6b639d2	feat(observability): request_id end-to-end + dev debug toolbar + centralized logging (#136 ) Cuts release 0.20.0. ## Highlights - X-Request-ID header on every response + sanitized to [A-Za-z0-9_-] (CRLF log-forging mitigation) - Error pages (HTML + JSON 500) surface request_id for support tickets - Dev debug toolbar gated by DEBUG=1 — fastapi-debug-toolbar with custom DuckDBPanel - Centralized app.logging_config.setup_logging() replaces 23 scattered basicConfig calls - Telegram bot drops bot.log file — stdout only (BREAKING) ## Devin findings addressed - BUG_0001: .env.template no longer claims FastAPI debug=True - BUG_0002: subprocess extractor logs INFO to stderr again - ANALYSIS_0003: _wants_html no longer matches Accept: / (curl gets JSON as before) - BUG on b1c6ee9: HTML 500 page no longer leaks str(exc) in production - BUG on b13d2fe: 2 CLAUDE.md compliance flags (transform.py + ws_gateway) accepted as scope-limited logging refactor — follow-up to update CLAUDE.md if needed See CHANGELOG [0.20.0] for full notes.	2026-04-29 22:54:21 +02:00
ZdenekSrotyr	b7a1795834	feat(scheduler): re-wire sync_schedule + script.schedule; tune via env; OpenMetadata TLS (#135 ) Bundles 4 issues: - #79 — table_registry.sync_schedule honored at runtime (API-side filter + Pydantic validators) - #78 — script_registry.schedule honored via new POST /api/scripts/run-due (atomic claim, BackgroundTask exec, deploy-time safety validation) - #77 — sidecar JOBS env-driven (SCHEDULER_DATA_REFRESH_INTERVAL/HEALTH_CHECK_INTERVAL/SCRIPT_RUN_INTERVAL/TICK_SECONDS) - #89 — OpenMetadataClient verify=True default (BREAKING for self-signed) Cuts release 0.19.0. See CHANGELOG for full notes incl. Known Limitations.	2026-04-29 22:06:30 +02:00
minasarustamyan	953bd9d250	fix(marketplace): use plugin.json name in synth marketplace.json (#133 ) Closes the /plugin UI 'Plugin <X> not found in marketplace' bug. Synth marketplace.json catalog 'name' now reads from <plugin_dir>/.claude-plugin/plugin.json (with fallback to upstream marketplace.json name). On-disk plugins/<slug>-<plugin>/ layout preserved so cross-marketplace files don't collide. /marketplace/info exposes both name and prefixed_name (BREAKING — downstream tooling parsing 'name' for the slug-prefixed form must switch to prefixed_name).	2026-04-29 19:25:57 +02:00
ZdenekSrotyr	514fe2c8b6	chore(release): cut 0.18.0 Bundles #119 (BigQuery register-table M1), #126 (memory tree+duplicates+bulk-edit), #131 (Google groups prefix filter, BREAKING — auto-Everyone removed).	2026-04-29 14:34:58 +02:00
minasarustamyan	c940593a90	feat(auth): Google Workspace group prefix filter + system mapping (#131 ) Three new env vars wire the Google OAuth callback to a configurable Workspace prefix and route admin/everyone Workspace groups onto the seeded system rows: AGNES_GOOGLE_GROUP_PREFIX, AGNES_GROUP_ADMIN_EMAIL, AGNES_GROUP_EVERYONE_EMAIL. Login gate redirects users with no prefix-matching group to /login?error=not_in_allowed_group. BREAKING: auto-Everyone membership for new users removed. Admin UI/API are read-only on Google-managed groups. See docs/auth-groups.md.	2026-04-29 14:08:04 +02:00
ZdenekSrotyr	82c5d71d63	feat(memory): #62 — duplicate hints + tree-view + bulk-edit (#126 ) Issue #62. Tree view with cross-axis filtering, duplicate-candidate hints (Jaccard score on entity overlap), bulk-edit endpoints (PATCH /api/memory/admin/{id} + POST /api/memory/admin/bulk-update), schema v17 (knowledge_item_relations), full CLI parity (da admin memory tree/edit/bulk-edit/duplicates list/resolve).	2026-04-29 13:55:15 +02:00
ZdenekSrotyr	1824b9dd9c	feat(admin): #108 M1 — BigQuery table registration in UI + CLI (#119 ) Issue #108 Milestone 1. Adds BigQuery table registration via /admin/tables UI and `da admin register-table` CLI without hand-editing table_registry. POST /api/admin/register-table/precheck for round-trip validation. --dry-run flag on CLI. Audit-log entries on register/update/unregister. PUT /api/admin/registry/{id} now preserves registered_at (closes #130).	2026-04-29 13:18:31 +02:00
ZdenekSrotyr	995e4cd366	fix(scheduler): HTTP marketplaces job + SCHEDULER_API_TOKEN shared secret (#127 ) * fix(scheduler): HTTP marketplaces job + SCHEDULER_API_TOKEN shared secret Two scheduler-reliability bugs surfaced after the v0.12.1 USER-agnes flip: 1. The marketplaces job called src.marketplace.sync_marketplaces() in-process from the scheduler container, racing the app's long-lived system.duckdb handle. DuckDB rejects cross-process writers — every cron tick 500-ed on "Could not set lock on file ... PID 0". 2. The data-refresh + new marketplaces jobs both 401-ed on the API because SCHEDULER_API_TOKEN was never propagated by the Terraform startup script. The scheduler had no credential to authenticate with. Fix: - New POST /api/marketplaces/sync-all (admin-only) drives the nightly refresh through the app process so it inherits the existing DB connection. - Scheduler swaps fn->http for marketplaces; all jobs are now plain HTTP and the scheduler is reduced to a cron clock. - New app/auth/scheduler_token.py adds a shared-secret auth path. The startup script generates a 256-bit secret on first boot, persists it across reboots, and writes it to /opt/agnes/.env. Both containers source the same .env. The app validates incoming Bearer tokens against the env var (constant-time, length-floored) and resolves matches to a synthetic scheduler@system.local user that's a member of the Admin system group. Audit-log entries from the scheduler are attributed to this user. - app/main.py seeds the synthetic user at startup so the first cron tick has a valid actor; lazy seed in get_scheduler_user covers token rotation before the next app restart. Tests: 5 new in tests/test_auth_scheduler_token.py covering empty/short secret rejection, exact-match comparison, idempotent user seeding, and lazy provisioning. 142 marketplace + scheduler tests + 96 auth tests remain green. Existing VMs with .env from before this change need a one-time re-provisioning (re-run startup-script or rotate via openssl rand); documented in CHANGELOG. * fix(audit): use '_all' sentinel for bulk marketplace sync — Devin review #127 Avoids the literal string 'marketplace:None' in the audit_log resource column when the bulk sync endpoint writes its summary row. * fix(scheduler): unblock event loop + per-job timeouts — Devin review #127 Two findings from Devin re-review on commit 5fbad15: 1. BUG: trigger_sync_all was async def, so FastAPI ran it on the asyncio event loop. sync_marketplaces() does blocking I/O (subprocess git clones up to GIT_TIMEOUT_SEC=300 each, threading.Lock, DuckDB writes) and would freeze every concurrent request for the duration of a bulk sync. Switched to plain def so FastAPI auto-routes to the thread pool. 2. ANALYSIS: scheduler used a fixed 120s httpx timeout for every POST. Bulk marketplace sync iterates the registry under a single lock with up to 300s per repo — easily exceeds 120s on 2-3 slow repos. The scheduler then sees a timeout, doesn't update last_run, and re-fires on the next 30s tick, queueing redundant work. Per-job timeout override added to the JOBS tuple; marketplaces gets 900s (15 min), data-refresh keeps 120s, health-check 30s. * fix(auth): require_session_token rejects scheduler shared secret — Devin review #127 require_session_token gates /auth/tokens (PAT minting). Pre-fix it only rejected JWTs with typ=pat — but the scheduler shared secret is an opaque string, so verify_token() returns None, payload becomes {}, and the PAT-claim check silently passed. A caller bearing SCHEDULER_API_TOKEN could mint persistent PATs that survive a secret rotation. Added explicit is_scheduler_token() check before the PAT-claim check; new regression test in tests/test_auth_scheduler_token.py. Devin's other note (pre-existing async def trigger_sync at marketplaces.py:392 also calls blocking sync_one) — Devin flagged it as out-of-scope for this PR and I agree; tracking separately. * release(0.17.0): cut + clean up CHANGELOG duplicates Cuts 0.17.0 (minor: scheduler shared-secret auth + sync-all endpoint plus the deploy-shape fixes that landed since the last release tag). Bumps pyproject from 0.15.0 — also corrects the missed bump from PR #120 (v0.16.0 was tagged on GitHub and shipped as :stable, but pyproject stayed at 0.15.0, so /api/version, /cli/latest, and `da --version` had been under-reporting the running release). Removes the long-form duplicate entries for 0.13.0 / 0.14.0 / 0.15.0 above [0.16.0] — the canonical short summaries (with GitHub-release links) already exist below 0.16.0, the long forms were leftover state from before those versions were cut and have been silently shadowed ever since.	2026-04-29 11:44:00 +02:00
dependabot[bot]	7012966482	chore(deps): bump actions/checkout from 5 to 6 (#125 ) Bumps [actions/checkout](https://github.com/actions/checkout) from 5 to 6. - [Release notes](https://github.com/actions/checkout/releases) - [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md) - [Commits](https://github.com/actions/checkout/compare/v5...v6) --- updated-dependencies: - dependency-name: actions/checkout dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: ZdenekSrotyr <139972147+ZdenekSrotyr@users.noreply.github.com>	2026-04-29 09:58:48 +02:00
dependabot[bot]	8d0edbf1c1	chore(deps): bump peter-evans/create-pull-request from 7 to 8 (#124 ) Bumps [peter-evans/create-pull-request](https://github.com/peter-evans/create-pull-request) from 7 to 8. - [Release notes](https://github.com/peter-evans/create-pull-request/releases) - [Commits](https://github.com/peter-evans/create-pull-request/compare/v7...v8) --- updated-dependencies: - dependency-name: peter-evans/create-pull-request dependency-version: '8' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-04-29 09:46:09 +02:00
dependabot[bot]	62a5b8540a	chore(deps): bump actions/upload-artifact from 4 to 7 (#123 ) Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 4 to 7. - [Release notes](https://github.com/actions/upload-artifact/releases) - [Commits](https://github.com/actions/upload-artifact/compare/v4...v7) --- updated-dependencies: - dependency-name: actions/upload-artifact dependency-version: '7' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-04-29 09:38:38 +02:00
ZdenekSrotyr	61f6b8d2d5	feat(ci+tests): deploy safety audit — linting, rollback, smoke tests, 50+ new tests (#120 ) Comprehensive deploy safety audit implementing 19 improvements across CI/CD pipeline, test coverage, and source code. ### CI/CD Pipeline - ruff + mypy added to both release.yml and keboola-deploy.yml (continue-on-error) - Smoke test added to keboola-deploy.yml (was missing) - Automatic rollback on smoke test failure in release.yml - Expanded smoke-test.sh with catalog, admin/tables, marketplace.zip, metrics - Required status checks via .github/settings.yml - Dependabot + CODEOWNERS + pre-commit hooks + ruff config ### Source Code - DB schema version check in /api/health (db_schema: ok/mismatch/unhealthy) - Config versioning (config_version: 1 in instance.yaml, non-blocking validation) - BigQuery extractor ATTACH error handling (try/except around INSTALL+ATTACH) - Post-deploy smoke test script for prod VM validation ### Test Coverage (~50 new tests) - v13->v14 migration, Email magic link TTL, PAT, Marketplace ZIP/Git, Jira webhooks, Hybrid Query BQ, Keboola/BQ extractor failure modes, Orchestrator failure modes Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-04-29 09:18:55 +02:00
ZdenekSrotyr	6752c4a53e	fix(web): restore admin nav menu items (#122 ) v13 RBAC migration nulled users.role and moved admin authority onto user_group_members. Header still gated on session.user.role == 'admin', so admin menu was hidden for everyone. Inject user['is_admin'] via is_user_admin in get_current_user; header reads session.user.is_admin.	2026-04-29 09:09:23 +02:00
ZdenekSrotyr	33b318e491	ci(release): build dev image on branch creation from main (#118 ) Fixes the per-developer dev-VM workflow: paths-ignore on push skipped same-SHA branch creates, build-and-push.if was main+dispatch-only. Add create: trigger filtered to branch refs, broaden build-and-push.if, add concurrency group keyed on github.ref with cancel-in-progress to dedupe create+push collisions on new branches with code changes.	2026-04-29 08:15:30 +02:00
ZdenekSrotyr	1baadd172e	fix(ui): render shared header full-width on corporate memory pages (#117 ) Move {% include '_app_header.html' %} out of .container-memory (max-width: 1000px) in corporate_memory.html and corporate_memory_admin.html so the header spans the viewport, matching dashboard.html. Page content stays constrained by the container.	2026-04-29 07:45:56 +02:00
PavelDo	e1108b6112	feat(memory): corporate memory v1+v1.5 + 0.15.0 (#72 ) Adds corporate memory v1 (verification flywheel + contradiction detection + confidence scoring) and v1.5 (audience-based distribution + per-item privacy + admin curation). Server: GET /api/memory/bundle returns mandatory + ranked-approved items within a token budget; POST /api/memory/admin/mandate accepts an audience field gated against user_group_members; /api/memory/stats uses SQL aggregation. CLI: da sync writes received items to .claude/rules/km_*.md. Verification detector extracts knowledge candidates from session JSONL files. Auto-tagging via Haiku when ai: is configured. Adapted from the v9-era branch onto v13/v14 RBAC: _is_privileged_viewer + _effective_groups now query user_group_members JOIN user_groups; require_role(Role.KM_ADMIN) replaced with require_admin (km_admin collapsed into admin). Schema v15: knowledge_items context-engineering columns + knowledge_contradictions + session_extraction_state. Schema v16: verification_evidence. Cuts release v0.15.0 (also bundles #116 /me/debug page).	2026-04-29 07:16:22 +02:00
minasarustamyan	7a06f1a585	feat(auth): /me/debug self-only auth diagnostic page (#116 ) Adds /me/debug HTML page rendering the logged-in user's own session state — decoded JWT claims (no raw token, sha256[:12] fingerprint for log correlation), group memberships with sources and bound external_id when present, resource grants effective via those memberships, and a Refetch from Google (dry-run) button that diffs a fresh fetch_user_groups call against the cached user_group_members snapshot. Gated by AGNES_DEBUG_AUTH env var (default off → 404, route existence undetectable in production). Self-only by construction: user_id is read from the validated session, never echoes raw JWT / password hash / full PAT. Tolerates v13 + v14 schemas via information_schema check on users.external_id.	2026-04-29 06:36:28 +02:00
ZdenekSrotyr	2e1dfb7553	feat(v2): claude-driven fetch primitives + 0.14.0 (#102 ) Replaces the BigQuery wrap-view pattern with a discovery + scoped-fetch toolkit driven by the analyst's Claude session. Adds /api/v2/{catalog,schema,sample,scan,scan/estimate}, da catalog/schema/describe/fetch/snapshot/disk-info CLI commands, sqlglot-backed WHERE validator, process-local quota tracker, agent rails skill (cli/skills/agnes-data-querying.md). BREAKING: BQ wrap views off by default — set data_source.bigquery.legacy_wrap_views=true for one cycle. Backward-compat field_validator on primary_key. Catalog cache now matches documented 300s TTL with RBAC fresh per request. Cuts release v0.14.0.	2026-04-29 01:07:19 +02:00
ZdenekSrotyr	a222f92e70	feat(admin): server configuration editor + 0.13.0 (#107 ) Adds /admin/server-config UI for editing instance.yaml from the web. Hardening: SSRF gate on data_source URLs, narrow-overlay write strategy, atomic writes, audit log with secret masking on shape changes, threading lock on read-modify-write, corrupt-overlay refusal on write side + louder log on read side, modal Promise resolution on backdrop dismiss, sentinel scrub on save (defense-in-depth client+server). Bundles Windows PowerShell wrapper from #80. Cuts release v0.13.0.	2026-04-29 00:47:23 +02:00
David Rybar	cfe5771856	feat(dev): add Windows PowerShell wrapper for local development (#80 ) Adds `scripts/run-local-dev.ps1` as a sibling of the bash script for Windows operators. Same compose stack (`docker-compose.yml` + `.dev.yml` + `.local-dev.yml`), same up/down/logs subcommands, same LOCAL_DEV_GROUPS default seeding. Restores caller's working directory and LOCAL_DEV_GROUPS on every exit path (success, error, Ctrl+C). Avoids advanced-script promotion so `up -d` / `down -v` reach docker compose instead of being eaten by -Debug/-Verbose.	2026-04-28 23:59:11 +02:00
ZdenekSrotyr	5f6bb7a4b2	fix(security+ops) + release(0.12.1): #82 #85 #87 hardening + cut 0.12.1 (#104 ) * fix(security+ops): #82 #85 #87 — auth hardening, API validation, deploy posture Security and operational hardening across three issue groups: - M23: docker-compose.override.yml → docker-compose.dev.yml (BREAKING, prod foot-gun) - C13: Container runs as non-root user 'agnes' (USER directive in Dockerfile) - M21: Docker resource limits (mem_limit, cpus) on app + scheduler - M22: Caddyfile security headers (X-Frame-Options, X-Content-Type-Options, Referrer-Policy, -Server) - M17: /api/health split into minimal (unauth) + /api/health/detailed (auth) (BREAKING) - M26: release.yml restricts build-and-push to main + workflow_dispatch; paths-ignore for docs - C2: table_id traversal validation on /api/data/{table_id}/download - M4: Upload streaming (chunk-read + temp file) instead of full-buffer; /local-md hashed filename - C5: reset_token removed from POST /api/users/{id}/reset-password response - C8: Startup WARNING when no user has password_hash (bootstrap window visible) - M9: Audit log on failed web form login (mirrors /auth/token endpoint) - M10: Atomic magic-link consume via compare-and-swap (CONSUMED: marker + DuckDB conflict catch) Also: SSRF protection on /api/admin/configure (#46), memory stats SQL aggregation (#90) Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com> * fix(review): SSRF 169.254.x.x + IPv6 multicast; M10 marker cleanup safety Review fixes: - Add 169.254.0.0/16 (link-local, cloud metadata) to SSRF regex — was missing, allowing requests to AWS/GCP/Azure metadata endpoints - Add ff[0-9a-f]{2}: (IPv6 multicast) to SSRF regex - M10: wrap Step 3 (CONSUMED marker cleanup) in try-except with warning log — prevents unhandled exception if DB write fails after successful token consumption - Add test for 169.254.169.254 SSRF rejection Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com> * fix(review): SSRF IPv6 bypass, CLI health endpoint, upload FD leak Address Devin Review findings on PR #104: 1. SSRF IPv6 bypass: Replace hostname regex with DNS resolution + ipaddress module checks. The old regex patterns like `fe80:` only matched up to the first colon, missing real IPv6 addresses like `fe80::1`, `fc00::1`, `ff02::1`. The new approach resolves the hostname via getaddrinfo and checks each resulting IP against ipaddress.is_private/is_loopback/is_link_local/is_reserved/is_multicast. 2. CLI commands broken: `da setup test-connection`, `da setup verify`, `da diagnose`, `da status` all called /api/health expecting the old format (status=="healthy", services dict). Now they call /api/health/detailed for service-level checks (with graceful fallback to the minimal endpoint when auth is not configured). 3. Temp file handle leak: _stream_to_temp returns an open NamedTemporaryFile; callers now close it before shutil.move() to prevent FD leaks until GC. Also adds IPv6 SSRF test cases (loopback, link-local, unique-local, multicast) with mocked DNS resolution for test environment independence. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com> * fix(review): download regex blocks hyphenated IDs; document health split Address Devin Review round-3 findings on PR #104: 1. _SAFE_IDENTIFIER regex blocked hyphenated table IDs: The download endpoint used the strict SQL-identifier regex which does not allow dots or hyphens, but Keboola table IDs like in.c-crm.orders contain both. Switched to _SAFE_QUOTED_IDENTIFIER which allows dots and hyphens while still blocking path-traversal chars (/, .., \) and quote/control characters. Added test for hyphenated/dotted IDs. 2. Documented health endpoint split in DEPLOYMENT.md: Added Health checks & external monitoring section explaining both endpoints (minimal unauth /api/health vs authenticated /api/health/detailed) and how to wire external monitoring tools to the detailed endpoint with a PAT. Generated with [Devin](https://cli.devin.ai/docs) Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com> * release(0.12.1): cut hotfix for snapshot integrity + #82/#85/#87 hardening * fix(security): apply CAS pattern to password reset confirm (#82/M10 follow-up) Devin review on the rebased PR flagged the asymmetry: magic-link verify got the atomic compare-and-swap pattern in the original M10 fix, but password reset confirm at /auth/password/reset/confirm was still using read-validate-clear. Two concurrent POSTs with the same valid reset token could both succeed in setting different new passwords (last-write- wins). Lower severity than the magic-link race because the attacker would need the reset token AND to race the legitimate user, but the asymmetry was a polish gap. Mirrors app/auth/providers/email.py::_consume_token CAS exactly: write unique CONSUMED:<random> marker via UPDATE...WHERE token=old_token, then SELECT to verify our marker won, then proceed. Only the winner clears the marker and applies the password change. New regression test_concurrent_reset_only_one_wins in tests/test_password_flows.py::TestResetConfirm pins the contract: two ThreadPoolExecutor workers + Barrier hit /reset/confirm with the same token; exactly one gets 302 (password applied), the other gets 200 with 'Invalid or expired'. Sanity-checked against the pre-CAS code — both POSTs got 302 (race confirmed). --------- Co-authored-by: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-04-28 19:57:30 +02:00
Petr Simecek	3047f310b9	fix(db): self-heal missing tables on future-version DBs (agnes-dev incident) (#75 ) Discovered when 0.11.5 deployed onto agnes-dev whose system DB had been bumped to schema_version=10 during local experimentation with a parallel WIP branch (PR #72-style Context Engineering work). The lab v10 migration laid down its own table set without including v9's role tables — so the v9 binary saw `current=10 > SCHEMA_VERSION=9`, correctly treated it as a future-version-rollback and skipped its migration ladder, but ALSO skipped the table-creation step. Every query against user_role_grants (`_hydrate_legacy_role`, /profile, require_internal_role's DB fallback, every admin-gated request) then crashed with `_duckdb.CatalogException: Table with name user_role_grants does not exist`. Symptom on agnes-dev: HTTP 500 on /profile, admin nav vanished, /admin/* returned 403. Fix: hoist `conn.execute(_SYSTEM_SCHEMA)` to the TOP of _ensure_schema, unconditional. _SYSTEM_SCHEMA is all `CREATE TABLE IF NOT EXISTS`, so existing tables stay untouched (columns + data preserved); missing tables get created. Idempotent, near-zero cost (a few dozen no-op DDLs per process start). The migration block below still calls _SYSTEM_SCHEMA when migrating; that's now the redundant-but-cheap follow-up — left in place so the migration ladder reads chronologically. Concrete coverage of the rebase scenario the user asked about — a contributor switching FROM a lab future-schema branch BACK to a released binary now boots cleanly: - Forward rebase (older → current): unchanged, ladder runs as before. - Same-version rebase: unchanged, _seed_core_roles tail call still drives doc-tweak refresh. - Backward "lab" rebase (this fix): tables get re-materialized; if the DB is still on a future schema_version, _seed_core_roles tail call remains gated so we don't accidentally write data into a schema shape this binary doesn't understand. Operator can drop the v9 schema_version manually to trigger a clean ladder re-run if they want the full v8→v9 backfill (what we did to recover agnes-dev). Test: new test_split_brain_future_version_with_missing_tables_self_heals in tests/test_db.py::TestMigrationSafety. Synthesizes a v99 DB whose only existing table is schema_version, runs _ensure_schema, asserts both user_role_grants AND internal_roles AND group_mappings AND users exist after the call, and that the schema_version row stays at 99 (future-version contract). test_future_version_is_noop docstring updated to reflect the new self-heal pass — its only assertion (the version-row contract) still holds unchanged. pyproject.toml: 0.11.5 → 0.11.6. CHANGELOG.md: new [0.11.6] section under [Unreleased] skeleton.	2026-04-28 15:51:33 +02:00
ZdenekSrotyr	d9c4a95d9e	Merge pull request #106 from keboola/ma/staging ma/staging → main: RBAC v13 + marketplace + #81 security hardening + OSS neutralization	2026-04-28 14:56:34 +02:00
ZdenekSrotyr	0c963b55ef	fix(rbac+auth): system-group PATCH accepts description; Google sync preserves memberships on empty Two Devin-flagged regressions on the squashed PR #106 head: 1) PATCH /api/admin/groups/{id} blanket-rejected on system groups. The repository guard at src/repositories/user_groups.py was already narrowed to "rename only" by `7147bac` (PR #110 follow-up), but the endpoint at app/api/access.py:331-343 still short-circuited with 409 "System groups are immutable" for any mutation. A description-only payload like {"description": "..."} returned 409 instead of 200 even though the repo would have accepted it. CHANGELOG entry promised the fix but the code didn't match. Endpoint now mirrors the repo contract: 409 only when payload.name is set AND differs from existing name. Same-name no-op renames are dropped before the repo call. Description-only updates flow through. 2) Google OAuth callback wiped google_sync memberships on transient API failure. fetch_user_groups is fail-soft and returns [] for both "user has no groups" and "Cloud Identity API error". The callback fed that empty list into replace_google_sync_groups, which DELETEs all rows with source='google_sync' for the user then INSERTs zero — silently wiping every Workspace-synced membership on a hiccup. Callback now skips replace_google_sync_groups when group_names is empty and logs "preserving existing memberships". Trade-off: a user whose Workspace groups were genuinely cleared keeps stale memberships until the next non-empty sync. Admin-added rows (source='admin') were already protected by source-scope and are unaffected. The previous guard against this exact regression was test_callback_empty_groups_ does_not_overwrite_existing in tests/test_auth_providers.py — that test class has been skipped since v12 (asserts users.groups JSON, needs rewrite for user_group_members).	2026-04-28 14:40:27 +02:00
ZdenekSrotyr	5c54320f75	chore(release): cut 0.12.0 Pre-1.0 SemVer convention: BREAKING changes land in MINOR. This release is heavy on those — RBAC v13 schema rewrite, schema v14 FK constraints, marketplace admin god-mode drop, internal_roles/group_mappings/user_role_grants tables removed, exit code 2 from Keboola extractor (partial fail), Script API admin-only, scripts/grpn/ → scripts/ops/. CHANGELOG.md: rename Unreleased → [0.12.0] — 2026-04-28; append a fresh empty Unreleased above so the next PR has somewhere to land. pyproject.toml: 0.11.5 → 0.12.0. Tag the merge commit as v0.12.0 + push the tag after merge to main.	2026-04-28 14:25:13 +02:00
Vojtech Rysanek	7147bac079	feat(rbac+marketplace): schema v14 FK + AGNES_ENABLE_TABLE_GRANTS + break-glass CLI Follow-up to the RBAC v13 + marketplace work in the parent commit. Addresses deferred Devin findings, gemini-flagged blockers, and adds three guard rails. == Schema v14 — FK constraints on user_group_members + resource_grants == Adds DuckDB foreign-key constraints so cascade deletes can no longer leave orphaned member / grant rows pointing at a deleted group_id (which were relying on application-level cascades up to v13). Migration is RENAME → CREATE-with-FK → INSERT → DROP, wrapped in BEGIN TRANSACTION so a partial failure rolls back without leaving the DB at a half-applied schema. == AGNES_ENABLE_TABLE_GRANTS feature flag (default off) == ResourceType.TABLE was shipped in the parent commit as listing-only — admins can record grants but runtime enforcement still flows through legacy dataset_permissions. To avoid the misleading-UX surface area, the chip is hidden from /admin/access and POST /api/admin/grants returns 422 with the env-var name in detail until the operator opts in. Existing TABLE rows in resource_grants stay listable + deletable so cleanup is never blocked. Helpers: is_resource_type_enabled(rt), enabled_resource_types(). == Break-glass admin CLI == `da admin break-glass <user>` adds the user to the Admin user_group with source='system_seed' regardless of RBAC state. Bypasses authentication — relies on filesystem access to ${DATA_DIR}/state/system.duckdb implying host-level trust. Recovery path when the operator has locked themselves out of /admin/access. == Devin round-2 fixes (deferred on b4ec4c4) == - src/repositories/user_groups.py — narrow update() guard from blocking any mutation on system groups to blocking name change only. Description edits now pass through. Endpoint pre-check stays as defense-in-depth. Prior behavior surfaced as a misleading 409 'Cannot rename a system group' on description-only PATCH. - app/api/access.py:delete_group — wrap cascade DELETEs + repo.delete in BEGIN TRANSACTION / COMMIT / ROLLBACK. Prevents orphan rows if any DELETE fails after the user_groups row is gone. - app/marketplace_server/{packager,router}.py — split compute_etag_for_user() from build_zip(); router resolves etag first and 304-shorts before any file read or ZIP_DEFLATED. In-process cachetools.TTLCache (default 120s, env-tunable via AGNES_MARKETPLACE_ETAG_TTL, set 0 to disable). invalidate_etag_cache() called by sync to force re-hash on content drift. == Tests == - TestTableGrantsFeatureFlag (4 cases) — endpoint exclude/include, grant rejection/acceptance under the flag. - test_v12_to_v13_finalize_rollback_on_failure — destructive: monkeypatches _seed_system_groups to raise mid-transaction, asserts schema_version stays at 12, legacy tables intact, new tables empty (rollback fired). Then restores the real function and asserts the retry succeeds. - test_update_system_group_description_allowed, test_update_system_group_same_name_no_op — repo-level coverage of the narrowed guard.	2026-04-28 14:25:13 +02:00
ZdenekSrotyr	e9d7af3cce	feat(rbac+marketplace): RBAC v13 + Claude Code marketplace + #81/#83/#44 hardening This squashes 13 commits from ma/staging plus a small docstring translation into a single coherent unit. Three workstreams. == RBAC v13 redesign == - Drops core.viewer/analyst/km_admin/admin hierarchy and the internal_roles / group_mappings / user_role_grants / plugin_access tables. - Replaced by user_group_members + resource_grants. Atomic v12→v13 backfill wrapped in BEGIN/COMMIT; ROLLBACK leaves schema_version at 12 for retry. - Two authorization primitives in app.auth.access: require_admin — Admin-group god-mode require_resource_access(rt, "{path}") — entity-scoped grants Single DB lookup per request; no session cache; no implies BFS. - /admin/access UI (single page) replaces /admin/role-mapping + /admin/plugin-access. CLI `da admin group/grant ` replaces `da admin role/mapping/grant-role/revoke-role/effective-roles`. - ResourceType.TABLE listing-only — admins can record table grants, runtime enforcement still flows through legacy dataset_permissions (migration plan in docs/TODO-rbac-data-enforcement.md). == Claude Code marketplace == - Aggregated /marketplace.zip + /marketplace.git/ (PAT-gated, RBAC-filtered, content-addressed cache via dulwich). - Admin god-mode dropped on the marketplace surface — admins curate their own view via grants like everyone else. - Bare-repo cache materializes per RBAC-filtered ETag; stale entries not pruned in this iteration (disclaimed in git_backend.py docstring). == #81 #83 #44 security/ops hardening == - #81 Group A — orchestrator ATTACH allow-listing (extension/url/alias). - #81 Group B — Keboola extractor 3-state exit codes: 0 success / 1 total fail / 2 PARTIAL fail Sync API logs PARTIAL FAILURE alert on exit 2. Operators with binary alerting must teach it the new partial signal. - #81 Group C — schema v10 view_ownership; rejects silent overwrite of a prior connector's view name on collision. - #81 Group D — extractor-side identifier validation. - #83 — Jira webhook fail-closed when JIRA_WEBHOOK_SECRET unset + path-traversal fix. - #44 — entire /api/scripts/* surface is admin-only (planted-script + sandbox-bypass risk closed). == Web UI polish + deploy fix == - /admin/access: live grant-count badges (no stale snapshot revert), shared-header CSS link added to /catalog and /admin/{tables,permissions}, per-resource-type colored stripes. - docker-compose.host-mount.yml: bind,rbind so dual-disk hosts don't silently shadow sub-mounts and write state to the wrong disk. == OSS vendor-neutralization (waves 1+2) == - scripts/grpn/ → scripts/ops/. Customer-specific identifiers (project IDs, internal hostnames, dev/prod VM IPs, brand names) replaced with placeholders across code, docs, Terraform, Caddyfile, OAuth probe, and planning docs. Downstream infra repos that copied scripts/grpn/agnes-tls-rotate.sh or agnes-auto-upgrade.sh must update the path. == Translation == - src/repositories/user_groups.py::ensure_system docstring translated from Czech to English for codebase consistency. Co-authored-by: Mina Rustamyan <mina@keboola.com>	2026-04-28 14:25:04 +02:00
ZdenekSrotyr	72230c3b51	fix: #81 Group C — view-name collision detection (schema v10, squashed) (#100 ) Schema v10 + view_ownership table. Cross-connector view name collisions are detected and refused with an actionable ERROR rather than silently last-write-wins. Pre-scan reconcile releases stale ownerships in the same rebuild as a rename — but only when ALL sources' pre-scans succeed (transient-IO defense; partial pre-scan skips reconcile to avoid silently stealing a name). 26/26 view collision + orchestrator tests pass. Refs #81 Group C.	2026-04-27 22:09:49 +02:00
ZdenekSrotyr	ef74ec010c	fix(ops): #81 Group B — Keboola partial-failure exit code 2 (squashed) (#99 ) Closes M14 from issue #81. Keboola extractor exits 0/1/2 (success/full-fail/partial). sync.py interprets exit 2 as PARTIAL FAILURE (data-quality alert, distinct from exit 1). Tests: tests/test_keboola_extractor_exit_codes.py — 14 cases including runtime mock subprocess (rc=0/1/2/124). Refs #81 Group B.	2026-04-27 21:52:46 +02:00
ZdenekSrotyr	569cd90d75	fix(security): #81 Group D — extractor-side identifier validation (squashed) (#97 ) Closes M15 from issue #81 — SQL injection via attacker-controlled identifiers in connectors/keboola/extractor.py and connectors/bigquery/extractor.py. Lifted _validate_identifier from src/orchestrator.py into a new src/identifier_validation.py shared module (single source of truth for both layers). Two validator policies: - validate_identifier (strict, ^[a-zA-Z_][a-zA-Z0-9_]{0,63}$) for table_name — matches the orchestrator's rebuild-time check, so dashed names fail fast at extraction rather than being silently dropped. - validate_quoted_identifier (relaxed, accepts dashes/dots) for bucket/dataset/source_table — Keboola in.c-foo and BigQuery my-dataset are legitimate, just need to be safe inside `"..."`. Both extractors skip-and-continue on unsafe rows (logged + counted in failure stats); _extract_via_extension re-validates as defense-in-depth. 71/71 extractor + orchestrator tests pass. Refs #81 Group D.	2026-04-27 21:46:17 +02:00
ZdenekSrotyr	23be8ad46f	fix(security): #81 Group A — orchestrator attach hardening (squashed) (#95 ) Closes the C1 findings from issue #81 plus the round-3/4 follow-ups on the read-only query path. Both _attach_remote_extensions (rebuild path) and _reattach_remote_extensions (query path) now apply the same hard allowlists for extensions and token-env names, single-quote-escape the URL, and split built-in vs community install. The CHANGELOG bullet documents the full scope including the table_schema → table_catalog fix that made the rebuild path a silent no-op for every connector. New module src/orchestrator_security.py centralises the policy. Tests in tests/test_orchestrator_remote_attach_security.py — 28/28 pass. Refs #81.	2026-04-27 21:34:04 +02:00
ZdenekSrotyr	24e81fb671	fix(security): gate Script-API /run on admin role (#44 ) (#92 ) * fix(security): gate Script-API /run on admin role (#44) The AST + string-blocklist sandbox in `_execute_script` is defense-in-depth, not a primary trust boundary. It does not block `vars()`, `type()`, or `__class__.__bases__` introspection chains, and the string blocklist is trivially evadable via concatenation/dunder encoding. Treat the role gate as the actual barrier: only admin can run scripts. - `POST /api/scripts/run` and `POST /api/scripts/{id}/run` now require admin. - `POST /api/scripts/deploy` stays analyst-accessible (storing != executing). - Existing /run tests retargeted to admin_token; added regression tests asserting analyst → 403 on both endpoints. - CHANGELOG: BREAKING (security) bullet under Unreleased/Changed. Closes #44. * fix(security): admin-gate /deploy + harden sandbox blocklist (review #92) Reviewer of PR #92 flagged three MUST-FIXes that #44 wasn't fully closed: 1. /api/scripts/deploy still accepted analyst → planted-script attack path (analyst plants malicious source, waits for admin to /run). Now: /deploy also requires admin; the entire Script API is admin-only. 2. The "Minimum (same-day)" blocklist mitigations from issue #44 weren't applied. Added the introspection-chain dunders that the issue PoC pivots through: __subclasses__, __globals__, __class__, __base__, __bases__, __mro__, __dict__, __code__, __builtins__. Plus `vars` in BLOCKED_FUNCTIONS. Deliberately NOT adding __init__ / __getattribute__ (substring match would flag every legit `def __init__`) nor `type`/`dir` (frequent in legitimate admin scripts). Documented the trade-off inline. 3. Tests didn't cover the actual PoC payload nor non-analyst non-admin roles. Added test_run_pwn_payload_blocked parametrized over the issue's own PoC + two equivalent variants (lambda+__globals__, __mro__ traversal); these stay green only as long as the dunder list does. test__requires_admin tests now parametrize over (analyst, viewer, km_admin) so all three non-admin core roles are pinned at 403. Conftest extension: seeded_app now exposes viewer_token and km_admin_token as siblings to admin_token / analyst_token. CHANGELOG bullet updated to reflect /deploy gate change and new internal regression tests. 35/35 scripts tests pass locally. Refs review of #92. fix(tests): test_security TestScriptSandbox needs admin token after #44 hardening CI failure on PR #92 caught a missed test file. tests/test_security.py seeded only an analyst user and used the analyst token to drive sandbox tests. After the #44 admin-gate (deploy + run both admin-only), every sandbox test got 403 from the role gate before the AST/string check could run, so 'blocks os.system' / 'blocks eval' / etc. all failed. Fix: extend the fixture to also seed an admin user and return the admin token. Sandbox tests now reach the sandbox layer; access-control tests further down in the module continue to use the analyst that was kept around. 41/41 test_security.py tests pass locally. * fix(security): #92 round-3 — gate GET /api/scripts on admin role Devin Review caught: GET /api/scripts (app/api/scripts.py:44-51) was left on Depends(get_current_user) when the rest of the API moved to admin-only. ScriptRepository.list_all() does SELECT * FROM script_registry which returns ALL columns including 'source' (the full script body). So any authenticated user (viewer / analyst / km_admin) could read admin-deployed scripts — leak of code that may contain credentials, business logic, or admin-only operational details. CHANGELOG already says 'The entire Script API is now admin-only', which was true for /deploy, /run, /{id}/run, DELETE — just not for GET. Now consistent: every Script endpoint requires admin. Tests: - New parametrized test_list_scripts_requires_admin over (analyst, viewer, km_admin) tokens — all assert 403. - Updated test_list_scripts_empty in both test_scripts_api.py and test_api_scripts.py to use admin_token. 79 tests pass. Refs Devin Review of #92. * fix: cleanup unused imports, stale docstrings, and incomplete CHANGELOG - Remove unused imports: Path, List, get_current_user (ruff F401) - Trim docstrings to describe current behavior, not change history - CHANGELOG now lists GET /api/scripts among admin-gated endpoints - Remove diff-commenting inline comments from tests Co-Authored-By: zdenek.srotyr <zdenek.srotyr@keboola.com> * fix: merge duplicate Changed sections into one per CLAUDE.md convention Co-Authored-By: zdenek.srotyr <zdenek.srotyr@keboola.com> --------- Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-04-27 21:13:56 +02:00
ZdenekSrotyr	4e4d2a39e6	chore(oss): isolate customer-specific deploy bits from scripts/grpn/ (#88 , wave 1) (#94 ) * chore(oss): isolate customer-specific deploy bits from scripts/grpn/ (#88) Vendor-neutralization step before public release. The directory mixed two concerns: (1) generic ops scripts referenced from mainline OSS infrastructure (TLS rotation, auto-upgrade cron) and (2) one operator's hackathon manual-deploy helper with hardcoded GCP project IDs, VM names, and admin emails. Splitting them per concern. Moved (still in OSS, just under a vendor-neutral name): - scripts/grpn/agnes-tls-rotate.sh → scripts/ops/agnes-tls-rotate.sh - scripts/grpn/agnes-auto-upgrade.sh → scripts/ops/agnes-auto-upgrade.sh Removed (belongs in private consumer infra repos, not upstream OSS): - scripts/grpn/Makefile (hardcoded prj-grp-foundryai-dev-7c37, foundryai-development VM name, e_zsrotyr@groupon.com bootstrap email) - scripts/grpn/README.md (GRPN hackathon deploy walkthrough) - docs/superpowers/plans/2026-04-22-grpn-deploy-learnings.md (org-specific deploy log) Cross-refs updated in README.md, CLAUDE.md, docs/DEPLOYMENT.md, docker-compose.yml. CHANGELOG entry flags BREAKING (ops) for any consumer infra repo that installs these scripts via path-based systemd timers. This is the first wave of #88 — the remaining leaks (test data with prj-grp-dataview-prod-1ff9, AIAgent.FoundryAI tags in OpenMetadata test fixtures, docstrings in connectors/openmetadata/enricher.py) will be a separate, smaller PR. Refs #88. * chore(oss): comprehensive vendor-neutralization (#88 wave 2 + review fixes) PR #94 review found that the original wave-1 grep was scoped wrong and many leaks survived. This commit closes wave 1 properly AND folds in all wave-2 anonymization in a single pass — easier to review than two PRs. Wave-1 review-fix corrections: - Caddyfile: scripts/grpn/agnes-tls-rotate.sh → scripts/ops/ (the original wave-1 grep filter excluded extensionless files like Caddyfile). - CHANGELOG bullet rewritten — original wording implied an in-repo migration for infra/modules/customer-instance/, which is wrong (the TF module embeds the script inline via heredoc, never sourced from scripts/grpn/). Now flags downstream consumer infra repos only. - infra/modules/customer-instance/variables.tf: Czech docstring with `grpn` example → English description with `acme, example` placeholders. Wave-2 anonymization: - Code docstrings (connectors/openmetadata/{client,transformer,enricher}.py, src/catalog_export.py, scripts/duckdb_manager.py): prj-grp-… → my-bq-project / prj-example-1234, AIAgent.FoundryAI → AIAgent.MyAgent, FoundryAIDataModel → AnalyticsDataModel. - Test fixtures (4 files): same set of replacements — 157 tests still pass. - .github/workflows/keboola-deploy.yml: "Groupon-side dev VMs" comment → generic "per-developer dev VMs". - docs/auth-groups.md + scripts/debug/probe_google_groups.py: kids-ai-data-analysis project name → acme-internal-prod placeholder. - 5 planning/spec docs under docs/superpowers/{plans,specs}/2026-04-21-: hardcoded IPs (34.77.94.14, 34.77.102.61) → <dev-vm-ip>/<prod-vm-ip>; GRPN/Groupon → Acme/another-customer; prj-grp-… → prj-example-…. - scripts/switch-dev-vm.sh deleted — hackathon-era helper hardcoded to a specific shared dev VM. Per-developer dev VMs are the supported pattern. Final grep `groupon\|grpn\|foundryai\|prj-grp\|groupondev\|34\.77\.(94\|102)\.…\|kids-ai-data` returns zero hits (excluding CHANGELOG.md historical entries). CHANGELOG entry expanded to document both waves under one bullet, with the BREAKING (ops) clarification about the TF module being unaffected. Refs review of #94, closes #88. fix(oss): close remaining #94 review-2 findings (Czech, padak refs, CHANGELOG) Reviewer of PR #94 round 2 caught 4 remaining items the wave-2 pass missed: 1. infra/modules/customer-instance/variables.tf had Czech descriptions on 8 more variables. Previous review only flagged line 19; this round audited the rest. Translated lines 2, 28, 42-46 (heredoc), 60, 65, 71, 78, 84 to English. Same review concern: a Terraform module that is the customer-facing API surface in Czech is unfit for OSS distribution. 2. infra/modules/customer-instance/outputs.tf had Czech descriptions on four outputs. Same fix. 3. docs/padak-security.md referenced a private repo (padak/keboola_agent_cli#206) in two places. Replaced with generic 'tracked upstream in the auth-CLI repo' per CLAUDE.md vendor-agnostic rule (no cross-refs to private repos). 4. scripts/fetch-env-from-secrets.sh:41 had a Czech comment. Translated. 5. CHANGELOG cosmetic: bullet said 'AIAgent.FoundryAI -> AIAgent.MyAgent' but the actual code uses both MyAgent (in docstrings) and Example (in test fixtures). Reworded to mention both targets. Final grep across all shipping file types (.md, .py, .yml, .yaml, .sh, Makefile, .json, .tf, .tpl, Caddyfile, .toml) for groupon\|grpn\|foundryai\| prj-grp\|groupondev\|34.77.94.14\|34.77.102.61\|kids-ai-data\|padak/keboola_agent_cli returns ZERO hits (excluding CHANGELOG.md). Czech-diacritic grep across .tf/.toml/Caddyfile/Makefile/.yml returns ZERO hits. 157/157 OpenMetadata + DuckDB tests still pass. * fix(oss): close #94 round-3 leaks (env.template, instance.yaml.example, padak typo) Round-3 reviewer caught two MUST-FIX leaks the round-2 grep missed (grep was scoped to extensions that did not include .template / .example suffixes — the audit was right, the previous grep was not paranoid enough): 1. config/instance.yaml.example:114 — '(optional - Groupon-specific)' brand leak in a shipping config example. Replaced with '(optional)'. 2. config/.env.template:68 — stale path 'scripts/grpn/agnes-tls-rotate.sh' in operator-facing env-template comment. The script lives at scripts/ops/ now (commit 16a85cc); this comment had been pointing operators at a non-existent path. 3. docs/padak-security.md:188 — phrase duplication 'tracked in tracked upstream' from a sloppy substitution in round-2. Trivial wording fix. Final paranoid grep across .md/.py/.yml/.yaml/.sh/Makefile/.json/.tf/.tpl/ Caddyfile/.toml/.template/.example/.env* with the full token set (groupon\|grpn\|foundryai\|prj-grp\|groupondev\|34\.77\.94\.14\|34\.77\.102\.61\| kids-ai-data\|padak/keboola_agent_cli) returns ZERO hits, excluding CHANGELOG.md historical entries. * fix(oss): #94 round-4 — QUICKSTART.md + rename padak-security.md Devin Review caught two findings on the latest round-3 commit: 1. docs/QUICKSTART.md:67 still pointed users at the deleted scripts/switch-dev-vm.sh. A Quickstart user following step-by-step would hit a missing-file error at the final step. Replaced with the inline gcloud-ssh equivalent that the Removed bullet documents. 2. docs/padak-security.md filename retains the personal identifier 'padak'. The PR fixed the body content (replaced padak/keboola_agent_cli#206 references with generic wording) but missed the filename. Renamed to docs/security-audit-2026-04.md (date-anchored, vendor-neutral). Updated the historical CHANGELOG link to point at the new path with an inline note about the rename. * fix(oss): redact remaining hardcoded IPs from planning docs + remove default email Devin Review caught two more leaks: 1. scripts/fetch-env-from-secrets.sh line 16 had a hardcoded personal-email default (zdenek.srotyr@keboola.com). Replaced with ':?' bash error so SEED_ADMIN_EMAIL must be explicitly set — safer than carrying any specific identity. 2. Planning docs still had 35.195.96.98 and 34.62.223.189 (legacy prod/dev IPs) that the round-1 IP-replace pattern missed (it only targeted 34.77.x.x). Generic regex redaction across all five planning docs replaces every public IP with <redacted-ip>, preserving private/loopback/IAP ranges.	2026-04-27 20:24:34 +02:00

... 3 4 5 6 7 ...

661 commits