Commit graph

661 commits

Author SHA1 Message Date
ZdenekSrotyr
7052a23552 release(0.30.0): per-connector tab UI + Keboola materialized parity + /admin/server-config full exposure
Highlights (full prose in CHANGELOG.md [0.30.0]):

- Smart local sync — Claude Code SessionStart/SessionEnd hooks via 'da analyst setup' + 'da sync --quiet' for hook-friendly output
- query_mode='materialized' end-to-end for BigQuery + Keboola — admin SELECT (against bq.dataset.x or kbc.bucket.table) → scheduler runs through DuckDB extension → parquet → da sync distribution
- /admin/tables per-connector tabs (BigQuery / Keboola / Jira), full Keboola Custom-SQL parity, form cleanup, per-row Manage access deep link
- /admin/server-config known-fields registry + structured nested editor: surfaces BQ optional knobs (billing_project, legacy_wrap_views, max_bytes_per_materialize), ai.base_url, new openmetadata + desktop sections, full corporate_memory governance schema
- da diagnose warns on USER_PROJECT_DENIED-prone billing_project=project config
- Schema v20 — adds source_query TEXT to table_registry
2026-05-01 20:38:34 +02:00
ZdenekSrotyr
b627de8344 feat(diagnose) + docs: warn on USER_PROJECT_DENIED footgun + document all newly-exposed knobs
Diagnostic + operator-facing documentation that closes the loop on the work in this PR.

`da diagnose` (via /api/health/detailed):
  - New _check_bq_billing_project() helper. When data_source.type='bigquery' and BqProjects.billing == .data, surface a yellow warning: 'BigQuery billing project equals data project'. Hint includes the YAML field path + the /admin/server-config UI shortcut. Diagnose's overall status promotes warning → degraded so the CLI echoes it.
  - Non-BQ instances (Keboola-only, etc.) skip the check.
  - Implementation hooks into the existing /api/health/detailed surface — no new endpoint, no CLI changes.

config/instance.yaml.example documentation:
  - data_source.bigquery.billing_project: USER_PROJECT_DENIED hint, /admin/server-config UI reference
  - data_source.bigquery.legacy_wrap_views: analyst-side discipline note (use `da fetch` / `da query --remote`), issue #101 history, view-heavy deployment guidance
  - data_source.bigquery.max_bytes_per_materialize: cost guardrail block (NEW — wasn't documented in .example before)
  - ai.base_url: provider list + UI hint
  - openmetadata + desktop: 'configurable via /admin/server-config UI' headers
  - corporate_memory: leading note that the schema is editable via UI

Other docs:
  - CHANGELOG.md: comprehensive Unreleased section
  - CLAUDE.md: schema chain → v20 + Materialized SQL connector mode + per-connector tab UI mention
  - README.md: mode-first source table summary
  - docs/architecture.md: per-connector tab UI mention
  - cli/skills/connectors.md: bootstrap rails (parallel to #154)
  - docs/superpowers/plans/2026-05-01-admin-tables-form-cleanup.md: implementation plan archive (2515 lines)
  - scripts/seed_dummy_tables.py: drop is_public after #150 RBAC migration (column gone)

Tests:
  - test_diagnose_billing.py — 3 cases (BQ with billing==data warns, BQ with billing!=data clean, non-BQ skips)
2026-05-01 20:27:24 +02:00
ZdenekSrotyr
df7f5b1d9a feat(admin-ui): /admin/server-config known-fields registry + structured nested editor
Today /admin/server-config renders fields by iterating Object.keys(payload) on the YAML value — if a key isn't in instance.yaml, the operator can't see it. They have to know to type it via the JSON-patch textarea (which only renders for empty sections) or SSH and edit YAML.

Adds a known-fields registry (`_KNOWN_FIELDS` in app/api/admin.py) the UI consumes alongside the YAML payload. Renderer shows BOTH:
  - existing fields (from YAML) with current value
  - known-but-unset fields with dashed-border placeholder + hint, ready to fill in

Renderer (`renderField`, `renderSection`, `collectSection`):
  - kind="string"|"secret"|"bool"|"int"|"select"|"object"|"array"|"map" — picks input type
  - kind="object" with `fields` — recursive structured form, arbitrary depth (corporate_memory needs 3-4 levels)
  - kind="array" with `item_kind` — vertical stack of typed inputs + add/remove buttons
  - kind="map" with `key_kind` + `value_kind` — key:value rows + add/remove (used for confidence.base, domain_owners, entity_resolution.entities)
  - data-path encoded as JSON segment array so map keys with embedded dots (e.g. 'user_verification.correction') survive collect → patch round-trip
  - .cfg-field.is-unset CSS — dashed border, muted label, italic hint

Sections newly exposed (added to _EDITABLE_SECTIONS):
  - openmetadata: url, token (secret), cache_ttl_seconds, verify_ssl
  - desktop: jwt_issuer, jwt_secret (secret), url_scheme

Known fields populated for existing sections:
  - data_source.bigquery: billing_project (the cause of the 403 USER_PROJECT_DENIED footgun when SA can read but not bill the data project), legacy_wrap_views (bigquery_query() wrap for VIEWs — issue #101 default off, ON for view-heavy deployments), max_bytes_per_materialize (cost guardrail)
  - data_source.keboola: stack_url, project_id (hints; values already populated)
  - ai: base_url (required for openai_compat), structured_output (select)
  - corporate_memory: full schema from instance.yaml.example — distribution_mode, approval_mode, review_period_months, notify_on_new_items, sources.{claude_local_md,session_transcripts}, extraction.{model,sensitivity_check,contradiction_check}, confidence.{base,modifiers,decay.{mode,half_life_months,decay_rate_monthly,floor}}, contradiction_detection.{enabled,max_candidates}, entity_resolution.{enabled,entities}, domain_owners, domains
  - Known partial: confidence.modifiers is map<string, map<string, float>> — falls through to JSON-textarea with TODO; structured editor for that one shape needs more renderer work

Tests:
  - test_admin_server_config_known_fields — registry envelope shape, smoke fixture
  - test_admin_server_config_renderer_depth — 4-level nested objects, arrays of strings, maps of floats, dotted-key safety
  - test_admin_server_config_corp_memory — full corporate_memory schema, 12 fields incl. nested
  - test_admin_server_config — existing tests adjusted for new shape
2026-05-01 20:27:01 +02:00
ZdenekSrotyr
c63f54d643 feat(admin-ui): /admin/tables per-connector tabs + Keboola materialized parity + form cleanup + Manage access deep link
Replaces the single mixed Jinja-branched form at /admin/tables with a per-connector tab interface and brings Keboola to capability parity with BigQuery.

Tab structure:
  - BigQuery tab: Register modal with two-question radio model (Q1 Live | Synced × Q2 Whole | Custom SQL), Discover datasets / List tables / Use-table-as-base autocomplete buttons, table-vs-view auto-detection hint, per-tab listing filter
  - Keboola tab: same two-question radio (Q2 only — no Live mode for Keboola), Custom SQL textarea against kbc."bucket"."table" for materialized rows
  - Jira tab: read-only listing (Jira is webhook-driven; no Register form)
  - Active tab persists in window.location.hash so refresh keeps the operator in place

Form cleanup (within tabs):
  - Drops the misleading 'Sync Strategy' dropdown — runtime never read it (only profiler.is_partitioned() consumes the value for parquet-layout detection); kept in DB for back-compat (Pydantic deprecated)
  - Adds Sync Schedule input to Keboola Register/Edit (was missing — scheduler honored per-table cron via is_table_due() for every source but the Keboola UI had no surface)
  - Hides Primary Key under <details>Advanced with clarifying hint that it's catalog-metadata only (Agnes does not perform upsert/dedup; every sync is a full overwrite)
  - Drops the Strategy column from the registry listing (every Keboola row defaulted to full_refresh after Strategy was hidden — column was noise)
  - Removes the legacy out-of-tab #registerModal + the legacy global Discovery panel; each tab now owns its own header + Register button + listing div

Edit modal:
  - BigQuery Edit modal physically relocated into <section id="tab-content-bigquery"> (mirrors Phase E Register placement)
  - Keboola Edit modal mirrors Register (same Q2 radio, Discover/List buttons via parameterized helpers)
  - openEditModal(table) dispatches by source_type to the right modal — fixes a quiet bug where Phase F's openEditKeboolaModal was never wired up and Keboola edits silently used the legacy modal

Per-row Manage access deep link:
  - Each row in the per-tab listing has a lock-icon button between Edit and Delete that navigates to /admin/access#table:<table_id>
  - admin_access.html bootstrap reads window.location.hash and pre-fills the resource filter, mirroring the existing ?group=<id> deep-link pattern

Tests:
  - test_admin_tables_tab_ui.py — tab nav, hash persistence, register-button-per-tab, listing partition by source_type, Manage access deep link
  - test_admin_tables_ui_materialized.py — two-question radio (BQ + Keboola), Discover/List/Use-as-base buttons, Edit modal parity, Jira read-only
2026-05-01 20:26:29 +02:00
ZdenekSrotyr
85d3810535 feat(materialized): query_mode='materialized' for BigQuery + Keboola — admin SELECT → parquet → analyst
Closes the 'admin pre-stages a curated table/view for analysts' use case end-to-end across both supported source connectors.

Backend (BigQuery + Keboola, schema v20):
  - schema v20 adds source_query TEXT to table_registry (renumbered from v19 after main's #150 RBAC migration also bumped to v19)
  - connectors/bigquery/extractor.py adds materialize_query(table_id, sql, *, bq, output_dir, max_bytes=...) — BqAccess session, dry-run cost guardrail (default 10 GiB, configurable via data_source.bigquery.max_bytes_per_materialize), idempotent ATTACH, rows/bytes/md5 metadata for sync_state
  - connectors/keboola/access.py — new KeboolaAccess facade (parallel of BqAccess) wrapping ATTACH 'keboola://...' AS kbc
  - connectors/keboola/extractor.py adds materialize_query — same shape, no dry-run analog (Keboola Storage API has different cost model); legacy bucket-download path skips query_mode='materialized' rows
  - app/api/sync.py:_run_materialized_pass dispatches by source_type to the right materialize_query
  - app/api/admin.py: RegisterTableRequest accepts source_query; model_validator coheres mode↔source_query↔bucket; PUT preserves omitted fields; deprecation marks (Field(deprecated=True)) on sync_strategy + profile_after_sync (no extractor reads them; profile_after_sync becomes inert — bug from earlier work where /api/sync/trigger never honored the flag); _BQ_OPTIONAL_FIELD_DEFAULTS injects defaults into GET /server-config payload

Operator + CLI surface:
  - da admin register-table --query / --query-mode materialized
  - scripts/smoke-test-materialized-bq.sh — end-to-end smoke for operators

Tests (incl. spike + integration + regression):
  - test_db_migration_v20, test_table_registry_source_query
  - test_bq_materialize, test_bq_cost_guardrail, test_bq_init_extract_skips
  - test_keboola_access, test_keboola_extension_query_passthrough (lock-in for the DuckDB extension capability), test_keboola_materialize, test_keboola_init_extract_skips, test_keboola_materialized_e2e (skipped without KBC_TEST_* creds)
  - test_sync_trigger_materialized, test_sync_trigger_keboola_materialized
  - test_api_admin_materialized, test_cli_admin_materialized
  - test_admin_bq_register, test_admin_discover_bigquery, test_admin_keboola_materialized, test_admin_phase_c_deprecation, test_admin_put_preservation, test_materialized_e2e

Cost: BQ uses bigquery_query() (jobs API, view-aware) — works on tables, views, materialized views uniformly. Keboola uses ATTACH+COPY parquet through the DuckDB extension.
2026-05-01 20:25:56 +02:00
ZdenekSrotyr
d0b7e122d6 feat(cli): smart local sync — Claude Code SessionStart/SessionEnd hooks + da sync --quiet
The analyst flow becomes a closed loop with the server-curated table catalog:

  - `da analyst setup` writes `<workspace>/.claude/settings.json` with two hooks:
      SessionStart → `da sync --quiet || true`        — pulls fresh RBAC-filtered parquets at session start
      SessionEnd   → `da sync --upload-only --quiet || true` — uploads session jsonl + CLAUDE.local.md
  - `|| true` keeps Claude Code unblocked when the server is down.
  - Workspace-level (not user-home) so the hooks fire only when Claude Code opens this analyst workspace.
  - `da sync --quiet` rewrites the CLI output for hook consumption — 0 stdout on success, single-line error on failure.
  - Existing settings.json is patched (deep-merged), not overwritten; malformed JSON is reported, not silently overwritten.

Tests cover: workspace bootstrap, hook insertion, malformed-json safety, quiet-mode output shape.
2026-05-01 20:25:27 +02:00
Vojtech
c364f65127
fix(tls-rotate): self-signed fallback sets basicConstraints=critical,CA:FALSE (#159)
* fix(tls-rotate): self-signed fallback sets basicConstraints=critical,CA:FALSE

OpenSSL's default '[v3_ca]' config marks CA:TRUE on 'req -x509', which
causes strict TLS stacks (rustls / webpki, used by uv, cargo, and
future versions of pip) to reject the cert with
'invalid peer certificate: CaUsedAsEndEntity' per RFC 5280 §4.2.1.9.
Browsers, curl, and OpenSSL-based clients tolerated the violation,
hiding the bug until a uv user hit it.

Affects every VM running on the self-signed fallback while the corp
PKI hasn't published the real chain yet. Fix lands on the next
agnes-tls-rotate.timer tick (or 'systemctl start
agnes-tls-rotate.service' for an immediate refresh). Existing CSR /
real-cert paths unaffected; only the bring-up fallback regenerates.

* chore(release): cut 0.29.0

---------

Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
2026-05-01 12:23:14 +02:00
Vojtech
bd7b8c3233
fix(analyst): document BigQuery remote-query capability in bootstrap CLAUDE.md template (#154)
* fix(analyst): document BigQuery remote-query capability in bootstrap CLAUDE.md template

Closes #153.

The CLAUDE.md template generated by `da analyst bootstrap` (config/claude_md_template.txt)
covered metrics, sync, corporate memory, and directory layout — but had ZERO
mention of query_mode: "remote", da fetch, da query --remote, or --register-bq.
Result: the AI analyst running in a freshly-bootstrapped workspace had no
idea BigQuery-backed tables existed, no path to fetch unsynced data, and no
fallback for tables not in the catalog.

Validated against /Users/<user>/foundry-ai/foundryai-data-analyst/CLAUDE.md
on 2026-05-01: section confirmed missing. Workspace-level (parent-dir)
CLAUDE.md carried legacy SSH-heredoc instructions but the analyst-level
file (which Claude reads as primary project context) had nothing.

## Changes

### config/claude_md_template.txt (+83)

Added a `## Remote Queries (BigQuery)` section covering:

- Discovery first — `da catalog --json | jq '...'` to see all tables
  with their query_mode, then `da schema` and `da describe` for shape.
- Three query patterns:
  - `da fetch` (preferred) — materialize a filtered subset locally,
    query the snapshot, drop when done.
  - `da query --remote` — one-shot server-side execution (cheap probes).
  - `da query --register-bq` — hybrid joins between local + ad-hoc BQ.
- `da fetch` estimate-first discipline — rules of thumb on
  --select / --where / --estimate / snapshot reuse.
- BigQuery SQL flavor cheat sheet for `--where` (DATE literal,
  DATE_SUB, REGEXP_CONTAINS, CAST AS INT64).
- Unknown-table fallback: when a table isn't in `da catalog` at all,
  use ad-hoc `--register-bq` if the agnes server SA has BQ access, or
  ask admin to register with `query_mode: "remote"` for ongoing use.
- Pointer to `da skills show agnes-data-querying` for deeper guidance.

### docs/setup/claude_md_template.txt (deleted)

Stale 359-line template that documented the deprecated SSH-heredoc
remote_query.sh protocol. No code references it (verified via grep
across .py / .sh / .yml / .md). Removing eliminates two failure
modes:
1. A future refactor accidentally pulling it into a workspace and
   shipping deprecated guidance to analyst Claude sessions.
2. Reviewer confusion over which template is canonical.

### CHANGELOG.md

`### Fixed` and `### Removed` entries under [Unreleased].

## Tested

- Manually walked the diff against `da skills show agnes-data-querying`
  output on a live VM (foundryai-development) — patterns + flags
  match the modern CLI exactly.
- Re-bootstrap test deferred: requires network round-trip; pattern
  is identical to existing template substitution path so render is
  not at risk.

## Out of scope

- The companion gap that data_description.md often only enumerates
  query_mode: "local" tables (no signal that other modes exist) —
  separate concern, fix likely belongs in the metadata generator
  on the server side, not in the analyst template.
- Encouraging admins to register frequently-queried BQ tables as
  `query_mode: "remote"` in the registry — workflow improvement, not
  a code bug.

* chore(release): cut 0.28.0

---------

Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
2026-05-01 12:06:41 +02:00
minasarustamyan
d4ac84dd46
feat(rbac): drop dataset_permissions + users.role + is_public; v19 migration (#150)
* feat(rbac): drop dataset_permissions + access_requests + users.role + is_public; v19 migration

BREAKING. Sjednocení datové RBAC vrstvy do per-group resource_grants modelu.
Před PR byla legacy data RBAC vrstva (dataset_permissions + is_public bypass)
de-facto neaktivní — is_public neměl API/UI/CLI surface, default true znamenal
že can_access_table vždycky bypassl. Dnes každý non-admin přístup vyžaduje
explicitní resource_grants(group, "table", id) řádek.

Schema v18 → v19 (src/db.py:_v18_to_v19_finalize):
- DROP TABLE dataset_permissions, access_requests
- DROP COLUMN users.role (NULL artifact since v13)
- DROP COLUMN table_registry.is_public
- Drops přes table-rebuild idiom (rename → create new → INSERT … SELECT
  → drop old) kvůli DuckDB ALTER DROP COLUMN limitacím na tabulkách
  s historic FK constraints. INSERT picks intersection sloupců, takže
  test fixtures s minimal pre-v19 schemou migrate cleanly.

Runtime:
- src/rbac.py:can_access_table → deleguje na app.auth.access.can_access
- DatasetPermissionRepository, AccessRequestRepository smazány
- AGNES_ENABLE_TABLE_GRANTS env-gate v app/resource_types.py odstraněn
  (TABLE je unconditionally enabled)

API drop:
- app/api/permissions.py, app/api/access_requests.py celé soubory
- /admin/permissions web route + admin_permissions.html
- "Request Access" modal v catalog.html + locked-row UI
- ~10 if user.get("role") != "admin" checků nahrazeno (admin shortcut
  je uvnitř can_access_table)
- /api/settings: drop permissions field z GET; PUT /api/settings/dataset
  gate přepnut na can_access(user_id, "table", dataset, conn)

Auth:
- app/auth/jwt.py:create_access_token: drop role parametr (claim zmizí
  z nově vydávaných JWT; staré tokeny zůstávají valid, claim ignored)
- app/api/users.py: drop role z CreateUserRequest / UpdateUserRequest
  (admin promotion = explicit add to Admin group via memberships API)
- src/repositories/users.py: drop role z create() / update()

CLI:
- da admin set-role smazán → hard-fail s replacement command
- da admin add-user --role flag pryč
- da auth import-token --role flag pryč
- da auth whoami: drop "Role:" výpis
- cli/config.py:save_token: role parametr now optional, no longer written
  (back-compat se starými token.json soubory zachována — pole se ignoruje)

Tests:
- DELETE: test_permissions.py, test_permissions_api.py, test_access_requests_api.py
- REWRITE: test_access_control.py (resource_grants flow), test_rbac.py
  (can_access_table over resource_grants), test_journey_rbac.py
  (drop access-request flow), test_resource_types.py (drop env-gate
  tests, drop is_public from helpers), test_v2_*.py (drop role-based
  user dicts in favor of id-based + Admin group membership),
  test_settings_api.py (no permissions field, can_access gate)
- TRIVIAL: ~30 souborů — drop role="admin" arg z UserRepository.create
  a 3rd positional role z create_access_token
- NEW: test_v18_to_v19 migration test (test_db.py),
  test_can_access_table_no_implicit_public (test_rbac.py),
  test_admin_set_role_returns_hardfail (test_cli_admin.py)
- OpenAPI snapshot regenerated

Docs:
- CHANGELOG: BREAKING entry pod [Unreleased]
- CLAUDE.md: schema v18 → v19
- docs/architecture.md: schema table + RBAC sekce přepsána
- docs/auth-google-oauth.md: admin promotion přes da admin break-glass
- cli/skills/security.md: kompletně přepsáno na group-based model
- docs/TODO-rbac-data-enforcement.md: smazáno (TODO splněn)

Test results: 2363 passed, 19 failed. Zbývající failures jsou pre-existing
Windows-specific issues (fcntl, charset) nesouvisející s tímto PR —
ověřeno git stash pop.

Plan: ~/.claude/plans/floofy-coalescing-parnas.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(release): cut 0.27.0

---------

Co-authored-by: Minas Arustamyan <arustamyan.minas@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
2026-04-30 22:02:16 +02:00
Vojtech
2447da7bb1
refactor(ops): bake all host artifacts into image, drop every curl-from-main (#149)
* refactor(ops): bake all host artifacts into image, drop every curl-from-main

Replaces the curl-from-main pattern (originally introduced in 0.25.0 for
agnes-auto-upgrade.sh; older for the compose files + Caddyfile) with image-
bundled host artifacts. Same-tag delivery for everything the host runs,
version-pinned by AGNES_TAG, atomically rolled back by reverting the image.

## Motivation

The customer-instance startup template was curling 6 files from
raw.githubusercontent.com on every VM boot:

  docker-compose.yml
  docker-compose.prod.yml
  docker-compose.host-mount.yml
  docker-compose.tls.yml
  Caddyfile
  scripts/ops/agnes-auto-upgrade.sh   (added in 0.25.0)

Every one of them already lives inside the image (`COPY . .` copies the
whole repo to /app/). Curling them from the public internet duplicates
content the image already carries and introduces three problems:

1. **Split-brain version pinning.** image_tag pins the docker image to an
   immutable digest. The compose files + script bypassed that pinning by
   tracking `main` (or the rarely-set compose_ref). A customer pinned to
   stable-2026.04.516 could wake up tomorrow with their host artifacts
   floating on whatever shipped to main overnight — even though they're
   explicitly pinned for stability.

2. **No rollback knob.** Reverting a bad host artifact meant reverting
   the upstream PR globally — affects every customer that reboots after
   the bad commit. No "rollback for me only" path; tag-pinning gave no
   protection.

3. **Public-internet dependency on every boot.** The image is already
   pulled from a private registry on the same boot. Reusing that channel
   is strictly cheaper than adding a second one. Customers with restricted
   egress (no raw.githubusercontent.com reachability) silently broke on
   every boot.

## Changes

### Dockerfile (+19 -8)

After `COPY . .` and before the wheel build, an explicit `cp` lifts every
host-side artifact into a stable contract path /opt/agnes-host/:

  agnes-auto-upgrade.sh                  (mode 0755 — host cron driver)
  docker-compose.{yml,prod,host-mount,tls}.yml
  Caddyfile                              (mode 0644)

Why a copy instead of pointing at /app directly: /app is owned by uid 999
(USER agnes); /opt/agnes-host is root-owned, mode 0755 across the board,
stable path that won't shift if /app structure refactors.

### infra/modules/customer-instance/startup-script.sh.tpl (+22 -36)

Replaced six curls and the standalone agnes-auto-upgrade.sh extract block
(introduced earlier in this PR) with one extract sequence in section 3:

    docker pull "$${IMAGE_REPO}:$${IMAGE_TAG}"
    EXTRACT_CONTAINER=$(docker create "$${IMAGE_REPO}:$${IMAGE_TAG}")
    trap "docker rm '$EXTRACT_CONTAINER' >/dev/null 2>&1 || true" EXIT
    docker cp "$EXTRACT_CONTAINER:/opt/agnes-host/." "$APP_DIR/"
    docker cp "$EXTRACT_CONTAINER:/opt/agnes-host/agnes-auto-upgrade.sh" /usr/local/bin/agnes-auto-upgrade.sh
    chmod +x /usr/local/bin/agnes-auto-upgrade.sh

The auto-upgrade section (#6) is now a no-op — script is already in place.

### infra/modules/customer-instance/variables.tf (+1 -1)

`compose_ref` marked DEPRECATED in description. Default unchanged for
one release cycle to avoid breaking existing terraform plans. Will be
removed in a future major bump.

### CHANGELOG.md

`### Changed` entry under [Unreleased] — supersedes the narrower entry
this PR previously had (which only covered the script).

## Out of scope (filed as follow-ups)

1. **agnes-the-ai-analyst-infra/startup.sh (operator deploy)** still
   curls the same artifacts from main. Symmetric fix needed there.
   Will file as a separate PR against the infra repo.

2. **Self-update inside agnes-auto-upgrade.sh** after a successful
   `docker compose pull` of a new digest. Otherwise the running cron
   keeps using the OLD baked-in script for one tick after image upgrade.
   ~10 LOC. Deferred to keep this PR scoped.

3. **scripts/ops/agnes-tls-rotate.sh** has the same shape — host-side
   bash currently sourced via the infra repo. Should follow the same
   bake-into-image pattern.

## Tested

- Local: `docker build .` succeeds with the new RUN block.
- `docker create` + `docker cp /opt/agnes-host/.` round-trips all 6
  artifacts; sha matches each source file.
- Not yet tested on a live VM bring-up — that requires a CI image with
  this Dockerfile change. **Recommend reviewer trigger CI build, then
  do a single VM-recreate against a dev VM (e.g. foundryai-development)
  to confirm the extract path works end-to-end before merge.**

## Compatibility

- Existing VMs running 0.25.0 are unaffected — they have host artifacts
  in place from `curl from main` already; this PR doesn't touch them.
  They pick up the new pattern only on next VM recreate.
- VMs pinned to an image_tag *older* than this PR (no /opt/agnes-host
  in the image) would FAIL the docker cp. Current diff fails-loud (no
  fallback). Recommend operators upgrade to a fresh-enough image_tag
  alongside the template upgrade — same coupling as any compose-flag bump.

* docs(infra): document image_tag >= v0.26.0 minimum on prod/dev_instances

The new startup script extracts host artifacts from /opt/agnes-host/
inside the image — a directory added in this PR (will ship as v0.26.0).
Pinning image_tag to an older tag would fail-loud at first boot with
'docker cp: No such file or directory'. Existing VMs are unaffected
because the module ignores metadata_startup_script changes.

Devin ANALYSIS_0004 on PR #149.

* fix(changelog): mark BREAKING + drop private-repo reference

Per CLAUDE.md, breaking changes start with **BREAKING** so operators
can grep before bumping the pin. The image_tag minimum constraint
introduced here qualifies — older tags fail-loud at first boot.

Also drop the explicit 'agnes-the-ai-analyst-infra' name from the
entry; the OSS distribution shouldn't reference operator-side
deploy templates by their private-repo names. Generic 'consumer-
side deploy templates' wording instead.

Devin BUG_0001 + WARN_0001 on PR #149.

* chore(release): cut 0.26.0

---------

Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
2026-04-30 21:40:25 +02:00
Vojtech
ddffdfeafd
fix(ops): fail-fast guard in agnes-auto-upgrade — refuse start if config disk not mounted (#146)
* fix(ops): fail-fast guard in agnes-auto-upgrade — refuse to start containers if config disk not mounted

Companion to keboola/agnes-the-ai-analyst-infra#62. Same incident:
foundryai-development 2026-04-30, marketplaces / DuckDB / session secret
written to /data (sdb) instead of the config disk (sdc), wiped on next
container recreate.

## Why an app-side guard

agnes-auto-upgrade.sh fires every 5 min on every VM. If `/data/state` is
not on the config disk (because of the propagation regression fixed by the
infra PR, or the boot-time udev race fixed by infra #58, or any future
mount-loss path), this script previously ran `docker compose up -d`
anyway — and the app silently wrote state onto the wrong disk. Next
recreate, that state was gone.

The boot-time fixes in infra are preventive. This is the runtime backstop.

## Behavior

Before the existing pull/up logic, when /dev/disk/by-id/google-config-disk
exists on the VM:

1. Up to 3 mount-and-verify attempts with backoff (2s, 4s, 6s).
   - Mount the config disk if /data/state is not a mountpoint.
   - Detect mismatch: if /data/state is mounted from the wrong source,
     umount and retry.
2. After the loop, assert findmnt source matches the config disk.
   - On mismatch: `logger -t agnes-auto-upgrade FATAL` + exit 1. systemd
     marks the service failed; no docker compose action runs; existing
     containers (if any) keep running on stale state, but no new write
     lands on the wrong disk.
3. Once verified mounted: re-apply `mount --make-rprivate /data /data/state`
   on every run. Idempotent. Guards against propagation regressions
   sneaking back in via future docker / kernel changes.

VMs without a config disk (foundryai-poc, single-disk legacy) skip the
whole block — the `if [ -e $CONFIG_DEVICE ]` guard.

## Tested

Patched script installed on foundryai-development as a hotfix; manual run
post-migration was a no-op (digest unchanged); /data/state stayed on sdc
across a full `docker compose down + up -d` cycle.

## Rollout

- This file is fetched by infra startup.sh from
  raw.githubusercontent.com/keboola/agnes-the-ai-analyst/main on every
  boot. Once merged to main, all VMs pick up the new script on their
  next boot — no infra recreate needed.
- For immediate rollout to running VMs without waiting for next boot:
  `scp scripts/ops/agnes-auto-upgrade.sh <vm>:/tmp/ &&
   ssh <vm> sudo install -m755 -o root -g root /tmp/agnes-auto-upgrade.sh
   /usr/local/bin/agnes-auto-upgrade.sh` (already done on
  foundryai-development).

* chore: vendor-agnostic comment + changelog text

Drop customer-specific VM names from the script comment and
CHANGELOG entry. The OSS distribution should not name a particular
operator's hosts; the technical description already conveys why
the guard exists.

* fix(ops): suppress mount stderr in retry loop

Match the rest of the script's error-tolerant idiom (2>/dev/null).
Mount failures in the cold-boot udev race the loop is designed
to handle gracefully should not flow to stdout — cron would mail
on every transient retry.

Devin BUG_0001 on PR #146.

* fix(changelog): move auto-upgrade entry to [Unreleased]

Entry landed under v0.20.0 because that section was [Unreleased]
when this branch first opened — releases v0.21–v0.24 cut in the
meantime stranded it inside an already-released section. Move it
back where new entries belong.

Devin BUG_0001 on PR #146.

* fix(infra): single-source agnes-auto-upgrade.sh via curl from main

Replace the inline heredoc copy of the auto-upgrade script in the
customer-instance Terraform startup template with a curl fetch from
raw.githubusercontent.com on every boot. The inline copy had drifted
several iterations behind canonical scripts/ops/agnes-auto-upgrade.sh
(missing TLS overlay detection, array-form COMPOSE_FILES, and now
the config-disk fail-fast guard from this PR).

Devin ANALYSIS_0001 on PR #146.

* fix(infra): fetch docker-compose.tls.yml unconditionally + document coupling

The canonical agnes-auto-upgrade.sh from main detects TLS at runtime
via cert files on disk, regardless of the TLS_MODE Terraform variable.
Certs can appear after boot via agnes-tls-rotate.sh or manual
provisioning, and the cron job would then fail every 5 min under
'set -euo pipefail' because docker-compose.tls.yml was never fetched.

Also document the main-vs-COMPOSE_REF coupling: when the canonical
script references a new compose file, the fetch list above must be
updated to match — pinned-ref VMs would otherwise break.

Devin BUG_0001 + ANALYSIS_0001 on PR #146.

* fix(ops,infra): unconditional Caddyfile + skip tls overlay if missing

Caddyfile fetch now matches docker-compose.tls.yml: unconditional in
startup-script.sh.tpl. Without it, Docker would auto-create an empty
directory at the bind-mount target and Caddy would crash-loop while
the tls overlay has already closed :8000 — making the app
unreachable on any non-caddy VM where certs land via rotate or
manual provisioning.

Defensive layer: agnes-auto-upgrade.sh now also requires Caddyfile
to exist (size > 0) before activating the tls profile, with a
WARN log if it's missing. Belt-and-suspenders so the failure mode
is contained even when the script is deployed by some other path
(not just the customer-instance TF module).

Devin BUG_0001 on PR #146.

* chore(release): cut 0.25.0

---------

Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
2026-04-30 20:07:22 +02:00
minasarustamyan
fb1573766a
feat(admin): users/groups UI polish + SSO lock + v18 migration (#142)
Cuts release 0.24.0.

## Highlights
- SSO-managed accounts read-only for password / delete operations (UI + API). New `is_sso_user` flag derived from group memberships.
- Admin/Everyone system rows show `google_sync` chip + Workspace email subtitle when env-mapped.
- Origin pill vocabulary unified across `/admin/groups`, `/admin/access`, `/admin/users`, `/admin/users/{id}`, `/profile` (Admin yellow, Everyone gray, google_sync green, custom purple).
- Effective-access readout no longer short-circuits for admin users — always renders per-resource breakdown.
- Schema migration v18 drops stranded non-google memberships in env-mapped Admin/Everyone groups (cleans up v13's blanket Everyone backfill).

## Devin findings addressed
- _is_sso_user requires source='google_sync' on system-group branches (so v13 system_seed memberships in env-mapped Everyone don't lock out the admin).
- POST add-to-group returns correct origin via _derive_origin (matching GET).
- 8 customer-specific token instances (groupon.com / foundryai) replaced with vendor-neutral placeholders across templates, tests, and CHANGELOG.
- deriveDisplayName name-skip for canonical "Admin"/"Everyone" so an overlapping AGNES_GOOGLE_GROUP_PREFIX doesn't mangle the chip text.

See CHANGELOG [0.24.0] for full notes.
2026-04-30 15:16:04 +02:00
ZdenekSrotyr
f3d252f17d
fix(tls-rotate): chown CERT_DIR to UID 999 so the app container can read its own certs (#143)
The script's `mkdir -p` left ownership of `/data/state/certs/` to whichever
process won the create race — root when systemd's timer fired before the
app container's first volume init, UID 999 when the container ran first.
With mode 700, a root-owned dir blocks the UID-999 agnes container from
reading its own fullchain.pem; `_read_agnes_ca_pem()` returns None, and
the cross-platform TLS trust block (Step 0 from PR #137) silently
disappears from the /install setup prompt. Operators on the unlucky-race
VMs got a setup prompt that couldn't bootstrap client trust against the
self-signed host. Existing VMs self-heal on next timer tick.
2026-04-30 13:21:59 +02:00
ZdenekSrotyr
70672204fe
feat(memory): admin Edit + MEMORY_DOMAIN RBAC + ai-section UI (#141)
Cuts release 0.23.0.

## Highlights
- Single-item Edit button on every memory item card (modal hits PATCH /api/memory/admin/{id}).
- MEMORY_DOMAIN RBAC resource type — admins grant user_groups access to specific domains via /admin/access. Composes with existing audience filter (OR semantics, no-op when no grants).
- ai: section editable in /admin/server-config — admins can set ANTHROPIC_API_KEY / model / provider / base_url for the corporate-memory extractor without editing instance.yaml directly. api_key auto-masked.

## Devin findings addressed
- Modal NULL→empty fix (audience visibility wouldn't break).
- Stats endpoint granted_domains parity with list endpoint.
- Documented intentional MEMORY_DOMAIN→audience bypass.
- Documented conscious ai.base_url SSRF exclusion (legit internal LiteLLM/vLLM proxies).

See CHANGELOG [0.23.0] for full notes.
2026-04-30 11:04:41 +02:00
ZdenekSrotyr
83adf01bde
fix(v2): #134 BigQuery cross-project errors return structured 502/400 + BqAccess facade (#138)
* docs(spec): #134 unify BigQuery access behind BqAccess facade

Brainstorm output for issue #134. Captures:
- root cause (incl. correction of the issue's hypothesis about commit 33a9964)
- BqAccess facade API + project resolution rules
- error contract — typed BqAccessError mapped to HTTP 502 for upstream
  BQ failures, 500 for deployment/config bugs
- migration plan for v2_scan, v2_sample, RemoteQueryEngine
- test rewrite eliminating _bq_client_factory injection point
- E2E verification protocol on agnes-development as success criterion

* docs(spec): #134 revise after first review

Incorporates code-reviewer findings:

Must-fix:
- Add v2_schema (2 copies of INSTALL/LOAD/SECRET dance) to migration scope.
- Reframe v2_scan headline: missing try/except around BQ calls is the
  actual cause of bare 500s, not project resolution (which 33a9964 fixed).
- List two more deferred call sites (extractor.py, register_bq_table)
  with explicit rationale.

Important:
- Drop billing != data clause from cross_project_forbidden heuristic;
  rely only on 'serviceusage' substring. billing != data is normal
  for cross-project setup, was over-classifying.
- Split bq_bad_request into _user (400) and _server (502) variants;
  add sql_origin parameter to translate_bq_error so call sites declare
  whether SQL contains user input.
- Add @functools.cache to BqAccess.from_config; document tests bypass
  via dependency_overrides.
- Replace monkey-patched-classmethod test pattern with
  BqAccess(client_factory=...) injection at construction time. Cleaner
  than today's _bq_client_factory and 1:1 migration shape.
- Keep BqProjects.data (reviewer assumed registry has source_project;
  it doesn't). Multi-project explicitly listed as non-goal with note.

Nice-to-have:
- Add 'Implementation strategy' section: 2 staged commits (bug fix
  alone is revertable; refactor follows).
- Extend E2E protocol to cover all three endpoints, not just /sample.
- Note removal of stale docstring at src/remote_query.py:204.

* docs(spec): #134 revision 3 — incorporates second-round review

Must-fix from second review:
- v2_schema split into two migration cases: _fetch_bq_schema translates
  errors via translate_bq_error; _fetch_bq_table_options preserves its
  swallow-all 'except Exception → return {}' so /schema doesn't 502 on
  partition-info failures.
- RemoteQueryEngine.__init__ now resolves BqAccess lazily (in
  _get_bq_client, not in __init__). Without this, ~7 DuckDB-only tests
  in test_remote_query.py would suddenly fail with not_configured.
- translate_bq_error pass-through for BqAccessError is now load-bearing
  (clause 1, before any Google-API branch). bq.client() raises BqAccessError
  for bq_lib_missing/auth_failed; without explicit pass-through those
  fall to 'unknown' and re-raise as bare 500.
- Commit 1 now emits the SAME structured response shape as commit 2 to
  avoid contract churn between commits.
- BIGQUERY_PROJECT env-var precedence is BREAKING for env-only deployments
  — flagged in CHANGELOG ### Changed.

Editorial:
- sql_origin renamed to bad_request_status with values 'client_error' /
  'upstream_error' (clearer about what the parameter actually decides).
  bq_bad_request_user/_server kinds collapsed to bq_bad_request (400)
  and bq_upstream_error (502).
- CLI (cli/commands/query.py) noted as external RemoteQueryEngine caller;
  unaffected because new bq_access kwarg has default None.
- Added unit/integration tests for the new contracts:
  test_translate_passes_through_BqAccessError,
  test_v2_scan_returns_500_on_bq_lib_missing,
  test_v2_schema_returns_200_with_empty_partition_on_bq_failure,
  test_resolve_succeeds_after_config_set.
- E2E protocol now covers /schema as the fourth endpoint.
- Documented functools.cache-doesn't-cache-exceptions semantics and
  fixture nullcontext-doesn't-close caveat for nested sessions.

* docs(spec): #134 revision 4 — incorporates third-round review

Third reviewer verdict: 'implementation-ready with two trivial edits';
explicitly noted prior rounds did the heavy lifting.

Edits:
1. get_bq_access() module-level function instead of @classmethod
   @functools.cache from_config. Removes the classmethod-cache stacking
   footgun (different Python versions wrap differently) and gives FastAPI's
   dependency introspection a clean function signature. Drops the
   'Do not subclass BqAccess' caveat that no longer applies.

2. Commit 1 strategy explicitly: wrap _fetch_bq_sample (v2_sample),
   _bq_dry_run_bytes + _run_bq_scan (v2_scan), and _fetch_bq_schema
   (v2_schema strict block). Do NOT touch _fetch_bq_table_options swallow-all
   in commit 1 — preserved as-is, then migrated (still preserved) in commit 2.
   All three endpoints emit the same structured body shape so client parsers
   see one consistent contract throughout the staged rollout. No more
   half-rolled-out window where /sample is bare 500 while /scan is
   structured 502.

* docs(plan): #134 implementation plan — Phase 1 (atomic bug fix) + Phase 2 (BqAccess refactor) + Phase 3 (verification)

Bite-sized TDD tasks. 3 phases, 16 tasks total:

Phase 1 (Commit 1) — atomic bug fix across all four v2 endpoints:
  Tasks 1.1-1.5 wrap _fetch_bq_sample, _bq_dry_run_bytes, _run_bq_scan,
  _fetch_bq_schema with structured 502/400 try/except. _fetch_bq_table_options
  preserved untouched. CHANGELOG Fixed entries.

Phase 2 (Commit 2) — BqAccess facade extraction + migration:
  Tasks 2.1-2.5 build connectors/bigquery/access.py bottom-up
  (BqProjects, BqAccessError, translate_bq_error, default factories,
  BqAccess class, get_bq_access module-level cached). Task 2.6 adds
  conftest.py fixture. Tasks 2.7-2.9 migrate v2_scan, v2_sample, v2_schema
  to BqAccess. Tasks 2.10-2.11 migrate RemoteQueryEngine + tests
  (lazy bq_access, drop _bq_client_factory). Task 2.12 CHANGELOG
  Changed BREAKING + Internal.

Phase 3 — Verification:
  3.1 full pytest. 3.2 squash into two PR-shape commits. 3.3 manual
  E2E on agnes-development per spec protocol → close #134.

Self-review table maps spec sections to implementing tasks; no gaps.

* fix(v2): #134 structured 502/400 on BQ errors across /scan, /scan/estimate, /sample, /schema

Wraps the BigQuery call sites in v2_scan, v2_sample, and v2_schema (strict
block only) with try/except for google.api_core exceptions, translating to
HTTPException with a structured body shape: {error, message, details}.

Fixes Pavel's report (#134) where these endpoints returned bare HTTP 500
with no body when the SA on agnes-development hit cross-project Forbidden
on serviceusage.services.use.

Also fixes /sample's missing billing_project fallback (the bug 33a9964
fixed for /scan never landed here).

Status code split:
  - /scan, /scan/estimate: BadRequest -> 400 (bq_bad_request) since SQL is
    user-derived from req.select/where/order_by.
  - /sample, /schema: BadRequest -> 502 (bq_upstream_error) since SQL is
    server-constructed from validated identifiers.
  - All Forbidden -> 502 with cross_project_forbidden if 'serviceusage' in
    error message (with hint pointing at data_source.bigquery.billing_project),
    else bq_forbidden.

Body shape matches what the upcoming BqAccess refactor (next commit) will
produce, so client-side parsers see one consistent contract throughout
the staged rollout.

_fetch_bq_table_options preserved exactly as-is — its swallow-all-and-return-empty
contract is intentional and survives into the refactor; /schema continues to
return 200 with empty partition info when partition queries fail.

Outer wraps in scan_endpoint, scan_estimate_endpoint, sample, and schema
endpoints exist only to make the test pattern (monkeypatching whole
_fetch_* functions) work, and are tagged TODO(#134 Phase 2) for removal
once BqAccess centralizes translation.

* refactor(bq): #134 BqAccess facade — unify v2_scan, v2_sample, v2_schema, RemoteQueryEngine

Extracts the duplicated BigQuery-access pattern (project resolution +
client construction + DuckDB-extension session + Google-API error
translation) into connectors/bigquery/access.py. Migrates four
call sites to use it:

- app/api/v2_scan.py — _bq_dry_run_bytes, _run_bq_scan
- app/api/v2_sample.py — _fetch_bq_sample
- app/api/v2_schema.py — _fetch_bq_schema (strict translation),
  _fetch_bq_table_options (preserves swallow-all best-effort contract)
- src/remote_query.py — RemoteQueryEngine, lazy bq_access kwarg

The new module exposes:
- BqProjects (frozen dataclass: billing + data project IDs)
- BqAccessError (typed exception with HTTP_STATUS class mapping)
- BqAccess (facade with injectable client_factory/duckdb_session_factory
  for tests; defaults call the real google-cloud-bigquery + DuckDB extension)
- get_bq_access (module-level @functools.cache; FastAPI Depends target)
- translate_bq_error (Google API exception → BqAccessError mapper, with
  BqAccessError pass-through, 'serviceusage'-substring heuristic for
  cross_project_forbidden, and bad_request_status param distinguishing
  user-derived (400) from server-constructed (502) SQL)
- _default_client_factory, _default_duckdb_session_factory

RemoteQueryEngine.__init__ no longer accepts _bq_client_factory; tests
migrate to bq_access=BqAccess(projects, client_factory=...). DuckDB-only
RemoteQueryEngine tests need no changes — bq_access defaults to None and
get_bq_access() is only invoked on first BQ call (lazy resolution).
BqAccessError raised internally is translated to RemoteQueryError(
error_type="bq_error") in _get_bq_client to preserve the engine's
existing public contract — CLI and /api/query/hybrid callers see no change.

Endpoint tests (test_v2_scan, test_v2_scan_estimate, test_v2_sample,
test_v2_schema) migrate from monkey-patching whole _fetch_* functions
to using the new bq_access fixture in tests/conftest.py — which
exercises the REAL translation path through BqAccess + translate_bq_error,
closing the test gap flagged in Task 1.1's review.

Side-effect behavior change: v2_sample's FROM clause now uses the data
project (instance.yaml data_source.bigquery.project), not the conflated
billing_project from Phase 1. Documented in CHANGELOG ### Internal.

BREAKING for deployments combining BIGQUERY_PROJECT env var with
data_source.bigquery.project in instance.yaml — env var now overrides
data project too. See CHANGELOG ### Changed.

Two known-duplicate BQ-access sites (connectors/bigquery/extractor.py,
scripts/duckdb_manager.register_bq_table) explicitly out of scope;
tracked as follow-up.

Removed stale docstring at the previous src/remote_query.py:204
that referenced scripts.duckdb_manager._create_bq_client as the default
BQ client factory (RemoteQueryEngine never actually used that function).

Test counts: tests/test_bq_access.py +27 (new), tests/test_v2_*.py +
tests/test_remote_query.py migrated to bq_access fixture (counts unchanged
or +1-2 per file). Full suite: 2086 passed, 8 pre-existing failures
(DB migration tests with unrelated internal_roles DependencyException —
not introduced by this PR).

* fix(bq_access): translate DefaultCredentialsError to BqAccessError(auth_failed)

CI on PR #138 caught: bigquery.Client(...) resolves Application Default
Credentials at construction time; without ADC (CI without SA key, dev
laptop without 'gcloud auth application-default login') it raises
google.auth.exceptions.DefaultCredentialsError synchronously.

Pre-fix _default_client_factory only caught ImportError, so DefaultCredentialsError
propagated as raw exception — and from production endpoints would surface
as bare 500 (the exact failure mode #134 sets out to fix).

Now translates to BqAccessError(kind='auth_failed', details.hint='Run
gcloud auth application-default login...'). Endpoint catch chain returns
HTTP 502 with structured body. Adds unit test
test_raises_auth_failed_on_default_credentials_error.

Third-round spec review flagged this case in passing; the fix didn't land.
CI's auth-less environment surfaced it.

* fix(bq_access): get_bq_access() returns sentinel instead of raising when not configured

Devin BUG_0001 on PR #138 review: 'get_bq_access() as FastAPI Depends
breaks all v2 endpoints for non-BigQuery instances'.

Pre-fix: get_bq_access() raised BqAccessError(not_configured) when
neither BIGQUERY_PROJECT env nor data_source.bigquery.project was set.
Because FastAPI resolves Depends() BEFORE the endpoint body runs, this
exception fires during dep-injection — the endpoint's try/except
BqAccessError clause never gets a chance to catch it. Result: every
v2 request on Keboola-only or CSV-only instances returned bare HTTP
500, even for local-source tables that never touch BigQuery.

Fix: get_bq_access() now returns a sentinel BqAccess with empty
BqProjects and factories that raise BqAccessError(not_configured)
on actual use. Construction succeeds, FastAPI's dep-injection cleanly
yields the sentinel, the endpoint runs. The local-source code path
in build_sample / build_schema / etc. never calls bq.client() or
bq.duckdb_session() (it reads parquet directly), so non-BQ tables
return 200 as before. Only when an endpoint actually tries to query
BQ (source_type == 'bigquery') does the sentinel raise — and the
endpoint's existing except BqAccessError catches it normally,
returning structured 502 with hint.

Test get_bq_access::test_raises_not_configured_when_neither_set
renamed and rewritten to test_returns_sentinel_when_neither_set:
asserts BqAccess is returned, then asserts client() and
duckdb_session() each raise BqAccessError(not_configured) on call.

Test test_does_not_cache_exceptions removed (no longer applicable)
and replaced with test_sentinel_is_cached_per_process documenting
the operator-restart-on-config-change contract.

* docs(spec+plan): #134 genericize customer-specific tokens (CLAUDE.md OSS rule)

Devin BUG_0001/0002 round 3 on PR #138: spec and plan docs contained
customer-specific deployment hostnames, deployment names, and a GCP
project ID that violated CLAUDE.md's vendor-agnostic OSS rule
('Nothing customer-specific belongs in code, configuration defaults,
comments, docs, commit messages, PR titles, or PR bodies').

Replacements:
  agnes-development.groupondev.com -> <your-agnes-host>
  agnes-development                -> <your-dev-instance>
  prj-grp-dataview-prod-1ff9       -> <your-data-project>
  s1_session_landings              -> <bq_table_id>

E2E verification semantics unchanged — operators still run the same
four curls + config flip + retry, just substituting their own host /
deployment name / project / table.

* fix(bq_access): hook get_bq_access.cache_clear into instance_config.reset_cache

Devin ANALYSIS_0004 on PR #138: get_bq_access is @functools.cache'd at
process level, so it captures BigQuery project IDs at first call and
ignores subsequent instance.yaml changes. Pre-Phase-2 the v2 endpoints
re-read get_value() on every request, so admin /api/admin/server-config
saves (which call instance_config.reset_cache()) hot-reloaded the BQ
project. Without this fix, my refactor silently regresses that contract
— operators editing instance.yaml via the admin UI would see no effect
on v2 endpoints until container restart.

instance_config.reset_cache() now also calls
connectors.bigquery.access.get_bq_access.cache_clear() (lazy import,
swallowed if connectors module isn't loaded — keeps instance_config
usable in isolated unit tests).

Adds test_instance_config_reset_cache_invalidates_get_bq_access as
regression guard. Updates CHANGELOG Internal entry to mention the
hot-reload contract + the not-configured sentinel behavior (round-3
fix from Devin BUG_0001 was previously only in commit message).

* fix(bq_access): surface not_configured before identifier validation + plan path genericize

Devin BUG_0001 + BUG_0002 round 5 on PR #138.

BUG_0001 (plan doc): personal filesystem path violated CLAUDE.md
vendor-agnostic rule. Replaced with '<worktree-root>' placeholder.

BUG_0002 (sentinel error path): when get_bq_access() returns the sentinel
BqAccess (BQ not configured), the empty bq.projects.data was reaching
validate_quoted_identifier first and raising ValueError -> endpoint
mapped to HTTP 400 'unsafe_identifier' instead of structured 500
'not_configured' with hint.

Each fetch helper now checks 'if not bq.projects.data: bq.client()' as
the first step, which triggers the sentinel's BqAccessError(not_configured).
Endpoint catches the typed error and returns HTTP 500 with hint pointing
at data_source.bigquery.project. Best-effort _fetch_bq_table_options
returns {} silently in this case (preserves the swallow-all contract).

* fix(bq_access): classify DuckDB-native exceptions from bigquery_query() via string match

Devin ANALYSIS on PR #138 review (latest round). The DuckDB bigquery
extension is a C++ plugin making its own HTTP calls — when BQ returns
403, it throws duckdb.IOException with the BQ error embedded as text,
not gax.Forbidden. translate_bq_error's isinstance checks would miss
these, falling to case 7 → bare 500 in production for v2_scan, v2_sample,
and v2_schema (the bigquery_query() paths).

Fix: last-resort string-match heuristic before the re-raise. 'Forbidden'
/ '403' / 'Bad Request' / '400' in the lowercased message classifies via
the same kind hierarchy. The 'serviceusage' substring still distinguishes
cross_project_forbidden from bq_forbidden. Specific enough that random
exceptions without HTTP-error keywords still re-raise.

Adds 4 unit tests covering the new heuristic + the 'don't swallow random
exceptions' invariant.

* chore(release): cut 0.22.0

PR #138 contains issue #134 user-visible behavior changes:
- BREAKING: BIGQUERY_PROJECT env var now overrides instance.yaml
  data_source.bigquery.project for v2 endpoints (previously
  RemoteQueryEngine billing only).
- Fixed: structured 502/400 on /api/v2/sample, /scan, /scan/estimate,
  /schema when BigQuery raises Forbidden/BadRequest (was bare 500).
- Internal: BqAccess facade refactor unifying four duplicate BQ-access
  call sites; instance_config.reset_cache() now invalidates BqAccess
  cache too so admin server-config saves hot-reload BQ project IDs.

Bumps to 0.22.0 because PR #137 merged first and took 0.21.0.
2026-04-30 10:11:20 +02:00
ZdenekSrotyr
b5178fe942
fix(ci): smoke-test stale route + rollback ghcr auth + issues:write (#140)
Three CI fixes triggered by the failed PR #137 deploy:

1. scripts/smoke-test.sh: assertion 8 was hitting /api/admin/tables (renamed to /api/admin/registry long ago). The 404 was treated as deployment regression and triggered the auto-rollback. Same stale URL also fixed in CLAUDE.md, README.md, dev_docs/server.md.

2. .github/workflows/release.yml smoke-test job: added Log in to GHCR step. The auto-rollback's docker push :stable was failing with 'unauthenticated' because the smoke-test job had no GHCR login of its own — leaving :stable pointing at the broken image.

3. Rollback step gained GH_TOKEN env, AND the workflow's permissions block gained issues:write. Both were needed for gh issue create to actually create the alert issue (was silently swallowed by the || echo fallback).

Manual cleanup outside this PR: :stable currently points at the broken PR #137 image — needs manual retag back to stable-2026.04.505.
2026-04-30 09:42:27 +02:00
minasarustamyan
4ec5ff44dd
feat(setup): cross-platform TLS bootstrap + marketplace plugin install (#137)
Bootstraps the Agnes Claude Code marketplace + RBAC-allowed plugins from
the dashboard CTA, and inlines the server's TLS cert when the chain isn't
publicly trusted (self-signed / private CA). Cross-platform setup prompt
covers Windows Git Bash, macOS, Linux. Includes Bun-compiled `claude` fix
(macOS goes via git-clone fallback, same as Windows), PAT stripping after
clone, explicit error handling, and four rounds of Devin Review fixes
(phantom step references, $PLATFORM re-detection, heredoc/awk line-count
sync). Cuts 0.21.0.

See CHANGELOG.md [0.21.0] section for details.
2026-04-30 08:56:45 +02:00
Vojtech
38f6b639d2
feat(observability): request_id end-to-end + dev debug toolbar + centralized logging (#136)
Cuts release 0.20.0.

## Highlights
- X-Request-ID header on every response + sanitized to [A-Za-z0-9_-] (CRLF log-forging mitigation)
- Error pages (HTML + JSON 500) surface request_id for support tickets
- Dev debug toolbar gated by DEBUG=1 — fastapi-debug-toolbar with custom DuckDBPanel
- Centralized app.logging_config.setup_logging() replaces 23 scattered basicConfig calls
- Telegram bot drops bot.log file — stdout only (BREAKING)

## Devin findings addressed
- BUG_0001: .env.template no longer claims FastAPI debug=True
- BUG_0002: subprocess extractor logs INFO to stderr again
- ANALYSIS_0003: _wants_html no longer matches Accept: */* (curl gets JSON as before)
- BUG on b1c6ee9: HTML 500 page no longer leaks str(exc) in production
- BUG on b13d2fe: 2 CLAUDE.md compliance flags (transform.py + ws_gateway) accepted as scope-limited logging refactor — follow-up to update CLAUDE.md if needed

See CHANGELOG [0.20.0] for full notes.
2026-04-29 22:54:21 +02:00
ZdenekSrotyr
b7a1795834
feat(scheduler): re-wire sync_schedule + script.schedule; tune via env; OpenMetadata TLS (#135)
Bundles 4 issues:
- #79 — table_registry.sync_schedule honored at runtime (API-side filter + Pydantic validators)
- #78 — script_registry.schedule honored via new POST /api/scripts/run-due (atomic claim, BackgroundTask exec, deploy-time safety validation)
- #77 — sidecar JOBS env-driven (SCHEDULER_DATA_REFRESH_INTERVAL/HEALTH_CHECK_INTERVAL/SCRIPT_RUN_INTERVAL/TICK_SECONDS)
- #89 — OpenMetadataClient verify=True default (BREAKING for self-signed)

Cuts release 0.19.0. See CHANGELOG for full notes incl. Known Limitations.
2026-04-29 22:06:30 +02:00
minasarustamyan
953bd9d250
fix(marketplace): use plugin.json name in synth marketplace.json (#133)
Closes the /plugin UI 'Plugin <X> not found in marketplace' bug. Synth marketplace.json catalog 'name' now reads from <plugin_dir>/.claude-plugin/plugin.json (with fallback to upstream marketplace.json name). On-disk plugins/<slug>-<plugin>/ layout preserved so cross-marketplace files don't collide. /marketplace/info exposes both name and prefixed_name (BREAKING — downstream tooling parsing 'name' for the slug-prefixed form must switch to prefixed_name).
2026-04-29 19:25:57 +02:00
ZdenekSrotyr
514fe2c8b6
chore(release): cut 0.18.0
Bundles #119 (BigQuery register-table M1), #126 (memory tree+duplicates+bulk-edit), #131 (Google groups prefix filter, BREAKING — auto-Everyone removed).
2026-04-29 14:34:58 +02:00
minasarustamyan
c940593a90
feat(auth): Google Workspace group prefix filter + system mapping (#131)
Three new env vars wire the Google OAuth callback to a configurable Workspace prefix and route admin/everyone Workspace groups onto the seeded system rows: AGNES_GOOGLE_GROUP_PREFIX, AGNES_GROUP_ADMIN_EMAIL, AGNES_GROUP_EVERYONE_EMAIL. Login gate redirects users with no prefix-matching group to /login?error=not_in_allowed_group. BREAKING: auto-Everyone membership for new users removed. Admin UI/API are read-only on Google-managed groups. See docs/auth-groups.md.
2026-04-29 14:08:04 +02:00
ZdenekSrotyr
82c5d71d63
feat(memory): #62 — duplicate hints + tree-view + bulk-edit (#126)
Issue #62. Tree view with cross-axis filtering, duplicate-candidate hints (Jaccard score on entity overlap), bulk-edit endpoints (PATCH /api/memory/admin/{id} + POST /api/memory/admin/bulk-update), schema v17 (knowledge_item_relations), full CLI parity (da admin memory tree/edit/bulk-edit/duplicates list/resolve).
2026-04-29 13:55:15 +02:00
ZdenekSrotyr
1824b9dd9c
feat(admin): #108 M1 — BigQuery table registration in UI + CLI (#119)
Issue #108 Milestone 1. Adds BigQuery table registration via /admin/tables UI and `da admin register-table` CLI without hand-editing table_registry. POST /api/admin/register-table/precheck for round-trip validation. --dry-run flag on CLI. Audit-log entries on register/update/unregister. PUT /api/admin/registry/{id} now preserves registered_at (closes #130).
2026-04-29 13:18:31 +02:00
ZdenekSrotyr
995e4cd366
fix(scheduler): HTTP marketplaces job + SCHEDULER_API_TOKEN shared secret (#127)
* fix(scheduler): HTTP marketplaces job + SCHEDULER_API_TOKEN shared secret

Two scheduler-reliability bugs surfaced after the v0.12.1 USER-agnes flip:

1. The marketplaces job called src.marketplace.sync_marketplaces() in-process
   from the scheduler container, racing the app's long-lived system.duckdb
   handle. DuckDB rejects cross-process writers — every cron tick 500-ed on
   "Could not set lock on file ... PID 0".

2. The data-refresh + new marketplaces jobs both 401-ed on the API because
   SCHEDULER_API_TOKEN was never propagated by the Terraform startup script.
   The scheduler had no credential to authenticate with.

Fix:
- New POST /api/marketplaces/sync-all (admin-only) drives the nightly refresh
  through the app process so it inherits the existing DB connection.
- Scheduler swaps fn->http for marketplaces; all jobs are now plain HTTP and
  the scheduler is reduced to a cron clock.
- New app/auth/scheduler_token.py adds a shared-secret auth path. The
  startup script generates a 256-bit secret on first boot, persists it
  across reboots, and writes it to /opt/agnes/.env. Both containers source
  the same .env. The app validates incoming Bearer tokens against the env
  var (constant-time, length-floored) and resolves matches to a synthetic
  scheduler@system.local user that's a member of the Admin system group.
  Audit-log entries from the scheduler are attributed to this user.
- app/main.py seeds the synthetic user at startup so the first cron tick
  has a valid actor; lazy seed in get_scheduler_user covers token rotation
  before the next app restart.

Tests: 5 new in tests/test_auth_scheduler_token.py covering empty/short
secret rejection, exact-match comparison, idempotent user seeding, and
lazy provisioning. 142 marketplace + scheduler tests + 96 auth tests
remain green.

Existing VMs with .env from before this change need a one-time
re-provisioning (re-run startup-script or rotate via openssl rand);
documented in CHANGELOG.

* fix(audit): use '_all' sentinel for bulk marketplace sync — Devin review #127

Avoids the literal string 'marketplace:None' in the audit_log resource
column when the bulk sync endpoint writes its summary row.

* fix(scheduler): unblock event loop + per-job timeouts — Devin review #127

Two findings from Devin re-review on commit 5fbad15:

1. BUG: trigger_sync_all was async def, so FastAPI ran it on the asyncio
   event loop. sync_marketplaces() does blocking I/O (subprocess git
   clones up to GIT_TIMEOUT_SEC=300 each, threading.Lock, DuckDB writes)
   and would freeze every concurrent request for the duration of a bulk
   sync. Switched to plain def so FastAPI auto-routes to the thread pool.

2. ANALYSIS: scheduler used a fixed 120s httpx timeout for every POST.
   Bulk marketplace sync iterates the registry under a single lock with
   up to 300s per repo — easily exceeds 120s on 2-3 slow repos. The
   scheduler then sees a timeout, doesn't update last_run, and re-fires
   on the next 30s tick, queueing redundant work. Per-job timeout
   override added to the JOBS tuple; marketplaces gets 900s (15 min),
   data-refresh keeps 120s, health-check 30s.

* fix(auth): require_session_token rejects scheduler shared secret — Devin review #127

require_session_token gates /auth/tokens (PAT minting). Pre-fix it only
rejected JWTs with typ=pat — but the scheduler shared secret is an opaque
string, so verify_token() returns None, payload becomes {}, and the
PAT-claim check silently passed. A caller bearing SCHEDULER_API_TOKEN
could mint persistent PATs that survive a secret rotation.

Added explicit is_scheduler_token() check before the PAT-claim check;
new regression test in tests/test_auth_scheduler_token.py.

Devin's other note (pre-existing async def trigger_sync at marketplaces.py:392
also calls blocking sync_one) — Devin flagged it as out-of-scope for this PR
and I agree; tracking separately.

* release(0.17.0): cut + clean up CHANGELOG duplicates

Cuts 0.17.0 (minor: scheduler shared-secret auth + sync-all endpoint
plus the deploy-shape fixes that landed since the last release tag).

Bumps pyproject from 0.15.0 — also corrects the missed bump from PR #120
(v0.16.0 was tagged on GitHub and shipped as :stable, but pyproject
stayed at 0.15.0, so /api/version, /cli/latest, and `da --version` had
been under-reporting the running release).

Removes the long-form duplicate entries for 0.13.0 / 0.14.0 / 0.15.0
above [0.16.0] — the canonical short summaries (with GitHub-release
links) already exist below 0.16.0, the long forms were leftover state
from before those versions were cut and have been silently shadowed
ever since.
2026-04-29 11:44:00 +02:00
dependabot[bot]
7012966482
chore(deps): bump actions/checkout from 5 to 6 (#125)
Bumps [actions/checkout](https://github.com/actions/checkout) from 5 to 6.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](https://github.com/actions/checkout/compare/v5...v6)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: ZdenekSrotyr <139972147+ZdenekSrotyr@users.noreply.github.com>
2026-04-29 09:58:48 +02:00
dependabot[bot]
8d0edbf1c1
chore(deps): bump peter-evans/create-pull-request from 7 to 8 (#124)
Bumps [peter-evans/create-pull-request](https://github.com/peter-evans/create-pull-request) from 7 to 8.
- [Release notes](https://github.com/peter-evans/create-pull-request/releases)
- [Commits](https://github.com/peter-evans/create-pull-request/compare/v7...v8)

---
updated-dependencies:
- dependency-name: peter-evans/create-pull-request
  dependency-version: '8'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-29 09:46:09 +02:00
dependabot[bot]
62a5b8540a
chore(deps): bump actions/upload-artifact from 4 to 7 (#123)
Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 4 to 7.
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](https://github.com/actions/upload-artifact/compare/v4...v7)

---
updated-dependencies:
- dependency-name: actions/upload-artifact
  dependency-version: '7'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-29 09:38:38 +02:00
ZdenekSrotyr
61f6b8d2d5
feat(ci+tests): deploy safety audit — linting, rollback, smoke tests, 50+ new tests (#120)
Comprehensive deploy safety audit implementing 19 improvements across CI/CD pipeline, test coverage, and source code.

### CI/CD Pipeline
- ruff + mypy added to both release.yml and keboola-deploy.yml (continue-on-error)
- Smoke test added to keboola-deploy.yml (was missing)
- Automatic rollback on smoke test failure in release.yml
- Expanded smoke-test.sh with catalog, admin/tables, marketplace.zip, metrics
- Required status checks via .github/settings.yml
- Dependabot + CODEOWNERS + pre-commit hooks + ruff config

### Source Code
- DB schema version check in /api/health (db_schema: ok/mismatch/unhealthy)
- Config versioning (config_version: 1 in instance.yaml, non-blocking validation)
- BigQuery extractor ATTACH error handling (try/except around INSTALL+ATTACH)
- Post-deploy smoke test script for prod VM validation

### Test Coverage (~50 new tests)
- v13->v14 migration, Email magic link TTL, PAT, Marketplace ZIP/Git,
  Jira webhooks, Hybrid Query BQ, Keboola/BQ extractor failure modes,
  Orchestrator failure modes

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-04-29 09:18:55 +02:00
ZdenekSrotyr
6752c4a53e
fix(web): restore admin nav menu items (#122)
v13 RBAC migration nulled users.role and moved admin authority onto user_group_members. Header still gated on session.user.role == 'admin', so admin menu was hidden for everyone. Inject user['is_admin'] via is_user_admin in get_current_user; header reads session.user.is_admin.
2026-04-29 09:09:23 +02:00
ZdenekSrotyr
33b318e491
ci(release): build dev image on branch creation from main (#118)
Fixes the per-developer dev-VM workflow: paths-ignore on push skipped same-SHA branch creates, build-and-push.if was main+dispatch-only. Add create: trigger filtered to branch refs, broaden build-and-push.if, add concurrency group keyed on github.ref with cancel-in-progress to dedupe create+push collisions on new branches with code changes.
2026-04-29 08:15:30 +02:00
ZdenekSrotyr
1baadd172e
fix(ui): render shared header full-width on corporate memory pages (#117)
Move {% include '_app_header.html' %} out of .container-memory (max-width: 1000px)
in corporate_memory.html and corporate_memory_admin.html so the header spans the
viewport, matching dashboard.html. Page content stays constrained by the container.
2026-04-29 07:45:56 +02:00
PavelDo
e1108b6112
feat(memory): corporate memory v1+v1.5 + 0.15.0 (#72)
Adds corporate memory v1 (verification flywheel + contradiction detection + confidence scoring) and v1.5 (audience-based distribution + per-item privacy + admin curation). Server: GET /api/memory/bundle returns mandatory + ranked-approved items within a token budget; POST /api/memory/admin/mandate accepts an audience field gated against user_group_members; /api/memory/stats uses SQL aggregation. CLI: da sync writes received items to .claude/rules/km_*.md. Verification detector extracts knowledge candidates from session JSONL files. Auto-tagging via Haiku when ai: is configured. Adapted from the v9-era branch onto v13/v14 RBAC: _is_privileged_viewer + _effective_groups now query user_group_members JOIN user_groups; require_role(Role.KM_ADMIN) replaced with require_admin (km_admin collapsed into admin). Schema v15: knowledge_items context-engineering columns + knowledge_contradictions + session_extraction_state. Schema v16: verification_evidence. Cuts release v0.15.0 (also bundles #116 /me/debug page).
2026-04-29 07:16:22 +02:00
minasarustamyan
7a06f1a585
feat(auth): /me/debug self-only auth diagnostic page (#116)
Adds /me/debug HTML page rendering the logged-in user's own session state — decoded JWT claims (no raw token, sha256[:12] fingerprint for log correlation), group memberships with sources and bound external_id when present, resource grants effective via those memberships, and a Refetch from Google (dry-run) button that diffs a fresh fetch_user_groups call against the cached user_group_members snapshot. Gated by AGNES_DEBUG_AUTH env var (default off → 404, route existence undetectable in production). Self-only by construction: user_id is read from the validated session, never echoes raw JWT / password hash / full PAT. Tolerates v13 + v14 schemas via information_schema check on users.external_id.
2026-04-29 06:36:28 +02:00
ZdenekSrotyr
2e1dfb7553
feat(v2): claude-driven fetch primitives + 0.14.0 (#102)
Replaces the BigQuery wrap-view pattern with a discovery + scoped-fetch toolkit driven by the analyst's Claude session. Adds /api/v2/{catalog,schema,sample,scan,scan/estimate}, da catalog/schema/describe/fetch/snapshot/disk-info CLI commands, sqlglot-backed WHERE validator, process-local quota tracker, agent rails skill (cli/skills/agnes-data-querying.md). BREAKING: BQ wrap views off by default — set data_source.bigquery.legacy_wrap_views=true for one cycle. Backward-compat field_validator on primary_key. Catalog cache now matches documented 300s TTL with RBAC fresh per request. Cuts release v0.14.0.
2026-04-29 01:07:19 +02:00
ZdenekSrotyr
a222f92e70
feat(admin): server configuration editor + 0.13.0 (#107)
Adds /admin/server-config UI for editing instance.yaml from the web. Hardening: SSRF gate on data_source URLs, narrow-overlay write strategy, atomic writes, audit log with secret masking on shape changes, threading lock on read-modify-write, corrupt-overlay refusal on write side + louder log on read side, modal Promise resolution on backdrop dismiss, sentinel scrub on save (defense-in-depth client+server). Bundles Windows PowerShell wrapper from #80. Cuts release v0.13.0.
2026-04-29 00:47:23 +02:00
David Rybar
cfe5771856
feat(dev): add Windows PowerShell wrapper for local development (#80)
Adds `scripts/run-local-dev.ps1` as a sibling of the bash script for Windows operators. Same compose stack (`docker-compose.yml` + `.dev.yml` + `.local-dev.yml`), same up/down/logs subcommands, same LOCAL_DEV_GROUPS default seeding. Restores caller's working directory and LOCAL_DEV_GROUPS on every exit path (success, error, Ctrl+C). Avoids advanced-script promotion so `up -d` / `down -v` reach docker compose instead of being eaten by -Debug/-Verbose.
2026-04-28 23:59:11 +02:00
ZdenekSrotyr
5f6bb7a4b2
fix(security+ops) + release(0.12.1): #82 #85 #87 hardening + cut 0.12.1 (#104)
* fix(security+ops): #82 #85 #87 — auth hardening, API validation, deploy posture

Security and operational hardening across three issue groups:

- M23: docker-compose.override.yml → docker-compose.dev.yml (BREAKING, prod foot-gun)
- C13: Container runs as non-root user 'agnes' (USER directive in Dockerfile)
- M21: Docker resource limits (mem_limit, cpus) on app + scheduler
- M22: Caddyfile security headers (X-Frame-Options, X-Content-Type-Options, Referrer-Policy, -Server)
- M17: /api/health split into minimal (unauth) + /api/health/detailed (auth) (BREAKING)
- M26: release.yml restricts build-and-push to main + workflow_dispatch; paths-ignore for docs

- C2: table_id traversal validation on /api/data/{table_id}/download
- M4: Upload streaming (chunk-read + temp file) instead of full-buffer; /local-md hashed filename

- C5: reset_token removed from POST /api/users/{id}/reset-password response
- C8: Startup WARNING when no user has password_hash (bootstrap window visible)
- M9: Audit log on failed web form login (mirrors /auth/token endpoint)
- M10: Atomic magic-link consume via compare-and-swap (CONSUMED: marker + DuckDB conflict catch)

Also: SSRF protection on /api/admin/configure (#46), memory stats SQL aggregation (#90)

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

* fix(review): SSRF 169.254.x.x + IPv6 multicast; M10 marker cleanup safety

Review fixes:
- Add 169.254.0.0/16 (link-local, cloud metadata) to SSRF regex — was
  missing, allowing requests to AWS/GCP/Azure metadata endpoints
- Add ff[0-9a-f]{2}: (IPv6 multicast) to SSRF regex
- M10: wrap Step 3 (CONSUMED marker cleanup) in try-except with
  warning log — prevents unhandled exception if DB write fails after
  successful token consumption
- Add test for 169.254.169.254 SSRF rejection

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

* fix(review): SSRF IPv6 bypass, CLI health endpoint, upload FD leak

Address Devin Review findings on PR #104:

1. SSRF IPv6 bypass: Replace hostname regex with DNS resolution +
   ipaddress module checks. The old regex patterns like `fe80:` only
   matched up to the first colon, missing real IPv6 addresses like
   `fe80::1`, `fc00::1`, `ff02::1`. The new approach resolves the
   hostname via getaddrinfo and checks each resulting IP against
   ipaddress.is_private/is_loopback/is_link_local/is_reserved/is_multicast.

2. CLI commands broken: `da setup test-connection`, `da setup verify`,
   `da diagnose`, `da status` all called /api/health expecting the old
   format (status=="healthy", services dict). Now they call
   /api/health/detailed for service-level checks (with graceful fallback
   to the minimal endpoint when auth is not configured).

3. Temp file handle leak: _stream_to_temp returns an open
   NamedTemporaryFile; callers now close it before shutil.move() to
   prevent FD leaks until GC.

Also adds IPv6 SSRF test cases (loopback, link-local, unique-local,
multicast) with mocked DNS resolution for test environment independence.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

* fix(review): download regex blocks hyphenated IDs; document health split

Address Devin Review round-3 findings on PR #104:

1. _SAFE_IDENTIFIER regex blocked hyphenated table IDs: The download
   endpoint used the strict SQL-identifier regex which does not allow
   dots or hyphens, but Keboola table IDs like in.c-crm.orders
   contain both. Switched to _SAFE_QUOTED_IDENTIFIER which allows dots
   and hyphens while still blocking path-traversal chars (/, .., \)
   and quote/control characters. Added test for hyphenated/dotted IDs.

2. Documented health endpoint split in DEPLOYMENT.md: Added Health
   checks & external monitoring section explaining both endpoints
   (minimal unauth /api/health vs authenticated /api/health/detailed)
   and how to wire external monitoring tools to the detailed endpoint
   with a PAT.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

* release(0.12.1): cut hotfix for snapshot integrity + #82/#85/#87 hardening

* fix(security): apply CAS pattern to password reset confirm (#82/M10 follow-up)

Devin review on the rebased PR flagged the asymmetry: magic-link verify
got the atomic compare-and-swap pattern in the original M10 fix, but
password reset confirm at /auth/password/reset/confirm was still using
read-validate-clear. Two concurrent POSTs with the same valid reset
token could both succeed in setting different new passwords (last-write-
wins). Lower severity than the magic-link race because the attacker
would need the reset token AND to race the legitimate user, but the
asymmetry was a polish gap.

Mirrors app/auth/providers/email.py::_consume_token CAS exactly: write
unique CONSUMED:<random> marker via UPDATE...WHERE token=old_token, then
SELECT to verify our marker won, then proceed. Only the winner clears
the marker and applies the password change.

New regression test_concurrent_reset_only_one_wins in
tests/test_password_flows.py::TestResetConfirm pins the contract: two
ThreadPoolExecutor workers + Barrier hit /reset/confirm with the same
token; exactly one gets 302 (password applied), the other gets 200 with
'Invalid or expired'. Sanity-checked against the pre-CAS code — both
POSTs got 302 (race confirmed).

---------

Co-authored-by: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-04-28 19:57:30 +02:00
Petr Simecek
3047f310b9
fix(db): self-heal missing tables on future-version DBs (agnes-dev incident) (#75)
Discovered when 0.11.5 deployed onto agnes-dev whose system DB had been
bumped to schema_version=10 during local experimentation with a parallel
WIP branch (PR #72-style Context Engineering work). The lab v10 migration
laid down its own table set without including v9's role tables — so the
v9 binary saw `current=10 > SCHEMA_VERSION=9`, correctly treated it as a
future-version-rollback and skipped its migration ladder, but ALSO
skipped the table-creation step. Every query against user_role_grants
(`_hydrate_legacy_role`, /profile, require_internal_role's DB fallback,
every admin-gated request) then crashed with `_duckdb.CatalogException:
Table with name user_role_grants does not exist`. Symptom on agnes-dev:
HTTP 500 on /profile, admin nav vanished, /admin/* returned 403.

Fix: hoist `conn.execute(_SYSTEM_SCHEMA)` to the TOP of _ensure_schema,
unconditional. _SYSTEM_SCHEMA is all `CREATE TABLE IF NOT EXISTS`, so
existing tables stay untouched (columns + data preserved); missing
tables get created. Idempotent, near-zero cost (a few dozen no-op DDLs
per process start). The migration block below still calls
_SYSTEM_SCHEMA when migrating; that's now the redundant-but-cheap
follow-up — left in place so the migration ladder reads chronologically.

Concrete coverage of the rebase scenario the user asked about — a
contributor switching FROM a lab future-schema branch BACK to a
released binary now boots cleanly:

- Forward rebase (older → current): unchanged, ladder runs as before.
- Same-version rebase: unchanged, _seed_core_roles tail call still
  drives doc-tweak refresh.
- Backward "lab" rebase (this fix): tables get re-materialized; if the
  DB is still on a future schema_version, _seed_core_roles tail call
  remains gated so we don't accidentally write data into a schema
  shape this binary doesn't understand. Operator can drop the v9
  schema_version manually to trigger a clean ladder re-run if they
  want the full v8→v9 backfill (what we did to recover agnes-dev).

Test: new test_split_brain_future_version_with_missing_tables_self_heals
in tests/test_db.py::TestMigrationSafety. Synthesizes a v99 DB whose
only existing table is schema_version, runs _ensure_schema, asserts
both user_role_grants AND internal_roles AND group_mappings AND users
exist after the call, and that the schema_version row stays at 99
(future-version contract). test_future_version_is_noop docstring
updated to reflect the new self-heal pass — its only assertion (the
version-row contract) still holds unchanged.

pyproject.toml: 0.11.5 → 0.11.6.
CHANGELOG.md: new [0.11.6] section under [Unreleased] skeleton.
2026-04-28 15:51:33 +02:00
ZdenekSrotyr
d9c4a95d9e
Merge pull request #106 from keboola/ma/staging
ma/staging → main: RBAC v13 + marketplace + #81 security hardening + OSS neutralization
2026-04-28 14:56:34 +02:00
ZdenekSrotyr
0c963b55ef fix(rbac+auth): system-group PATCH accepts description; Google sync preserves memberships on empty
Two Devin-flagged regressions on the squashed PR #106 head:

1) PATCH /api/admin/groups/{id} blanket-rejected on system groups.

   The repository guard at src/repositories/user_groups.py was already
   narrowed to "rename only" by 7147bac (PR #110 follow-up), but the
   endpoint at app/api/access.py:331-343 still short-circuited with
   409 "System groups are immutable" for any mutation. A description-only
   payload like {"description": "..."} returned 409 instead of 200 even
   though the repo would have accepted it. CHANGELOG entry promised the
   fix but the code didn't match.

   Endpoint now mirrors the repo contract: 409 only when payload.name
   is set AND differs from existing name. Same-name no-op renames are
   dropped before the repo call. Description-only updates flow through.

2) Google OAuth callback wiped google_sync memberships on transient
   API failure.

   fetch_user_groups is fail-soft and returns [] for both "user has no
   groups" and "Cloud Identity API error". The callback fed that empty
   list into replace_google_sync_groups, which DELETEs all rows with
   source='google_sync' for the user then INSERTs zero — silently
   wiping every Workspace-synced membership on a hiccup.

   Callback now skips replace_google_sync_groups when group_names is
   empty and logs "preserving existing memberships". Trade-off: a user
   whose Workspace groups were genuinely cleared keeps stale memberships
   until the next non-empty sync. Admin-added rows (source='admin') were
   already protected by source-scope and are unaffected. The previous
   guard against this exact regression was test_callback_empty_groups_
   does_not_overwrite_existing in tests/test_auth_providers.py — that
   test class has been skipped since v12 (asserts users.groups JSON,
   needs rewrite for user_group_members).
2026-04-28 14:40:27 +02:00
ZdenekSrotyr
5c54320f75 chore(release): cut 0.12.0
Pre-1.0 SemVer convention: BREAKING changes land in MINOR. This release is
heavy on those — RBAC v13 schema rewrite, schema v14 FK constraints,
marketplace admin god-mode drop, internal_roles/group_mappings/user_role_grants
tables removed, exit code 2 from Keboola extractor (partial fail), Script API
admin-only, scripts/grpn/ → scripts/ops/.

CHANGELOG.md: rename Unreleased → [0.12.0] — 2026-04-28; append a fresh
empty Unreleased above so the next PR has somewhere to land.
pyproject.toml: 0.11.5 → 0.12.0.

Tag the merge commit as v0.12.0 + push the tag after merge to main.
2026-04-28 14:25:13 +02:00
Vojtech Rysanek
7147bac079 feat(rbac+marketplace): schema v14 FK + AGNES_ENABLE_TABLE_GRANTS + break-glass CLI
Follow-up to the RBAC v13 + marketplace work in the parent commit. Addresses
deferred Devin findings, gemini-flagged blockers, and adds three guard rails.

== Schema v14 — FK constraints on user_group_members + resource_grants ==
Adds DuckDB foreign-key constraints so cascade deletes can no longer leave
orphaned member / grant rows pointing at a deleted group_id (which were
relying on application-level cascades up to v13). Migration is RENAME →
CREATE-with-FK → INSERT → DROP, wrapped in BEGIN TRANSACTION so a partial
failure rolls back without leaving the DB at a half-applied schema.

== AGNES_ENABLE_TABLE_GRANTS feature flag (default off) ==
ResourceType.TABLE was shipped in the parent commit as listing-only — admins
can record grants but runtime enforcement still flows through legacy
dataset_permissions. To avoid the misleading-UX surface area, the chip is
hidden from /admin/access and POST /api/admin/grants returns 422 with the
env-var name in detail until the operator opts in. Existing TABLE rows in
resource_grants stay listable + deletable so cleanup is never blocked.

Helpers: is_resource_type_enabled(rt), enabled_resource_types().

== Break-glass admin CLI ==
`da admin break-glass <user>` adds the user to the Admin user_group with
source='system_seed' regardless of RBAC state. Bypasses authentication —
relies on filesystem access to ${DATA_DIR}/state/system.duckdb implying
host-level trust. Recovery path when the operator has locked themselves
out of /admin/access.

== Devin round-2 fixes (deferred on b4ec4c4) ==
- src/repositories/user_groups.py — narrow update() guard from blocking any
  mutation on system groups to blocking name change only. Description edits
  now pass through. Endpoint pre-check stays as defense-in-depth. Prior
  behavior surfaced as a misleading 409 'Cannot rename a system group' on
  description-only PATCH.
- app/api/access.py:delete_group — wrap cascade DELETEs + repo.delete in
  BEGIN TRANSACTION / COMMIT / ROLLBACK. Prevents orphan rows if any
  DELETE fails after the user_groups row is gone.
- app/marketplace_server/{packager,router}.py — split compute_etag_for_user()
  from build_zip(); router resolves etag first and 304-shorts before any
  file read or ZIP_DEFLATED. In-process cachetools.TTLCache (default 120s,
  env-tunable via AGNES_MARKETPLACE_ETAG_TTL, set 0 to disable).
  invalidate_etag_cache() called by sync to force re-hash on content drift.

== Tests ==
- TestTableGrantsFeatureFlag (4 cases) — endpoint exclude/include, grant
  rejection/acceptance under the flag.
- test_v12_to_v13_finalize_rollback_on_failure — destructive: monkeypatches
  _seed_system_groups to raise mid-transaction, asserts schema_version stays
  at 12, legacy tables intact, new tables empty (rollback fired). Then
  restores the real function and asserts the retry succeeds.
- test_update_system_group_description_allowed,
  test_update_system_group_same_name_no_op — repo-level coverage of the
  narrowed guard.
2026-04-28 14:25:13 +02:00
ZdenekSrotyr
e9d7af3cce feat(rbac+marketplace): RBAC v13 + Claude Code marketplace + #81/#83/#44 hardening
This squashes 13 commits from ma/staging plus a small docstring translation
into a single coherent unit. Three workstreams.

== RBAC v13 redesign ==
- Drops core.viewer/analyst/km_admin/admin hierarchy and the
  internal_roles / group_mappings / user_role_grants / plugin_access tables.
- Replaced by user_group_members + resource_grants. Atomic v12→v13 backfill
  wrapped in BEGIN/COMMIT; ROLLBACK leaves schema_version at 12 for retry.
- Two authorization primitives in app.auth.access:
    require_admin                        — Admin-group god-mode
    require_resource_access(rt, "{path}") — entity-scoped grants
  Single DB lookup per request; no session cache; no implies BFS.
- /admin/access UI (single page) replaces /admin/role-mapping +
  /admin/plugin-access. CLI `da admin group/grant *` replaces
  `da admin role/mapping/grant-role/revoke-role/effective-roles`.
- ResourceType.TABLE listing-only — admins can record table grants,
  runtime enforcement still flows through legacy dataset_permissions
  (migration plan in docs/TODO-rbac-data-enforcement.md).

== Claude Code marketplace ==
- Aggregated /marketplace.zip + /marketplace.git/* (PAT-gated,
  RBAC-filtered, content-addressed cache via dulwich).
- Admin god-mode dropped on the marketplace surface — admins curate
  their own view via grants like everyone else.
- Bare-repo cache materializes per RBAC-filtered ETag; stale entries
  not pruned in this iteration (disclaimed in git_backend.py docstring).

== #81 #83 #44 security/ops hardening ==
- #81 Group A — orchestrator ATTACH allow-listing (extension/url/alias).
- #81 Group B — Keboola extractor 3-state exit codes:
    0 success / 1 total fail / 2 PARTIAL fail
  Sync API logs PARTIAL FAILURE alert on exit 2. Operators with binary
  alerting must teach it the new partial signal.
- #81 Group C — schema v10 view_ownership; rejects silent overwrite
  of a prior connector's view name on collision.
- #81 Group D — extractor-side identifier validation.
- #83 — Jira webhook fail-closed when JIRA_WEBHOOK_SECRET unset
  + path-traversal fix.
- #44 — entire /api/scripts/* surface is admin-only (planted-script +
  sandbox-bypass risk closed).

== Web UI polish + deploy fix ==
- /admin/access: live grant-count badges (no stale snapshot revert),
  shared-header CSS link added to /catalog and /admin/{tables,permissions},
  per-resource-type colored stripes.
- docker-compose.host-mount.yml: bind,rbind so dual-disk hosts don't
  silently shadow sub-mounts and write state to the wrong disk.

== OSS vendor-neutralization (waves 1+2) ==
- scripts/grpn/ → scripts/ops/. Customer-specific identifiers
  (project IDs, internal hostnames, dev/prod VM IPs, brand names)
  replaced with placeholders across code, docs, Terraform, Caddyfile,
  OAuth probe, and planning docs. Downstream infra repos that copied
  scripts/grpn/agnes-tls-rotate.sh or agnes-auto-upgrade.sh must
  update the path.

== Translation ==
- src/repositories/user_groups.py::ensure_system docstring translated
  from Czech to English for codebase consistency.

Co-authored-by: Mina Rustamyan <mina@keboola.com>
2026-04-28 14:25:04 +02:00
ZdenekSrotyr
72230c3b51
fix: #81 Group C — view-name collision detection (schema v10, squashed) (#100)
Schema v10 + view_ownership table. Cross-connector view name
collisions are detected and refused with an actionable ERROR rather
than silently last-write-wins. Pre-scan reconcile releases stale
ownerships in the same rebuild as a rename — but only when ALL
sources' pre-scans succeed (transient-IO defense; partial pre-scan
skips reconcile to avoid silently stealing a name).

26/26 view collision + orchestrator tests pass.
Refs #81 Group C.
2026-04-27 22:09:49 +02:00
ZdenekSrotyr
ef74ec010c
fix(ops): #81 Group B — Keboola partial-failure exit code 2 (squashed) (#99)
Closes M14 from issue #81. Keboola extractor exits 0/1/2
(success/full-fail/partial). sync.py interprets exit 2 as
PARTIAL FAILURE (data-quality alert, distinct from exit 1).

Tests: tests/test_keboola_extractor_exit_codes.py — 14 cases including
runtime mock subprocess (rc=0/1/2/124).

Refs #81 Group B.
2026-04-27 21:52:46 +02:00
ZdenekSrotyr
569cd90d75
fix(security): #81 Group D — extractor-side identifier validation (squashed) (#97)
Closes M15 from issue #81 — SQL injection via attacker-controlled
identifiers in connectors/keboola/extractor.py and
connectors/bigquery/extractor.py.

Lifted _validate_identifier from src/orchestrator.py into a new
src/identifier_validation.py shared module (single source of truth for
both layers). Two validator policies:

- validate_identifier (strict, ^[a-zA-Z_][a-zA-Z0-9_]{0,63}$) for
  table_name — matches the orchestrator's rebuild-time check, so dashed
  names fail fast at extraction rather than being silently dropped.
- validate_quoted_identifier (relaxed, accepts dashes/dots) for
  bucket/dataset/source_table — Keboola in.c-foo and BigQuery
  my-dataset are legitimate, just need to be safe inside `"..."`.

Both extractors skip-and-continue on unsafe rows (logged + counted in
failure stats); _extract_via_extension re-validates as defense-in-depth.

71/71 extractor + orchestrator tests pass.
Refs #81 Group D.
2026-04-27 21:46:17 +02:00
ZdenekSrotyr
23be8ad46f
fix(security): #81 Group A — orchestrator attach hardening (squashed) (#95)
Closes the C1 findings from issue #81 plus the round-3/4 follow-ups
on the read-only query path.

Both _attach_remote_extensions (rebuild path) and
_reattach_remote_extensions (query path) now apply the same hard
allowlists for extensions and token-env names, single-quote-escape
the URL, and split built-in vs community install. The CHANGELOG bullet
documents the full scope including the table_schema → table_catalog
fix that made the rebuild path a silent no-op for every connector.

New module src/orchestrator_security.py centralises the policy. Tests
in tests/test_orchestrator_remote_attach_security.py — 28/28 pass.

Refs #81.
2026-04-27 21:34:04 +02:00
ZdenekSrotyr
24e81fb671
fix(security): gate Script-API /run on admin role (#44) (#92)
* fix(security): gate Script-API /run on admin role (#44)

The AST + string-blocklist sandbox in `_execute_script` is defense-in-depth,
not a primary trust boundary. It does not block `vars()`, `type()`, or
`__class__.__bases__` introspection chains, and the string blocklist is
trivially evadable via concatenation/dunder encoding. Treat the role gate
as the actual barrier: only admin can run scripts.

- `POST /api/scripts/run` and `POST /api/scripts/{id}/run` now require admin.
- `POST /api/scripts/deploy` stays analyst-accessible (storing != executing).
- Existing /run tests retargeted to admin_token; added regression tests
  asserting analyst → 403 on both endpoints.
- CHANGELOG: BREAKING (security) bullet under Unreleased/Changed.

Closes #44.

* fix(security): admin-gate /deploy + harden sandbox blocklist (review #92)

Reviewer of PR #92 flagged three MUST-FIXes that #44 wasn't fully closed:

1. /api/scripts/deploy still accepted analyst → planted-script attack
   path (analyst plants malicious source, waits for admin to /run).
   Now: /deploy also requires admin; the entire Script API is admin-only.

2. The "Minimum (same-day)" blocklist mitigations from issue #44 weren't
   applied. Added the introspection-chain dunders that the issue PoC
   pivots through: __subclasses__, __globals__, __class__, __base__,
   __bases__, __mro__, __dict__, __code__, __builtins__. Plus `vars`
   in BLOCKED_FUNCTIONS. Deliberately NOT adding __init__ /
   __getattribute__ (substring match would flag every legit `def __init__`)
   nor `type`/`dir` (frequent in legitimate admin scripts). Documented
   the trade-off inline.

3. Tests didn't cover the actual PoC payload nor non-analyst non-admin
   roles. Added test_run_pwn_payload_blocked parametrized over the issue's
   own PoC + two equivalent variants (lambda+__globals__, __mro__
   traversal); these stay green only as long as the dunder list does.
   test_*_requires_admin tests now parametrize over (analyst, viewer,
   km_admin) so all three non-admin core roles are pinned at 403.

Conftest extension: seeded_app now exposes viewer_token and
km_admin_token as siblings to admin_token / analyst_token.

CHANGELOG bullet updated to reflect /deploy gate change and new
internal regression tests. 35/35 scripts tests pass locally.

Refs review of #92.

* fix(tests): test_security TestScriptSandbox needs admin token after #44 hardening

CI failure on PR #92 caught a missed test file. tests/test_security.py
seeded only an analyst user and used the analyst token to drive sandbox
tests. After the #44 admin-gate (deploy + run both admin-only), every
sandbox test got 403 from the role gate before the AST/string check
could run, so 'blocks os.system' / 'blocks eval' / etc. all failed.

Fix: extend the fixture to also seed an admin user and return the admin
token. Sandbox tests now reach the sandbox layer; access-control tests
further down in the module continue to use the analyst that was kept
around. 41/41 test_security.py tests pass locally.

* fix(security): #92 round-3 — gate GET /api/scripts on admin role

Devin Review caught: GET /api/scripts (app/api/scripts.py:44-51) was
left on Depends(get_current_user) when the rest of the API moved to
admin-only. ScriptRepository.list_all() does SELECT * FROM script_registry
which returns ALL columns including 'source' (the full script body).
So any authenticated user (viewer / analyst / km_admin) could read
admin-deployed scripts — leak of code that may contain credentials,
business logic, or admin-only operational details.

CHANGELOG already says 'The entire Script API is now admin-only',
which was true for /deploy, /run, /{id}/run, DELETE — just not for
GET. Now consistent: every Script endpoint requires admin.

Tests:
- New parametrized test_list_scripts_requires_admin over (analyst,
  viewer, km_admin) tokens — all assert 403.
- Updated test_list_scripts_empty in both test_scripts_api.py and
  test_api_scripts.py to use admin_token.

79 tests pass.

Refs Devin Review of #92.

* fix: cleanup unused imports, stale docstrings, and incomplete CHANGELOG

- Remove unused imports: Path, List, get_current_user (ruff F401)
- Trim docstrings to describe current behavior, not change history
- CHANGELOG now lists GET /api/scripts among admin-gated endpoints
- Remove diff-commenting inline comments from tests

Co-Authored-By: zdenek.srotyr <zdenek.srotyr@keboola.com>

* fix: merge duplicate Changed sections into one per CLAUDE.md convention

Co-Authored-By: zdenek.srotyr <zdenek.srotyr@keboola.com>

---------

Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-04-27 21:13:56 +02:00
ZdenekSrotyr
4e4d2a39e6
chore(oss): isolate customer-specific deploy bits from scripts/grpn/ (#88, wave 1) (#94)
* chore(oss): isolate customer-specific deploy bits from scripts/grpn/ (#88)

Vendor-neutralization step before public release. The directory mixed
two concerns: (1) generic ops scripts referenced from mainline OSS
infrastructure (TLS rotation, auto-upgrade cron) and (2) one operator's
hackathon manual-deploy helper with hardcoded GCP project IDs, VM names,
and admin emails. Splitting them per concern.

Moved (still in OSS, just under a vendor-neutral name):
- scripts/grpn/agnes-tls-rotate.sh   → scripts/ops/agnes-tls-rotate.sh
- scripts/grpn/agnes-auto-upgrade.sh → scripts/ops/agnes-auto-upgrade.sh

Removed (belongs in private consumer infra repos, not upstream OSS):
- scripts/grpn/Makefile (hardcoded prj-grp-foundryai-dev-7c37, foundryai-development VM name, e_zsrotyr@groupon.com bootstrap email)
- scripts/grpn/README.md (GRPN hackathon deploy walkthrough)
- docs/superpowers/plans/2026-04-22-grpn-deploy-learnings.md (org-specific deploy log)

Cross-refs updated in README.md, CLAUDE.md, docs/DEPLOYMENT.md,
docker-compose.yml. CHANGELOG entry flags BREAKING (ops) for any
consumer infra repo that installs these scripts via path-based systemd
timers.

This is the first wave of #88 — the remaining leaks (test data with
prj-grp-dataview-prod-1ff9, AIAgent.FoundryAI tags in OpenMetadata test
fixtures, docstrings in connectors/openmetadata/enricher.py) will be a
separate, smaller PR.

Refs #88.

* chore(oss): comprehensive vendor-neutralization (#88 wave 2 + review fixes)

PR #94 review found that the original wave-1 grep was scoped wrong and
many leaks survived. This commit closes wave 1 properly AND folds in all
wave-2 anonymization in a single pass — easier to review than two PRs.

Wave-1 review-fix corrections:
- Caddyfile: scripts/grpn/agnes-tls-rotate.sh → scripts/ops/ (the original
  wave-1 grep filter excluded extensionless files like Caddyfile).
- CHANGELOG bullet rewritten — original wording implied an in-repo migration
  for infra/modules/customer-instance/, which is wrong (the TF module embeds
  the script inline via heredoc, never sourced from scripts/grpn/). Now
  flags downstream consumer infra repos only.
- infra/modules/customer-instance/variables.tf: Czech docstring with `grpn`
  example → English description with `acme, example` placeholders.

Wave-2 anonymization:
- Code docstrings (connectors/openmetadata/{client,transformer,enricher}.py,
  src/catalog_export.py, scripts/duckdb_manager.py): prj-grp-… →
  my-bq-project / prj-example-1234, AIAgent.FoundryAI → AIAgent.MyAgent,
  FoundryAIDataModel → AnalyticsDataModel.
- Test fixtures (4 files): same set of replacements — 157 tests still pass.
- .github/workflows/keboola-deploy.yml: "Groupon-side dev VMs" comment →
  generic "per-developer dev VMs".
- docs/auth-groups.md + scripts/debug/probe_google_groups.py:
  kids-ai-data-analysis project name → acme-internal-prod placeholder.
- 5 planning/spec docs under docs/superpowers/{plans,specs}/2026-04-21-*:
  hardcoded IPs (34.77.94.14, 34.77.102.61) → <dev-vm-ip>/<prod-vm-ip>;
  GRPN/Groupon → Acme/another-customer; prj-grp-… → prj-example-….
- scripts/switch-dev-vm.sh deleted — hackathon-era helper hardcoded to a
  specific shared dev VM. Per-developer dev VMs are the supported pattern.

Final grep `groupon|grpn|foundryai|prj-grp|groupondev|34\.77\.(94|102)\.…|kids-ai-data`
returns zero hits (excluding CHANGELOG.md historical entries).

CHANGELOG entry expanded to document both waves under one bullet, with
the BREAKING (ops) clarification about the TF module being unaffected.

Refs review of #94, closes #88.

* fix(oss): close remaining #94 review-2 findings (Czech, padak refs, CHANGELOG)

Reviewer of PR #94 round 2 caught 4 remaining items the wave-2 pass missed:

1. infra/modules/customer-instance/variables.tf had Czech descriptions on
   8 more variables. Previous review only flagged line 19; this round
   audited the rest. Translated lines 2, 28, 42-46 (heredoc), 60, 65, 71,
   78, 84 to English. Same review concern: a Terraform module that is
   the customer-facing API surface in Czech is unfit for OSS distribution.

2. infra/modules/customer-instance/outputs.tf had Czech descriptions on
   four outputs. Same fix.

3. docs/padak-security.md referenced a private repo (padak/keboola_agent_cli#206)
   in two places. Replaced with generic 'tracked upstream in the auth-CLI repo'
   per CLAUDE.md vendor-agnostic rule (no cross-refs to private repos).

4. scripts/fetch-env-from-secrets.sh:41 had a Czech comment.
   Translated.

5. CHANGELOG cosmetic: bullet said 'AIAgent.FoundryAI -> AIAgent.MyAgent'
   but the actual code uses both MyAgent (in docstrings) and Example
   (in test fixtures). Reworded to mention both targets.

Final grep across all shipping file types (.md, .py, .yml, .yaml, .sh,
Makefile, .json, .tf, .tpl, Caddyfile, .toml) for groupon|grpn|foundryai|
prj-grp|groupondev|34.77.94.14|34.77.102.61|kids-ai-data|padak/keboola_agent_cli
returns ZERO hits (excluding CHANGELOG.md). Czech-diacritic grep across
.tf/.toml/Caddyfile/Makefile/.yml returns ZERO hits.

157/157 OpenMetadata + DuckDB tests still pass.

* fix(oss): close #94 round-3 leaks (env.template, instance.yaml.example, padak typo)

Round-3 reviewer caught two MUST-FIX leaks the round-2 grep missed
(grep was scoped to extensions that did not include .template / .example
suffixes — the audit was right, the previous grep was not paranoid enough):

1. config/instance.yaml.example:114 — '(optional - Groupon-specific)' brand
   leak in a shipping config example. Replaced with '(optional)'.

2. config/.env.template:68 — stale path 'scripts/grpn/agnes-tls-rotate.sh'
   in operator-facing env-template comment. The script lives at
   scripts/ops/ now (commit 16a85cc); this comment had been pointing
   operators at a non-existent path.

3. docs/padak-security.md:188 — phrase duplication 'tracked in tracked
   upstream' from a sloppy substitution in round-2. Trivial wording fix.

Final paranoid grep across .md/.py/.yml/.yaml/.sh/Makefile/.json/.tf/.tpl/
Caddyfile/.toml/.template/.example/.env* with the full token set
(groupon|grpn|foundryai|prj-grp|groupondev|34\.77\.94\.14|34\.77\.102\.61|
kids-ai-data|padak/keboola_agent_cli) returns ZERO hits, excluding
CHANGELOG.md historical entries.

* fix(oss): #94 round-4 — QUICKSTART.md + rename padak-security.md

Devin Review caught two findings on the latest round-3 commit:

1. docs/QUICKSTART.md:67 still pointed users at the deleted
   scripts/switch-dev-vm.sh. A Quickstart user following step-by-step
   would hit a missing-file error at the final step. Replaced with the
   inline gcloud-ssh equivalent that the Removed bullet documents.

2. docs/padak-security.md filename retains the personal identifier
   'padak'. The PR fixed the body content (replaced
   padak/keboola_agent_cli#206 references with generic wording) but
   missed the filename. Renamed to docs/security-audit-2026-04.md
   (date-anchored, vendor-neutral). Updated the historical CHANGELOG
   link to point at the new path with an inline note about the rename.

* fix(oss): redact remaining hardcoded IPs from planning docs + remove default email

Devin Review caught two more leaks:
1. scripts/fetch-env-from-secrets.sh line 16 had a hardcoded
   personal-email default (zdenek.srotyr@keboola.com). Replaced with
   ':?' bash error so SEED_ADMIN_EMAIL must be explicitly set —
   safer than carrying any specific identity.
2. Planning docs still had 35.195.96.98 and 34.62.223.189 (legacy
   prod/dev IPs) that the round-1 IP-replace pattern missed (it only
   targeted 34.77.x.x). Generic regex redaction across all five
   planning docs replaces every public IP with <redacted-ip>,
   preserving private/loopback/IAP ranges.
2026-04-27 20:24:34 +02:00