Commit graph

21 commits

Author SHA1 Message Date
ZdenekSrotyr
5f6bb7a4b2
fix(security+ops) + release(0.12.1): #82 #85 #87 hardening + cut 0.12.1 (#104)
* fix(security+ops): #82 #85 #87 — auth hardening, API validation, deploy posture

Security and operational hardening across three issue groups:

- M23: docker-compose.override.yml → docker-compose.dev.yml (BREAKING, prod foot-gun)
- C13: Container runs as non-root user 'agnes' (USER directive in Dockerfile)
- M21: Docker resource limits (mem_limit, cpus) on app + scheduler
- M22: Caddyfile security headers (X-Frame-Options, X-Content-Type-Options, Referrer-Policy, -Server)
- M17: /api/health split into minimal (unauth) + /api/health/detailed (auth) (BREAKING)
- M26: release.yml restricts build-and-push to main + workflow_dispatch; paths-ignore for docs

- C2: table_id traversal validation on /api/data/{table_id}/download
- M4: Upload streaming (chunk-read + temp file) instead of full-buffer; /local-md hashed filename

- C5: reset_token removed from POST /api/users/{id}/reset-password response
- C8: Startup WARNING when no user has password_hash (bootstrap window visible)
- M9: Audit log on failed web form login (mirrors /auth/token endpoint)
- M10: Atomic magic-link consume via compare-and-swap (CONSUMED: marker + DuckDB conflict catch)

Also: SSRF protection on /api/admin/configure (#46), memory stats SQL aggregation (#90)

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

* fix(review): SSRF 169.254.x.x + IPv6 multicast; M10 marker cleanup safety

Review fixes:
- Add 169.254.0.0/16 (link-local, cloud metadata) to SSRF regex — was
  missing, allowing requests to AWS/GCP/Azure metadata endpoints
- Add ff[0-9a-f]{2}: (IPv6 multicast) to SSRF regex
- M10: wrap Step 3 (CONSUMED marker cleanup) in try-except with
  warning log — prevents unhandled exception if DB write fails after
  successful token consumption
- Add test for 169.254.169.254 SSRF rejection

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

* fix(review): SSRF IPv6 bypass, CLI health endpoint, upload FD leak

Address Devin Review findings on PR #104:

1. SSRF IPv6 bypass: Replace hostname regex with DNS resolution +
   ipaddress module checks. The old regex patterns like `fe80:` only
   matched up to the first colon, missing real IPv6 addresses like
   `fe80::1`, `fc00::1`, `ff02::1`. The new approach resolves the
   hostname via getaddrinfo and checks each resulting IP against
   ipaddress.is_private/is_loopback/is_link_local/is_reserved/is_multicast.

2. CLI commands broken: `da setup test-connection`, `da setup verify`,
   `da diagnose`, `da status` all called /api/health expecting the old
   format (status=="healthy", services dict). Now they call
   /api/health/detailed for service-level checks (with graceful fallback
   to the minimal endpoint when auth is not configured).

3. Temp file handle leak: _stream_to_temp returns an open
   NamedTemporaryFile; callers now close it before shutil.move() to
   prevent FD leaks until GC.

Also adds IPv6 SSRF test cases (loopback, link-local, unique-local,
multicast) with mocked DNS resolution for test environment independence.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

* fix(review): download regex blocks hyphenated IDs; document health split

Address Devin Review round-3 findings on PR #104:

1. _SAFE_IDENTIFIER regex blocked hyphenated table IDs: The download
   endpoint used the strict SQL-identifier regex which does not allow
   dots or hyphens, but Keboola table IDs like in.c-crm.orders
   contain both. Switched to _SAFE_QUOTED_IDENTIFIER which allows dots
   and hyphens while still blocking path-traversal chars (/, .., \)
   and quote/control characters. Added test for hyphenated/dotted IDs.

2. Documented health endpoint split in DEPLOYMENT.md: Added Health
   checks & external monitoring section explaining both endpoints
   (minimal unauth /api/health vs authenticated /api/health/detailed)
   and how to wire external monitoring tools to the detailed endpoint
   with a PAT.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

* release(0.12.1): cut hotfix for snapshot integrity + #82/#85/#87 hardening

* fix(security): apply CAS pattern to password reset confirm (#82/M10 follow-up)

Devin review on the rebased PR flagged the asymmetry: magic-link verify
got the atomic compare-and-swap pattern in the original M10 fix, but
password reset confirm at /auth/password/reset/confirm was still using
read-validate-clear. Two concurrent POSTs with the same valid reset
token could both succeed in setting different new passwords (last-write-
wins). Lower severity than the magic-link race because the attacker
would need the reset token AND to race the legitimate user, but the
asymmetry was a polish gap.

Mirrors app/auth/providers/email.py::_consume_token CAS exactly: write
unique CONSUMED:<random> marker via UPDATE...WHERE token=old_token, then
SELECT to verify our marker won, then proceed. Only the winner clears
the marker and applies the password change.

New regression test_concurrent_reset_only_one_wins in
tests/test_password_flows.py::TestResetConfirm pins the contract: two
ThreadPoolExecutor workers + Barrier hit /reset/confirm with the same
token; exactly one gets 302 (password applied), the other gets 200 with
'Invalid or expired'. Sanity-checked against the pre-CAS code — both
POSTs got 302 (race confirmed).

---------

Co-authored-by: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-04-28 19:57:30 +02:00
Petr Simecek
3047f310b9
fix(db): self-heal missing tables on future-version DBs (agnes-dev incident) (#75)
Discovered when 0.11.5 deployed onto agnes-dev whose system DB had been
bumped to schema_version=10 during local experimentation with a parallel
WIP branch (PR #72-style Context Engineering work). The lab v10 migration
laid down its own table set without including v9's role tables — so the
v9 binary saw `current=10 > SCHEMA_VERSION=9`, correctly treated it as a
future-version-rollback and skipped its migration ladder, but ALSO
skipped the table-creation step. Every query against user_role_grants
(`_hydrate_legacy_role`, /profile, require_internal_role's DB fallback,
every admin-gated request) then crashed with `_duckdb.CatalogException:
Table with name user_role_grants does not exist`. Symptom on agnes-dev:
HTTP 500 on /profile, admin nav vanished, /admin/* returned 403.

Fix: hoist `conn.execute(_SYSTEM_SCHEMA)` to the TOP of _ensure_schema,
unconditional. _SYSTEM_SCHEMA is all `CREATE TABLE IF NOT EXISTS`, so
existing tables stay untouched (columns + data preserved); missing
tables get created. Idempotent, near-zero cost (a few dozen no-op DDLs
per process start). The migration block below still calls
_SYSTEM_SCHEMA when migrating; that's now the redundant-but-cheap
follow-up — left in place so the migration ladder reads chronologically.

Concrete coverage of the rebase scenario the user asked about — a
contributor switching FROM a lab future-schema branch BACK to a
released binary now boots cleanly:

- Forward rebase (older → current): unchanged, ladder runs as before.
- Same-version rebase: unchanged, _seed_core_roles tail call still
  drives doc-tweak refresh.
- Backward "lab" rebase (this fix): tables get re-materialized; if the
  DB is still on a future schema_version, _seed_core_roles tail call
  remains gated so we don't accidentally write data into a schema
  shape this binary doesn't understand. Operator can drop the v9
  schema_version manually to trigger a clean ladder re-run if they
  want the full v8→v9 backfill (what we did to recover agnes-dev).

Test: new test_split_brain_future_version_with_missing_tables_self_heals
in tests/test_db.py::TestMigrationSafety. Synthesizes a v99 DB whose
only existing table is schema_version, runs _ensure_schema, asserts
both user_role_grants AND internal_roles AND group_mappings AND users
exist after the call, and that the schema_version row stays at 99
(future-version contract). test_future_version_is_noop docstring
updated to reflect the new self-heal pass — its only assertion (the
version-row contract) still holds unchanged.

pyproject.toml: 0.11.5 → 0.11.6.
CHANGELOG.md: new [0.11.6] section under [Unreleased] skeleton.
2026-04-28 15:51:33 +02:00
ZdenekSrotyr
0c963b55ef fix(rbac+auth): system-group PATCH accepts description; Google sync preserves memberships on empty
Two Devin-flagged regressions on the squashed PR #106 head:

1) PATCH /api/admin/groups/{id} blanket-rejected on system groups.

   The repository guard at src/repositories/user_groups.py was already
   narrowed to "rename only" by 7147bac (PR #110 follow-up), but the
   endpoint at app/api/access.py:331-343 still short-circuited with
   409 "System groups are immutable" for any mutation. A description-only
   payload like {"description": "..."} returned 409 instead of 200 even
   though the repo would have accepted it. CHANGELOG entry promised the
   fix but the code didn't match.

   Endpoint now mirrors the repo contract: 409 only when payload.name
   is set AND differs from existing name. Same-name no-op renames are
   dropped before the repo call. Description-only updates flow through.

2) Google OAuth callback wiped google_sync memberships on transient
   API failure.

   fetch_user_groups is fail-soft and returns [] for both "user has no
   groups" and "Cloud Identity API error". The callback fed that empty
   list into replace_google_sync_groups, which DELETEs all rows with
   source='google_sync' for the user then INSERTs zero — silently
   wiping every Workspace-synced membership on a hiccup.

   Callback now skips replace_google_sync_groups when group_names is
   empty and logs "preserving existing memberships". Trade-off: a user
   whose Workspace groups were genuinely cleared keeps stale memberships
   until the next non-empty sync. Admin-added rows (source='admin') were
   already protected by source-scope and are unaffected. The previous
   guard against this exact regression was test_callback_empty_groups_
   does_not_overwrite_existing in tests/test_auth_providers.py — that
   test class has been skipped since v12 (asserts users.groups JSON,
   needs rewrite for user_group_members).
2026-04-28 14:40:27 +02:00
ZdenekSrotyr
5c54320f75 chore(release): cut 0.12.0
Pre-1.0 SemVer convention: BREAKING changes land in MINOR. This release is
heavy on those — RBAC v13 schema rewrite, schema v14 FK constraints,
marketplace admin god-mode drop, internal_roles/group_mappings/user_role_grants
tables removed, exit code 2 from Keboola extractor (partial fail), Script API
admin-only, scripts/grpn/ → scripts/ops/.

CHANGELOG.md: rename Unreleased → [0.12.0] — 2026-04-28; append a fresh
empty Unreleased above so the next PR has somewhere to land.
pyproject.toml: 0.11.5 → 0.12.0.

Tag the merge commit as v0.12.0 + push the tag after merge to main.
2026-04-28 14:25:13 +02:00
ZdenekSrotyr
72230c3b51
fix: #81 Group C — view-name collision detection (schema v10, squashed) (#100)
Schema v10 + view_ownership table. Cross-connector view name
collisions are detected and refused with an actionable ERROR rather
than silently last-write-wins. Pre-scan reconcile releases stale
ownerships in the same rebuild as a rename — but only when ALL
sources' pre-scans succeed (transient-IO defense; partial pre-scan
skips reconcile to avoid silently stealing a name).

26/26 view collision + orchestrator tests pass.
Refs #81 Group C.
2026-04-27 22:09:49 +02:00
ZdenekSrotyr
ef74ec010c
fix(ops): #81 Group B — Keboola partial-failure exit code 2 (squashed) (#99)
Closes M14 from issue #81. Keboola extractor exits 0/1/2
(success/full-fail/partial). sync.py interprets exit 2 as
PARTIAL FAILURE (data-quality alert, distinct from exit 1).

Tests: tests/test_keboola_extractor_exit_codes.py — 14 cases including
runtime mock subprocess (rc=0/1/2/124).

Refs #81 Group B.
2026-04-27 21:52:46 +02:00
ZdenekSrotyr
569cd90d75
fix(security): #81 Group D — extractor-side identifier validation (squashed) (#97)
Closes M15 from issue #81 — SQL injection via attacker-controlled
identifiers in connectors/keboola/extractor.py and
connectors/bigquery/extractor.py.

Lifted _validate_identifier from src/orchestrator.py into a new
src/identifier_validation.py shared module (single source of truth for
both layers). Two validator policies:

- validate_identifier (strict, ^[a-zA-Z_][a-zA-Z0-9_]{0,63}$) for
  table_name — matches the orchestrator's rebuild-time check, so dashed
  names fail fast at extraction rather than being silently dropped.
- validate_quoted_identifier (relaxed, accepts dashes/dots) for
  bucket/dataset/source_table — Keboola in.c-foo and BigQuery
  my-dataset are legitimate, just need to be safe inside `"..."`.

Both extractors skip-and-continue on unsafe rows (logged + counted in
failure stats); _extract_via_extension re-validates as defense-in-depth.

71/71 extractor + orchestrator tests pass.
Refs #81 Group D.
2026-04-27 21:46:17 +02:00
ZdenekSrotyr
23be8ad46f
fix(security): #81 Group A — orchestrator attach hardening (squashed) (#95)
Closes the C1 findings from issue #81 plus the round-3/4 follow-ups
on the read-only query path.

Both _attach_remote_extensions (rebuild path) and
_reattach_remote_extensions (query path) now apply the same hard
allowlists for extensions and token-env names, single-quote-escape
the URL, and split built-in vs community install. The CHANGELOG bullet
documents the full scope including the table_schema → table_catalog
fix that made the rebuild path a silent no-op for every connector.

New module src/orchestrator_security.py centralises the policy. Tests
in tests/test_orchestrator_remote_attach_security.py — 28/28 pass.

Refs #81.
2026-04-27 21:34:04 +02:00
ZdenekSrotyr
24e81fb671
fix(security): gate Script-API /run on admin role (#44) (#92)
* fix(security): gate Script-API /run on admin role (#44)

The AST + string-blocklist sandbox in `_execute_script` is defense-in-depth,
not a primary trust boundary. It does not block `vars()`, `type()`, or
`__class__.__bases__` introspection chains, and the string blocklist is
trivially evadable via concatenation/dunder encoding. Treat the role gate
as the actual barrier: only admin can run scripts.

- `POST /api/scripts/run` and `POST /api/scripts/{id}/run` now require admin.
- `POST /api/scripts/deploy` stays analyst-accessible (storing != executing).
- Existing /run tests retargeted to admin_token; added regression tests
  asserting analyst → 403 on both endpoints.
- CHANGELOG: BREAKING (security) bullet under Unreleased/Changed.

Closes #44.

* fix(security): admin-gate /deploy + harden sandbox blocklist (review #92)

Reviewer of PR #92 flagged three MUST-FIXes that #44 wasn't fully closed:

1. /api/scripts/deploy still accepted analyst → planted-script attack
   path (analyst plants malicious source, waits for admin to /run).
   Now: /deploy also requires admin; the entire Script API is admin-only.

2. The "Minimum (same-day)" blocklist mitigations from issue #44 weren't
   applied. Added the introspection-chain dunders that the issue PoC
   pivots through: __subclasses__, __globals__, __class__, __base__,
   __bases__, __mro__, __dict__, __code__, __builtins__. Plus `vars`
   in BLOCKED_FUNCTIONS. Deliberately NOT adding __init__ /
   __getattribute__ (substring match would flag every legit `def __init__`)
   nor `type`/`dir` (frequent in legitimate admin scripts). Documented
   the trade-off inline.

3. Tests didn't cover the actual PoC payload nor non-analyst non-admin
   roles. Added test_run_pwn_payload_blocked parametrized over the issue's
   own PoC + two equivalent variants (lambda+__globals__, __mro__
   traversal); these stay green only as long as the dunder list does.
   test_*_requires_admin tests now parametrize over (analyst, viewer,
   km_admin) so all three non-admin core roles are pinned at 403.

Conftest extension: seeded_app now exposes viewer_token and
km_admin_token as siblings to admin_token / analyst_token.

CHANGELOG bullet updated to reflect /deploy gate change and new
internal regression tests. 35/35 scripts tests pass locally.

Refs review of #92.

* fix(tests): test_security TestScriptSandbox needs admin token after #44 hardening

CI failure on PR #92 caught a missed test file. tests/test_security.py
seeded only an analyst user and used the analyst token to drive sandbox
tests. After the #44 admin-gate (deploy + run both admin-only), every
sandbox test got 403 from the role gate before the AST/string check
could run, so 'blocks os.system' / 'blocks eval' / etc. all failed.

Fix: extend the fixture to also seed an admin user and return the admin
token. Sandbox tests now reach the sandbox layer; access-control tests
further down in the module continue to use the analyst that was kept
around. 41/41 test_security.py tests pass locally.

* fix(security): #92 round-3 — gate GET /api/scripts on admin role

Devin Review caught: GET /api/scripts (app/api/scripts.py:44-51) was
left on Depends(get_current_user) when the rest of the API moved to
admin-only. ScriptRepository.list_all() does SELECT * FROM script_registry
which returns ALL columns including 'source' (the full script body).
So any authenticated user (viewer / analyst / km_admin) could read
admin-deployed scripts — leak of code that may contain credentials,
business logic, or admin-only operational details.

CHANGELOG already says 'The entire Script API is now admin-only',
which was true for /deploy, /run, /{id}/run, DELETE — just not for
GET. Now consistent: every Script endpoint requires admin.

Tests:
- New parametrized test_list_scripts_requires_admin over (analyst,
  viewer, km_admin) tokens — all assert 403.
- Updated test_list_scripts_empty in both test_scripts_api.py and
  test_api_scripts.py to use admin_token.

79 tests pass.

Refs Devin Review of #92.

* fix: cleanup unused imports, stale docstrings, and incomplete CHANGELOG

- Remove unused imports: Path, List, get_current_user (ruff F401)
- Trim docstrings to describe current behavior, not change history
- CHANGELOG now lists GET /api/scripts among admin-gated endpoints
- Remove diff-commenting inline comments from tests

Co-Authored-By: zdenek.srotyr <zdenek.srotyr@keboola.com>

* fix: merge duplicate Changed sections into one per CLAUDE.md convention

Co-Authored-By: zdenek.srotyr <zdenek.srotyr@keboola.com>

---------

Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-04-27 21:13:56 +02:00
ZdenekSrotyr
4e4d2a39e6
chore(oss): isolate customer-specific deploy bits from scripts/grpn/ (#88, wave 1) (#94)
* chore(oss): isolate customer-specific deploy bits from scripts/grpn/ (#88)

Vendor-neutralization step before public release. The directory mixed
two concerns: (1) generic ops scripts referenced from mainline OSS
infrastructure (TLS rotation, auto-upgrade cron) and (2) one operator's
hackathon manual-deploy helper with hardcoded GCP project IDs, VM names,
and admin emails. Splitting them per concern.

Moved (still in OSS, just under a vendor-neutral name):
- scripts/grpn/agnes-tls-rotate.sh   → scripts/ops/agnes-tls-rotate.sh
- scripts/grpn/agnes-auto-upgrade.sh → scripts/ops/agnes-auto-upgrade.sh

Removed (belongs in private consumer infra repos, not upstream OSS):
- scripts/grpn/Makefile (hardcoded prj-grp-foundryai-dev-7c37, foundryai-development VM name, e_zsrotyr@groupon.com bootstrap email)
- scripts/grpn/README.md (GRPN hackathon deploy walkthrough)
- docs/superpowers/plans/2026-04-22-grpn-deploy-learnings.md (org-specific deploy log)

Cross-refs updated in README.md, CLAUDE.md, docs/DEPLOYMENT.md,
docker-compose.yml. CHANGELOG entry flags BREAKING (ops) for any
consumer infra repo that installs these scripts via path-based systemd
timers.

This is the first wave of #88 — the remaining leaks (test data with
prj-grp-dataview-prod-1ff9, AIAgent.FoundryAI tags in OpenMetadata test
fixtures, docstrings in connectors/openmetadata/enricher.py) will be a
separate, smaller PR.

Refs #88.

* chore(oss): comprehensive vendor-neutralization (#88 wave 2 + review fixes)

PR #94 review found that the original wave-1 grep was scoped wrong and
many leaks survived. This commit closes wave 1 properly AND folds in all
wave-2 anonymization in a single pass — easier to review than two PRs.

Wave-1 review-fix corrections:
- Caddyfile: scripts/grpn/agnes-tls-rotate.sh → scripts/ops/ (the original
  wave-1 grep filter excluded extensionless files like Caddyfile).
- CHANGELOG bullet rewritten — original wording implied an in-repo migration
  for infra/modules/customer-instance/, which is wrong (the TF module embeds
  the script inline via heredoc, never sourced from scripts/grpn/). Now
  flags downstream consumer infra repos only.
- infra/modules/customer-instance/variables.tf: Czech docstring with `grpn`
  example → English description with `acme, example` placeholders.

Wave-2 anonymization:
- Code docstrings (connectors/openmetadata/{client,transformer,enricher}.py,
  src/catalog_export.py, scripts/duckdb_manager.py): prj-grp-… →
  my-bq-project / prj-example-1234, AIAgent.FoundryAI → AIAgent.MyAgent,
  FoundryAIDataModel → AnalyticsDataModel.
- Test fixtures (4 files): same set of replacements — 157 tests still pass.
- .github/workflows/keboola-deploy.yml: "Groupon-side dev VMs" comment →
  generic "per-developer dev VMs".
- docs/auth-groups.md + scripts/debug/probe_google_groups.py:
  kids-ai-data-analysis project name → acme-internal-prod placeholder.
- 5 planning/spec docs under docs/superpowers/{plans,specs}/2026-04-21-*:
  hardcoded IPs (34.77.94.14, 34.77.102.61) → <dev-vm-ip>/<prod-vm-ip>;
  GRPN/Groupon → Acme/another-customer; prj-grp-… → prj-example-….
- scripts/switch-dev-vm.sh deleted — hackathon-era helper hardcoded to a
  specific shared dev VM. Per-developer dev VMs are the supported pattern.

Final grep `groupon|grpn|foundryai|prj-grp|groupondev|34\.77\.(94|102)\.…|kids-ai-data`
returns zero hits (excluding CHANGELOG.md historical entries).

CHANGELOG entry expanded to document both waves under one bullet, with
the BREAKING (ops) clarification about the TF module being unaffected.

Refs review of #94, closes #88.

* fix(oss): close remaining #94 review-2 findings (Czech, padak refs, CHANGELOG)

Reviewer of PR #94 round 2 caught 4 remaining items the wave-2 pass missed:

1. infra/modules/customer-instance/variables.tf had Czech descriptions on
   8 more variables. Previous review only flagged line 19; this round
   audited the rest. Translated lines 2, 28, 42-46 (heredoc), 60, 65, 71,
   78, 84 to English. Same review concern: a Terraform module that is
   the customer-facing API surface in Czech is unfit for OSS distribution.

2. infra/modules/customer-instance/outputs.tf had Czech descriptions on
   four outputs. Same fix.

3. docs/padak-security.md referenced a private repo (padak/keboola_agent_cli#206)
   in two places. Replaced with generic 'tracked upstream in the auth-CLI repo'
   per CLAUDE.md vendor-agnostic rule (no cross-refs to private repos).

4. scripts/fetch-env-from-secrets.sh:41 had a Czech comment.
   Translated.

5. CHANGELOG cosmetic: bullet said 'AIAgent.FoundryAI -> AIAgent.MyAgent'
   but the actual code uses both MyAgent (in docstrings) and Example
   (in test fixtures). Reworded to mention both targets.

Final grep across all shipping file types (.md, .py, .yml, .yaml, .sh,
Makefile, .json, .tf, .tpl, Caddyfile, .toml) for groupon|grpn|foundryai|
prj-grp|groupondev|34.77.94.14|34.77.102.61|kids-ai-data|padak/keboola_agent_cli
returns ZERO hits (excluding CHANGELOG.md). Czech-diacritic grep across
.tf/.toml/Caddyfile/Makefile/.yml returns ZERO hits.

157/157 OpenMetadata + DuckDB tests still pass.

* fix(oss): close #94 round-3 leaks (env.template, instance.yaml.example, padak typo)

Round-3 reviewer caught two MUST-FIX leaks the round-2 grep missed
(grep was scoped to extensions that did not include .template / .example
suffixes — the audit was right, the previous grep was not paranoid enough):

1. config/instance.yaml.example:114 — '(optional - Groupon-specific)' brand
   leak in a shipping config example. Replaced with '(optional)'.

2. config/.env.template:68 — stale path 'scripts/grpn/agnes-tls-rotate.sh'
   in operator-facing env-template comment. The script lives at
   scripts/ops/ now (commit 16a85cc); this comment had been pointing
   operators at a non-existent path.

3. docs/padak-security.md:188 — phrase duplication 'tracked in tracked
   upstream' from a sloppy substitution in round-2. Trivial wording fix.

Final paranoid grep across .md/.py/.yml/.yaml/.sh/Makefile/.json/.tf/.tpl/
Caddyfile/.toml/.template/.example/.env* with the full token set
(groupon|grpn|foundryai|prj-grp|groupondev|34\.77\.94\.14|34\.77\.102\.61|
kids-ai-data|padak/keboola_agent_cli) returns ZERO hits, excluding
CHANGELOG.md historical entries.

* fix(oss): #94 round-4 — QUICKSTART.md + rename padak-security.md

Devin Review caught two findings on the latest round-3 commit:

1. docs/QUICKSTART.md:67 still pointed users at the deleted
   scripts/switch-dev-vm.sh. A Quickstart user following step-by-step
   would hit a missing-file error at the final step. Replaced with the
   inline gcloud-ssh equivalent that the Removed bullet documents.

2. docs/padak-security.md filename retains the personal identifier
   'padak'. The PR fixed the body content (replaced
   padak/keboola_agent_cli#206 references with generic wording) but
   missed the filename. Renamed to docs/security-audit-2026-04.md
   (date-anchored, vendor-neutral). Updated the historical CHANGELOG
   link to point at the new path with an inline note about the rename.

* fix(oss): redact remaining hardcoded IPs from planning docs + remove default email

Devin Review caught two more leaks:
1. scripts/fetch-env-from-secrets.sh line 16 had a hardcoded
   personal-email default (zdenek.srotyr@keboola.com). Replaced with
   ':?' bash error so SEED_ADMIN_EMAIL must be explicitly set —
   safer than carrying any specific identity.
2. Planning docs still had 35.195.96.98 and 34.62.223.189 (legacy
   prod/dev IPs) that the round-1 IP-replace pattern missed (it only
   targeted 34.77.x.x). Generic regex redaction across all five
   planning docs replaces every public IP with <redacted-ip>,
   preserving private/loopback/IAP ranges.
2026-04-27 20:24:34 +02:00
ZdenekSrotyr
2f783c5c0a
fix(security): close Jira webhook fail-open + path traversal (#83) (#93)
* fix(security): close Jira webhook fail-open + path traversal (#83)

Two related vulnerabilities:

1. Fail-open signature check: when JIRA_WEBHOOK_SECRET was unset,
   _verify_signature returned True and any unauthenticated POST to
   /webhooks/jira would run the full ingest pipeline. Now fail-closed —
   the handler short-circuits with 503 (operator-misconfiguration signal,
   distinct from 401 wrong-signature) when the secret is missing.

2. Path traversal via attacker-controlled issue_key: webhook payloads
   carry issue.key, which flowed unsanitized into save_issue (issues_dir /
   "{issue_key}.json"), download_attachment (attachments_dir / issue_key),
   and incremental_transform (raw_dir / "issues" / "{issue_key}.json"). A
   crafted webhook with issue.key="../../etc/passwd" could write outside
   the Jira data dir.

Defense-in-depth: new connectors/jira/validation.py exposes
is_valid_issue_key (whitelist regex ^[A-Z][A-Z0-9_]{0,31}-\d{1,12}$) and
safe_join_under (Path.resolve() containment check). Both are enforced at
the webhook entry point AND at every filesystem boundary in the connector.

Tests:
- New tests/test_jira_validation.py — unit tests for both helpers
  (parametrized invalid keys, traversal/symlink/absolute-path cases).
- Webhook tests: test_unconfigured_secret_returns_503,
  test_path_traversal_in_issue_key_rejected (parametrized over 10 bad keys),
  test_valid_issue_key_accepted.

CHANGELOG: two CRITICAL Fixed bullets under Unreleased.

Closes #83.

* fix(security): close remaining #83 review findings — webhookEvent traversal, _handle_deletion guard, regex tightening

Reviewer of PR #93 flagged four MUST-FIXes:

1. _log_webhook_event used the attacker-controlled `webhookEvent` field
   as a filename component without sanitization. Payload with
   `webhookEvent: "../../tmp/pwn"` could escape WEBHOOK_LOG_DIR. Now:
   - non-`[A-Za-z0-9_-]` runs are replaced with `_` (dot excluded so
     `..` cannot survive sanitization as a directory component)
   - length capped at 64 chars
   - final path routed through safe_join_under
   New regression test `test_webhook_event_path_traversal_sanitized`.

2. _handle_deletion (connectors/jira/service.py:530) and
   process_webhook_event (line 487) still used raw issue_key in path
   builds. Even though the webhook handler validates upstream, the
   "defense-in-depth at every filesystem boundary" claim required these
   too. Both now run is_valid_issue_key and safe_join_under guards.

3. Regex `^[A-Z][A-Z0-9_]{0,31}-\d{1,12}$` permitted underscores in
   project keys. Atlassian's project-key validator does not — `A_B-1`
   is rejected by Jira itself. Tightened to `[A-Z0-9]` and updated
   tests: `ABC_DEF-1` is now invalid, added Cyrillic А-1 (lookalike),
   CRLF, and oversize cases to the bad-key parametrization.

4. Existing test test_deletion_of_nonexistent_issue_returns_true used
   `PROJ-NOEXIST` which is not a real Jira key shape. Updated to
   `PROJ-99999`. The test still exercises the same intent (deletion of
   issue with no local file is idempotent).

73/73 jira tests pass locally (test_jira_webhooks + test_jira_validation
+ test_jira_service + test_jira_service_full + test_jira_incremental).

CHANGELOG updated to document the regex tightening and the new
webhookEvent sanitization.

Refs review of #93.

* fix(tests): test_journey_jira tests assumed fail-open before #83 fix

CI failure on PR #93 caught two journey tests that pinned the OLD
fail-open contract:

- test_webhook_with_no_secret_configured_accepted asserted 200 when
  JIRA_WEBHOOK_SECRET was unset. After the #83 fix that's a 503
  (operator misconfig). Renamed to _refused and flipped the assertion.

- test_webhook_empty_payload_rejected didn't set the secret, so the
  503 short-circuit fired before the empty-payload 400 could. Set
  JIRA_WEBHOOK_SECRET in the patched Config so the test exercises the
  intended path.

56/56 jira journey + webhook + validation tests now pass.

* fix(security): #93 round-3 — webhook fallback format + save_issue early validation

Devin Review caught two real findings:

1. Webhook handler regression: the round-2 fix extracted issue_key only
   from event_data['issue']['key'], but process_webhook_event has long
   supported a fallback 'issue_key' top-level field for certain Jira
   event formats (e.g. delete events historically). The handler now
   blocks those events with 400 before they reach the service layer.
   Fix: mirror process_webhook_event's fallback in the handler — try
   issue.key first, fall through to event_data.get('issue_key') when
   empty. is_valid_issue_key still validates whichever source provided
   the key.

2. save_issue defense-in-depth was incomplete: is_valid_issue_key ran
   AFTER fetch_remote_links and fetch_sla_fields had already used the
   unvalidated issue_key in HTTP URL construction
   ({base_url}/issue/{issue_key}/remotelink etc.). A future internal
   caller invoking save_issue directly with attacker-controlled input
   could trigger outbound requests with a malicious path component
   (limited SSRF / URL-path manipulation against the Jira API server).
   Fix: move the is_valid_issue_key check to immediately after the
   null guard, before any HTTP request or filesystem op. Webhook layer
   still validates upstream, this is the second layer.

66 jira tests pass.

Refs Devin Review of #93.

* fix(changelog): #93 round-4 — add BREAKING marker to fail-closed bullet

Devin Review caught: the JIRA_WEBHOOK_SECRET fail-closed change is a
behavior change for operators (response code 503 vs old 200) that
existing alerting may treat differently. Per CLAUDE.md changelog
discipline rule, operators grep for **BREAKING** before bumping the
pin. Added the marker + a short note on what action operators need
to take (set the env var if they haven't).

Refs Devin Review of #93.

* fix: #93 round-5 — null-issue crash + comment drift

Devin Review caught two findings on the round-4 commit:

1. Pre-existing crash on null issue field: a webhook payload with
   {"issue": null} (rather than omitting the key) caused
   event_data.get("issue", {}) to return None, then issue.get("key")
   raised AttributeError → unhandled 500. Pre-existing but reachable.
   Fix: 'event_data.get("issue") or {}' normalises None to {}, then
   the existing fallback / validation path returns 400 cleanly.
   New regression test test_null_issue_field_does_not_crash.

2. Inline comment drift: the comment at line 77 documented the allowed
   character class as [A-Za-z0-9._-] (with dot) but the regex at line 27
   excludes dot deliberately (so '..' cannot survive sanitization).
   Fixed the comment to match.

52 jira tests pass.

Refs Devin Review of #93 round 5.

* fix: #93 round-6 — process_webhook_event also normalises null issue field

Devin Review caught: the webhook handler at app/api/jira_webhooks.py
correctly handles {"issue": null} via 'event_data.get("issue") or {}',
but process_webhook_event at connectors/jira/service.py:509 still
used the bare 'event_data.get("issue", {})' which returns None on
explicit null. Internal callers (anything that invokes
process_webhook_event without going through the HTTP handler) would
hit the same AttributeError the round-5 fix closed at the handler
layer. Same one-line fix.

32 jira tests pass.

Refs Devin Review of #93 round 5.

* fix: #93 round-7 — issue-key regex uses [0-9] not \d

Devin Review caught: Python 3's \d matches any Unicode decimal digit
(Arabic-Indic ٣, Bengali ৩, Devanagari ३, …). A key like TEST-٣ would
pass the regex even though it's not a valid Jira input. Tightened to
[0-9] (ASCII only).

Added three Unicode-digit cases to the bad-key parametrization in
test_jira_validation.py to lock in the contract.

Refs Devin Review of #93 round 6.

* fix: #93 round-8 — use \\Z anchor not $ in issue-key regex

Devin Review caught: Python's $ anchor matches before a trailing \\n,
so re.match('…$', 'TEST-1\\n') returns a match. is_valid_issue_key
returned True for CRLF-injected keys. \\Z is hard end-of-string and
closes that bypass.

Manual verification:
  is_valid_issue_key('TEST-1\\n') → False (was True before fix)
  is_valid_issue_key('TEST-1\\r\\n') → False
  is_valid_issue_key('TEST-1') → True

Refs Devin Review of #93 round 7.

* docs: #93 round-9 — CHANGELOG regex matches implementation
2026-04-27 19:53:55 +02:00
Petr Simecek
2dfb246996
release(0.11.5): post-merge follow-up — Devin review fixes + authlib warning silenced (#74)
Cuts 0.11.5 with all the [Unreleased] bullets that landed on top of PR #73
between commit a899877 (the original "v0.11.4" tag in the chain) and the
final merge commit on main. No new public-API surface; the user-visible
payoff is that v8→v9-migrated installations work end-to-end (login flows,
GET /api/users, admin nav, the new role-management REST API and its
last-admin protection) and `make local-dev` startup is finally quiet.

Bullets covered (full text in CHANGELOG.md [0.11.5]):
- _hydrate_legacy_role re-resolves from grants on every request — fixes
  privilege-retention after grant revoke via the role-management API.
- Dev-bypass + OAuth callback now pass user_id to resolve_internal_roles
  so direct grants land in the session cache (not the DB-fallback path).
- GET /api/users hydrates user dicts before Pydantic validation
  (HTTP 500 on every migrated install) + same fix for update/delete
  paths so last-admin protection triggers on migrated admins.
- Scheduler stopped spamming POST /auth/token 401 — the auto-fetch
  fallback was always broken; SCHEDULER_API_TOKEN is now the only path.
- POST /auth/token / Google OAuth / password / email-magic-link all
  hydrate user["role"] before issuing the JWT (Pydantic 500 + wrong
  token payload). New TestAuthLoginFlowsPostMigration regression class.
- docs/RBAC.md no longer documents the non-existent implies= keyword
  on register_internal_role.
- _seed_core_roles now actually runs on every connect (the docstring
  was lying — only ran during fresh install + v8→v9). New
  TestSeedCoreRolesSafetyNet regression class.

This commit also adds:
- AuthlibDeprecationWarning suppression at app/main.py top — upstream-
  internal forward-compat note from authlib._joserfc_helpers, not
  actionable on our side. Filter is targeted by class (with a
  message-based fallback) so other DeprecationWarnings remain visible.
- pyproject.toml version: 0.11.4 → 0.11.5.
- CHANGELOG.md: [Unreleased] → [0.11.5] — 2026-04-27, new empty
  [Unreleased] skeleton appended for the next PR to land on.

Tag v0.11.5 follows; keboola-deploy-v0.11.5 tag triggers the
keboola-deploy.yml workflow for agnes-dev.keboola.com.
2026-04-27 02:32:18 +02:00
Petr Simecek
83ced81966
feat(auth): unified role management — UI + REST API + CLI + schema v9 (v0.11.4) (#73)
* feat(auth): v9 schema — unified role management foundation (WIP)

Tasks 1-5, 10 of the role-management-complete plan. Foundation only,
follow-up commits add REST API, CLI, UI, and tests.

Schema v9:
- user_role_grants table: direct user → internal_role mapping
  (complementary to group_mappings). Drives PAT/headless auth and
  persists across sessions. Source field tracks 'direct' vs auto-seed.
- internal_roles.implies (JSON): transitive role hierarchy. core.admin
  implies core.km_admin → core.analyst → core.viewer. Resolver does BFS
  expand at lookup time.
- internal_roles.is_core (BOOL): distinguishes seeded core.* hierarchy
  from module-registered roles. UI renders them differently.
- v8→v9 migration: ADD COLUMN, CREATE TABLE, _seed_core_roles +
  _backfill_users_role_to_grants, then NULL legacy users.role values.
  DuckDB FK constraint blocks DROP COLUMN — sloupec zůstává jako
  deprecated artifact (UserRepository ignoruje), fyzický drop deferred.

Resolver:
- Regex extended to allow dotted namespace (core.admin,
  context_engineering.admin), max 64 chars total.
- expand_implies(role_keys, conn): BFS over implies JSON column.
- resolve_internal_roles signature gains optional user_id parameter;
  unions group-mapping resolution with user_role_grants direct grants
  before implies expansion.

require_internal_role:
- Two-path resolution: session cache (OAuth) → DB grants (PAT/headless
  fallback). PAT clients now legitimately satisfy gates without the
  OAuth round-trip, fixing the v8 limitation where every PAT-callable
  admin endpoint needed require_role(Role.ADMIN) instead of
  require_internal_role(...).

Backward-compat:
- require_role(Role.X) and require_admin become thin wrappers over
  require_internal_role(f"core.{role}"). Implies hierarchy preserves the
  legacy "at least this level" semantics automatically — no per-level
  comparison code needed.
- src/rbac.py helpers (is_admin, has_role, get_user_role,
  set_user_role, can_access_table, get_accessible_tables) all read from
  the resolver via _get_internal_role_keys.
- UserRepository.create() and update() now mirror role changes into
  user_role_grants via _grant_core_role helper. Preserves API while
  making the new table the source of truth.
- UserRepository.delete() pre-deletes user_role_grants rows
  (FK cascade — DuckDB doesn't auto-cascade).
- count_admins() reads user_role_grants ⨝ internal_roles instead of the
  now-NULL users.role column.

First consumer:
- app/api/admin.py module-level docstring documents the v9 pattern for
  future module authors. Existing require_role(Role.ADMIN) callsites
  flow through the wrapper; no behavior change for OAuth callers, and
  PAT callers gain access via direct grants.

Tests: full suite green (1396 passed, 6 skipped). Existing tests
exercise the new pathway transparently because UserRepository.create
auto-grants. New test_pat_caller_with_direct_grant_passes pins the
PAT-aware contract.

Schema: v9 (was v8). pyproject.toml + CHANGELOG bump deferred to the
final PR-prep commit.

* feat(auth): role management complete — REST API + CLI + UI + docs (v0.11.4)

Sjednocuje legacy users.role enum s v8 internal-roles foundation pod jeden
model s implies hierarchií, dodává admin UI + REST API + CLI pro správu
group mappings i přímých user grants, a dělá require_internal_role
PAT-aware tak, aby admin endpointy fungovaly uniformly napříč OAuth
i headless callery.

REST API (app/api/role_management.py, +496 LOC):
- 8 endpointů pod /api/admin: internal-roles list, group-mappings CRUD,
  users/{id}/role-grants CRUD, users/{id}/effective-roles debug.
- Všechny gated require_internal_role("core.admin"). Audit-log na každé
  mutaci (role_mapping.created/deleted, role_grant.created/deleted).
- Last-admin protection: refuse to delete the final core.admin grant
  (mirrors users.py:count_admins protection).
- Nový UserRoleGrantsRepository v src/repositories/user_role_grants.py.

CLI (cli/commands/admin.py extension, +258 LOC):
- da admin role list / show <key>
- da admin mapping list / create <group-id> <role-key> / delete <id>
- da admin grant-role <email> <role-key>
- da admin revoke-role <email> <role-key>
- da admin effective-roles <email>
- Všechno přes typer + PAT auth, --json flag, response-shape tolerantní.

UI (admin_role_mapping.html + admin_user_detail.html + nav + user list):
- Nová stránka /admin/role-mapping: internal_roles read-only table +
  group_mappings table with create/delete forms.
- Nová stránka /admin/users/{id}: core role single-select + capabilities
  multi-checkbox + effective-roles debug (direct + group + expanded).
- Existing user list dostává "Detail" link na novou stránku.
- Nav link na /admin/role-mapping.

Tests: +85 nových testů přes 4 nové soubory:
- test_schema_v9_migration.py (8) — fresh install + v8→v9 backfill +
  legacy column NULL semantics + unknown-role fallback + invariants.
- test_api_role_management.py (33) — všech 8 endpointů, happy + error
  paths, audit-log assertions, last-admin protection.
- test_cli_admin_role.py (25 + 1 conditional) — typer subcommands,
  text + json output, PAT integration smoke.
- test_admin_role_mapping_ui.py (9) + test_admin_user_capabilities_ui.py (10)
  — page rendering, auth gating, form contracts, JS hooks.
Full suite: 1482 passed, 6 skipped (was 1396 → +86, žádné regrese).

Docs:
- docs/internal-roles.md kompletní rewrite — odstranil "no UI yet",
  přidal hierarchy diagram, dual-path resolution, dotted-namespace
  convention, admin workflow přes UI/CLI/REST, refresh semantics
  for group mappings vs direct grants, migration notes.
- CLAUDE.md schema v8 → v9.
- CHANGELOG.md [0.11.4] s BREAKING marker pro users.role NULL
  semantics + complete Added/Changed/Removed/Internal sekce.
- pyproject.toml: 0.11.3 → 0.11.4.

Sequencing: po mergi tohoto PR Pabu rebasuje pabu/local-dev (PR #72)
na main, jeho schema migrations se posouvají z v9/v10/v11 na v10/v11/v12.

Implementation breakdown:
- Sequential (já): foundation tasks — schema v9, resolver, PAT-aware
  require_internal_role, backward-compat wrappers, rbac refactor,
  UserRepository auto-grant.
- Parallel sub-agents (3 worktrees, ~10 min): REST API, CLI, UI.
- Sequential (já): integrace, docs/CHANGELOG/version, schema tests,
  fullsuite verification.

* fix(auth): address Devin review on PR #73 — three regressions

Three concrete bugs caught in Devin's PR review, all fixed in this commit.

1. **users.role hydration on read** (the big one):
   v8→v9 migration NULLs users.role for every existing user, but a long
   tail of read sites still inspect user["role"] directly:
   - app/web/templates/_app_header.html:15 — admin nav gate
   - app/web/templates/_app_header.html:36-37 — role badge in dropdown
   - app/web/router.py:319-321 — UserInfo.is_admin/is_analyst/is_privileged
   - app/web/router.py:489 — corporate memory is_km_admin
   - app/api/catalog.py:54 — admin "see all tables" bypass
   - app/api/sync.py:215 — admin "see all sync states" bypass

   Without a fix, every existing admin loses the entire admin nav (and
   API admin bypasses) immediately after upgrade — a serious regression.

   Fix: new helper _hydrate_legacy_role() in app/auth/dependencies.py
   maps the highest-level core.* grant back into user["role"] as the
   legacy enum string. Called from get_current_user() on both auth paths
   (LOCAL_DEV_MODE + JWT/PAT). Idempotent — skips when role is already
   populated. Net effect: every pre-v9 callsite keeps working transparently
   for both OAuth and PAT callers, with one extra DB round-trip per
   authenticated request (same cost as the existing PAT-aware
   require_internal_role fallback).

   3 regression tests in tests/test_schema_v9_migration.py:
   - test_hydration_recovers_role_from_user_role_grants
   - test_hydration_returns_highest_grant (multi-grant → highest wins)
   - test_hydration_falls_back_to_viewer_when_no_grants (safe fallback)

2. **CLI effective-roles TypeError**:
   API returns direct/group as List[Dict] (RoleGrantResponse-shaped),
   but the CLI did ', '.join(direct) which raises TypeError on dicts.
   Tests masked it because mocks used bare string lists. Replaced
   raw .join() with a _names() helper that extracts role_key from
   each item, falling back to str() for legacy mock shapes.

3. **UI template field-name mismatch**:
   admin_user_detail.html JS reads data.groups but the API serializes
   the field as group (singular, per EffectiveRolesResponse pydantic).
   Currently benign because the API always returns group:[], but the
   field would silently disappear once the group-derived view is wired
   up. Added data.group as the primary lookup, kept the legacy aliases
   for shape-drift tolerance.

Full suite: 1485 passed (was 1482, +3 hydration tests), 6 skipped, no
regressions.

* fix(auth): Devin review #2 + UX self-service + RBAC docs rename

Three threads landed in one commit because they share the same
auth/role surface and CHANGELOG entry.

Devin review #73 second round (2 actionable findings):

- _hydrate_legacy_role no longer short-circuits on truthy users.role.
  The role-management endpoints (POST/DELETE /api/admin/users/{id}/
  role-grants + the changeCoreRole UI flow) only mutate
  user_role_grants — they don't update the legacy column. The early
  return trusted that stale value, so a user downgraded via the new
  REST/UI kept role="admin" in their dict on subsequent requests,
  which fooled _is_admin_user_dict (src/rbac.py) and the catalog/sync
  admin-bypass short-circuits into retaining elevated table access
  even though require_internal_role correctly denied the API gates.
  Always re-resolves now, making user_role_grants the single source
  of truth on every authenticated request. Cost: one DB round-trip
  per request — same as the existing PAT-aware fallback. Pinned by
  test_hydration_ignores_stale_legacy_role_after_grant_revoke.

- Dev-bypass (app/auth/dependencies.py) and OAuth callback
  (app/auth/providers/google.py) now pass user_id to
  resolve_internal_roles so direct grants land in
  session["internal_roles"] alongside group-mapped roles. Pre-fix,
  every admin-gated request fell through to the per-request DB
  fallback inside require_internal_role and the dev-bypass log line
  read "resolved 0 internal role(s)" for an obviously-admin user.
  test_session_internal_roles_populated updated to assert union.

User-visible UX (also addresses local-test feedback):

- HTTP 500 on /admin/users post-v8→v9 migration — UserResponse.role
  is required str, but legacy users.role was NULL-ed by the
  migration. _to_response in app/api/users.py now routes every dict
  through _hydrate_legacy_role; same fix lifts the silent no-op of
  last-admin protection in update_user/delete_user (the role-equality
  short-circuits would skip the count_admins guard for migrated
  admins). Three regression tests under TestAPIUsersPostMigration.

- /profile is now a real self-service detail page for *every*
  signed-in user (not just admins). Three new server-side sections:
  Effective roles (resolver output as chip cloud), Direct grants
  (rows in user_role_grants with source label), Roles via groups
  (which Cloud Identity / dev group grants which role for the
  current user). Non-admins finally see *why* a feature is or isn't
  accessible. Admins additionally see a deep-link to
  /admin/users/{id} for editing their own grants.

- /admin/role-mapping group-id picker. New "Known groups" panel
  above the create form: clickable chips for the calling admin's
  own session.google_groups (tagged "your group") merged with
  external_group_ids already used in existing mappings (tagged
  "already mapped"). Click a chip → fills the form. Empty-state
  copy points operators at LOCAL_DEV_GROUPS / Google sign-in
  instead of leaving them to guess Cloud Identity opaque IDs from
  memory.

Operational fixes:

- Scheduler log-noise: every cron tick produced a
  POST /auth/token 401 because the auto-fetch fallback called the
  endpoint with just an email (no password) and silently fell
  through. Removed the broken path entirely. Operators set
  SCHEDULER_API_TOKEN (long-lived PAT) in production; in
  LOCAL_DEV_MODE the dev-bypass auto-authenticates the un-tokenized
  request, so jobs continue to work.

Docs:

- docs/internal-roles.md → docs/RBAC.md (git mv preserves history).
  Standard industry term, more discoverable for engineers grepping
  for RBAC in a new repo. Restructured: Quickstart-by-role
  (operator / end-user / module author), step-by-step
  Module-author workflow with code examples (register key, gate
  endpoint, declare implies, write contract test), naming pitfalls,
  refresh semantics. CLAUDE.md gets a new
  "Extensibility → RBAC" section pointing contributors at the doc
  before they add gated endpoints. Cross-refs in app/api/admin.py
  + tests/test_role_resolver.py updated.

Tests: 293 in the auth/role/scheduler/UI test set passed, 0 regressions.

* fix(auth): Devin review #3 — login flows + RBAC docs

Two new findings on commit 7d1c048, both real and addressed.

Finding 1 (BUG, HTTP 500): every auth login flow loaded users via
UserRepository.get_by_email and passed user["role"] straight to
create_access_token, Pydantic response models, and _set_login_cookie
without going through _hydrate_legacy_role. Post-v9 the legacy column
is NULL for migrated users, and TokenResponse.role is a required str —
so POST /auth/token raised ValidationError → HTTP 500 for any v8-admin
trying to log in via password. Same root cause produced non-crashing
but semantically wrong JWTs (role: null) from Google OAuth, password
web flows, and email magic-link verification.

Fix: hydrate inline in every login flow before reading user["role"]:
- app/auth/router.py — POST /auth/token (the crash site)
- app/auth/providers/google.py — OAuth callback (was just stale JWT)
- app/auth/providers/password.py — 5 flows: JSON login, web login,
  JSON setup, web reset confirm, web setup confirm
- app/auth/providers/email.py — centralized in _consume_token,
  covers both /verify endpoints

New regression class TestAuthLoginFlowsPostMigration pins both the
no-crash and the correct-role contracts for all four legacy levels
(viewer/analyst/km_admin/admin) on POST /auth/token.

Finding 2 (DOCS): docs/RBAC.md showed register_internal_role() being
called with implies=[...], but the function signature is (key, *,
display_name, description, owner_module). A module author copying the
example would TypeError at import time. The implies field on
internal_roles IS honored at runtime by expand_implies, but the
registry-side write path (register_internal_role + InternalRoleSpec +
sync_registered_roles_to_db) doesn't exist yet — implies is currently
seeded only for the core.* hierarchy via _seed_core_roles in src/db.py.

Rewrote the Implies hierarchy and Module-author workflow sections to
document what's actually supported in 0.11.4 and what a future change
would need to add. The "for cross-module hierarchies, register each
level + grant both" pattern works today.

Tests: 322 in the auth/role/scheduler/UI/password test set passed,
0 regressions.

* fix(db): _seed_core_roles actually runs on every connect (Devin review #4)

Devin flagged that the docstring on `_seed_core_roles` promised per-connect
execution as a safety net for accidental DELETEs and in-code seed changes,
but the only call sites lived inside `if current < SCHEMA_VERSION:` — so
once a DB was on v9 the function never ran again, and the docstring lied.

Picked option (b) from the review (actually call it on every startup) over
option (a) (fix the docstring) because the safety net is genuinely useful:
- recovery from accidental admin DELETE on internal_roles,
- in-code _CORE_ROLES_SEED tweaks (display_name/description/implies)
  ship without a manual SQL deploy,
- fresh installs and migrations stop needing their own seed call sites.

Tail call gated by `get_schema_version(conn) <= SCHEMA_VERSION` so the
future-version-is-noop rollback contract still holds — a v9 binary won't
touch a DB that's been upgraded past v9.

Test coverage: new TestSeedCoreRolesSafetyNet class (3 tests) pins the
three contracts — deleted row re-seeds, mutated display_name re-syncs
from in-code seed, applied_at on schema_version doesn't churn on
already-current DBs. Existing TestMigrationSafety::test_future_version_is_noop
still passes (verified against the gating logic).
2026-04-27 02:23:01 +02:00
Petr Simecek
6c36b26979
release(0.11.3): internal roles + external→internal group mapping (foundation) (#71)
* feat(auth): internal roles + external→internal group mapping (foundation)

Two-layer authorization model: external Cloud Identity groups (org-managed)
get mapped onto internal Agnes-defined capabilities (app-managed) via an
admin-curated many-to-many table. Per-request permission checks read off
the session — no DB hit. Refresh requires re-login.

Schema v8 — new tables:
- internal_roles (id, key UNIQUE, display_name, description, owner_module, …)
  — app-defined capabilities like 'context_admin'. Modules self-register at
  import; the startup hook syncs the registry into this table (idempotent).
- group_mappings (id, external_group_id, internal_role_id FK, …)
  — admin-managed bindings, UNIQUE(external_group_id, internal_role_id).

app/auth/role_resolver.py — new module:
- register_internal_role(key, display_name, description, owner_module)
  Module-author entry point. lower_snake_case key, immutable, validated.
  Same key + same fields = no-op (re-import safe); same key + different
  fields = ValueError so two modules can't silently overwrite each other.
- sync_registered_roles_to_db(conn) — startup reconciliation. Inserts new
  keys, updates drifted metadata, never deletes (preserves mappings).
- resolve_internal_roles(external_groups, conn) — joins group_mappings.
  Sorted, deduplicated role-key list. Plugged into google_callback +
  dev-bypass branch in get_current_user.
- require_internal_role('key') — FastAPI dependency factory; reads
  session.internal_roles; 403 with explicit message when missing.

Resolution runs at sign-in only (Google callback + LOCAL_DEV_GROUPS change
in dev-bypass) — same semantics as session.google_groups. No admin UI yet;
mappings created via repository directly until follow-up PR ships UI.

21 new tests in tests/test_role_resolver.py: register/list, idempotency,
collision detection, key-format validation; sync insert/update/no-delete;
resolve empty/single/many-to-many/malformed-input; e2e via
LOCAL_DEV_GROUPS — gated endpoint allowed/denied + direct session-cookie
inspection. Full sweep: 178/178 passed across auth + db + repo tests.
(Two pre-existing test_catalog_export.py failures verified unrelated.)

* fix(auth): polish review feedback — first-request dev populate + PAT doc

Two follow-ups from a code-reviewer pass on the foundation commit before
opening the PR:

- Dev-bypass populates session["internal_roles"] on the first request
  after sign-in, not just when external groups change. The previous
  guard only resolved when groups_changed=True, which left a hole for
  the LOCAL_DEV_GROUPS=`""` (explicit empty) flow: target=[],
  current=None, neither write branch fires, internal_roles stays
  unset, and require_internal_role then 403s with no roles to check
  against. The OAuth callback writes session["internal_roles"]
  unconditionally on sign-in (even []); dev-bypass now matches that
  semantics. Adds a single-pass populate gated on the key being
  absent from the session, so subsequent same-state requests still
  no-op (cheap session lookup, no resolver call).

- Document that internal roles are session-scoped and PAT/headless
  clients will get 403 from any require_internal_role(...) endpoint.
  Same constraint already applies to session.google_groups (PAT JWTs
  deliberately don't snapshot group memberships — they could change
  after issuance with no way to re-sign), but the doc didn't surface
  this — an operator pointing a CLI at a role-gated endpoint would
  see 403 with no clue why. New "PAT and headless requests" section
  spells out the constraint, the rationale, and the three escape
  valves (use users.role for the gate; route through OAuth; wait for
  the planned `da admin grant-role` CLI helper).

54 auth tests still pass locally (21 role-resolver + 33 existing
auth-provider).

* release(0.11.3): cut release for the internal-roles foundation

Bumps pyproject.toml 0.11.2 → 0.11.3 and renames CHANGELOG's
[Unreleased] section to [0.11.3] — 2026-04-26 (with a fresh
empty [Unreleased] skeleton appended). Adds the matching
[0.11.3] link reference at the bottom of CHANGELOG so the
section heading renders as a hyperlink to the GitHub release
page once the tag lands.

The bullet itself is unchanged content; the rephrasing of
"dev-bypass when external groups change" → "dev-bypass —
populates on first request and whenever external groups
change, mirroring the OAuth callback's always-write
semantics" reflects the polish committed in d590579, plus
the appended PAT/headless caveat pointing at the doc
section that landed in the same polish pass.

* fix(auth): address review feedback from Pavel — PAT-specific 403, audit logs, hardening

Round-2 polish over the internal-roles foundation, addressing Pavel's review
on PR #71. No behavior change for the happy path; tightens the safety rails
and makes the failure modes self-explanatory.

User-visible:
- require_internal_role now distinguishes "no session" (Bearer/PAT caller)
  from "signed in but missing role" and surfaces a PAT-specific 403 detail
  in the first case ("This endpoint needs an interactive (OAuth) session
  — Bearer/PAT tokens do not carry session-resolved roles by design").
- docs/internal-roles.md documents deactivate+reactivate as the supported
  "force re-resolve now" lever for users that can't be made to log out.

Internal hardening:
- INFO-level audit log on every successful resolve (OAuth callback +
  dev-bypass) so a wrong-role complaint is debuggable from the log alone.
- Startup warning when SESSION_SECRET is shorter than 32 chars, matching
  the existing JWT_SECRET_KEY gate — both HMAC surfaces sign trust-laden
  state (session.internal_roles, session.google_groups, JWTs).
- _clear_registry_for_tests() now refuses to run unless TESTING=1 so a
  stray import path in production can't drop the registered capabilities.

Tests:
- 4 new tests in tests/test_role_resolver.py covering: stale-session
  contract after a mid-session mapping revoke (pin the documented
  limitation), PAT 403 detail wording, OAuth pipeline data flow from
  external groups to internal_roles, and the dev-bypass empty-list
  fallback when the resolver raises.

CHANGELOG.md updated under [0.11.3] (### Changed + ### Internal).
CLAUDE.md schema doc bumped from v7 to v8.

---------

Co-authored-by: Claude <noreply@anthropic.com>
2026-04-26 23:49:10 +02:00
Petr Simecek
1c18cdf15f
release(0.11.2): LOCAL_DEV_GROUPS dev mock + Makefile defaults + docs/local-development.md (#70)
* feat(auth): mock session.google_groups in LOCAL_DEV_MODE via LOCAL_DEV_GROUPS

LOCAL_DEV_MODE auto-logged-in the dev user but left session.google_groups
empty, so group-aware UI/code paths can't be exercised on localhost without
a real Google OAuth round-trip. New LOCAL_DEV_GROUPS env var (JSON array
matching the production {id, name} shape) populates the session on every
dev-bypass request — same structure the OAuth callback writes, so mock and
prod stay in lockstep. Compare-then-write avoids spurious Set-Cookie noise
on PAT/CLI requests; malformed input falls back to [] with a WARNING so
the dev mock never breaks the dev flow.

* refactor(auth): fail-fast LOCAL_DEV_GROUPS at startup + cache + no-mutate

Three small follow-ups on the same dev-mock vector before merge:

- Validate LOCAL_DEV_GROUPS at app startup and report the parsed group IDs
  in the LOCAL_DEV_MODE banner. A malformed value now warns loudly at boot
  instead of silently logging on the first authenticated request, where
  it's easy to miss.
- Cache the parsed result single-slot, keyed by the raw env-string. Avoids
  re-parsing JSON on every authenticated request without test-isolation
  surprises — when the env value changes, the key changes and the cache
  transparently rebuilds.
- Stop mutating the parsed-input dicts (item.setdefault → spread-merge)
  so the cached list stays a fresh value on every rebuild.
- Replace the try/except guard around request.session with hasattr —
  SessionMiddleware is always registered, the silent except was paranoid.

Tests grow by a direct session-cookie inspection (decoupled from the
profile template) and three startup-banner log assertions.

* fix(auth): drop fragile session-decoder test + actually skip empty-target write

Two follow-ups on the LOCAL_DEV_GROUPS feature before merge:

- Drop test_session_holds_mocked_groups_directly. It manually decoded the
  signed session cookie via TimestampSigner + base64, hardcoding both the
  Starlette session-cookie format and the 14-day max_age. Starlette has
  changed its session encoding before (URLSafeTimedSerializer pre-0.20)
  and would do so again silently — the test would fail with a cryptic
  BadSignature, not a clear "mock is broken" signal. The remaining
  test_dev_user_sees_mocked_groups_on_profile already covers the same
  observable signal (mocked groups in /profile body) without coupling to
  Starlette internals.

- Actually skip the session write when target_groups is empty. The previous
  comment claimed compare-then-write avoided spurious Set-Cookie noise on
  PAT/CLI requests, but on those requests session.get("google_groups") is
  None and target is [], so None != [] always evaluates True and the write
  fired anyway, marking the session dirty and re-issuing Set-Cookie on
  every request. Adding `target_groups and ...` to the guard makes the
  comment honest: empty mock now genuinely no-ops, stable browser sessions
  still skip via value-equality, and the only remaining write is the one
  that actually changes state.

33 auth tests still pass locally.

* fix(auth): match production's always-write semantics for stale dev groups

Devin code-review finding on PR #70: my earlier `target_groups and ...`
short-circuit silently diverged from the production OAuth callback. In
app/auth/providers/google.py:189-194 the callback always writes
session.google_groups on each login — including [] on failure or empty
token — so the session always reflects authoritative current state. The
mock should match.

Failure mode the previous guard left open: a developer sets
LOCAL_DEV_GROUPS=[{...}] for a session, the groups land in the signed
cookie, then the developer unsets the env var and reloads. target → [],
session.get → [{...}], `if target_groups and ...` is False, no write,
stale groups stay in the browser session indefinitely. Mock now lies
about state until logout.

Fix splits the guard:
- target_groups truthy + value-changed → write the new mock (existing path)
- target_groups falsy + non-empty stored → write [] to clear stale state
- otherwise no-op (target [] + stored None/[]: no transition to record)

PAT/CLI requests with no prior session still take the no-op path
(target=[], session.get → None which is falsy), so the original goal of
suppressing spurious Set-Cookie noise on token traffic is preserved.

Tests already cover the populated and unset paths; the new clear-stale
branch is correct by construction (production has the same shape) and
the rare manual reset workflow.

* release(0.11.2): default mocked groups in make local-dev + docs/local-development.md

Cuts 0.11.2 around the LOCAL_DEV_GROUPS work plus a small dev-experience
follow-up: every `make local-dev` now boots with two sensible default
mocked groups (Local Dev Engineers + Local Dev Admins on example.com),
so /profile and group-aware code paths render something realistic
without the operator having to discover and set LOCAL_DEV_GROUPS.

Layered so the default lives in the workflow, not the contract:

- scripts/run-local-dev.sh seeds LOCAL_DEV_GROUPS via shell ":="
  syntax — only sets the var when the operator hasn't already.
  Override: LOCAL_DEV_GROUPS='[...]' make local-dev. Disable:
  LOCAL_DEV_GROUPS= make local-dev.
- docker-compose.local-dev.yml swaps the commented JSON example for
  a bare `- LOCAL_DEV_GROUPS` passthrough — the value comes from the
  shell, the compose file just propagates it. Operators running
  `docker compose up` directly without the wrapper script get an
  empty mock (correct: they didn't opt into the make-driven defaults).
- Makefile help line mentions the mocked groups so the behavior is
  visible without grepping.

New docs/local-development.md consolidates dev-onboarding instructions
that were previously scattered across docker-compose.local-dev.yml
inline comments, docs/auth-groups.md "Local-dev mock" section, the
Makefile help text, and CLAUDE.md "First-Time Setup". Single page now
covers TL;DR, what LOCAL_DEV_MODE actually bypasses, group mocking
controls + verification, what is *not* mocked (Cloud Identity, real
OAuth, admin Workspace permissions), and the safety rails that keep
the dev shortcuts off production.

Version bump 0.11.1 → 0.11.2 in pyproject.toml, CHANGELOG cuts
[Unreleased] → [0.11.2] — 2026-04-26 with a fresh empty [Unreleased]
skeleton.

* fix(local-dev): default LOCAL_DEV_GROUPS truncated by shell parameter expansion

Reported by an operator running `make local-dev` against the freshly
released 0.11.2 — the LOCAL_DEV_MODE banner showed:

    LOCAL_DEV_GROUPS is not valid JSON, ignoring:
    Expecting ',' delimiter: line 1 column 70 (char 69)
    LOCAL_DEV_GROUPS is set but produced no valid groups —
    check the WARNING above for the parse error.

Cause: the default value lived inside `${LOCAL_DEV_GROUPS:=…}` parameter
expansion. Bash matches `}` to close the expansion at the *first* `}`
encountered in the body, regardless of context — even one inside a
nested JSON object literal. The two-element JSON array was therefore
truncated to the first group's closing brace, leaving an unparseable
fragment:

    [{"id":"local-dev-engineers@example.com","name":"Local Dev Engineers"

There is no escaping syntax for `}` inside parameter expansion (the
backslash escapes I had only escaped the quotes — `}` reaches bash
literally). Fix: hold the default in a single-quoted variable and
reference it through `${LOCAL_DEV_GROUPS:-$DEFAULT_LOCAL_DEV_GROUPS}`.
The variable's value is opaque to the expansion — no `}` matching
inside it — so the JSON survives intact. Verified with `python -m json`:

    parsed OK: 2 groups: ['local-dev-engineers@example.com',
                          'local-dev-admins@example.com']

Operators on a running 0.11.2 stack: `make local-dev-down && make
local-dev` to pick up the corrected default.

* fix(local-dev): respect LOCAL_DEV_GROUPS= disable path + add 0.11.2 changelog link

Two follow-ups from a Devin code-review pass on PR #70:

- run-local-dev.sh: switch ${LOCAL_DEV_GROUPS:-$DEFAULT} to
  ${LOCAL_DEV_GROUPS-$DEFAULT} (no leading colon). The :- form
  substitutes the default when the variable is unset OR set-but-empty,
  silently overwriting the documented disable knob. Three places
  promise this works — docs/local-development.md, the CHANGELOG entry,
  and the script's own comment — so the bug was an operator-facing
  lie, not just an implementation detail. The bare - form only
  substitutes on unset, so `LOCAL_DEV_GROUPS= make local-dev` now
  reaches the Python parser as "" and short-circuits to []. Verified
  with both empty and unset shells.

- CHANGELOG.md: add the [0.11.2] link reference at the bottom.
  Keep-a-Changelog convention is to mirror every version heading
  with a release-tag link in the footer; the 0.11.2 heading was
  missing its counterpart, breaking the Markdown link rendering on
  GitHub.

---------

Co-authored-by: Claude <noreply@anthropic.com>
2026-04-26 16:48:55 +02:00
Petr Simecek
78ed5b31fe
fix(tests): refresh nightly docker-e2e asserts after auth + health refactors (#69)
Two assertions in the docker-marker test files had drifted from the live API
and were only caught by the scheduled nightly CI job (run #24947963804,
2026-04-26 04:12 UTC):

- tests/test_docker_full.py::test_app_returns_html_on_root expected 200 on
  GET / — but app/web/router.py:189-193 always returns 302 (to /dashboard
  for authenticated users, /login otherwise) since the auth middleware
  landed. Updated to use follow_redirects=False and assert 302 + location.

- tests/test_e2e_docker.py::TestDockerHealth::test_health_has_duckdb read
  data["checks"]["duckdb"|"database"] — but the health payload shape is
  {"services": {"duckdb_state": ..., "data": ..., "users": ...}} and has
  been since app/api/health.py was last refactored. Updated to read
  services["duckdb_state"]["status"] — same pattern used by the (passing)
  tests/test_api.py::TestHealth suite, so the two test layers now agree.

Both fixes are test-only; no application behavior changes.
2026-04-26 16:12:20 +02:00
Petr Simecek
2099bb816e
release(0.11.1): hotfix the missed CADDY_TLS passthrough + changelog discipline (#67)
Patch release containing the two follow-up changes from 0.11.0:

- Caddy CADDY_TLS env passthrough in docker-compose.yml (#55) — should
  have shipped with #52 but the first PR got accidentally closed before
  merge. Without this fix Caddy ignores .env CADDY_TLS and crash-loops
  on any LE / internal-CA deployment.
- CLAUDE.md changelog discipline (#59) — every PR touching user-visible
  behavior must update CHANGELOG.md under [Unreleased] in the same PR.

The discipline rule itself caused this release to exist: writing the
[Unreleased] entry made the missed fix obvious, which is exactly the
feedback loop the rule is supposed to create.
2026-04-26 01:52:08 +02:00
Petr Simecek
864a245acf
fix(deploy): pass CADDY_TLS through to caddy container (#55)
* fix(deploy): pass CADDY_TLS through to caddy container

PR #52 added the {$CADDY_TLS:default} substitution to the Caddyfile but
forgot to expose CADDY_TLS to the caddy service in docker-compose.yml.
Result: Caddyfile substitution falls back to the default
(`tls /certs/fullchain.pem /certs/privkey.pem`) regardless of what the
operator wrote into .env, and Caddy crash-loops with "open
/certs/fullchain.pem: no such file or directory" on any LE / internal
deployment.

Compose `- CADDY_TLS` (no `=value`) is the bare-form passthrough — Compose
reads the value from .env (or the host shell) at up time. No-op when
CADDY_TLS is unset (Caddyfile default kicks in), exact behavior preserved
for cert-file deployments.

Caught by Keboola's first agnes-dev recreate (kids-ai-data-analysis project,
agnes-dev.keboola.com) — VM came up with .env containing
CADDY_TLS="tls petr@keboola.com" but Caddy ignored it and tried to load
the corp PKI cert file.

* docs(changelog): document the CADDY_TLS passthrough fix per discipline rule
2026-04-26 01:46:42 +02:00
Petr Simecek
d7bd710ca2
docs(claude): non-negotiable CHANGELOG.md update rule + [Unreleased] skeleton (#59)
CLAUDE.md gains a "Changelog discipline — non-negotiable" section above
"Git Commits & Pull Requests". Codifies the rule that every PR touching
user-visible behavior must update CHANGELOG.md under [Unreleased] in
the same PR — with concrete instructions for which sections to use,
how to mark breaking changes, and what counts as user-visible.

CHANGELOG.md gets an [Unreleased] skeleton above [0.11.0] so the next
PR has somewhere obvious to land its bullet, plus the inaugural
[Unreleased] entry documenting this very rule (eats its own dog food).

The rule is intentionally strict ("no exceptions, no follow-ups") —
soft "should" rules erode under pressure; binding rules survive PR
churn. Reviewers should bounce PRs that violate it, same as they'd
bounce a PR with no test changes for new logic.
2026-04-26 01:10:32 +02:00
Petr Simecek
598f186eb1
release(0.11.0): reset to pre-1.0 semver + first changelog (#58)
The version = "2.x" strings in earlier pyproject.toml snapshots were
arbitrary placeholders from the initial scaffold (cookiecutter default),
not a reflection of API maturity. Resetting to 0.11.0 to signal pre-1.0
status: public surface (CLI flags, REST endpoints, instance.yaml schema,
extract.duckdb contract) may still shift between minor versions.

CalVer image tags (stable-YYYY.MM.N, dev-YYYY.MM.N) continue from CI;
semver tags (v0.X.Y) are cut at release boundaries and reference the
same commit as a stable-* tag from the same day.

CHANGELOG.md replaces the old CalVer draft format with Keep a Changelog
+ semver. The 0.11.0 entry curates everything currently in main:
- Auth: Workspace groups, password reset, PAT, magic-link, seed admin pwd
- Deploy: keboola-deploy workflow, Caddy/LE/cert-file TLS, dev_instances
  TLS, optional Google OAuth from SM, LOCAL_DEV_MODE, /setup wizard
- CLI: wheel distribution, auto-update, --version, --dry-run, gzip
- Data: remote query (BQ+DuckDB), business metrics, OpenAPI snapshot test
- Security: padak-security.md audit batch + urllib3 + argon2-cffi
- Two BREAKING items called out (Caddy profile rename, Caddyfile default
  cert mode flipped to cert-file)
2026-04-26 01:05:55 +02:00
ZdenekSrotyr
6c53082295 feat: multi-instance deployment — all 14 must-have items from spec
CalVer CI (release.yml) with stable/dev channels, health endpoint
with version/channel/schema_version, JWT secret auto-generation with
file persistence, smoke test script + Docker-in-CI, pre-migration
snapshot, /api/admin/configure for headless setup, /api/admin/
discover-and-register, /setup wizard, OpenAPI snapshot test, custom
connector mount support, CHANGELOG, migration safety tests, startup
banner.

663 tests pass (6 new migration safety + 3 OpenAPI snapshot + 1
updated JWT test).
2026-04-10 11:57:42 +02:00