Commit graph

143 commits

Author SHA1 Message Date
ZdenekSrotyr
96281f884c docs: implementation plan for customizable welcome prompt 2026-05-03 16:10:48 +02:00
ZdenekSrotyr
b627de8344 feat(diagnose) + docs: warn on USER_PROJECT_DENIED footgun + document all newly-exposed knobs
Diagnostic + operator-facing documentation that closes the loop on the work in this PR.

`da diagnose` (via /api/health/detailed):
  - New _check_bq_billing_project() helper. When data_source.type='bigquery' and BqProjects.billing == .data, surface a yellow warning: 'BigQuery billing project equals data project'. Hint includes the YAML field path + the /admin/server-config UI shortcut. Diagnose's overall status promotes warning → degraded so the CLI echoes it.
  - Non-BQ instances (Keboola-only, etc.) skip the check.
  - Implementation hooks into the existing /api/health/detailed surface — no new endpoint, no CLI changes.

config/instance.yaml.example documentation:
  - data_source.bigquery.billing_project: USER_PROJECT_DENIED hint, /admin/server-config UI reference
  - data_source.bigquery.legacy_wrap_views: analyst-side discipline note (use `da fetch` / `da query --remote`), issue #101 history, view-heavy deployment guidance
  - data_source.bigquery.max_bytes_per_materialize: cost guardrail block (NEW — wasn't documented in .example before)
  - ai.base_url: provider list + UI hint
  - openmetadata + desktop: 'configurable via /admin/server-config UI' headers
  - corporate_memory: leading note that the schema is editable via UI

Other docs:
  - CHANGELOG.md: comprehensive Unreleased section
  - CLAUDE.md: schema chain → v20 + Materialized SQL connector mode + per-connector tab UI mention
  - README.md: mode-first source table summary
  - docs/architecture.md: per-connector tab UI mention
  - cli/skills/connectors.md: bootstrap rails (parallel to #154)
  - docs/superpowers/plans/2026-05-01-admin-tables-form-cleanup.md: implementation plan archive (2515 lines)
  - scripts/seed_dummy_tables.py: drop is_public after #150 RBAC migration (column gone)

Tests:
  - test_diagnose_billing.py — 3 cases (BQ with billing==data warns, BQ with billing!=data clean, non-BQ skips)
2026-05-01 20:27:24 +02:00
ZdenekSrotyr
d0b7e122d6 feat(cli): smart local sync — Claude Code SessionStart/SessionEnd hooks + da sync --quiet
The analyst flow becomes a closed loop with the server-curated table catalog:

  - `da analyst setup` writes `<workspace>/.claude/settings.json` with two hooks:
      SessionStart → `da sync --quiet || true`        — pulls fresh RBAC-filtered parquets at session start
      SessionEnd   → `da sync --upload-only --quiet || true` — uploads session jsonl + CLAUDE.local.md
  - `|| true` keeps Claude Code unblocked when the server is down.
  - Workspace-level (not user-home) so the hooks fire only when Claude Code opens this analyst workspace.
  - `da sync --quiet` rewrites the CLI output for hook consumption — 0 stdout on success, single-line error on failure.
  - Existing settings.json is patched (deep-merged), not overwritten; malformed JSON is reported, not silently overwritten.

Tests cover: workspace bootstrap, hook insertion, malformed-json safety, quiet-mode output shape.
2026-05-01 20:25:27 +02:00
Vojtech
bd7b8c3233
fix(analyst): document BigQuery remote-query capability in bootstrap CLAUDE.md template (#154)
* fix(analyst): document BigQuery remote-query capability in bootstrap CLAUDE.md template

Closes #153.

The CLAUDE.md template generated by `da analyst bootstrap` (config/claude_md_template.txt)
covered metrics, sync, corporate memory, and directory layout — but had ZERO
mention of query_mode: "remote", da fetch, da query --remote, or --register-bq.
Result: the AI analyst running in a freshly-bootstrapped workspace had no
idea BigQuery-backed tables existed, no path to fetch unsynced data, and no
fallback for tables not in the catalog.

Validated against /Users/<user>/foundry-ai/foundryai-data-analyst/CLAUDE.md
on 2026-05-01: section confirmed missing. Workspace-level (parent-dir)
CLAUDE.md carried legacy SSH-heredoc instructions but the analyst-level
file (which Claude reads as primary project context) had nothing.

## Changes

### config/claude_md_template.txt (+83)

Added a `## Remote Queries (BigQuery)` section covering:

- Discovery first — `da catalog --json | jq '...'` to see all tables
  with their query_mode, then `da schema` and `da describe` for shape.
- Three query patterns:
  - `da fetch` (preferred) — materialize a filtered subset locally,
    query the snapshot, drop when done.
  - `da query --remote` — one-shot server-side execution (cheap probes).
  - `da query --register-bq` — hybrid joins between local + ad-hoc BQ.
- `da fetch` estimate-first discipline — rules of thumb on
  --select / --where / --estimate / snapshot reuse.
- BigQuery SQL flavor cheat sheet for `--where` (DATE literal,
  DATE_SUB, REGEXP_CONTAINS, CAST AS INT64).
- Unknown-table fallback: when a table isn't in `da catalog` at all,
  use ad-hoc `--register-bq` if the agnes server SA has BQ access, or
  ask admin to register with `query_mode: "remote"` for ongoing use.
- Pointer to `da skills show agnes-data-querying` for deeper guidance.

### docs/setup/claude_md_template.txt (deleted)

Stale 359-line template that documented the deprecated SSH-heredoc
remote_query.sh protocol. No code references it (verified via grep
across .py / .sh / .yml / .md). Removing eliminates two failure
modes:
1. A future refactor accidentally pulling it into a workspace and
   shipping deprecated guidance to analyst Claude sessions.
2. Reviewer confusion over which template is canonical.

### CHANGELOG.md

`### Fixed` and `### Removed` entries under [Unreleased].

## Tested

- Manually walked the diff against `da skills show agnes-data-querying`
  output on a live VM (foundryai-development) — patterns + flags
  match the modern CLI exactly.
- Re-bootstrap test deferred: requires network round-trip; pattern
  is identical to existing template substitution path so render is
  not at risk.

## Out of scope

- The companion gap that data_description.md often only enumerates
  query_mode: "local" tables (no signal that other modes exist) —
  separate concern, fix likely belongs in the metadata generator
  on the server side, not in the analyst template.
- Encouraging admins to register frequently-queried BQ tables as
  `query_mode: "remote"` in the registry — workflow improvement, not
  a code bug.

* chore(release): cut 0.28.0

---------

Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
2026-05-01 12:06:41 +02:00
minasarustamyan
d4ac84dd46
feat(rbac): drop dataset_permissions + users.role + is_public; v19 migration (#150)
* feat(rbac): drop dataset_permissions + access_requests + users.role + is_public; v19 migration

BREAKING. Sjednocení datové RBAC vrstvy do per-group resource_grants modelu.
Před PR byla legacy data RBAC vrstva (dataset_permissions + is_public bypass)
de-facto neaktivní — is_public neměl API/UI/CLI surface, default true znamenal
že can_access_table vždycky bypassl. Dnes každý non-admin přístup vyžaduje
explicitní resource_grants(group, "table", id) řádek.

Schema v18 → v19 (src/db.py:_v18_to_v19_finalize):
- DROP TABLE dataset_permissions, access_requests
- DROP COLUMN users.role (NULL artifact since v13)
- DROP COLUMN table_registry.is_public
- Drops přes table-rebuild idiom (rename → create new → INSERT … SELECT
  → drop old) kvůli DuckDB ALTER DROP COLUMN limitacím na tabulkách
  s historic FK constraints. INSERT picks intersection sloupců, takže
  test fixtures s minimal pre-v19 schemou migrate cleanly.

Runtime:
- src/rbac.py:can_access_table → deleguje na app.auth.access.can_access
- DatasetPermissionRepository, AccessRequestRepository smazány
- AGNES_ENABLE_TABLE_GRANTS env-gate v app/resource_types.py odstraněn
  (TABLE je unconditionally enabled)

API drop:
- app/api/permissions.py, app/api/access_requests.py celé soubory
- /admin/permissions web route + admin_permissions.html
- "Request Access" modal v catalog.html + locked-row UI
- ~10 if user.get("role") != "admin" checků nahrazeno (admin shortcut
  je uvnitř can_access_table)
- /api/settings: drop permissions field z GET; PUT /api/settings/dataset
  gate přepnut na can_access(user_id, "table", dataset, conn)

Auth:
- app/auth/jwt.py:create_access_token: drop role parametr (claim zmizí
  z nově vydávaných JWT; staré tokeny zůstávají valid, claim ignored)
- app/api/users.py: drop role z CreateUserRequest / UpdateUserRequest
  (admin promotion = explicit add to Admin group via memberships API)
- src/repositories/users.py: drop role z create() / update()

CLI:
- da admin set-role smazán → hard-fail s replacement command
- da admin add-user --role flag pryč
- da auth import-token --role flag pryč
- da auth whoami: drop "Role:" výpis
- cli/config.py:save_token: role parametr now optional, no longer written
  (back-compat se starými token.json soubory zachována — pole se ignoruje)

Tests:
- DELETE: test_permissions.py, test_permissions_api.py, test_access_requests_api.py
- REWRITE: test_access_control.py (resource_grants flow), test_rbac.py
  (can_access_table over resource_grants), test_journey_rbac.py
  (drop access-request flow), test_resource_types.py (drop env-gate
  tests, drop is_public from helpers), test_v2_*.py (drop role-based
  user dicts in favor of id-based + Admin group membership),
  test_settings_api.py (no permissions field, can_access gate)
- TRIVIAL: ~30 souborů — drop role="admin" arg z UserRepository.create
  a 3rd positional role z create_access_token
- NEW: test_v18_to_v19 migration test (test_db.py),
  test_can_access_table_no_implicit_public (test_rbac.py),
  test_admin_set_role_returns_hardfail (test_cli_admin.py)
- OpenAPI snapshot regenerated

Docs:
- CHANGELOG: BREAKING entry pod [Unreleased]
- CLAUDE.md: schema v18 → v19
- docs/architecture.md: schema table + RBAC sekce přepsána
- docs/auth-google-oauth.md: admin promotion přes da admin break-glass
- cli/skills/security.md: kompletně přepsáno na group-based model
- docs/TODO-rbac-data-enforcement.md: smazáno (TODO splněn)

Test results: 2363 passed, 19 failed. Zbývající failures jsou pre-existing
Windows-specific issues (fcntl, charset) nesouvisející s tímto PR —
ověřeno git stash pop.

Plan: ~/.claude/plans/floofy-coalescing-parnas.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(release): cut 0.27.0

---------

Co-authored-by: Minas Arustamyan <arustamyan.minas@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
2026-04-30 22:02:16 +02:00
ZdenekSrotyr
83adf01bde
fix(v2): #134 BigQuery cross-project errors return structured 502/400 + BqAccess facade (#138)
* docs(spec): #134 unify BigQuery access behind BqAccess facade

Brainstorm output for issue #134. Captures:
- root cause (incl. correction of the issue's hypothesis about commit 33a9964)
- BqAccess facade API + project resolution rules
- error contract — typed BqAccessError mapped to HTTP 502 for upstream
  BQ failures, 500 for deployment/config bugs
- migration plan for v2_scan, v2_sample, RemoteQueryEngine
- test rewrite eliminating _bq_client_factory injection point
- E2E verification protocol on agnes-development as success criterion

* docs(spec): #134 revise after first review

Incorporates code-reviewer findings:

Must-fix:
- Add v2_schema (2 copies of INSTALL/LOAD/SECRET dance) to migration scope.
- Reframe v2_scan headline: missing try/except around BQ calls is the
  actual cause of bare 500s, not project resolution (which 33a9964 fixed).
- List two more deferred call sites (extractor.py, register_bq_table)
  with explicit rationale.

Important:
- Drop billing != data clause from cross_project_forbidden heuristic;
  rely only on 'serviceusage' substring. billing != data is normal
  for cross-project setup, was over-classifying.
- Split bq_bad_request into _user (400) and _server (502) variants;
  add sql_origin parameter to translate_bq_error so call sites declare
  whether SQL contains user input.
- Add @functools.cache to BqAccess.from_config; document tests bypass
  via dependency_overrides.
- Replace monkey-patched-classmethod test pattern with
  BqAccess(client_factory=...) injection at construction time. Cleaner
  than today's _bq_client_factory and 1:1 migration shape.
- Keep BqProjects.data (reviewer assumed registry has source_project;
  it doesn't). Multi-project explicitly listed as non-goal with note.

Nice-to-have:
- Add 'Implementation strategy' section: 2 staged commits (bug fix
  alone is revertable; refactor follows).
- Extend E2E protocol to cover all three endpoints, not just /sample.
- Note removal of stale docstring at src/remote_query.py:204.

* docs(spec): #134 revision 3 — incorporates second-round review

Must-fix from second review:
- v2_schema split into two migration cases: _fetch_bq_schema translates
  errors via translate_bq_error; _fetch_bq_table_options preserves its
  swallow-all 'except Exception → return {}' so /schema doesn't 502 on
  partition-info failures.
- RemoteQueryEngine.__init__ now resolves BqAccess lazily (in
  _get_bq_client, not in __init__). Without this, ~7 DuckDB-only tests
  in test_remote_query.py would suddenly fail with not_configured.
- translate_bq_error pass-through for BqAccessError is now load-bearing
  (clause 1, before any Google-API branch). bq.client() raises BqAccessError
  for bq_lib_missing/auth_failed; without explicit pass-through those
  fall to 'unknown' and re-raise as bare 500.
- Commit 1 now emits the SAME structured response shape as commit 2 to
  avoid contract churn between commits.
- BIGQUERY_PROJECT env-var precedence is BREAKING for env-only deployments
  — flagged in CHANGELOG ### Changed.

Editorial:
- sql_origin renamed to bad_request_status with values 'client_error' /
  'upstream_error' (clearer about what the parameter actually decides).
  bq_bad_request_user/_server kinds collapsed to bq_bad_request (400)
  and bq_upstream_error (502).
- CLI (cli/commands/query.py) noted as external RemoteQueryEngine caller;
  unaffected because new bq_access kwarg has default None.
- Added unit/integration tests for the new contracts:
  test_translate_passes_through_BqAccessError,
  test_v2_scan_returns_500_on_bq_lib_missing,
  test_v2_schema_returns_200_with_empty_partition_on_bq_failure,
  test_resolve_succeeds_after_config_set.
- E2E protocol now covers /schema as the fourth endpoint.
- Documented functools.cache-doesn't-cache-exceptions semantics and
  fixture nullcontext-doesn't-close caveat for nested sessions.

* docs(spec): #134 revision 4 — incorporates third-round review

Third reviewer verdict: 'implementation-ready with two trivial edits';
explicitly noted prior rounds did the heavy lifting.

Edits:
1. get_bq_access() module-level function instead of @classmethod
   @functools.cache from_config. Removes the classmethod-cache stacking
   footgun (different Python versions wrap differently) and gives FastAPI's
   dependency introspection a clean function signature. Drops the
   'Do not subclass BqAccess' caveat that no longer applies.

2. Commit 1 strategy explicitly: wrap _fetch_bq_sample (v2_sample),
   _bq_dry_run_bytes + _run_bq_scan (v2_scan), and _fetch_bq_schema
   (v2_schema strict block). Do NOT touch _fetch_bq_table_options swallow-all
   in commit 1 — preserved as-is, then migrated (still preserved) in commit 2.
   All three endpoints emit the same structured body shape so client parsers
   see one consistent contract throughout the staged rollout. No more
   half-rolled-out window where /sample is bare 500 while /scan is
   structured 502.

* docs(plan): #134 implementation plan — Phase 1 (atomic bug fix) + Phase 2 (BqAccess refactor) + Phase 3 (verification)

Bite-sized TDD tasks. 3 phases, 16 tasks total:

Phase 1 (Commit 1) — atomic bug fix across all four v2 endpoints:
  Tasks 1.1-1.5 wrap _fetch_bq_sample, _bq_dry_run_bytes, _run_bq_scan,
  _fetch_bq_schema with structured 502/400 try/except. _fetch_bq_table_options
  preserved untouched. CHANGELOG Fixed entries.

Phase 2 (Commit 2) — BqAccess facade extraction + migration:
  Tasks 2.1-2.5 build connectors/bigquery/access.py bottom-up
  (BqProjects, BqAccessError, translate_bq_error, default factories,
  BqAccess class, get_bq_access module-level cached). Task 2.6 adds
  conftest.py fixture. Tasks 2.7-2.9 migrate v2_scan, v2_sample, v2_schema
  to BqAccess. Tasks 2.10-2.11 migrate RemoteQueryEngine + tests
  (lazy bq_access, drop _bq_client_factory). Task 2.12 CHANGELOG
  Changed BREAKING + Internal.

Phase 3 — Verification:
  3.1 full pytest. 3.2 squash into two PR-shape commits. 3.3 manual
  E2E on agnes-development per spec protocol → close #134.

Self-review table maps spec sections to implementing tasks; no gaps.

* fix(v2): #134 structured 502/400 on BQ errors across /scan, /scan/estimate, /sample, /schema

Wraps the BigQuery call sites in v2_scan, v2_sample, and v2_schema (strict
block only) with try/except for google.api_core exceptions, translating to
HTTPException with a structured body shape: {error, message, details}.

Fixes Pavel's report (#134) where these endpoints returned bare HTTP 500
with no body when the SA on agnes-development hit cross-project Forbidden
on serviceusage.services.use.

Also fixes /sample's missing billing_project fallback (the bug 33a9964
fixed for /scan never landed here).

Status code split:
  - /scan, /scan/estimate: BadRequest -> 400 (bq_bad_request) since SQL is
    user-derived from req.select/where/order_by.
  - /sample, /schema: BadRequest -> 502 (bq_upstream_error) since SQL is
    server-constructed from validated identifiers.
  - All Forbidden -> 502 with cross_project_forbidden if 'serviceusage' in
    error message (with hint pointing at data_source.bigquery.billing_project),
    else bq_forbidden.

Body shape matches what the upcoming BqAccess refactor (next commit) will
produce, so client-side parsers see one consistent contract throughout
the staged rollout.

_fetch_bq_table_options preserved exactly as-is — its swallow-all-and-return-empty
contract is intentional and survives into the refactor; /schema continues to
return 200 with empty partition info when partition queries fail.

Outer wraps in scan_endpoint, scan_estimate_endpoint, sample, and schema
endpoints exist only to make the test pattern (monkeypatching whole
_fetch_* functions) work, and are tagged TODO(#134 Phase 2) for removal
once BqAccess centralizes translation.

* refactor(bq): #134 BqAccess facade — unify v2_scan, v2_sample, v2_schema, RemoteQueryEngine

Extracts the duplicated BigQuery-access pattern (project resolution +
client construction + DuckDB-extension session + Google-API error
translation) into connectors/bigquery/access.py. Migrates four
call sites to use it:

- app/api/v2_scan.py — _bq_dry_run_bytes, _run_bq_scan
- app/api/v2_sample.py — _fetch_bq_sample
- app/api/v2_schema.py — _fetch_bq_schema (strict translation),
  _fetch_bq_table_options (preserves swallow-all best-effort contract)
- src/remote_query.py — RemoteQueryEngine, lazy bq_access kwarg

The new module exposes:
- BqProjects (frozen dataclass: billing + data project IDs)
- BqAccessError (typed exception with HTTP_STATUS class mapping)
- BqAccess (facade with injectable client_factory/duckdb_session_factory
  for tests; defaults call the real google-cloud-bigquery + DuckDB extension)
- get_bq_access (module-level @functools.cache; FastAPI Depends target)
- translate_bq_error (Google API exception → BqAccessError mapper, with
  BqAccessError pass-through, 'serviceusage'-substring heuristic for
  cross_project_forbidden, and bad_request_status param distinguishing
  user-derived (400) from server-constructed (502) SQL)
- _default_client_factory, _default_duckdb_session_factory

RemoteQueryEngine.__init__ no longer accepts _bq_client_factory; tests
migrate to bq_access=BqAccess(projects, client_factory=...). DuckDB-only
RemoteQueryEngine tests need no changes — bq_access defaults to None and
get_bq_access() is only invoked on first BQ call (lazy resolution).
BqAccessError raised internally is translated to RemoteQueryError(
error_type="bq_error") in _get_bq_client to preserve the engine's
existing public contract — CLI and /api/query/hybrid callers see no change.

Endpoint tests (test_v2_scan, test_v2_scan_estimate, test_v2_sample,
test_v2_schema) migrate from monkey-patching whole _fetch_* functions
to using the new bq_access fixture in tests/conftest.py — which
exercises the REAL translation path through BqAccess + translate_bq_error,
closing the test gap flagged in Task 1.1's review.

Side-effect behavior change: v2_sample's FROM clause now uses the data
project (instance.yaml data_source.bigquery.project), not the conflated
billing_project from Phase 1. Documented in CHANGELOG ### Internal.

BREAKING for deployments combining BIGQUERY_PROJECT env var with
data_source.bigquery.project in instance.yaml — env var now overrides
data project too. See CHANGELOG ### Changed.

Two known-duplicate BQ-access sites (connectors/bigquery/extractor.py,
scripts/duckdb_manager.register_bq_table) explicitly out of scope;
tracked as follow-up.

Removed stale docstring at the previous src/remote_query.py:204
that referenced scripts.duckdb_manager._create_bq_client as the default
BQ client factory (RemoteQueryEngine never actually used that function).

Test counts: tests/test_bq_access.py +27 (new), tests/test_v2_*.py +
tests/test_remote_query.py migrated to bq_access fixture (counts unchanged
or +1-2 per file). Full suite: 2086 passed, 8 pre-existing failures
(DB migration tests with unrelated internal_roles DependencyException —
not introduced by this PR).

* fix(bq_access): translate DefaultCredentialsError to BqAccessError(auth_failed)

CI on PR #138 caught: bigquery.Client(...) resolves Application Default
Credentials at construction time; without ADC (CI without SA key, dev
laptop without 'gcloud auth application-default login') it raises
google.auth.exceptions.DefaultCredentialsError synchronously.

Pre-fix _default_client_factory only caught ImportError, so DefaultCredentialsError
propagated as raw exception — and from production endpoints would surface
as bare 500 (the exact failure mode #134 sets out to fix).

Now translates to BqAccessError(kind='auth_failed', details.hint='Run
gcloud auth application-default login...'). Endpoint catch chain returns
HTTP 502 with structured body. Adds unit test
test_raises_auth_failed_on_default_credentials_error.

Third-round spec review flagged this case in passing; the fix didn't land.
CI's auth-less environment surfaced it.

* fix(bq_access): get_bq_access() returns sentinel instead of raising when not configured

Devin BUG_0001 on PR #138 review: 'get_bq_access() as FastAPI Depends
breaks all v2 endpoints for non-BigQuery instances'.

Pre-fix: get_bq_access() raised BqAccessError(not_configured) when
neither BIGQUERY_PROJECT env nor data_source.bigquery.project was set.
Because FastAPI resolves Depends() BEFORE the endpoint body runs, this
exception fires during dep-injection — the endpoint's try/except
BqAccessError clause never gets a chance to catch it. Result: every
v2 request on Keboola-only or CSV-only instances returned bare HTTP
500, even for local-source tables that never touch BigQuery.

Fix: get_bq_access() now returns a sentinel BqAccess with empty
BqProjects and factories that raise BqAccessError(not_configured)
on actual use. Construction succeeds, FastAPI's dep-injection cleanly
yields the sentinel, the endpoint runs. The local-source code path
in build_sample / build_schema / etc. never calls bq.client() or
bq.duckdb_session() (it reads parquet directly), so non-BQ tables
return 200 as before. Only when an endpoint actually tries to query
BQ (source_type == 'bigquery') does the sentinel raise — and the
endpoint's existing except BqAccessError catches it normally,
returning structured 502 with hint.

Test get_bq_access::test_raises_not_configured_when_neither_set
renamed and rewritten to test_returns_sentinel_when_neither_set:
asserts BqAccess is returned, then asserts client() and
duckdb_session() each raise BqAccessError(not_configured) on call.

Test test_does_not_cache_exceptions removed (no longer applicable)
and replaced with test_sentinel_is_cached_per_process documenting
the operator-restart-on-config-change contract.

* docs(spec+plan): #134 genericize customer-specific tokens (CLAUDE.md OSS rule)

Devin BUG_0001/0002 round 3 on PR #138: spec and plan docs contained
customer-specific deployment hostnames, deployment names, and a GCP
project ID that violated CLAUDE.md's vendor-agnostic OSS rule
('Nothing customer-specific belongs in code, configuration defaults,
comments, docs, commit messages, PR titles, or PR bodies').

Replacements:
  agnes-development.groupondev.com -> <your-agnes-host>
  agnes-development                -> <your-dev-instance>
  prj-grp-dataview-prod-1ff9       -> <your-data-project>
  s1_session_landings              -> <bq_table_id>

E2E verification semantics unchanged — operators still run the same
four curls + config flip + retry, just substituting their own host /
deployment name / project / table.

* fix(bq_access): hook get_bq_access.cache_clear into instance_config.reset_cache

Devin ANALYSIS_0004 on PR #138: get_bq_access is @functools.cache'd at
process level, so it captures BigQuery project IDs at first call and
ignores subsequent instance.yaml changes. Pre-Phase-2 the v2 endpoints
re-read get_value() on every request, so admin /api/admin/server-config
saves (which call instance_config.reset_cache()) hot-reloaded the BQ
project. Without this fix, my refactor silently regresses that contract
— operators editing instance.yaml via the admin UI would see no effect
on v2 endpoints until container restart.

instance_config.reset_cache() now also calls
connectors.bigquery.access.get_bq_access.cache_clear() (lazy import,
swallowed if connectors module isn't loaded — keeps instance_config
usable in isolated unit tests).

Adds test_instance_config_reset_cache_invalidates_get_bq_access as
regression guard. Updates CHANGELOG Internal entry to mention the
hot-reload contract + the not-configured sentinel behavior (round-3
fix from Devin BUG_0001 was previously only in commit message).

* fix(bq_access): surface not_configured before identifier validation + plan path genericize

Devin BUG_0001 + BUG_0002 round 5 on PR #138.

BUG_0001 (plan doc): personal filesystem path violated CLAUDE.md
vendor-agnostic rule. Replaced with '<worktree-root>' placeholder.

BUG_0002 (sentinel error path): when get_bq_access() returns the sentinel
BqAccess (BQ not configured), the empty bq.projects.data was reaching
validate_quoted_identifier first and raising ValueError -> endpoint
mapped to HTTP 400 'unsafe_identifier' instead of structured 500
'not_configured' with hint.

Each fetch helper now checks 'if not bq.projects.data: bq.client()' as
the first step, which triggers the sentinel's BqAccessError(not_configured).
Endpoint catches the typed error and returns HTTP 500 with hint pointing
at data_source.bigquery.project. Best-effort _fetch_bq_table_options
returns {} silently in this case (preserves the swallow-all contract).

* fix(bq_access): classify DuckDB-native exceptions from bigquery_query() via string match

Devin ANALYSIS on PR #138 review (latest round). The DuckDB bigquery
extension is a C++ plugin making its own HTTP calls — when BQ returns
403, it throws duckdb.IOException with the BQ error embedded as text,
not gax.Forbidden. translate_bq_error's isinstance checks would miss
these, falling to case 7 → bare 500 in production for v2_scan, v2_sample,
and v2_schema (the bigquery_query() paths).

Fix: last-resort string-match heuristic before the re-raise. 'Forbidden'
/ '403' / 'Bad Request' / '400' in the lowercased message classifies via
the same kind hierarchy. The 'serviceusage' substring still distinguishes
cross_project_forbidden from bq_forbidden. Specific enough that random
exceptions without HTTP-error keywords still re-raise.

Adds 4 unit tests covering the new heuristic + the 'don't swallow random
exceptions' invariant.

* chore(release): cut 0.22.0

PR #138 contains issue #134 user-visible behavior changes:
- BREAKING: BIGQUERY_PROJECT env var now overrides instance.yaml
  data_source.bigquery.project for v2 endpoints (previously
  RemoteQueryEngine billing only).
- Fixed: structured 502/400 on /api/v2/sample, /scan, /scan/estimate,
  /schema when BigQuery raises Forbidden/BadRequest (was bare 500).
- Internal: BqAccess facade refactor unifying four duplicate BQ-access
  call sites; instance_config.reset_cache() now invalidates BqAccess
  cache too so admin server-config saves hot-reload BQ project IDs.

Bumps to 0.22.0 because PR #137 merged first and took 0.21.0.
2026-04-30 10:11:20 +02:00
Vojtech
38f6b639d2
feat(observability): request_id end-to-end + dev debug toolbar + centralized logging (#136)
Cuts release 0.20.0.

## Highlights
- X-Request-ID header on every response + sanitized to [A-Za-z0-9_-] (CRLF log-forging mitigation)
- Error pages (HTML + JSON 500) surface request_id for support tickets
- Dev debug toolbar gated by DEBUG=1 — fastapi-debug-toolbar with custom DuckDBPanel
- Centralized app.logging_config.setup_logging() replaces 23 scattered basicConfig calls
- Telegram bot drops bot.log file — stdout only (BREAKING)

## Devin findings addressed
- BUG_0001: .env.template no longer claims FastAPI debug=True
- BUG_0002: subprocess extractor logs INFO to stderr again
- ANALYSIS_0003: _wants_html no longer matches Accept: */* (curl gets JSON as before)
- BUG on b1c6ee9: HTML 500 page no longer leaks str(exc) in production
- BUG on b13d2fe: 2 CLAUDE.md compliance flags (transform.py + ws_gateway) accepted as scope-limited logging refactor — follow-up to update CLAUDE.md if needed

See CHANGELOG [0.20.0] for full notes.
2026-04-29 22:54:21 +02:00
ZdenekSrotyr
b7a1795834
feat(scheduler): re-wire sync_schedule + script.schedule; tune via env; OpenMetadata TLS (#135)
Bundles 4 issues:
- #79 — table_registry.sync_schedule honored at runtime (API-side filter + Pydantic validators)
- #78 — script_registry.schedule honored via new POST /api/scripts/run-due (atomic claim, BackgroundTask exec, deploy-time safety validation)
- #77 — sidecar JOBS env-driven (SCHEDULER_DATA_REFRESH_INTERVAL/HEALTH_CHECK_INTERVAL/SCRIPT_RUN_INTERVAL/TICK_SECONDS)
- #89 — OpenMetadataClient verify=True default (BREAKING for self-signed)

Cuts release 0.19.0. See CHANGELOG for full notes incl. Known Limitations.
2026-04-29 22:06:30 +02:00
minasarustamyan
c940593a90
feat(auth): Google Workspace group prefix filter + system mapping (#131)
Three new env vars wire the Google OAuth callback to a configurable Workspace prefix and route admin/everyone Workspace groups onto the seeded system rows: AGNES_GOOGLE_GROUP_PREFIX, AGNES_GROUP_ADMIN_EMAIL, AGNES_GROUP_EVERYONE_EMAIL. Login gate redirects users with no prefix-matching group to /login?error=not_in_allowed_group. BREAKING: auto-Everyone membership for new users removed. Admin UI/API are read-only on Google-managed groups. See docs/auth-groups.md.
2026-04-29 14:08:04 +02:00
ZdenekSrotyr
1824b9dd9c
feat(admin): #108 M1 — BigQuery table registration in UI + CLI (#119)
Issue #108 Milestone 1. Adds BigQuery table registration via /admin/tables UI and `da admin register-table` CLI without hand-editing table_registry. POST /api/admin/register-table/precheck for round-trip validation. --dry-run flag on CLI. Audit-log entries on register/update/unregister. PUT /api/admin/registry/{id} now preserves registered_at (closes #130).
2026-04-29 13:18:31 +02:00
ZdenekSrotyr
61f6b8d2d5
feat(ci+tests): deploy safety audit — linting, rollback, smoke tests, 50+ new tests (#120)
Comprehensive deploy safety audit implementing 19 improvements across CI/CD pipeline, test coverage, and source code.

### CI/CD Pipeline
- ruff + mypy added to both release.yml and keboola-deploy.yml (continue-on-error)
- Smoke test added to keboola-deploy.yml (was missing)
- Automatic rollback on smoke test failure in release.yml
- Expanded smoke-test.sh with catalog, admin/tables, marketplace.zip, metrics
- Required status checks via .github/settings.yml
- Dependabot + CODEOWNERS + pre-commit hooks + ruff config

### Source Code
- DB schema version check in /api/health (db_schema: ok/mismatch/unhealthy)
- Config versioning (config_version: 1 in instance.yaml, non-blocking validation)
- BigQuery extractor ATTACH error handling (try/except around INSTALL+ATTACH)
- Post-deploy smoke test script for prod VM validation

### Test Coverage (~50 new tests)
- v13->v14 migration, Email magic link TTL, PAT, Marketplace ZIP/Git,
  Jira webhooks, Hybrid Query BQ, Keboola/BQ extractor failure modes,
  Orchestrator failure modes

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-04-29 09:18:55 +02:00
PavelDo
e1108b6112
feat(memory): corporate memory v1+v1.5 + 0.15.0 (#72)
Adds corporate memory v1 (verification flywheel + contradiction detection + confidence scoring) and v1.5 (audience-based distribution + per-item privacy + admin curation). Server: GET /api/memory/bundle returns mandatory + ranked-approved items within a token budget; POST /api/memory/admin/mandate accepts an audience field gated against user_group_members; /api/memory/stats uses SQL aggregation. CLI: da sync writes received items to .claude/rules/km_*.md. Verification detector extracts knowledge candidates from session JSONL files. Auto-tagging via Haiku when ai: is configured. Adapted from the v9-era branch onto v13/v14 RBAC: _is_privileged_viewer + _effective_groups now query user_group_members JOIN user_groups; require_role(Role.KM_ADMIN) replaced with require_admin (km_admin collapsed into admin). Schema v15: knowledge_items context-engineering columns + knowledge_contradictions + session_extraction_state. Schema v16: verification_evidence. Cuts release v0.15.0 (also bundles #116 /me/debug page).
2026-04-29 07:16:22 +02:00
ZdenekSrotyr
2e1dfb7553
feat(v2): claude-driven fetch primitives + 0.14.0 (#102)
Replaces the BigQuery wrap-view pattern with a discovery + scoped-fetch toolkit driven by the analyst's Claude session. Adds /api/v2/{catalog,schema,sample,scan,scan/estimate}, da catalog/schema/describe/fetch/snapshot/disk-info CLI commands, sqlglot-backed WHERE validator, process-local quota tracker, agent rails skill (cli/skills/agnes-data-querying.md). BREAKING: BQ wrap views off by default — set data_source.bigquery.legacy_wrap_views=true for one cycle. Backward-compat field_validator on primary_key. Catalog cache now matches documented 300s TTL with RBAC fresh per request. Cuts release v0.14.0.
2026-04-29 01:07:19 +02:00
David Rybar
cfe5771856
feat(dev): add Windows PowerShell wrapper for local development (#80)
Adds `scripts/run-local-dev.ps1` as a sibling of the bash script for Windows operators. Same compose stack (`docker-compose.yml` + `.dev.yml` + `.local-dev.yml`), same up/down/logs subcommands, same LOCAL_DEV_GROUPS default seeding. Restores caller's working directory and LOCAL_DEV_GROUPS on every exit path (success, error, Ctrl+C). Avoids advanced-script promotion so `up -d` / `down -v` reach docker compose instead of being eaten by -Debug/-Verbose.
2026-04-28 23:59:11 +02:00
ZdenekSrotyr
5f6bb7a4b2
fix(security+ops) + release(0.12.1): #82 #85 #87 hardening + cut 0.12.1 (#104)
* fix(security+ops): #82 #85 #87 — auth hardening, API validation, deploy posture

Security and operational hardening across three issue groups:

- M23: docker-compose.override.yml → docker-compose.dev.yml (BREAKING, prod foot-gun)
- C13: Container runs as non-root user 'agnes' (USER directive in Dockerfile)
- M21: Docker resource limits (mem_limit, cpus) on app + scheduler
- M22: Caddyfile security headers (X-Frame-Options, X-Content-Type-Options, Referrer-Policy, -Server)
- M17: /api/health split into minimal (unauth) + /api/health/detailed (auth) (BREAKING)
- M26: release.yml restricts build-and-push to main + workflow_dispatch; paths-ignore for docs

- C2: table_id traversal validation on /api/data/{table_id}/download
- M4: Upload streaming (chunk-read + temp file) instead of full-buffer; /local-md hashed filename

- C5: reset_token removed from POST /api/users/{id}/reset-password response
- C8: Startup WARNING when no user has password_hash (bootstrap window visible)
- M9: Audit log on failed web form login (mirrors /auth/token endpoint)
- M10: Atomic magic-link consume via compare-and-swap (CONSUMED: marker + DuckDB conflict catch)

Also: SSRF protection on /api/admin/configure (#46), memory stats SQL aggregation (#90)

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

* fix(review): SSRF 169.254.x.x + IPv6 multicast; M10 marker cleanup safety

Review fixes:
- Add 169.254.0.0/16 (link-local, cloud metadata) to SSRF regex — was
  missing, allowing requests to AWS/GCP/Azure metadata endpoints
- Add ff[0-9a-f]{2}: (IPv6 multicast) to SSRF regex
- M10: wrap Step 3 (CONSUMED marker cleanup) in try-except with
  warning log — prevents unhandled exception if DB write fails after
  successful token consumption
- Add test for 169.254.169.254 SSRF rejection

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

* fix(review): SSRF IPv6 bypass, CLI health endpoint, upload FD leak

Address Devin Review findings on PR #104:

1. SSRF IPv6 bypass: Replace hostname regex with DNS resolution +
   ipaddress module checks. The old regex patterns like `fe80:` only
   matched up to the first colon, missing real IPv6 addresses like
   `fe80::1`, `fc00::1`, `ff02::1`. The new approach resolves the
   hostname via getaddrinfo and checks each resulting IP against
   ipaddress.is_private/is_loopback/is_link_local/is_reserved/is_multicast.

2. CLI commands broken: `da setup test-connection`, `da setup verify`,
   `da diagnose`, `da status` all called /api/health expecting the old
   format (status=="healthy", services dict). Now they call
   /api/health/detailed for service-level checks (with graceful fallback
   to the minimal endpoint when auth is not configured).

3. Temp file handle leak: _stream_to_temp returns an open
   NamedTemporaryFile; callers now close it before shutil.move() to
   prevent FD leaks until GC.

Also adds IPv6 SSRF test cases (loopback, link-local, unique-local,
multicast) with mocked DNS resolution for test environment independence.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

* fix(review): download regex blocks hyphenated IDs; document health split

Address Devin Review round-3 findings on PR #104:

1. _SAFE_IDENTIFIER regex blocked hyphenated table IDs: The download
   endpoint used the strict SQL-identifier regex which does not allow
   dots or hyphens, but Keboola table IDs like in.c-crm.orders
   contain both. Switched to _SAFE_QUOTED_IDENTIFIER which allows dots
   and hyphens while still blocking path-traversal chars (/, .., \)
   and quote/control characters. Added test for hyphenated/dotted IDs.

2. Documented health endpoint split in DEPLOYMENT.md: Added Health
   checks & external monitoring section explaining both endpoints
   (minimal unauth /api/health vs authenticated /api/health/detailed)
   and how to wire external monitoring tools to the detailed endpoint
   with a PAT.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

* release(0.12.1): cut hotfix for snapshot integrity + #82/#85/#87 hardening

* fix(security): apply CAS pattern to password reset confirm (#82/M10 follow-up)

Devin review on the rebased PR flagged the asymmetry: magic-link verify
got the atomic compare-and-swap pattern in the original M10 fix, but
password reset confirm at /auth/password/reset/confirm was still using
read-validate-clear. Two concurrent POSTs with the same valid reset
token could both succeed in setting different new passwords (last-write-
wins). Lower severity than the magic-link race because the attacker
would need the reset token AND to race the legitimate user, but the
asymmetry was a polish gap.

Mirrors app/auth/providers/email.py::_consume_token CAS exactly: write
unique CONSUMED:<random> marker via UPDATE...WHERE token=old_token, then
SELECT to verify our marker won, then proceed. Only the winner clears
the marker and applies the password change.

New regression test_concurrent_reset_only_one_wins in
tests/test_password_flows.py::TestResetConfirm pins the contract: two
ThreadPoolExecutor workers + Barrier hit /reset/confirm with the same
token; exactly one gets 302 (password applied), the other gets 200 with
'Invalid or expired'. Sanity-checked against the pre-CAS code — both
POSTs got 302 (race confirmed).

---------

Co-authored-by: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-04-28 19:57:30 +02:00
ZdenekSrotyr
e9d7af3cce feat(rbac+marketplace): RBAC v13 + Claude Code marketplace + #81/#83/#44 hardening
This squashes 13 commits from ma/staging plus a small docstring translation
into a single coherent unit. Three workstreams.

== RBAC v13 redesign ==
- Drops core.viewer/analyst/km_admin/admin hierarchy and the
  internal_roles / group_mappings / user_role_grants / plugin_access tables.
- Replaced by user_group_members + resource_grants. Atomic v12→v13 backfill
  wrapped in BEGIN/COMMIT; ROLLBACK leaves schema_version at 12 for retry.
- Two authorization primitives in app.auth.access:
    require_admin                        — Admin-group god-mode
    require_resource_access(rt, "{path}") — entity-scoped grants
  Single DB lookup per request; no session cache; no implies BFS.
- /admin/access UI (single page) replaces /admin/role-mapping +
  /admin/plugin-access. CLI `da admin group/grant *` replaces
  `da admin role/mapping/grant-role/revoke-role/effective-roles`.
- ResourceType.TABLE listing-only — admins can record table grants,
  runtime enforcement still flows through legacy dataset_permissions
  (migration plan in docs/TODO-rbac-data-enforcement.md).

== Claude Code marketplace ==
- Aggregated /marketplace.zip + /marketplace.git/* (PAT-gated,
  RBAC-filtered, content-addressed cache via dulwich).
- Admin god-mode dropped on the marketplace surface — admins curate
  their own view via grants like everyone else.
- Bare-repo cache materializes per RBAC-filtered ETag; stale entries
  not pruned in this iteration (disclaimed in git_backend.py docstring).

== #81 #83 #44 security/ops hardening ==
- #81 Group A — orchestrator ATTACH allow-listing (extension/url/alias).
- #81 Group B — Keboola extractor 3-state exit codes:
    0 success / 1 total fail / 2 PARTIAL fail
  Sync API logs PARTIAL FAILURE alert on exit 2. Operators with binary
  alerting must teach it the new partial signal.
- #81 Group C — schema v10 view_ownership; rejects silent overwrite
  of a prior connector's view name on collision.
- #81 Group D — extractor-side identifier validation.
- #83 — Jira webhook fail-closed when JIRA_WEBHOOK_SECRET unset
  + path-traversal fix.
- #44 — entire /api/scripts/* surface is admin-only (planted-script +
  sandbox-bypass risk closed).

== Web UI polish + deploy fix ==
- /admin/access: live grant-count badges (no stale snapshot revert),
  shared-header CSS link added to /catalog and /admin/{tables,permissions},
  per-resource-type colored stripes.
- docker-compose.host-mount.yml: bind,rbind so dual-disk hosts don't
  silently shadow sub-mounts and write state to the wrong disk.

== OSS vendor-neutralization (waves 1+2) ==
- scripts/grpn/ → scripts/ops/. Customer-specific identifiers
  (project IDs, internal hostnames, dev/prod VM IPs, brand names)
  replaced with placeholders across code, docs, Terraform, Caddyfile,
  OAuth probe, and planning docs. Downstream infra repos that copied
  scripts/grpn/agnes-tls-rotate.sh or agnes-auto-upgrade.sh must
  update the path.

== Translation ==
- src/repositories/user_groups.py::ensure_system docstring translated
  from Czech to English for codebase consistency.

Co-authored-by: Mina Rustamyan <mina@keboola.com>
2026-04-28 14:25:04 +02:00
ZdenekSrotyr
4e4d2a39e6
chore(oss): isolate customer-specific deploy bits from scripts/grpn/ (#88, wave 1) (#94)
* chore(oss): isolate customer-specific deploy bits from scripts/grpn/ (#88)

Vendor-neutralization step before public release. The directory mixed
two concerns: (1) generic ops scripts referenced from mainline OSS
infrastructure (TLS rotation, auto-upgrade cron) and (2) one operator's
hackathon manual-deploy helper with hardcoded GCP project IDs, VM names,
and admin emails. Splitting them per concern.

Moved (still in OSS, just under a vendor-neutral name):
- scripts/grpn/agnes-tls-rotate.sh   → scripts/ops/agnes-tls-rotate.sh
- scripts/grpn/agnes-auto-upgrade.sh → scripts/ops/agnes-auto-upgrade.sh

Removed (belongs in private consumer infra repos, not upstream OSS):
- scripts/grpn/Makefile (hardcoded prj-grp-foundryai-dev-7c37, foundryai-development VM name, e_zsrotyr@groupon.com bootstrap email)
- scripts/grpn/README.md (GRPN hackathon deploy walkthrough)
- docs/superpowers/plans/2026-04-22-grpn-deploy-learnings.md (org-specific deploy log)

Cross-refs updated in README.md, CLAUDE.md, docs/DEPLOYMENT.md,
docker-compose.yml. CHANGELOG entry flags BREAKING (ops) for any
consumer infra repo that installs these scripts via path-based systemd
timers.

This is the first wave of #88 — the remaining leaks (test data with
prj-grp-dataview-prod-1ff9, AIAgent.FoundryAI tags in OpenMetadata test
fixtures, docstrings in connectors/openmetadata/enricher.py) will be a
separate, smaller PR.

Refs #88.

* chore(oss): comprehensive vendor-neutralization (#88 wave 2 + review fixes)

PR #94 review found that the original wave-1 grep was scoped wrong and
many leaks survived. This commit closes wave 1 properly AND folds in all
wave-2 anonymization in a single pass — easier to review than two PRs.

Wave-1 review-fix corrections:
- Caddyfile: scripts/grpn/agnes-tls-rotate.sh → scripts/ops/ (the original
  wave-1 grep filter excluded extensionless files like Caddyfile).
- CHANGELOG bullet rewritten — original wording implied an in-repo migration
  for infra/modules/customer-instance/, which is wrong (the TF module embeds
  the script inline via heredoc, never sourced from scripts/grpn/). Now
  flags downstream consumer infra repos only.
- infra/modules/customer-instance/variables.tf: Czech docstring with `grpn`
  example → English description with `acme, example` placeholders.

Wave-2 anonymization:
- Code docstrings (connectors/openmetadata/{client,transformer,enricher}.py,
  src/catalog_export.py, scripts/duckdb_manager.py): prj-grp-… →
  my-bq-project / prj-example-1234, AIAgent.FoundryAI → AIAgent.MyAgent,
  FoundryAIDataModel → AnalyticsDataModel.
- Test fixtures (4 files): same set of replacements — 157 tests still pass.
- .github/workflows/keboola-deploy.yml: "Groupon-side dev VMs" comment →
  generic "per-developer dev VMs".
- docs/auth-groups.md + scripts/debug/probe_google_groups.py:
  kids-ai-data-analysis project name → acme-internal-prod placeholder.
- 5 planning/spec docs under docs/superpowers/{plans,specs}/2026-04-21-*:
  hardcoded IPs (34.77.94.14, 34.77.102.61) → <dev-vm-ip>/<prod-vm-ip>;
  GRPN/Groupon → Acme/another-customer; prj-grp-… → prj-example-….
- scripts/switch-dev-vm.sh deleted — hackathon-era helper hardcoded to a
  specific shared dev VM. Per-developer dev VMs are the supported pattern.

Final grep `groupon|grpn|foundryai|prj-grp|groupondev|34\.77\.(94|102)\.…|kids-ai-data`
returns zero hits (excluding CHANGELOG.md historical entries).

CHANGELOG entry expanded to document both waves under one bullet, with
the BREAKING (ops) clarification about the TF module being unaffected.

Refs review of #94, closes #88.

* fix(oss): close remaining #94 review-2 findings (Czech, padak refs, CHANGELOG)

Reviewer of PR #94 round 2 caught 4 remaining items the wave-2 pass missed:

1. infra/modules/customer-instance/variables.tf had Czech descriptions on
   8 more variables. Previous review only flagged line 19; this round
   audited the rest. Translated lines 2, 28, 42-46 (heredoc), 60, 65, 71,
   78, 84 to English. Same review concern: a Terraform module that is
   the customer-facing API surface in Czech is unfit for OSS distribution.

2. infra/modules/customer-instance/outputs.tf had Czech descriptions on
   four outputs. Same fix.

3. docs/padak-security.md referenced a private repo (padak/keboola_agent_cli#206)
   in two places. Replaced with generic 'tracked upstream in the auth-CLI repo'
   per CLAUDE.md vendor-agnostic rule (no cross-refs to private repos).

4. scripts/fetch-env-from-secrets.sh:41 had a Czech comment.
   Translated.

5. CHANGELOG cosmetic: bullet said 'AIAgent.FoundryAI -> AIAgent.MyAgent'
   but the actual code uses both MyAgent (in docstrings) and Example
   (in test fixtures). Reworded to mention both targets.

Final grep across all shipping file types (.md, .py, .yml, .yaml, .sh,
Makefile, .json, .tf, .tpl, Caddyfile, .toml) for groupon|grpn|foundryai|
prj-grp|groupondev|34.77.94.14|34.77.102.61|kids-ai-data|padak/keboola_agent_cli
returns ZERO hits (excluding CHANGELOG.md). Czech-diacritic grep across
.tf/.toml/Caddyfile/Makefile/.yml returns ZERO hits.

157/157 OpenMetadata + DuckDB tests still pass.

* fix(oss): close #94 round-3 leaks (env.template, instance.yaml.example, padak typo)

Round-3 reviewer caught two MUST-FIX leaks the round-2 grep missed
(grep was scoped to extensions that did not include .template / .example
suffixes — the audit was right, the previous grep was not paranoid enough):

1. config/instance.yaml.example:114 — '(optional - Groupon-specific)' brand
   leak in a shipping config example. Replaced with '(optional)'.

2. config/.env.template:68 — stale path 'scripts/grpn/agnes-tls-rotate.sh'
   in operator-facing env-template comment. The script lives at
   scripts/ops/ now (commit 16a85cc); this comment had been pointing
   operators at a non-existent path.

3. docs/padak-security.md:188 — phrase duplication 'tracked in tracked
   upstream' from a sloppy substitution in round-2. Trivial wording fix.

Final paranoid grep across .md/.py/.yml/.yaml/.sh/Makefile/.json/.tf/.tpl/
Caddyfile/.toml/.template/.example/.env* with the full token set
(groupon|grpn|foundryai|prj-grp|groupondev|34\.77\.94\.14|34\.77\.102\.61|
kids-ai-data|padak/keboola_agent_cli) returns ZERO hits, excluding
CHANGELOG.md historical entries.

* fix(oss): #94 round-4 — QUICKSTART.md + rename padak-security.md

Devin Review caught two findings on the latest round-3 commit:

1. docs/QUICKSTART.md:67 still pointed users at the deleted
   scripts/switch-dev-vm.sh. A Quickstart user following step-by-step
   would hit a missing-file error at the final step. Replaced with the
   inline gcloud-ssh equivalent that the Removed bullet documents.

2. docs/padak-security.md filename retains the personal identifier
   'padak'. The PR fixed the body content (replaced
   padak/keboola_agent_cli#206 references with generic wording) but
   missed the filename. Renamed to docs/security-audit-2026-04.md
   (date-anchored, vendor-neutral). Updated the historical CHANGELOG
   link to point at the new path with an inline note about the rename.

* fix(oss): redact remaining hardcoded IPs from planning docs + remove default email

Devin Review caught two more leaks:
1. scripts/fetch-env-from-secrets.sh line 16 had a hardcoded
   personal-email default (zdenek.srotyr@keboola.com). Replaced with
   ':?' bash error so SEED_ADMIN_EMAIL must be explicitly set —
   safer than carrying any specific identity.
2. Planning docs still had 35.195.96.98 and 34.62.223.189 (legacy
   prod/dev IPs) that the round-1 IP-replace pattern missed (it only
   targeted 34.77.x.x). Generic regex redaction across all five
   planning docs replaces every public IP with <redacted-ip>,
   preserving private/loopback/IAP ranges.
2026-04-27 20:24:34 +02:00
Petr Simecek
83ced81966
feat(auth): unified role management — UI + REST API + CLI + schema v9 (v0.11.4) (#73)
* feat(auth): v9 schema — unified role management foundation (WIP)

Tasks 1-5, 10 of the role-management-complete plan. Foundation only,
follow-up commits add REST API, CLI, UI, and tests.

Schema v9:
- user_role_grants table: direct user → internal_role mapping
  (complementary to group_mappings). Drives PAT/headless auth and
  persists across sessions. Source field tracks 'direct' vs auto-seed.
- internal_roles.implies (JSON): transitive role hierarchy. core.admin
  implies core.km_admin → core.analyst → core.viewer. Resolver does BFS
  expand at lookup time.
- internal_roles.is_core (BOOL): distinguishes seeded core.* hierarchy
  from module-registered roles. UI renders them differently.
- v8→v9 migration: ADD COLUMN, CREATE TABLE, _seed_core_roles +
  _backfill_users_role_to_grants, then NULL legacy users.role values.
  DuckDB FK constraint blocks DROP COLUMN — sloupec zůstává jako
  deprecated artifact (UserRepository ignoruje), fyzický drop deferred.

Resolver:
- Regex extended to allow dotted namespace (core.admin,
  context_engineering.admin), max 64 chars total.
- expand_implies(role_keys, conn): BFS over implies JSON column.
- resolve_internal_roles signature gains optional user_id parameter;
  unions group-mapping resolution with user_role_grants direct grants
  before implies expansion.

require_internal_role:
- Two-path resolution: session cache (OAuth) → DB grants (PAT/headless
  fallback). PAT clients now legitimately satisfy gates without the
  OAuth round-trip, fixing the v8 limitation where every PAT-callable
  admin endpoint needed require_role(Role.ADMIN) instead of
  require_internal_role(...).

Backward-compat:
- require_role(Role.X) and require_admin become thin wrappers over
  require_internal_role(f"core.{role}"). Implies hierarchy preserves the
  legacy "at least this level" semantics automatically — no per-level
  comparison code needed.
- src/rbac.py helpers (is_admin, has_role, get_user_role,
  set_user_role, can_access_table, get_accessible_tables) all read from
  the resolver via _get_internal_role_keys.
- UserRepository.create() and update() now mirror role changes into
  user_role_grants via _grant_core_role helper. Preserves API while
  making the new table the source of truth.
- UserRepository.delete() pre-deletes user_role_grants rows
  (FK cascade — DuckDB doesn't auto-cascade).
- count_admins() reads user_role_grants ⨝ internal_roles instead of the
  now-NULL users.role column.

First consumer:
- app/api/admin.py module-level docstring documents the v9 pattern for
  future module authors. Existing require_role(Role.ADMIN) callsites
  flow through the wrapper; no behavior change for OAuth callers, and
  PAT callers gain access via direct grants.

Tests: full suite green (1396 passed, 6 skipped). Existing tests
exercise the new pathway transparently because UserRepository.create
auto-grants. New test_pat_caller_with_direct_grant_passes pins the
PAT-aware contract.

Schema: v9 (was v8). pyproject.toml + CHANGELOG bump deferred to the
final PR-prep commit.

* feat(auth): role management complete — REST API + CLI + UI + docs (v0.11.4)

Sjednocuje legacy users.role enum s v8 internal-roles foundation pod jeden
model s implies hierarchií, dodává admin UI + REST API + CLI pro správu
group mappings i přímých user grants, a dělá require_internal_role
PAT-aware tak, aby admin endpointy fungovaly uniformly napříč OAuth
i headless callery.

REST API (app/api/role_management.py, +496 LOC):
- 8 endpointů pod /api/admin: internal-roles list, group-mappings CRUD,
  users/{id}/role-grants CRUD, users/{id}/effective-roles debug.
- Všechny gated require_internal_role("core.admin"). Audit-log na každé
  mutaci (role_mapping.created/deleted, role_grant.created/deleted).
- Last-admin protection: refuse to delete the final core.admin grant
  (mirrors users.py:count_admins protection).
- Nový UserRoleGrantsRepository v src/repositories/user_role_grants.py.

CLI (cli/commands/admin.py extension, +258 LOC):
- da admin role list / show <key>
- da admin mapping list / create <group-id> <role-key> / delete <id>
- da admin grant-role <email> <role-key>
- da admin revoke-role <email> <role-key>
- da admin effective-roles <email>
- Všechno přes typer + PAT auth, --json flag, response-shape tolerantní.

UI (admin_role_mapping.html + admin_user_detail.html + nav + user list):
- Nová stránka /admin/role-mapping: internal_roles read-only table +
  group_mappings table with create/delete forms.
- Nová stránka /admin/users/{id}: core role single-select + capabilities
  multi-checkbox + effective-roles debug (direct + group + expanded).
- Existing user list dostává "Detail" link na novou stránku.
- Nav link na /admin/role-mapping.

Tests: +85 nových testů přes 4 nové soubory:
- test_schema_v9_migration.py (8) — fresh install + v8→v9 backfill +
  legacy column NULL semantics + unknown-role fallback + invariants.
- test_api_role_management.py (33) — všech 8 endpointů, happy + error
  paths, audit-log assertions, last-admin protection.
- test_cli_admin_role.py (25 + 1 conditional) — typer subcommands,
  text + json output, PAT integration smoke.
- test_admin_role_mapping_ui.py (9) + test_admin_user_capabilities_ui.py (10)
  — page rendering, auth gating, form contracts, JS hooks.
Full suite: 1482 passed, 6 skipped (was 1396 → +86, žádné regrese).

Docs:
- docs/internal-roles.md kompletní rewrite — odstranil "no UI yet",
  přidal hierarchy diagram, dual-path resolution, dotted-namespace
  convention, admin workflow přes UI/CLI/REST, refresh semantics
  for group mappings vs direct grants, migration notes.
- CLAUDE.md schema v8 → v9.
- CHANGELOG.md [0.11.4] s BREAKING marker pro users.role NULL
  semantics + complete Added/Changed/Removed/Internal sekce.
- pyproject.toml: 0.11.3 → 0.11.4.

Sequencing: po mergi tohoto PR Pabu rebasuje pabu/local-dev (PR #72)
na main, jeho schema migrations se posouvají z v9/v10/v11 na v10/v11/v12.

Implementation breakdown:
- Sequential (já): foundation tasks — schema v9, resolver, PAT-aware
  require_internal_role, backward-compat wrappers, rbac refactor,
  UserRepository auto-grant.
- Parallel sub-agents (3 worktrees, ~10 min): REST API, CLI, UI.
- Sequential (já): integrace, docs/CHANGELOG/version, schema tests,
  fullsuite verification.

* fix(auth): address Devin review on PR #73 — three regressions

Three concrete bugs caught in Devin's PR review, all fixed in this commit.

1. **users.role hydration on read** (the big one):
   v8→v9 migration NULLs users.role for every existing user, but a long
   tail of read sites still inspect user["role"] directly:
   - app/web/templates/_app_header.html:15 — admin nav gate
   - app/web/templates/_app_header.html:36-37 — role badge in dropdown
   - app/web/router.py:319-321 — UserInfo.is_admin/is_analyst/is_privileged
   - app/web/router.py:489 — corporate memory is_km_admin
   - app/api/catalog.py:54 — admin "see all tables" bypass
   - app/api/sync.py:215 — admin "see all sync states" bypass

   Without a fix, every existing admin loses the entire admin nav (and
   API admin bypasses) immediately after upgrade — a serious regression.

   Fix: new helper _hydrate_legacy_role() in app/auth/dependencies.py
   maps the highest-level core.* grant back into user["role"] as the
   legacy enum string. Called from get_current_user() on both auth paths
   (LOCAL_DEV_MODE + JWT/PAT). Idempotent — skips when role is already
   populated. Net effect: every pre-v9 callsite keeps working transparently
   for both OAuth and PAT callers, with one extra DB round-trip per
   authenticated request (same cost as the existing PAT-aware
   require_internal_role fallback).

   3 regression tests in tests/test_schema_v9_migration.py:
   - test_hydration_recovers_role_from_user_role_grants
   - test_hydration_returns_highest_grant (multi-grant → highest wins)
   - test_hydration_falls_back_to_viewer_when_no_grants (safe fallback)

2. **CLI effective-roles TypeError**:
   API returns direct/group as List[Dict] (RoleGrantResponse-shaped),
   but the CLI did ', '.join(direct) which raises TypeError on dicts.
   Tests masked it because mocks used bare string lists. Replaced
   raw .join() with a _names() helper that extracts role_key from
   each item, falling back to str() for legacy mock shapes.

3. **UI template field-name mismatch**:
   admin_user_detail.html JS reads data.groups but the API serializes
   the field as group (singular, per EffectiveRolesResponse pydantic).
   Currently benign because the API always returns group:[], but the
   field would silently disappear once the group-derived view is wired
   up. Added data.group as the primary lookup, kept the legacy aliases
   for shape-drift tolerance.

Full suite: 1485 passed (was 1482, +3 hydration tests), 6 skipped, no
regressions.

* fix(auth): Devin review #2 + UX self-service + RBAC docs rename

Three threads landed in one commit because they share the same
auth/role surface and CHANGELOG entry.

Devin review #73 second round (2 actionable findings):

- _hydrate_legacy_role no longer short-circuits on truthy users.role.
  The role-management endpoints (POST/DELETE /api/admin/users/{id}/
  role-grants + the changeCoreRole UI flow) only mutate
  user_role_grants — they don't update the legacy column. The early
  return trusted that stale value, so a user downgraded via the new
  REST/UI kept role="admin" in their dict on subsequent requests,
  which fooled _is_admin_user_dict (src/rbac.py) and the catalog/sync
  admin-bypass short-circuits into retaining elevated table access
  even though require_internal_role correctly denied the API gates.
  Always re-resolves now, making user_role_grants the single source
  of truth on every authenticated request. Cost: one DB round-trip
  per request — same as the existing PAT-aware fallback. Pinned by
  test_hydration_ignores_stale_legacy_role_after_grant_revoke.

- Dev-bypass (app/auth/dependencies.py) and OAuth callback
  (app/auth/providers/google.py) now pass user_id to
  resolve_internal_roles so direct grants land in
  session["internal_roles"] alongside group-mapped roles. Pre-fix,
  every admin-gated request fell through to the per-request DB
  fallback inside require_internal_role and the dev-bypass log line
  read "resolved 0 internal role(s)" for an obviously-admin user.
  test_session_internal_roles_populated updated to assert union.

User-visible UX (also addresses local-test feedback):

- HTTP 500 on /admin/users post-v8→v9 migration — UserResponse.role
  is required str, but legacy users.role was NULL-ed by the
  migration. _to_response in app/api/users.py now routes every dict
  through _hydrate_legacy_role; same fix lifts the silent no-op of
  last-admin protection in update_user/delete_user (the role-equality
  short-circuits would skip the count_admins guard for migrated
  admins). Three regression tests under TestAPIUsersPostMigration.

- /profile is now a real self-service detail page for *every*
  signed-in user (not just admins). Three new server-side sections:
  Effective roles (resolver output as chip cloud), Direct grants
  (rows in user_role_grants with source label), Roles via groups
  (which Cloud Identity / dev group grants which role for the
  current user). Non-admins finally see *why* a feature is or isn't
  accessible. Admins additionally see a deep-link to
  /admin/users/{id} for editing their own grants.

- /admin/role-mapping group-id picker. New "Known groups" panel
  above the create form: clickable chips for the calling admin's
  own session.google_groups (tagged "your group") merged with
  external_group_ids already used in existing mappings (tagged
  "already mapped"). Click a chip → fills the form. Empty-state
  copy points operators at LOCAL_DEV_GROUPS / Google sign-in
  instead of leaving them to guess Cloud Identity opaque IDs from
  memory.

Operational fixes:

- Scheduler log-noise: every cron tick produced a
  POST /auth/token 401 because the auto-fetch fallback called the
  endpoint with just an email (no password) and silently fell
  through. Removed the broken path entirely. Operators set
  SCHEDULER_API_TOKEN (long-lived PAT) in production; in
  LOCAL_DEV_MODE the dev-bypass auto-authenticates the un-tokenized
  request, so jobs continue to work.

Docs:

- docs/internal-roles.md → docs/RBAC.md (git mv preserves history).
  Standard industry term, more discoverable for engineers grepping
  for RBAC in a new repo. Restructured: Quickstart-by-role
  (operator / end-user / module author), step-by-step
  Module-author workflow with code examples (register key, gate
  endpoint, declare implies, write contract test), naming pitfalls,
  refresh semantics. CLAUDE.md gets a new
  "Extensibility → RBAC" section pointing contributors at the doc
  before they add gated endpoints. Cross-refs in app/api/admin.py
  + tests/test_role_resolver.py updated.

Tests: 293 in the auth/role/scheduler/UI test set passed, 0 regressions.

* fix(auth): Devin review #3 — login flows + RBAC docs

Two new findings on commit 7d1c048, both real and addressed.

Finding 1 (BUG, HTTP 500): every auth login flow loaded users via
UserRepository.get_by_email and passed user["role"] straight to
create_access_token, Pydantic response models, and _set_login_cookie
without going through _hydrate_legacy_role. Post-v9 the legacy column
is NULL for migrated users, and TokenResponse.role is a required str —
so POST /auth/token raised ValidationError → HTTP 500 for any v8-admin
trying to log in via password. Same root cause produced non-crashing
but semantically wrong JWTs (role: null) from Google OAuth, password
web flows, and email magic-link verification.

Fix: hydrate inline in every login flow before reading user["role"]:
- app/auth/router.py — POST /auth/token (the crash site)
- app/auth/providers/google.py — OAuth callback (was just stale JWT)
- app/auth/providers/password.py — 5 flows: JSON login, web login,
  JSON setup, web reset confirm, web setup confirm
- app/auth/providers/email.py — centralized in _consume_token,
  covers both /verify endpoints

New regression class TestAuthLoginFlowsPostMigration pins both the
no-crash and the correct-role contracts for all four legacy levels
(viewer/analyst/km_admin/admin) on POST /auth/token.

Finding 2 (DOCS): docs/RBAC.md showed register_internal_role() being
called with implies=[...], but the function signature is (key, *,
display_name, description, owner_module). A module author copying the
example would TypeError at import time. The implies field on
internal_roles IS honored at runtime by expand_implies, but the
registry-side write path (register_internal_role + InternalRoleSpec +
sync_registered_roles_to_db) doesn't exist yet — implies is currently
seeded only for the core.* hierarchy via _seed_core_roles in src/db.py.

Rewrote the Implies hierarchy and Module-author workflow sections to
document what's actually supported in 0.11.4 and what a future change
would need to add. The "for cross-module hierarchies, register each
level + grant both" pattern works today.

Tests: 322 in the auth/role/scheduler/UI/password test set passed,
0 regressions.

* fix(db): _seed_core_roles actually runs on every connect (Devin review #4)

Devin flagged that the docstring on `_seed_core_roles` promised per-connect
execution as a safety net for accidental DELETEs and in-code seed changes,
but the only call sites lived inside `if current < SCHEMA_VERSION:` — so
once a DB was on v9 the function never ran again, and the docstring lied.

Picked option (b) from the review (actually call it on every startup) over
option (a) (fix the docstring) because the safety net is genuinely useful:
- recovery from accidental admin DELETE on internal_roles,
- in-code _CORE_ROLES_SEED tweaks (display_name/description/implies)
  ship without a manual SQL deploy,
- fresh installs and migrations stop needing their own seed call sites.

Tail call gated by `get_schema_version(conn) <= SCHEMA_VERSION` so the
future-version-is-noop rollback contract still holds — a v9 binary won't
touch a DB that's been upgraded past v9.

Test coverage: new TestSeedCoreRolesSafetyNet class (3 tests) pins the
three contracts — deleted row re-seeds, mutated display_name re-syncs
from in-code seed, applied_at on schema_version doesn't churn on
already-current DBs. Existing TestMigrationSafety::test_future_version_is_noop
still passes (verified against the gating logic).
2026-04-27 02:23:01 +02:00
Petr Simecek
6c36b26979
release(0.11.3): internal roles + external→internal group mapping (foundation) (#71)
* feat(auth): internal roles + external→internal group mapping (foundation)

Two-layer authorization model: external Cloud Identity groups (org-managed)
get mapped onto internal Agnes-defined capabilities (app-managed) via an
admin-curated many-to-many table. Per-request permission checks read off
the session — no DB hit. Refresh requires re-login.

Schema v8 — new tables:
- internal_roles (id, key UNIQUE, display_name, description, owner_module, …)
  — app-defined capabilities like 'context_admin'. Modules self-register at
  import; the startup hook syncs the registry into this table (idempotent).
- group_mappings (id, external_group_id, internal_role_id FK, …)
  — admin-managed bindings, UNIQUE(external_group_id, internal_role_id).

app/auth/role_resolver.py — new module:
- register_internal_role(key, display_name, description, owner_module)
  Module-author entry point. lower_snake_case key, immutable, validated.
  Same key + same fields = no-op (re-import safe); same key + different
  fields = ValueError so two modules can't silently overwrite each other.
- sync_registered_roles_to_db(conn) — startup reconciliation. Inserts new
  keys, updates drifted metadata, never deletes (preserves mappings).
- resolve_internal_roles(external_groups, conn) — joins group_mappings.
  Sorted, deduplicated role-key list. Plugged into google_callback +
  dev-bypass branch in get_current_user.
- require_internal_role('key') — FastAPI dependency factory; reads
  session.internal_roles; 403 with explicit message when missing.

Resolution runs at sign-in only (Google callback + LOCAL_DEV_GROUPS change
in dev-bypass) — same semantics as session.google_groups. No admin UI yet;
mappings created via repository directly until follow-up PR ships UI.

21 new tests in tests/test_role_resolver.py: register/list, idempotency,
collision detection, key-format validation; sync insert/update/no-delete;
resolve empty/single/many-to-many/malformed-input; e2e via
LOCAL_DEV_GROUPS — gated endpoint allowed/denied + direct session-cookie
inspection. Full sweep: 178/178 passed across auth + db + repo tests.
(Two pre-existing test_catalog_export.py failures verified unrelated.)

* fix(auth): polish review feedback — first-request dev populate + PAT doc

Two follow-ups from a code-reviewer pass on the foundation commit before
opening the PR:

- Dev-bypass populates session["internal_roles"] on the first request
  after sign-in, not just when external groups change. The previous
  guard only resolved when groups_changed=True, which left a hole for
  the LOCAL_DEV_GROUPS=`""` (explicit empty) flow: target=[],
  current=None, neither write branch fires, internal_roles stays
  unset, and require_internal_role then 403s with no roles to check
  against. The OAuth callback writes session["internal_roles"]
  unconditionally on sign-in (even []); dev-bypass now matches that
  semantics. Adds a single-pass populate gated on the key being
  absent from the session, so subsequent same-state requests still
  no-op (cheap session lookup, no resolver call).

- Document that internal roles are session-scoped and PAT/headless
  clients will get 403 from any require_internal_role(...) endpoint.
  Same constraint already applies to session.google_groups (PAT JWTs
  deliberately don't snapshot group memberships — they could change
  after issuance with no way to re-sign), but the doc didn't surface
  this — an operator pointing a CLI at a role-gated endpoint would
  see 403 with no clue why. New "PAT and headless requests" section
  spells out the constraint, the rationale, and the three escape
  valves (use users.role for the gate; route through OAuth; wait for
  the planned `da admin grant-role` CLI helper).

54 auth tests still pass locally (21 role-resolver + 33 existing
auth-provider).

* release(0.11.3): cut release for the internal-roles foundation

Bumps pyproject.toml 0.11.2 → 0.11.3 and renames CHANGELOG's
[Unreleased] section to [0.11.3] — 2026-04-26 (with a fresh
empty [Unreleased] skeleton appended). Adds the matching
[0.11.3] link reference at the bottom of CHANGELOG so the
section heading renders as a hyperlink to the GitHub release
page once the tag lands.

The bullet itself is unchanged content; the rephrasing of
"dev-bypass when external groups change" → "dev-bypass —
populates on first request and whenever external groups
change, mirroring the OAuth callback's always-write
semantics" reflects the polish committed in d590579, plus
the appended PAT/headless caveat pointing at the doc
section that landed in the same polish pass.

* fix(auth): address review feedback from Pavel — PAT-specific 403, audit logs, hardening

Round-2 polish over the internal-roles foundation, addressing Pavel's review
on PR #71. No behavior change for the happy path; tightens the safety rails
and makes the failure modes self-explanatory.

User-visible:
- require_internal_role now distinguishes "no session" (Bearer/PAT caller)
  from "signed in but missing role" and surfaces a PAT-specific 403 detail
  in the first case ("This endpoint needs an interactive (OAuth) session
  — Bearer/PAT tokens do not carry session-resolved roles by design").
- docs/internal-roles.md documents deactivate+reactivate as the supported
  "force re-resolve now" lever for users that can't be made to log out.

Internal hardening:
- INFO-level audit log on every successful resolve (OAuth callback +
  dev-bypass) so a wrong-role complaint is debuggable from the log alone.
- Startup warning when SESSION_SECRET is shorter than 32 chars, matching
  the existing JWT_SECRET_KEY gate — both HMAC surfaces sign trust-laden
  state (session.internal_roles, session.google_groups, JWTs).
- _clear_registry_for_tests() now refuses to run unless TESTING=1 so a
  stray import path in production can't drop the registered capabilities.

Tests:
- 4 new tests in tests/test_role_resolver.py covering: stale-session
  contract after a mid-session mapping revoke (pin the documented
  limitation), PAT 403 detail wording, OAuth pipeline data flow from
  external groups to internal_roles, and the dev-bypass empty-list
  fallback when the resolver raises.

CHANGELOG.md updated under [0.11.3] (### Changed + ### Internal).
CLAUDE.md schema doc bumped from v7 to v8.

---------

Co-authored-by: Claude <noreply@anthropic.com>
2026-04-26 23:49:10 +02:00
Petr Simecek
1c18cdf15f
release(0.11.2): LOCAL_DEV_GROUPS dev mock + Makefile defaults + docs/local-development.md (#70)
* feat(auth): mock session.google_groups in LOCAL_DEV_MODE via LOCAL_DEV_GROUPS

LOCAL_DEV_MODE auto-logged-in the dev user but left session.google_groups
empty, so group-aware UI/code paths can't be exercised on localhost without
a real Google OAuth round-trip. New LOCAL_DEV_GROUPS env var (JSON array
matching the production {id, name} shape) populates the session on every
dev-bypass request — same structure the OAuth callback writes, so mock and
prod stay in lockstep. Compare-then-write avoids spurious Set-Cookie noise
on PAT/CLI requests; malformed input falls back to [] with a WARNING so
the dev mock never breaks the dev flow.

* refactor(auth): fail-fast LOCAL_DEV_GROUPS at startup + cache + no-mutate

Three small follow-ups on the same dev-mock vector before merge:

- Validate LOCAL_DEV_GROUPS at app startup and report the parsed group IDs
  in the LOCAL_DEV_MODE banner. A malformed value now warns loudly at boot
  instead of silently logging on the first authenticated request, where
  it's easy to miss.
- Cache the parsed result single-slot, keyed by the raw env-string. Avoids
  re-parsing JSON on every authenticated request without test-isolation
  surprises — when the env value changes, the key changes and the cache
  transparently rebuilds.
- Stop mutating the parsed-input dicts (item.setdefault → spread-merge)
  so the cached list stays a fresh value on every rebuild.
- Replace the try/except guard around request.session with hasattr —
  SessionMiddleware is always registered, the silent except was paranoid.

Tests grow by a direct session-cookie inspection (decoupled from the
profile template) and three startup-banner log assertions.

* fix(auth): drop fragile session-decoder test + actually skip empty-target write

Two follow-ups on the LOCAL_DEV_GROUPS feature before merge:

- Drop test_session_holds_mocked_groups_directly. It manually decoded the
  signed session cookie via TimestampSigner + base64, hardcoding both the
  Starlette session-cookie format and the 14-day max_age. Starlette has
  changed its session encoding before (URLSafeTimedSerializer pre-0.20)
  and would do so again silently — the test would fail with a cryptic
  BadSignature, not a clear "mock is broken" signal. The remaining
  test_dev_user_sees_mocked_groups_on_profile already covers the same
  observable signal (mocked groups in /profile body) without coupling to
  Starlette internals.

- Actually skip the session write when target_groups is empty. The previous
  comment claimed compare-then-write avoided spurious Set-Cookie noise on
  PAT/CLI requests, but on those requests session.get("google_groups") is
  None and target is [], so None != [] always evaluates True and the write
  fired anyway, marking the session dirty and re-issuing Set-Cookie on
  every request. Adding `target_groups and ...` to the guard makes the
  comment honest: empty mock now genuinely no-ops, stable browser sessions
  still skip via value-equality, and the only remaining write is the one
  that actually changes state.

33 auth tests still pass locally.

* fix(auth): match production's always-write semantics for stale dev groups

Devin code-review finding on PR #70: my earlier `target_groups and ...`
short-circuit silently diverged from the production OAuth callback. In
app/auth/providers/google.py:189-194 the callback always writes
session.google_groups on each login — including [] on failure or empty
token — so the session always reflects authoritative current state. The
mock should match.

Failure mode the previous guard left open: a developer sets
LOCAL_DEV_GROUPS=[{...}] for a session, the groups land in the signed
cookie, then the developer unsets the env var and reloads. target → [],
session.get → [{...}], `if target_groups and ...` is False, no write,
stale groups stay in the browser session indefinitely. Mock now lies
about state until logout.

Fix splits the guard:
- target_groups truthy + value-changed → write the new mock (existing path)
- target_groups falsy + non-empty stored → write [] to clear stale state
- otherwise no-op (target [] + stored None/[]: no transition to record)

PAT/CLI requests with no prior session still take the no-op path
(target=[], session.get → None which is falsy), so the original goal of
suppressing spurious Set-Cookie noise on token traffic is preserved.

Tests already cover the populated and unset paths; the new clear-stale
branch is correct by construction (production has the same shape) and
the rare manual reset workflow.

* release(0.11.2): default mocked groups in make local-dev + docs/local-development.md

Cuts 0.11.2 around the LOCAL_DEV_GROUPS work plus a small dev-experience
follow-up: every `make local-dev` now boots with two sensible default
mocked groups (Local Dev Engineers + Local Dev Admins on example.com),
so /profile and group-aware code paths render something realistic
without the operator having to discover and set LOCAL_DEV_GROUPS.

Layered so the default lives in the workflow, not the contract:

- scripts/run-local-dev.sh seeds LOCAL_DEV_GROUPS via shell ":="
  syntax — only sets the var when the operator hasn't already.
  Override: LOCAL_DEV_GROUPS='[...]' make local-dev. Disable:
  LOCAL_DEV_GROUPS= make local-dev.
- docker-compose.local-dev.yml swaps the commented JSON example for
  a bare `- LOCAL_DEV_GROUPS` passthrough — the value comes from the
  shell, the compose file just propagates it. Operators running
  `docker compose up` directly without the wrapper script get an
  empty mock (correct: they didn't opt into the make-driven defaults).
- Makefile help line mentions the mocked groups so the behavior is
  visible without grepping.

New docs/local-development.md consolidates dev-onboarding instructions
that were previously scattered across docker-compose.local-dev.yml
inline comments, docs/auth-groups.md "Local-dev mock" section, the
Makefile help text, and CLAUDE.md "First-Time Setup". Single page now
covers TL;DR, what LOCAL_DEV_MODE actually bypasses, group mocking
controls + verification, what is *not* mocked (Cloud Identity, real
OAuth, admin Workspace permissions), and the safety rails that keep
the dev shortcuts off production.

Version bump 0.11.1 → 0.11.2 in pyproject.toml, CHANGELOG cuts
[Unreleased] → [0.11.2] — 2026-04-26 with a fresh empty [Unreleased]
skeleton.

* fix(local-dev): default LOCAL_DEV_GROUPS truncated by shell parameter expansion

Reported by an operator running `make local-dev` against the freshly
released 0.11.2 — the LOCAL_DEV_MODE banner showed:

    LOCAL_DEV_GROUPS is not valid JSON, ignoring:
    Expecting ',' delimiter: line 1 column 70 (char 69)
    LOCAL_DEV_GROUPS is set but produced no valid groups —
    check the WARNING above for the parse error.

Cause: the default value lived inside `${LOCAL_DEV_GROUPS:=…}` parameter
expansion. Bash matches `}` to close the expansion at the *first* `}`
encountered in the body, regardless of context — even one inside a
nested JSON object literal. The two-element JSON array was therefore
truncated to the first group's closing brace, leaving an unparseable
fragment:

    [{"id":"local-dev-engineers@example.com","name":"Local Dev Engineers"

There is no escaping syntax for `}` inside parameter expansion (the
backslash escapes I had only escaped the quotes — `}` reaches bash
literally). Fix: hold the default in a single-quoted variable and
reference it through `${LOCAL_DEV_GROUPS:-$DEFAULT_LOCAL_DEV_GROUPS}`.
The variable's value is opaque to the expansion — no `}` matching
inside it — so the JSON survives intact. Verified with `python -m json`:

    parsed OK: 2 groups: ['local-dev-engineers@example.com',
                          'local-dev-admins@example.com']

Operators on a running 0.11.2 stack: `make local-dev-down && make
local-dev` to pick up the corrected default.

* fix(local-dev): respect LOCAL_DEV_GROUPS= disable path + add 0.11.2 changelog link

Two follow-ups from a Devin code-review pass on PR #70:

- run-local-dev.sh: switch ${LOCAL_DEV_GROUPS:-$DEFAULT} to
  ${LOCAL_DEV_GROUPS-$DEFAULT} (no leading colon). The :- form
  substitutes the default when the variable is unset OR set-but-empty,
  silently overwriting the documented disable knob. Three places
  promise this works — docs/local-development.md, the CHANGELOG entry,
  and the script's own comment — so the bug was an operator-facing
  lie, not just an implementation detail. The bare - form only
  substitutes on unset, so `LOCAL_DEV_GROUPS= make local-dev` now
  reaches the Python parser as "" and short-circuits to []. Verified
  with both empty and unset shells.

- CHANGELOG.md: add the [0.11.2] link reference at the bottom.
  Keep-a-Changelog convention is to mirror every version heading
  with a release-tag link in the footer; the 0.11.2 heading was
  missing its counterpart, breaking the Markdown link rendering on
  GitHub.

---------

Co-authored-by: Claude <noreply@anthropic.com>
2026-04-26 16:48:55 +02:00
Petr Simecek
c25fd41bf7
feat(auth): Google Workspace groups on /profile + tag-triggered Keboola deploy workflow (#56)
* feat(auth): display Google Workspace groups on /profile

- Request cloud-identity.groups.readonly scope in Google OAuth
- Fetch groups via Cloud Identity API after callback; tolerate 4xx
  (non-Workspace tenants) and network errors — never break login
- Store result in Starlette session as google_groups
- Replace /profile redirect with a real profile page rendering
  account details (email, name, role) and the group list; show a
  friendly empty state when no groups are available
- Tests: helper parsing + 403 + exception paths; profile page
  smoke test; updated the old redirect test

* test: remove stale /profile redirect tests

Cherry-pick of Zdeněk's 4f7e4cd ("display Google Workspace groups on
/profile") replaces the /profile redirect with a real profile page —
but only updated one of three tests that expected the old behaviour.

These two tests in test_admin_tokens_ui.py and test_pat.py were left
asserting `/profile → 302 /tokens`, which now returns
`/profile → 302 /login?next=%2Fprofile` for unauth users (the standard
auth guard) or `/profile → 200 HTML` for authenticated users.

Removed both rather than patched — coverage for the new behaviour
already exists in tests/test_auth_providers.py (added by the same
commit). The /tokens render assertions in the deleted test_pat.py case
are redundant with test_admin_tokens_ui.py's own /tokens UI tests.

* fix(auth): Google groups search query needs parent + labels predicates

Cloud Identity Groups Search API returns 400 INVALID_ARGUMENT when the
CEL query lacks the required `parent == 'customers/<id>'` predicate AND
a `'<label>' in labels` membership predicate. Zdeněk's original 4f7e4cd
query had only `member_key_id == '<email>'` — every fetch silently
returned [] and the /profile groups list was always empty.

Fix: build the query with all three required pieces:
  parent == 'customers/my_customer'   (alias = caller's own Workspace
                                       org; no need to look up customer ID)
  member_key_id == '<email>'           (filter to this user's memberships)
  'cloudidentity.googleapis.com/groups.discussion_forum' in labels
                                       (Workspace mailing-list groups —
                                       the common case; security-group
                                       coverage is a follow-up)

Also: log the full error body (not truncated to 200 chars) and the
query string so the next time Google rejects something we can diagnose
in one log line instead of a re-deploy.

Caught when first agnes-dev login completed normally (HTTP 302) but app
log showed `Google groups fetch returned 400 for petr@keboola.com:
{"error":{"code":400,"message":"Request contains an invalid argument."}}`
on the same VM (kids-ai-data-analysis / agnes-dev.keboola.com).

Reference: https://cloud.google.com/identity/docs/reference/rest/v1/groups/search

* feat(web): add Profile link to user dropdown menu

The /profile page (Zdeněk's 4f7e4cd cherry-pick) renders a real profile
view including Google Workspace groups, but had no entry point in the
UI — users could only reach it by typing the URL manually. Add a
"Profile" menu item between the user header (email + role) and
"My tokens" so the page is discoverable.

Side effect: cleaned up the leftover `or _path.startswith('/profile')`
condition on the "My tokens" active class, which dated from the old
/profile → /tokens redirect (removed in c789617). Now each menu item
owns its own active state.

* fix: profile-link tests + .env quoting for CADDY_TLS

Two issues caught by Keboola's first agnes-dev deploy + agnes-auto-upgrade
cron run:

1. tests/test_web_ui.py — two negative assertions ("href=/profile" NOT in
   body) date from when /profile was a redirect-only stub. Now /profile
   is a real page (groups display) AND has a dropdown menu link, so the
   negative assertions flip to positive. Same for ">Profile<" text in
   the non-admin nav test.

2. startup-script.sh.tpl — CADDY_TLS line must be QUOTED in .env, because
   agnes-auto-upgrade.sh sources .env via `set -a; . .env; set +a` and
   bash treats `KEY=value with spaces` as `KEY=value` followed by `with`
   and `spaces` exec attempts. Symptom: cron log spam
   `/opt/agnes/.env: line 14: petr@keboola.com: command not found`,
   the cron exits non-zero, and no auto-upgrade ever happens. Caddy
   itself reads the value fine because docker-compose env_file=.env
   parses key=value properly without shell-evaluating the rest.

   Fix: emit `CADDY_TLS="tls <email>"` instead of `CADDY_TLS=tls <email>`.
   Both the cron source and docker-compose env_file accept the quoted
   form; cron stops failing.

* fix(auth): use searchTransitiveGroups + security label for non-admin user

Three bugs in the original cherry-pick + my prior fix attempt, all caught
by a stdlib probe script (scripts/debug/probe_google_groups.py) run
locally with a Playground-issued OAuth token:

1. Wrong endpoint. `groups:search` is the admin "find groups in org"
   endpoint and 400s for non-admin users regardless of query. Switched
   to `groups/-/memberships:searchTransitiveGroups` which is the
   user-perspective "what groups am I in" endpoint.

2. Wrong label. Querying with `cloudidentity.googleapis.com/groups.discussion_forum`
   returns 403 "Insufficient permissions to retrieve memberships" even
   on the new endpoint — Workspace policy denies non-admin reads of
   discussion-forum groups. Switching to `groups.security` returns 200
   with the actual membership list. Empirically every Workspace group
   at Keboola carries BOTH labels, so the security filter sees the full
   set anyway. Confirmed with the probe script.

3. Wrong response shape. `searchTransitiveGroups` returns
   {"memberships": [...]}, not {"groups": [...]}. Parser updated
   accordingly.

Also adds scripts/debug/probe_google_groups.py — stdlib-only standalone
probe that hits 6 candidate endpoints with a user OAuth token. Saved a
deploy cycle (~10 min) per query iteration; future API-syntax debugging
should start there.

Verified end-to-end: petr@keboola.com login on agnes-dev returns 5
groups (LIC-1PASSWORD, ROLE_ATLASSIAN_*, etc.) via the probe; once
deployed, the same will populate session["google_groups"] and render
on /profile.

* test(auth): update Google groups parser fixture to match searchTransitiveGroups shape

Mock payload was `{"groups": [...]}` (the shape `groups:search` returns).
After switching to `groups/-/memberships:searchTransitiveGroups` in the
prior commit, the actual response is `{"memberships": [...]}` and the
parser iterates that key. Test now mirrors the real shape.

The per-item structure (groupKey.id + displayName) is unchanged, so the
expected output dict stays the same: [{"id": "...", "name": "..."}].

* docs(auth): add docs/auth-groups.md — Google Workspace groups runbook

Captures the non-obvious bits: the GCP-side setup checklist (Cloud
Identity API + scope on consent screen + Internal user type), the
`security` vs `discussion_forum` label trap (the latter 403s for
non-admins, the former 200s — one of those is a 4-iteration debug
session and shouldn't have to be repeated), where groups are stored
(session, not DB) and how to refresh (re-login), plus how to use the
probe script for future API-syntax issues.

Deliberately stops short of explaining "what is Cloud Identity" or
"what is OAuth scope" — those belong in Google's own docs, not ours.

* docs(claude): document release workflows + module versioning + recreate trick

New "Release & deploy workflows" section in CLAUDE.md covers what didn't
exist anywhere in the repo before:

- Distinction between release.yml (auto-build per push) vs the new
  keboola-deploy.yml (tag-triggered, explicit deploy only) — plus when
  to use which (per-developer convenience vs shared dev VM safety)
- Module versioning (infra-vX.Y.Z) and the bump-after-merge dance
- The lifecycle.ignore_changes [metadata_startup_script] gotcha and how
  to force a recreate via workflow_dispatch's recreate_targets input

All generic — no customer hostnames, project IDs, IPs. Customer-specific
deploy steps belong in the consuming infra repo's README.

Also: cross-reference docs/auth-groups.md from the Authentication
section so future Claude sessions find the Workspace-groups runbook
without grepping.

---------

Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
2026-04-26 00:56:44 +02:00
Vojtech
0bbbf3e40b
feat(tls): corporate-CA HTTPS with URL-driven rotation, on-VM CSR gen, self-signed fallback (#51)
Replaces the implicit Let's Encrypt flow with a general corporate-CA HTTPS path:

- Caddy switches to cert-file mode (`tls /certs/fullchain.pem /certs/privkey.pem`) with HSTS + TLS 1.2/1.3 floor
- New `docker-compose.tls.yml` overlay closes host `:8000` when Caddy fronts (no TLS bypass)
- New `scripts/tls-fetch.sh` — generic URL fetcher for `sm://`, `gs://`, `https://`, `file://` with redirect refusal + PEM validation
- New `scripts/grpn/agnes-tls-rotate.sh` — daily rotation, self-signed fallback against same key (zero key churn), on-VM RSA-2048 + CSR auto-gen, atomic swap, SIGUSR1 reload
- `scripts/grpn/agnes-auto-upgrade.sh` becomes cert-aware (auto-enables tls overlay when certs present)
- Compose profile `production` renamed to `tls` (aligns with DEPLOYMENT.md and infra startup)

Pairs with FoundryAI/agnes-the-ai-analyst-infra#27 (merged) which wires per-VM `local.vm_tls`, writes `TLS_*` env vars into `.env`, auto-creates Secret Manager containers for `sm://` privkey URLs, and installs `agnes-tls-rotate.{service,timer}` for daily polling.

Includes hardening + docs follow-ups from code review:
- `TLS_CSR_SUBJECT` env-var parametrisation applied to both CSR and self-signed cert paths
- curl `--max-redirs 0 --proto '=https'` + post-fetch PEM validation in `tls-fetch.sh`
- `ulimit -c 0` + array-form `COMPOSE_FILES` (POSIX-safe, bash 3.2 compatible)
- TLS section added to `config/.env.template`
- Historical-note headers in `docs/superpowers/{plans,specs}/2026-04-09-*.md` flagging the profile rename
2026-04-25 19:51:25 +00:00
ZdenekSrotyr
1381770057
fix(auth): uvicorn --proxy-headers + Google OAuth doc + vendor-agnostic OSS rule in CLAUDE.md (#39)
* fix(compose): pass --proxy-headers to uvicorn so OAuth callbacks resolve to https

When the app runs behind a reverse proxy (Caddy, nginx, Cloudflare Tunnel),
uvicorn's default policy of trusting X-Forwarded-* only from 127.0.0.1 means
the request the container sees still looks like http://localhost:8000/...,
even when the user is on https://. The OAuth provider then sends Google a
callback URL Google has never seen — Error 400: redirect_uri_mismatch.

--proxy-headers + --forwarded-allow-ips '*' tell uvicorn to honor those
headers from any source. The container only ever sees its own docker network
anyway; trusting it everywhere is safe in this deployment shape.

Adds docs/auth-google-oauth.md with the full operator gotcha list — env
vars that have to be set, instance.yaml fields that silently fall back to
defaults, and the DB workaround for ad-hoc role promotion when
SEED_ADMIN_EMAIL was missed on first boot.

* docs(claude): codify vendor-agnostic OSS rule for AI agents and humans

Adds a "Vendor-agnostic OSS" section to CLAUDE.md spelling out what cannot
land in this repo (specific deployments, internal hostnames/projects, cross-
references to private repos, customer-specific paths) and how to phrase
abstractions instead. Plus a pre-PR grep checklist in the existing "Git
Commits & Pull Requests" section.

This trips up agents and humans alike — the previous version of #39 had
private-deployment references in the body and a customer domain in a doc
example. Surfacing the rule once in the file every Claude/Cursor/Aider
session reads should prevent that on the next PR.

* docs(oauth): cover DOMAIN + SERVER_URL env vars introduced by PR #48

PR #48 (merged) added DOMAIN-gated Secure cookie in google.py and
documented SERVER_URL in .env.template, but this operator doc was
drafted before that merge and didn't reference either variable.
Adding both to the env table and extending the common-failure-modes
table with a sticky-cookie / redirect-URI-mismatch entry that
references SERVER_URL as the host-header-independent fix. Also
aligns the compose command snippet with the `='*'` syntax that
actually ships on main post-PR #48.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Vojtech Rysanek <vrysanek@groupon.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 07:07:33 +00:00
Petr Simecek
f593a151fc
docs(security): add padak-security.md audit report (#35)
* docs(security): add padak-security.md — full audit report from 2026-04-22

Four-agent audit (secrets/SQLi/authz/SSRF, auth flows, UI wiring, data layer)
deduped into one document. Top 5 to fix first, second/third/fourth tier by
real exploitability, verified non-issues so we don't re-open them, and
coverage gaps where automated scanners / pytest / Jira connector / infra
were not touched.

Missing /auth/password/reset is already tracked in
padak/keboola_agent_cli#206; other top items (script sandbox RCE,
rate-limit, backslash open-redirect, SSRF) still need their own issues.

* docs(security): rephrase methodology description

Replace "four parallel agents" with "parallel review passes over four scope
areas" — same meaning, removes the overlap with agentic-AI terminology.
2026-04-22 16:31:13 +02:00
ZdenekSrotyr
d2c76cb221
User management + PAT + CLI distribution + HTML auth redirect (#9 #10 #11 #12) (#28)
* fix: redirect unauthenticated HTML routes to /login (#10)

* docs(plan): user mgmt + PAT + CLI distribution implementation plan (#9 #10 #11 #12)

* build(docker): produce wheel artifact for /cli/download (#9)

* feat(db): schema v5 — users.active + deactivated_at/by (#11)

* feat(api): /cli/download wheel + /cli/install.sh with baked server URL (#9)

* feat(users): repository supports active flag + count_admins (#11)

* feat(ui): /install page with per-deployment install instructions (#9)

* feat(api): user PATCH/reset-password/set-password/activate/deactivate (#11)

* fix(cli): da login prompts for password and sends it in body (#9)

* test(api): safeguard tests for self-deactivate and last admin (#11)

* feat(auth): reject requests from deactivated users (#11)

* fixup(#10): propagate next through /login buttons + lock down sanitizer tests

* feat(cli): da admin set-role/activate/deactivate/reset-password/set-password (#11)

* feat(ui): /admin/users management page (#11)

* feat(db): schema v6 — personal_access_tokens (#12)

* feat(users): access_tokens repository (#12)

* feat(auth): JWT carries typ (session|pat) and explicit jti (#12)

* feat(auth): reject revoked/expired PATs; update last_used_at (#12)

* feat(api): /auth/tokens CRUD + admin revoke; session-only guard (#12)

* feat(cli): da auth token create/list/revoke (#12)

* feat(ui): /profile page with PAT create/list/revoke (#12)

* docs: PAT usage and session/PAT TTL clarification (#12)

* feat(auth): PAT first-use-from-new-IP audit + last_used_ip (schema v7) (#12)

Closes remaining acceptance gap from issue #12: audit_log entry on first use
of a PAT from an IP that differs from the recorded last_used_ip.

- schema v7: personal_access_tokens.last_used_ip column
- AccessTokenRepository.mark_used now stores the client IP
- get_current_user extracts client IP (X-Forwarded-For first hop, fallback
  to request.client.host) and emits a token.first_use_new_ip audit when the
  IP changes on a subsequent use (not the very first use)
- tests: new-ip audit, same-ip no-op, first-ever-use no-op, schema v7 column

* fix: address Devin review findings on PR #28

- app/main.py: exclude /auth/* from HTML redirect handler so JSON
  endpoints under /auth/ (PAT CRUD used by `da auth token` CLI) keep
  their 401 JSON contract (Devin #1, bug)
- app/api/tokens.py: reject expires_in_days <= 0 explicitly; use
  `is not None` so 0 no longer silently creates a non-expiring token
  (Devin #2)
- app/api/users.py: validate role against Role enum in create_user
  to match update_user and prevent 500 on role-protected requests
  later (Devin #3)
- app/web/templates/admin_users.html: escape user-supplied strings
  before innerHTML; move onclick handlers to addEventListener via
  data attributes so emails with quotes / HTML no longer break the UI
  or enable stored XSS (Devin #4)
- app/auth/router.py, app/auth/providers/{password,google}.py:
  reject deactivated users at login instead of issuing a JWT that
  would then fail on the next request — removes the confusing
  redirect loop (Devin #5)
- CLAUDE.md: document schema v7 instead of stale v4 (Devin #6)
- tests/test_web_ui.py: regression test for the /auth/* JSON 401

* feat(web): add /profile and /admin/users links to dashboard nav

* feat(web): point setup banner at /install page

* chore(web): drop unused setup_instructions context

* fix: address Devin review round 2 on PR #28

- app/api/tokens.py: when expires_in_days is None (the "never" option),
  use a ~100-year JWT expiry so the token doesn't silently die in 24h
  via the session-default fallback in create_access_token. The real
  expiry enforcement stays in verify_token's DB-level check (Devin 🔴)
- app/web/templates/profile.html: escape t.name and other user-supplied
  strings via esc() helper before innerHTML, same pattern as
  admin_users.html. Move revoke onclick to data-attribute +
  addEventListener (Devin 🟡)
- app/api/cli_artifacts.py: use `mktemp -d` with X's at end of template
  for GNU/BSD portability, place wheel inside the temp dir and
  clean up with rm -rf (Devin 🚩)

* feat(web): redesign /install page; make curl one-liner primary, collapse manual

Rebuild the public /install page using the dashboard visual language
(shared header, card layout, gradient hero, design tokens from
style-custom.css). The page is now anchored on the one-liner install
path: curl -fsSL <server>/cli/install.sh | bash is rendered as the
primary, prominent step 1, while the old manual wheel-download flow
is tucked behind a closed-by-default <details> block for users in
restricted/offline environments.

Information architecture:
  hero (server URL + version)
  -> step 1: quick install (one-liner, big Copy button)
  -> step 2: create PAT on /profile + export DA_TOKEN / da auth whoami
  -> step 3: Claude Code / MCP via ~/.config/da/token.json
  -> collapsed "Manual install" details for download-wheel flow
  -> footer link to docs/HEADLESS_USAGE.md

Every shell snippet has a vanilla-JS "Copy" button that confirms
visually ("Copied!" for 1.5s) and falls back to textarea+execCommand
on non-secure contexts. No new dependencies, no bundler.

The route now also pulls an optional user so the header shows the
same nav (Dashboard / Profile / Logout) as dashboard.html when a
session exists, while staying fully public when signed out.

* fix(cli): use real wheel filename in install.sh (broken pip/uv install)

The installer wrote the downloaded wheel as agnes_cli.whl, which lacks a
PEP-427 version component — both pip and uv tool install reject it and
abort the one-liner.

Use curl -OJ so Content-Disposition determines the on-disk filename, then
resolve it via glob. Install an EXIT trap to remove the tmpdir even when
install fails.

* fix(web): correct manual install wheel glob and add PEP 668 / PATH hints

- Wheel glob is agnes_the_ai_analyst-*.whl (not agnes-*.whl) — the old
  pattern never matched the real artefact name from the build.
- Add — or — separator between uv tool install and pip install.
- Warn that pip install --user is blocked on macOS Homebrew / modern
  Debian (PEP 668) and recommend uv tool install as the default path.
- Both flows now show the ~/.local/bin PATH hint so a fresh shell can
  find the da binary after install.

* fix(web): consistent session.user reference in install header

The avatar-letter fallback inside {% if session.user %} was reading
user.name / user.email directly, but the route dependency can pass
user=None — those references resolved to an empty FlexDict and produced
an empty avatar circle. Read everything through session.user to match
the guard and the dashboard pattern.

* fix(web): point headless usage link at GitHub source

/docs/HEADLESS_USAGE.md 404s — no static route serves repo docs. Point
the footer link at the rendered markdown on GitHub instead of adding a
dedicated docs serving route just for one file.

* feat(web): /install hero size, anon sign-in banner, step 2 copy polish

- Bump hero h1 from 26px to 30px to match dashboard primary scale.
- Anonymous visitors see a small sign-in banner above Step 2 (creating
  a token requires auth; without the banner the flow appears stuck).
- Add an 'After generating your token' section label inside Step 2 so
  the /profile CTA button no longer looks wedged mid-sentence between
  adjacent paragraphs.

* chore(web): /install a11y + version pill polish

- aria-live='polite' on copy buttons so screen readers announce the
  'Copied!' state change.
- Replace redundant INSTANCE_NAME eyebrow (already in the header logo)
  with 'Getting started'.
- Hide the version pill when AGNES_VERSION is unset/'dev' — avoids the
  misleading 'vdev' label in local/unbuilt runs.
- Manual summary focus-visible outline-offset +2px (was -2px which
  clipped inside the card), and mark the chevron as decorative.

* fix(web): use session.user in dashboard avatar fallback

Inside {% if session.user %} guard, the avatar fallback referenced
(user.name or user.email). If user is None the block crashes when
the profile picture is absent. Align with the guard variable.

* fix: address Devin review round 3 on PR #28

- app/api/users.py: stop auto-sending email from reset_password. The
  magic-link sender would deliver a "Login Link" that — when clicked —
  consumes the reset_token via verify_magic_link and logs the user in
  WITHOUT prompting for a new password. Admins now share the raw
  reset_token from the API response manually, or use set-password
  directly. email_sent is always False. Documented inline. (Devin 🟡)
- app/api/cli_artifacts.py: harden /cli/install.sh generation against
  shell injection via Host header or AGNES_VERSION. base_url is
  validated against a strict scheme+host+port regex; version against
  an alnum + dot/dash/underscore allowlist. Both values are also
  piped through shlex.quote() as defense in depth. (Devin 🟡)

The shared users.reset_token column between magic-link and password-
reset flows (Devin 🚩) remains an architectural gap; splitting into
separate columns needs schema v8 and is tracked for a follow-up PR.

* docs, chore(grpn): manual-deploy helpers + hackathon deploy learnings

Adds scripts/grpn/ — Makefile + agnes-auto-upgrade.sh + README for
operating Agnes on GRPN's existing foundryai-development VM when the
full Terraform flow is blocked by org policies:

- iam.disableServiceAccountKeyCreation (org constraint) forbids SA
  JSON keys, so GCP_SA_KEY-based CI is unavailable
- No projectIamAdmin delegation → bootstrap-gcp.sh can't grant roles
- Secret Manager IAM bindings require setIamPolicy which editor lacks

Helper targets: deploy, deploy-tag, recreate, restart, stop, start,
status, version, logs, ps, env, ssh, tunnel, open, bootstrap-admin,
set-data-source, install-cron, uninstall-cron.

docs/superpowers/plans/2026-04-22-grpn-deploy-learnings.md — running
log of all org-policy constraints hit during the hackathon deploy,
with workarounds and derived follow-ups (WIF support, external_ip
variable, customer onboarding IAM checklist).

Not a replacement for the TF flow — stopgap until WIF lands.

* fix(web): make header logos clickable links to home

* feat(web): one-click "Setup a new Claude Code" button

Adds a single-button flow on the dashboard and /install page that
generates a fresh personal access token via POST /auth/tokens and
copies a complete, paste-ready setup script (server URL, token,
install/verify commands) to the clipboard. Falls back to a modal
textarea when the clipboard is blocked; redirects to /login on 401;
surfaces backend errors inline.

- dashboard.html: replaces the top "Set up your local environment"
  anchor with a real button wired to setupNewClaude(). Removes the
  duplicate bottom setup banner to keep a single entry point.
- install.html: for signed-in users, Step 1 leads with the one-click
  button and demotes the curl one-liner into a collapsible "Or run
  manually" aside. Anonymous visitors still see the curl flow plus a
  sign-in hint.
- No new deps. Vanilla JS. Token lives in memory/clipboard only —
  never rendered into persistent DOM.

* feat(cli): add "da auth import-token" for non-interactive PAT login

Writes a provided JWT into ~/.config/da/token.json using the canonical
{access_token, email, role} shape expected by save_token(). Decodes the
token locally to pull email/role claims, verifies it against the server
via GET /api/catalog/tables, and refuses to overwrite an existing token
file if the server returns 401. --email / --role overrides exist for
tokens missing those claims; --skip-verify bypasses the server round-trip
for offline / CI scenarios.

* test(cli): cover da auth import-token success + 401 + claim-fallback paths

Three new tests in TestAuthImportToken:
- valid JWT + 200 -> canonical token.json written
- 401 from /api/catalog/tables -> exit 1, existing token file untouched
- JWT without email/role claims -> refused without overrides, accepted
  with --email / --role flags

* feat(web): update one-click Claude setup instructions — explicit uv install, import-token, skills question

Replaces the fragile `cat > token.json <<EOF` clipboard payload with an
explicit, auditable sequence:

  1. `curl -fsSL /cli/download` + `uv tool install --force` (no opaque
     `curl | bash`).
  2. `da auth import-token --token ...` instead of hand-written JSON.
  3. Explicit PATH persistence for zsh/bash.
  4. A required question to the user about whether to copy the bundled
     skills into ~/.claude/skills/agnes/ or pull them on-demand via
     `da skills show`.
  5. A final confirmation step with whoami + version output.

Factored both pages to include a shared partial
(app/web/templates/_claude_setup_instructions.jinja) so dashboard.html
and install.html can never drift apart again. {server_url} and {token}
stay as runtime placeholders substituted by renderSetupInstructions().

* feat(ui): modernize /admin/users + unify header nav across pages

- New shared partial app/web/templates/_app_header.html — single source
  of truth for the top navigation. Used by base.html and dashboard.html
  (which doesn't extend base.html). Active page highlighted via
  request.url.path. Admin "Users" link gated by session.user.role.
- style-custom.css: add .app-header / .app-nav-link / .app-btn-logout /
  .app-avatar styles (mirrors dashboard's previous inline copy under
  app-* prefix). Mobile-friendly fallback at <720px.
- base.html: include the new partial so every page extending base
  (admin_users, profile, login_email, error, …) gets the same chrome
  the dashboard has.
- dashboard.html: replace its inline <header class="header"> markup
  with the shared partial. Inline .header CSS left in place as
  harmless dead code (separate cleanup PR).
- admin_users.html: rewritten with avatars, role pills (color-coded
  per role), toggle switch for active, search/filter input, toast
  notifications, modal dialogs replacing alert/confirm/prompt,
  one-click copy for the reset token, empty / loading states.
  All XSS-safe via the existing esc() helper + data-attribute
  event delegation.
- tests/test_web_ui.py: smoke test that /admin/users renders the new
  shared header chrome and the modernized markup.

* feat(api): serve CLI wheel at /cli/agnes.whl for direct uv install

uv tool install inspects the URL path suffix to recognise a wheel, so
/cli/download (which has no .whl suffix) cannot be installed directly.
Expose a stable /cli/agnes.whl alias over the same wheel lookup so users
can run: uv tool install --force https://<server>/cli/agnes.whl

* test(cli): cover da auth import-token --server persisting to config.yaml

The server persistence was already implemented in the import-token command
(save_config({server}) call) but not covered by tests. Add an explicit test
so the one-step setup contract — single import-token call writes both token
and server — cannot regress.

* feat(web): simpler Claude setup — single uv install URL, single import-token call

User feedback: the prior clipboard payload repeated the server URL and
token across multiple steps (curl + tmpfile + install + rm + separate
seed-config + import-token). Collapse to:

 1. uv tool install --force {server_url}/cli/agnes.whl  (single URL, direct)
 2. da auth import-token --token ... --server ...        (one call, persists both)
 3. da auth whoami
 4. skills (ask user first)
 5. confirm

uv accepts HTTPS URLs that end in .whl and installs them directly, so
the tmpfile dance is unnecessary. import-token --server already persists
the server to config.yaml, so no separate printf > config.yaml step.

* fix(tests): update admin users heading assertion after template rename

The admin_users.html template now uses <h2 class="users-title">Users</h2>
instead of <h2>User management</h2>. Update the assertion to match.

* feat(ui): unify header across remaining 7 standalone pages

These 7 pages render their own full <html> and don't extend base.html,
so the previous unification commit only covered base + dashboard. Each
had its own ad-hoc <header> markup with inconsistent classes
(.top-header / .header / .page-header), inconsistent nav-link sets,
and inconsistent avatar/email styling.

Replace each inline <header>...</header> block with the shared
{% include '_app_header.html' %} so /activity-center, /admin/permissions,
/admin/tables, /catalog, /corporate-memory, /corporate-memory/admin,
and /install all show the same chrome (Dashboard / Install CLI /
Profile / Users / email + avatar / Logout) with the active page
highlighted via request.url.path.

Old inline header CSS (.header, .top-header, .page-header, .nav-link,
etc.) is left in place as harmless dead code; it can be cleaned up in
a follow-up sweep.

* feat(web): add readable preview of Claude setup payload on dashboard + /install

Move the line-by-line setup instructions into app/web/setup_instructions.py
as the single source of truth, then render them in two modes from the
existing _claude_setup_instructions.jinja partial:

- preview_mode=True  → visible, read-only <pre><code> block with the real
  server URL and a clearly-styled placeholder token (never a real one).
- preview_mode=False → the JS SETUP_INSTRUCTIONS_TEMPLATE used by the
  one-click flow (unchanged behaviour).

Both /dashboard (env-setup-cta card) and /install (Step 1 card) now show
the preview directly under the 'Setup a new Claude Code' button so users
can see exactly what will land in their clipboard before they click.

* feat(web): update setup instructions — `da diagnose` step, explicit section titles

Rework the Claude Code setup payload to:

- Give every numbered step an unambiguous verb header ("1) Install the CLI",
  "2) Log in", "3) Verify the login", "4) Run diagnostics", "5) Skills (ask
  the user first)", "6) Confirm").
- Add step 4 `da diagnose` as the post-login health check. The CLI already
  ships this command (cli/commands/diagnose.py); it prints "Overall:
  healthy" and a list of green checks that map cleanly to next actions.
- Ask the skills copy-vs-on-demand question verbatim so Claude Code always
  prompts the user the same way.
- Replace the terse "Confirm" line with a 4-bullet summary (version,
  whoami, skills choice, diagnose status) so the return message is
  structured and comparable across setups.

* chore(web): remove stale MCP card from /install (no MCP server today)

The 'Use with Claude Code / MCP' card (Step 3 on /install) referenced an
MCP integration Agnes does not ship. Remove the whole card. The one-click
'Setup a new Claude Code' flow in Step 1 already covers the long-lived
client use case and is less confusing than dangling persistence tips for
a non-existent integration.

* feat(api): include user_email + last_used_ip + user_id in admin tokens list response

Adds AdminTokenItem response model (superset of TokenListItem) and
AccessTokenRepository.list_all_with_user() joining personal_access_tokens
with users to denormalize user_email. Needed for /admin/tokens UI where
admins triage tokens across all users.

* feat(web): /admin/tokens page — list, filter, search, revoke across all users

Adds a new admin-only page with client-side filtering (status, user email,
last-used window), column sorting, counts bar (active/revoked/expired),
and an inline revoke action. Mirrors the /admin/users visual language.

* feat(web): add Tokens nav link for admins + deep-link from admin/users row

Admin-only nav entry to /admin/tokens, and a per-row Tokens button on
/admin/users that prefills the token page's user filter via ?user=<email>.

* test(admin): cover /admin/tokens rendering, filter state, non-admin denial, revoke

Verifies admin can render the page (title + JS hooks present), a non-admin
is blocked, unauthenticated users are redirected, the admin list response
includes user_email / user_id / last_used_ip, and admin can revoke another
user's token.

* feat(web): modern redesign of /admin/tokens — hero, stat strip, refined table, responsive cards, a11y

* feat(web): ditch the table — /admin/tokens as a card stack, modern GitHub-style list

Replaces the table-based layout with a stack of self-contained token cards
inside a <ul role=list>. Each card is a flex row: avatar + name/meta on the
left, last-used block in the middle, status pill + outlined 'Revoke' button
on the right. Status and sort controls are pill-shaped toggle chips; user
email search has an inline search icon. No <table>/<tr>/<th>/<td> anywhere.
Responsive below 720px (card stacks vertically) and 480px (stat chips 2x2).
Preserves filter IDs (flt-status, flt-user, flt-last-used) and data-revoke
for existing tests.

* feat(web): add /tokens (role-aware) — single page for both user PAT CRUD and admin overview

- Rename admin_tokens.html -> tokens.html with a new is_admin context flag.
- New route GET /tokens: renders the same card-stack UI for everyone.
  * Admins: loads /auth/admin/tokens, shows owner column + stat strip, keeps
    the owner-email search box and sort-by-owner chip.
  * Non-admins: loads /auth/tokens (own tokens only), hides owner column +
    stat chips, adds a 'New token' CTA in the hero that opens a modal
    (name + expires_in_days) calling POST /auth/tokens. The raw token is
    revealed once in a dismissable banner and cleared from the DOM on Hide.
- GET /admin/tokens now 302-redirects to /tokens, preserving query string
  (so the /admin/users deep-link ?user=foo still works).

* feat(web): /tokens full-bleed layout to match dashboard width

The hero, toolbar, and card list used to sit inside base.html's .container
(max-width 800px). Break out with negative horizontal margins so the page
spans the viewport like /dashboard does, capped at 1440px for readability
on very wide screens with a 24px gutter on each side.

- No change to base.html itself. The override is scoped to .tokens-page.
- body { overflow-x: hidden; } guards against rare horizontal scrollbars.
- < 808px viewport: reset to natural flow (mobile already narrower).
- ≥ 1488px viewport: cap to 1440px and re-center.

* chore(web): remove /profile template + nav link (redirect /profile -> /tokens)

The old /profile PAT CRUD page is now redundant — the modern /tokens page
covers both user and admin flows. Delete the template; the router's
/profile handler already 302-redirects to /tokens.

Nav cleanup:
- Remove the 'Profile' link.
- Show a single 'Tokens' link to every signed-in user (previously only
  admins saw it).
- Active-state matches /tokens, /admin/tokens, and /profile so the
  highlight survives the redirect chain.

/install CTA now points at /tokens instead of /profile.

* test: cover /tokens for admin + non-admin flows, /profile redirect, nav update

tests/test_admin_tokens_ui.py
- Point admin rendering test at /tokens directly and tighten assertions
  (admin-only stat strip + owner search, non-admin CTA absent).
- Add test_non_admin_can_render_tokens_page: personal body, New-token CTA,
  create-modal, reveal banner; stat strip + owner search absent.
- Add test_admin_tokens_redirects_to_tokens: 302 to /tokens, query string
  (?user=...) preserved for the /admin/users deep-link.
- Add test_profile_redirects_to_tokens: 302 to /tokens.
- Add test_non_admin_can_create_pat_via_tokens_page_api: exercises the
  POST /auth/tokens call that the non-admin create-modal submits.

tests/test_pat.py
- test_profile_page_renders -> test_profile_page_redirects_to_tokens:
  assert the 302 + that /tokens lands on the unified non-admin body.

tests/test_web_ui.py
- admin_users nav assertion: 'Tokens' link present, 'Profile' link absent.
- Add test_nav_shows_tokens_link_for_non_admin: non-admins see the same
  'Tokens' link (previously only admins did).
- Add test_profile_redirects_to_tokens back-compat check.

* feat(web): collapse 'What Claude Code will receive' by default

The preview block on /dashboard and /install now uses <details>/<summary>
so it is hidden by default. Click the chevron/title to expand and review
the clipboard payload. Markup stays in the DOM so existing tests that
assert on content continue to pass.

* fix(web): /tokens width — override .container to 1280px like dashboard

The negative-margin full-bleed trick was fragile and pushed content past
the right edge on deployed viewports. Replace with a simple max-width
override of base.html's .container on this page only, matching
/dashboard's 1280px center-column layout.

* feat(web): split role-aware /tokens into my_tokens.html + admin_tokens.html

* feat(web): router — separate handlers for /tokens (own) and /admin/tokens (all)

* feat(web): nav — show Tokens for all, add All tokens for admins

* test: cover split token pages (own vs all) + admin access gating

* feat(web): move 'My tokens' into a user dropdown menu

Replaces the separate Tokens/email/Logout nav trio with a rounded
avatar trigger that opens a dropdown containing the user's email,
role, a 'My tokens' link, and Logout. Admin-only 'All tokens' stays
as a top-level nav item since it's an admin function, not a personal
one. Click-outside and Escape close the panel; chevron rotates on
open.

* fix(api): allow PATs to list/get/revoke their own tokens (CLI flow)

The documented 'da auth token list/revoke' CLI flow in
docs/HEADLESS_USAGE.md uses a PAT, but the previous dependency
(require_session_token) returned 403. Only create_token must be
session-only to prevent PAT-spawning-PAT chains; listing and
revoking your own tokens is safe with a PAT.

* fix(api): cap expires_in_days at 3650 to avoid datetime overflow (500 to 400)

Values above ~11 million days overflowed datetime.max in
datetime.now(utc) + timedelta(days=...) and surfaced as an
unhandled OverflowError → 500. Cap at 10 years with a clear
400 instead; the no-expiry code path is unaffected.

* fix(api): relax _SAFE_URL_RE to allow path prefixes, underscores, and IPv6

The previous regex rejected legitimate reverse-proxy base_url values
(https://host/agnes/), underscores in Docker Compose hostnames, and
IPv6 literals (http://[::1]:8000). Widen the charset and allow an
optional trailing path. shlex.quote continues to provide
defense-in-depth against any metacharacter that slips through.

* fix(web): /login/email and Google OAuth propagate next_path

Previously, /login/email silently dropped the ?next=<path> query
param so the hidden form field rendered empty and login always
landed on /dashboard. Google's button was hard-coded to
/auth/google/login, ignoring next entirely.

- /login page now appends ?next to the Google button URL
- /login/email reads + sanitizes next, passes as template context
- google_login stashes sanitized next_path in session['login_next']
- google_callback pops + re-sanitizes and redirects there

Sanitization factored into app/auth/_common.safe_next_path.

* fix(auth): differentiate argon2 VerifyMismatchError from internal errors in web login

The previous except (VerifyMismatchError, Exception) collapsed both
cases into the generic 'invalid credentials' redirect, silently
hiding corrupted-hash / library errors from ops. Split the two:
bad password still gets ?error=invalid; anything else logs via
logger.exception and redirects with ?err=auth_internal so ops have
a visible signal and users don't retry forever against a broken
password_hash column.

* docs: correct CLAUDE.md table name (personal_access_tokens)

v7 note referenced 'access_tokens.last_used_ip' but the real table
is personal_access_tokens (as mentioned two tokens earlier in the
same bullet). Same-file consistency fix.

* chore(web): clarify admin user-reset UI — encourage Set password over the unused reset_token

POST /api/users/{id}/reset-password stores and returns a token
but no endpoint consumes it — the magic-link sender would log the
user in without prompting for a new password, defeating the reset.
- Drop the 'Reset' row action from admin_users so admins aren't
  pointed at a dead end.
- Rewrite the reveal-modal copy to tell admins to use Set password
  and explicitly note that the magic-link flow isn't available
  for reset tokens in this build.
The API endpoint stays for API-level future use.

* test: cover PAT CLI flow, expires_in_days overflow, proxy base_url, next propagation

- tests/test_pat.py: PAT can list own tokens (200, was 403);
  PAT can revoke own tokens (204); create_token returns 400 for
  expires_in_days > 3650 (was 500 via datetime overflow).
- tests/test_cli_artifacts.py: _SAFE_URL_RE accepts reverse-proxy
  path prefixes, underscores, and IPv6 literals; end-to-end check
  of cli_install_script with a stubbed base_url that includes
  a path prefix (Agnes behind /agnes/).
- tests/test_web_ui.py: /login propagates ?next to the Google
  button URL; /login/email renders next in the hidden form field
  and strips hostile values; unit coverage of safe_next_path.

* fix(security): use \Z instead of $ in URL/version allowlists (trailing-\n bypass)

Python regex `$` also matches just before a trailing newline, so a Host
header or AGNES_VERSION value like "good.example.com\n$(rm -rf /)"
would slip past the allowlist. `\Z` anchors to strict end-of-string.

shlex.quote downstream remains as defense-in-depth, but the allowlist
is now the tight gate it claims to be.

* fix(auth): PAT with null expiry omits JWT exp claim (DB is the source of truth)

Previously a PAT created with `expires_in_days=null` (user-requested
"never expires") set the DB `expires_at` to NULL (correct) but still
baked a ~100y `exp` claim into the JWT. That is misleading: the PAT
silently did expire eventually, despite the UI and API promising
"no expiry".

`create_access_token` now accepts `omit_exp=True` to skip the `exp`
claim entirely. `app/api/tokens.py` passes that when `expires_in_days
is None`. The authoritative expiry check lives in
`app/auth/dependencies.py`, which reads `expires_at` from the DB row —
unchanged. PyJWT accepts claim-less JWTs indefinitely.

* test: cover trailing-newline regex bypass + no-exp JWT for unbounded PAT

- test_safe_url_re_rejects_trailing_newline_bypass: asserts both
  `_SAFE_URL_RE` and `_SAFE_VERSION_RE` reject values with a trailing
  `\n` (previously accepted because Python `$` matches before `\n`).
- test_pat_null_expiry_jwt_has_no_exp_claim: POST /auth/tokens with
  `expires_in_days=null`, decode the returned JWT, assert `exp` is
  absent while `typ=pat`, `sub`, and `jti` are still present.
- test_pat_with_null_expiry_is_accepted_by_verify_token: verify_token
  round-trips a claim-less JWT without ExpiredSignatureError.
- test_pat_null_expiry_end_to_end_allows_authenticated_request: use
  the null-expiry PAT against /auth/tokens and confirm it authenticates.

* docs(auth): document X-Forwarded-For trust model in _client_ip

Deployment runs behind Caddy which strips incoming X-Forwarded-For
and sets its own, so the leftmost hop is trustworthy. Clarify that
the stored last_used_ip is audit-only and never used for access
control — if the app is ever exposed directly, this value becomes
client-settable.

* docs: /profile → /tokens in install.sh next-steps, CLI error, HEADLESS_USAGE, security skill

After splitting PAT management to /tokens (with /profile as a back-compat
302), stale references remained in user-facing text. Update them to the
canonical /tokens URL so shell scripts, CLI error hints, docs, and the
bundled security skill are all consistent.
2026-04-22 14:24:28 +02:00
ZdenekSrotyr
060335deba
docs(quickstart): add Hackathon section pointing to switch-dev-vm.sh and HACKATHON.md (#14) (#23) 2026-04-21 21:59:23 +02:00
ZdenekSrotyr
1ca5295d54
docs: add HACKATHON.md — condensed deploy + dev playbooks (#21)
Written for both humans and AI agents — explicit commands, expected
outputs, troubleshooting tables, 'safe to run anytime' vs 'requires
thought' sections, pitfalls checklist.

Three parts:
1. Deploy for a new customer (45 min target, 7 steps)
2. Develop against Agnes (branch → image → dev VM loop, common tasks)
3. AI agent checklist (guardrails, verification, common pitfalls)

Complements the deep docs (ONBOARDING.md, DEPLOYMENT.md, architecture.md)
with a practical quick-reference for hackathon-style deploys.
2026-04-21 21:33:06 +02:00
ZdenekSrotyr
2cbffce85f
ci: propagate infra-v* tags to template repo + auto-merge rules (#17)
* dryrun: verify per-branch GHCR tag

* ci: propagate infra-v* tag bumps to template repo

On push of any infra-v* tag, opens a PR in keboola/agnes-infra-template
that bumps the module ref in terraform/main.tf. Auto-merge rules in the
template (Renovate + CI validate + GitHub native auto-merge) land it
without manual work on patch/minor bumps.

Requires repo secret TEMPLATE_REPO_TOKEN (fine-grained PAT with
Contents:write + Pull requests:write on keboola/agnes-infra-template).

Fail-soft: if secret is missing the job is skipped and Renovate on the
template repo picks up the new tag on its next cycle as a fallback.

* docs(onboarding): 'Keeping the template up-to-date' maintainer section

Documents the two mechanisms (upstream release hook + Renovate), the
required repo settings (allow_auto_merge, validate.yml gate), the TOKEN
secret setup, and the one-time setup checklist. Notes the difference
between template repo (auto-merge on) and customer infra repos
(human approval).
2026-04-21 21:32:58 +02:00
ZdenekSrotyr
1a55167234 docs: workflow-driven VM recreate for startup-script propagation
- ONBOARDING.md: replace 'propagating module changes' section with two
  explicit options — workflow_dispatch with recreate_targets (recommended,
  CI audit trail), or local terraform apply -replace (emergency). Adds a
  'do not' section banning manual .env edits on VMs.
- deployment-log.md: iteration 4 summary (version badge + module v1.5.0 +
  workflow_dispatch).
2026-04-21 20:24:31 +02:00
ZdenekSrotyr
cdd959b19f docs(log): add iteration 3 — review, bootstrap fix, docs sweep, infra-v1.4.0 2026-04-21 20:09:13 +02:00
ZdenekSrotyr
0121354596 docs: refresh DEPLOYMENT.md and ONBOARDING.md for infra-v1.4.0
- docs/DEPLOYMENT.md: rewritten to pick between Terraform (managed) and
  Docker Compose (OSS self-host). Old manual SSH-key-and-git-clone flow
  replaced with compose-based instructions pointing at the persistent-disk
  overlay and bootstrap endpoint.
- docs/ONBOARDING.md: section 4 now documents the new v1.4.0 variables
  (runtime_secrets, firewall_ssh_source_ranges, notification_channel_ids,
  compose_ref). Section 6 explains the /auth/bootstrap seed-user fix and
  warns that destroy+apply reopens the bootstrap window until run again.
- README.md: Documentation list expanded — ONBOARDING.md first (recommended
  path), DEPLOYMENT.md as the branching point, plus links to CONFIGURATION,
  architecture, and QUICKSTART.
2026-04-21 20:07:43 +02:00
ZdenekSrotyr
6470e23df3 docs: finalize deployment log — iteration 2 summary 2026-04-21 19:11:07 +02:00
ZdenekSrotyr
0b4807a836 docs(onboarding): use 'gh repo create --clone' to avoid template-copy race
Separate 'gh repo create --clone=false' + 'git clone' races with GitHub's
template content propagation. '--clone' waits for it in one step.
2026-04-21 19:10:04 +02:00
ZdenekSrotyr
3e9213bfc4 docs(onboarding): add module propagation, backup restore, monitoring setup
- 'Propagating module changes' — explains ignore_changes + -replace workflow
- 'Restoring from backup' — step-by-step disk swap from daily snapshot
- 'Monitoring alerts' — wiring notification channels
2026-04-21 19:06:20 +02:00
ZdenekSrotyr
03dd81c825 docs: update deployment log with final state and onboarding workflow
- Volume fix documented (Docker named volume → bind mount /data)
- Watchtower → cron-based auto-upgrade
- Final state snapshot of VMs, repos, tags, secrets
- Onboarding flow summary for 2nd customer
2026-04-21 16:51:20 +02:00
ZdenekSrotyr
a44e11a5e2 docs: add ONBOARDING.md — end-to-end per-customer deployment guide 2026-04-21 16:49:45 +02:00
ZdenekSrotyr
e53de59a42 docs: multi-customer deployment spec + implementation plan
- Spec: pure self-deploy model with per-customer GCP project
- Public upstream repo with TF module; private template + per-customer repos
- Branch-aware dev VMs via dev_instances list
- Caddy TLS, Secret Manager for tokens, SA JSON key for CI (WIF follow-up)
- 6-phase implementation plan with bite-sized tasks
2026-04-21 15:25:17 +02:00
ZdenekSrotyr
bd6921c4d5 docs,tests: anonymize customer references
Replace identifying customer names and infrastructure URLs in
documentation and test fixtures with generic placeholders.
Test semantics preserved.
2026-04-21 11:56:19 +02:00
ZdenekSrotyr
51f60bbf91 docs: add comprehensive test suite implementation plan (8 tasks, 6 parallel blocks)
Covers shared infrastructure, API gaps, CLI gaps, services, connectors,
E2E journeys, Docker and live tests. Tasks 2-7 are independent for
parallel sub-agent dispatch.
2026-04-12 10:44:08 +02:00
ZdenekSrotyr
55d11920ef docs: add comprehensive test strategy spec (6 parallel blocks, 4 layers)
Covers gap analysis, 8 critical E2E journeys, shared test infrastructure,
Docker E2E and live test design for full project coverage.
2026-04-12 10:33:26 +02:00
ZdenekSrotyr
816168f96b docs: add remote query implementation plan (5 tasks)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 11:02:04 +02:00
ZdenekSrotyr
eb68e6292d docs: fix remote query spec after code review
- Address read-only LOAD uncertainty with verification step + workaround
- Clarify register_bq wraps BQ logic (not delegates to register_bq_table)
- Use existing max_bq_registration_rows config key name
- Apply SQL blocklist to both register_bq and final sql
- Define connection lifecycle (caller owns, try/finally)
- Fix CLI argument handling (optional positional + --sql flag)
- Document concurrency safety (Unix inode semantics)
- Handle missing google-cloud-bigquery gracefully

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 10:58:25 +02:00
ZdenekSrotyr
017cf07674 docs: add design spec for remote query (extension re-attach + two-phase BQ)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 10:52:39 +02:00
ZdenekSrotyr
344d744089 feat: add 10 starter pack metrics (revenue, usage, sales, operations) 2026-04-10 19:35:28 +02:00
ZdenekSrotyr
06ac937f8b docs: add implementation plans for porting internal features
Three independent plans following TDD approach:
1. Business metrics (10 tasks) — schema v4, repository, CLI, API, starter pack, profiler integration
2. Analyst bootstrap (4 tasks) — da analyst setup, CLAUDE.md template, freshness check
3. Metadata writer (4 tasks) — column metadata repo, CLI, API, Keboola push

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 19:08:55 +02:00
ZdenekSrotyr
c57e195932 docs: fix design spec after code review
Addresses all Critical and Important issues found by reviewer:
- Fix schema migration details (_V3_TO_V4_MIGRATIONS, _ensure_schema chain)
- Add YAML-to-DuckDB field mapping table (table→table_name)
- Remove unexplained src/metrics.py from new files
- Fix API endpoint URLs (table/{id} → {table_id}, /api/data/tables → /api/catalog/tables)
- Commit to da analyst as top-level command (not sub-sub-command)
- Fix CLAUDE.local.md path to .claude/CLAUDE.local.md
- Remove duplicate --upload-local flag (--upload-only already exists)
- Detail profiler refactor call sites
- Add metrics API deprecation plan for catalog endpoint
- Use {metric_id:path} for slash-containing IDs
- Add --force flag and resume behavior for bootstrap
- Specify proposals directory path
- Simplify da metrics add to --file import

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 18:58:39 +02:00
ZdenekSrotyr
1ce632bc0b docs: add design spec for porting internal features to OSS
Covers business metrics in DuckDB, analyst bootstrap flow,
and metadata writer — based on comparison with internal repo.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 18:49:34 +02:00
ZdenekSrotyr
6c53082295 feat: multi-instance deployment — all 14 must-have items from spec
CalVer CI (release.yml) with stable/dev channels, health endpoint
with version/channel/schema_version, JWT secret auto-generation with
file persistence, smoke test script + Docker-in-CI, pre-migration
snapshot, /api/admin/configure for headless setup, /api/admin/
discover-and-register, /setup wizard, OpenAPI snapshot test, custom
connector mount support, CHANGELOG, migration safety tests, startup
banner.

663 tests pass (6 new migration safety + 3 OpenAPI snapshot + 1
updated JWT test).
2026-04-10 11:57:42 +02:00
ZdenekSrotyr
cce179f114 docs: add versioned tags per channel (dev-YYYY.MM.N, stable-YYYY.MM.N) 2026-04-10 06:44:25 +02:00
ZdenekSrotyr
4ea22232ef docs: multi-instance deployment and versioning design spec 2026-04-09 21:14:21 +02:00
ZdenekSrotyr
c8e232e43e docs: update stale v1 docs to v2 Docker/FastAPI/DuckDB architecture
- CONFIGURATION.md: remove Flask/SendGrid/WEBAPP_SECRET_KEY references,
  update env vars to JWT_SECRET_KEY and SESSION_SECRET, point to
  config/.env.template and config/instance.yaml.example
- disaster-recovery.md: rewrite for Docker volumes; cover GCP disk
  snapshot backup/restore and full VM rebuild; drop systemd/nginx/SSH
- server.md: strip rsync, systemd, nginx, Linux group, and sudo
  sections; keep Docker Compose operations, log viewing, health checks,
  sync/admin CLI, and Jira webhook procedures
2026-04-09 18:44:25 +02:00
ZdenekSrotyr
22cfbfe5fb docs: update references to deleted files
- QUICKSTART.md: replace data_description.md.example copy step with
  note that tables are registered via the admin API or web UI
- NOTIFICATIONS.md: replace examples/ section with planned-feature note
- telegram_bot.md: remove examples/notifications/ rows from deployment
  table and example scripts section; note feature is planned
- dev_docs/README.md: remove plan-corporate-memory.md entry
- duckdb_manager.py: update comment from remote_query.py to query API endpoint
2026-04-09 17:15:19 +02:00
ZdenekSrotyr
988cdb4320 docs: add production deployment sections to DEPLOYMENT.md
Add GHCR pre-built images, HTTPS/Caddy, multi-instance (Terraform + manual), and instance update procedures.
2026-04-09 16:41:26 +02:00
ZdenekSrotyr
53f39bb38d chore: clean stale docs — rewrite architecture.md, remove old plans
- architecture.md rewritten for v2 (FastAPI, DuckDB, Docker) — removed
  all Flask/rsync/SSH/systemd references
- Deleted PLAN.md and REFACTORING_PLAN.md (completed, superseded)
- auto-install.md replaced with redirect to DEPLOYMENT.md
- Fixed absolute paths in superpowers plan doc
2026-04-09 09:06:13 +02:00
ZdenekSrotyr
1b219cabe9 fix: remove dead PRAGMA enable_wal code
DuckDB has used WAL by default since v0.8, so this pragma is not
valid DuckDB syntax. Removed obsolete try-except block that attempted
to enable WAL on system database initialization.
2026-04-09 06:59:57 +02:00
ZdenekSrotyr
89154d043b chore: clean repo for public release — fix references, remove drafts
- Replace padak/tmp_oss → keboola/agnes-the-ai-analyst in all docs, infra, CLI
- Replace your-org/ai-data-analyst → keboola/agnes-the-ai-analyst in README, Jira docs
- Remove real GCP project ID from terraform.tfvars.example
- Delete internal draft documents (dev_docs/draft/)
- Update infra/main.tf to clone from main branch
2026-04-08 19:27:25 +02:00
ZdenekSrotyr
79443e0df4 fix: CSV all_varchar in legacy extractor, rewrite DEPLOYMENT.md from real deploy
- Legacy extractor now uses read_csv(all_varchar=true) to avoid type
  inference errors (e.g. seniority column typed as DOUBLE with string values)
- DEPLOYMENT.md rewritten based on actual dev VM deployment experience:
  deploy key setup, DuckDB write locking, env reload gotchas, bootstrap flow
2026-04-08 19:09:55 +02:00
ZdenekSrotyr
92fbb88c15 chore: Docker prod config (Python 3.13, no reload), fix utcnow deprecation, update docs 2026-04-08 12:10:47 +02:00
ZdenekSrotyr
1074d5ec49 feat: implement data access control — table-level permissions
Schema v3: add is_public column to table_registry (default true).

src/rbac.py: can_access_table() checks admin bypass, public flag,
explicit permissions, wildcard bucket permissions.

API enforcement:
- manifest: filters tables by user access
- download: 403 if no access
- catalog: filters table list
- query: validates referenced tables against allowed list

New admin permissions API (/api/admin/permissions) for grant/revoke.

28 access control tests + 733 total tests passing.
2026-03-31 12:33:31 +02:00
ZdenekSrotyr
18e5f0b6e8 feat: implement extract.duckdb contract — orchestrator + extractors
Phase 0: extend table_registry schema (v1→v2 migration), add
source_type/bucket/source_table/query_mode columns.

Phase 1: SyncOrchestrator ATTACHes extract.duckdb files into master
analytics.duckdb. Keboola extractor uses DuckDB extension with
legacy client fallback. BigQuery extractor is remote-only via
DuckDB BQ extension (no data download).

62 tests passing.
2026-03-30 20:12:56 +02:00
ZdenekSrotyr
0b9720d090 docs: rewrite core refactoring spec v2 — simplified extract.duckdb contract 2026-03-30 19:24:19 +02:00
ZdenekSrotyr
9ee7b3bd09 docs: add core refactoring design spec — DuckDB-centric extract architecture 2026-03-30 18:15:52 +02:00
ZdenekSrotyr
1287e63ed9 feat: complete system — web UI, all API endpoints, governance, admin, CLI commands
Major additions:
- Web UI: Jinja2 templates in FastAPI (login, dashboard, catalog, corporate memory, admin)
- API: catalog profiles/metrics, telegram verify/unlink/status, admin table registry CRUD
- Corporate memory governance: approve/reject/mandate/revoke/edit/batch + audit log
- Sync: real DataSyncManager trigger, sync-settings, table-subscriptions
- CLI: setup (init/test/deploy/verify), server (logs/restart/deploy/backup), explore
- Instance config integration (instance.yaml loaded at startup)
- 140 tests passing (25 new)
2026-03-27 16:52:22 +01:00
ZdenekSrotyr
07b396bfe2 docs: add refactoring plan, design spec, and gitignore updates 2026-03-27 15:42:57 +01:00
Petr
1318b74ff1 Add Corporate Memory governance — Phase 1 (data model + admin API)
Add admin curation layer between AI extraction and knowledge distribution.
Admins (km_admin flag in instance.yaml) can approve, reject, mandate, and
revoke knowledge items. Mandatory items distribute to all targeted users
automatically.

Three governance modes (configurable per instance):
- mandatory_only: admin controls everything, no user voting
- admin_curated: admin controls, users vote as feedback signal
- hybrid: mandatory from admin + optional from user voting

Three approval workflows:
- review_queue: nothing published without admin approval
- auto_publish: items go live immediately, admin intervenes retroactively
- threshold: confidence-based auto-publish (Phase 5)

Includes:
- 9 admin action functions (approve/reject/mandate/revoke/edit/batch/...)
- 11 new admin API endpoints under /api/corporate-memory/admin/
- Immutable audit log (audit.jsonl)
- Audience targeting via groups
- Automatic migration of existing items to "approved" status
- km_admin_required auth decorator
- 69 tests covering all governance logic
- Backward compatible: no config = legacy wiki behavior
2026-03-23 19:15:33 +01:00
Petr
95358448e6 Add modular LLM connector for Corporate Memory
Replace hardwired Anthropic API calls with a pluggable provider system.
Each deployment configures its AI provider in instance.yaml — switching
between Anthropic, LiteLLM, OpenRouter, or any OpenAI-compatible proxy
is a config change, not a code change.

New connectors/llm/ module:
- StructuredExtractor Protocol with extract_json() interface
- AnthropicExtractor: direct Anthropic SDK with retry + backoff
- OpenAICompatExtractor: any OpenAI-compatible proxy with three-layer
  structured output fallback (json_schema -> json_object -> prompt)
- Configurable structured_output policy (strict/json/auto)
- Custom exception hierarchy (auth/rate_limit/timeout/format/refusal)
- Zero secrets in logs: no API keys, prompts, or responses logged

Reviewed by: Google Gemini, Claude Sonnet, OpenAI GPT-5.4.
Security audit passed with all critical findings resolved.
2026-03-23 12:08:33 +01:00
Petr
84d14da611 Fix remote query UX: file-based stdin, ssh permissions, deprecation
Session testing revealed 3 issues with remote queries:

1. CLAUDE.md template recommended `cat <<HEREDOC | ssh ...` but
   claude_settings.json had `cat` in deny list, causing 2-3 failed
   attempts per query. Replaced with file-based approach: Write tool
   creates JSON file, then `ssh ... < file` avoids the cat deny.

2. ssh/scp commands were not in the allow list, requiring manual
   approval for every remote query. Added both to allow list.

3. DuckDB fetch_arrow_table() emitted DeprecationWarning on every
   parquet export. Replaced with .arrow().read_all().

Also added instruction for proactive hybrid analysis when remote
tables are available (agent was only using local data until asked).
2026-03-21 18:41:43 +01:00
Petr
67df4acd73 Add --stdin JSON mode to avoid shell escaping nightmare
Agent was failing 3x on SSH commands due to backticks (BQ table names)
and single quotes (SQL string literals) getting mangled by nested shell
interpretation (local -> SSH -> bash -> Python).

New --stdin mode reads query spec as JSON from stdin via heredoc:
  cat <<'QUERY' | ssh alias 'bash remote_query.sh --stdin'
  {"register_bq": {"alias": "SELECT ... FROM \`table\` ..."}, "sql": "..."}
  QUERY

Heredoc with <<'QUERY' (quoted) passes everything literally -- no
escaping needed for backticks, quotes, or parentheses.

Updated claude_md_template.txt to use --stdin as the primary method.
2026-03-21 12:15:50 +01:00
Petr
dce8454894 Add remote_query.sh wrapper, fix analyst SSH permissions
Analyst user (foundry_e_psimecek) couldn't access /opt/data-analyst/.
Added to data-ops group on server.

New scripts/remote_query.sh wrapper handles env setup (PYTHONPATH,
CONFIG_DIR, .env) so agents use simple:
  ssh alias 'bash ~/server/scripts/remote_query.sh --sql "..." --format table'

Updated claude_md_template.txt to use wrapper instead of raw commands.
2026-03-21 11:58:04 +01:00
Petr
d180b2014e Step 28: Remote query architecture for local+remote table JOINs
Add src/remote_query.py CLI module enabling the AI agent to run SQL
queries spanning local Parquet tables and remote BigQuery tables in a
single DuckDB session on the server. Two-phase protocol: BQ sub-queries
(--register-bq) fetch filtered/aggregated data, then DuckDB SQL (--sql)
joins everything.

Safety: COUNT(*) pre-check, memory estimation (2GB cap), row limits
(500K per BQ sub-query, 100K final result).

Changes:
- New src/remote_query.py with CLI, BQ registration, output formatting
- Add bq_entity_type field to TableConfig (view vs table routing)
- Extract create_local_views() from duckdb_manager.py for reuse
- Update claude_md_template.txt with remote query agent instructions
- Update example configs with remote_query section and docs
- 52 new tests (42 remote_query + 10 bq_entity_type), all passing
2026-03-21 11:39:15 +01:00
Petr
49adbe26ec Move server venv setup to bootstrap, remove cross-platform pip freeze sync
Server venv is created during bootstrap via SSH (same package list,
installed natively on Linux). Removes sync_data.sh section that copied
pip freeze output across platforms (Windows/macOS freeze is incompatible
with Linux).
2026-03-15 00:59:45 +01:00
Petr
508d92771f Generate setup instructions from bootstrap.yaml (single source of truth)
- Rewrite bootstrap.yaml as clean structured YAML with steps, commands,
  descriptions, conditions, and notes
- Add _generate_setup_instructions() in app.py that reads YAML, substitutes
  placeholders, and produces clipboard-ready plain text
- Replace 50-line hardcoded JS string builder with single tojson variable
- All setup instructions now editable in one YAML file
2026-03-15 00:37:19 +01:00
Petr
021c453ea6 Auto-create .sync_connection via printf command in bootstrap
Replace 'Save this to .sync_connection' prose with actual printf command
that Claude/user executes. Fix heredoc indentation in bootstrap.yaml.
2026-03-15 00:05:42 +01:00
Petr
2237334b05 Make CLAUDE.md template generic and instance-aware
- Remove all Keboola-specific content (metric categories, MRR/ARR refs,
  corporate memory, hardcoded server IP)
- Add {ssh_alias}, {server_host}, {webapp_url} placeholders
- Bootstrap saves .sync_connection file with instance details
- sync_data.sh reads .sync_connection to substitute all placeholders
- Text instructions in dashboard include .sync_connection step
2026-03-14 23:57:58 +01:00
Petr
140cbb3cee Make bootstrap.yaml instance-agnostic with configurable SSH alias
Add {ssh_alias} and {ssh_key} placeholders so each instance can use
its own SSH config name (avoids conflicts when user has multiple instances).
Remove Keboola-specific sync_settings and dataset references.
Simplify to single download_server_data step (rsync with scp fallback).
Handle SSH alias conflicts gracefully.
2026-03-14 20:58:26 +01:00
Petr
da6d605ae0 Add sample metric YAML as fallback when OpenMetadata metrics unavailable
The /api/v1/metrics endpoint may not be available in all OpenMetadata instances.
This sample metric provides a fallback for demonstration purposes.
2026-03-12 15:14:04 +01:00
Petr
5fc9526627 Phase 2: Replace demo YAML metrics with OpenMetadata catalog data
- Add get_metric_by_fqn() to OpenMetadataClient
- Add get_metrics() to CatalogEnricher with TTL caching
- Implement _parse_om_metric() to extract category/grain from OpenMetadata tags
- Implement _load_metrics_from_catalog() to fetch and categorize metrics
- Implement _build_om_metric_detail() to convert OpenMetadata format to MetricParser JSON
- Add /api/catalog/metrics/<fqn> endpoint for metric detail modal
- Update _load_metrics_data() to prefer catalog over YAML fallback
- Update metric_modal.js to route catalog:{fqn} to catalog API endpoint
- Delete 10 demo YAML files from docs/metrics/
- Replace metric tests with new unit tests for catalog parsing functions (19 tests)

Catalog metrics provide single source of truth vs maintaining demo YAML files.
UI remains unchanged - only data source changes from YAML to OpenMetadata catalog.
2026-03-12 15:10:42 +01:00
Petr
c77a6f6c2e Fix clipped annotation badges in theme-reference.html
Remove overflow:hidden from mockup containers and reposition
surface/text_primary badges that were cut off at edges.
2026-03-11 14:09:04 +01:00
Petr
d438438e33 Add configurable white-label theming via instance.yaml
Extend theming from 3 CSS variables (primary colors only) to 14
configurable properties covering colors, fonts, borders, and shape.
All values are optional with sensible defaults.

- New _theme.html include replaces duplicated inline injection
- Wire theme include into all 7 templates (base, login, dashboard,
  catalog, admin_tables, activity_center, corporate_memory)
- Conditional font loading: skip default Inter when custom font_url set
- Config.theme_overrides() classmethod generates CSS variable dict
- Visual theme-reference.html guide for instance configurators
- Document all theme keys in instance.yaml.example
2026-03-11 13:58:58 +01:00
Petr
49559fba1b Remove hardcoded Jira and Telemetry cards from catalog
These Keboola-specific data source cards don't belong in the OSS repo.
The catalog now shows only dynamic content: Core Business Data (from
data_description.md) and Business Metrics (from docs/metrics/*.yml).

Also update auto-install.md with Business Metrics documentation,
pipeline diagram, and expanded checklist.
2026-03-10 22:48:07 +01:00
Petr
5a84473213 Add dynamic Business Metrics with sample e-commerce definitions
Replace hardcoded Keboola-specific metrics card in Data Catalog with
dynamic Jinja template that renders whatever metric YAMLs exist in
docs/metrics/. Add 10 sample e-commerce metric definitions across
4 categories (revenue, customers, marketing, support) that align
with the sample data generator tables.

Key changes:
- MetricParser: new category colors + dynamic sql_* field discovery
- _load_metrics_data(): scans docs/metrics/*/*.yml with prod fallback
- catalog.html: 240 lines hardcoded HTML -> 35 lines Jinja loop
- metric_modal.js: regex-based category class removal, new categories
- 21 tests validating YAML schema, parser, and loader
2026-03-10 22:38:44 +01:00
Petr
f685dc357f Document Data Catalog and Profiler pipeline in auto-install guide
- Add architecture diagram showing data flow from instance config
  through profiler to webapp
- Explain folder_mapping dual purpose (catalog categories + file paths)
- Add Step 6c for running the profiler
- Document foreign_keys for relationship diagrams
- Explain profiles.json fallback for catalog header stats
- Expand checklist with profiler verification steps
2026-03-10 22:14:45 +01:00
Petr
7f61ae8772 Update auto-install docs with Data Catalog setup
- Split Step 6 into 6a (Generate Parquet) and 6b (Configure Data Catalog)
- Document data_description.md + instance.yaml catalog categories
- Uncomment data_description.md symlink in Step 3c
- Add Data Catalog verification to Step 6 checklist
2026-03-10 22:00:28 +01:00
Petr
302494b632 Add --format parquet using project's ParquetManager
Generator now supports --format {csv,parquet,both}. Parquet mode
uses src.parquet_manager.ParquetManager for snappy compression,
proper column types (DATE, TIMESTAMP, DOUBLE), and metadata.
No more ad-hoc pandas conversion needed on the server.
2026-03-10 21:46:20 +01:00
Petr
44bf43535b Add sample data generator with 9 e-commerce tables
Synthetic data generator for demo/testing without real data adapter:
- 9 tables: customers, products, campaigns, web_sessions, web_leads,
  orders, order_items, payments, support_tickets
- 4 size presets: xs (1MB), s (15MB), m (150MB), l (1.5GB)
- Realistic patterns: seasonality, Pareto customer distribution,
  segment-based behavior, referential integrity
- Deterministic output via --seed parameter

Also: docs/sample-data.md, updated auto-install.md with Step 6,
updated CLAUDE.md (email auth provider, dual-repo architecture)
2026-03-10 12:31:14 +01:00
Petr
879bc6c44f docs 2026-03-10 11:43:11 +01:00
Petr
495940d6b8 Rewrite auto-install guide with dual-repo architecture
Document the full end-to-end workflow: OSS repo (code) + private
instance repo (config/secrets). Covers SSH key isolation per repo,
symlink bridging, and ongoing deployment workflow.
2026-03-10 11:38:41 +01:00
Petr
a8a9efeb60 Update auto-install docs with steps 3-4 (config + email auth) 2026-03-10 10:43:39 +01:00
Petr
f2d3d156e3 Move standalone services from server/ to services/
Extract 4 self-contained services into services/ module:
- server/telegram_bot/ -> services/telegram_bot/
- server/ws_gateway/ -> services/ws_gateway/
- server/corporate_memory/ -> services/corporate_memory/
- server/session_collector.py -> services/session_collector/

Each service now has its own systemd/ directory with .service and .timer files.
deploy.sh updated to auto-discover service units from services/*/systemd/*.

server/ now contains only deployment infrastructure (deploy.sh, setup scripts,
bin/ management tools, sudoers, nginx config).

All imports updated: webapp/app.py, server/bin/ scripts, systemd ExecStart paths.
2026-03-09 12:54:30 +01:00
Petr
38b86127ed Branding cleanup: remove Keboola-specific references from docs and config
- server/deploy.sh: KEBOOLA_ENV_FILE -> SYNC_ENV_FILE
- server/ws-gateway.service, notify-bot.service: remove Keboola from descriptions
- .gitignore: generic comment for data directory
- CLAUDE.md, README.md, ARCHITECTURE.md: update paths from src/adapters to connectors/
- docs/DATA_SOURCES.md: update custom connector guide to connectors/ pattern
- connectors/jira/README.md: keboola-analyst -> data-analyst in config paths
- dev_docs/desktop-app.md: KeboolaAnalyst -> DataAnalyst branding
2026-03-09 12:22:27 +01:00
Petr
d8226c6641 Restructure docs for OSS readability
Remove redundant docs (GETTING_STARTED, README index, jira_schema),
add ARCHITECTURE.md and llms.txt for AI-era discoverability,
move notifications.md to docs/future/NOTIFICATIONS.md.
2026-03-09 10:42:45 +01:00
Petr
26c4e0934d OSS cleanup: remove internal references, harden deployment, add config env interpolation
Phase 1 - Internal reference cleanup:
- Delete dev_docs/meetings/ (internal meeting notes/transcripts)
- Replace hardcoded usernames (padak/matejkys/dasa) with deploy/generic
- Replace "Internal AI Data Analyst" with "AI Data Analyst"
- Replace keboola/internal_ai_data_analyst URLs with your-org/ai-data-analyst
- Replace /tmp/keboola_load/ with /tmp/data_analyst_staging/ in dev_docs

Phase 2 - Deployment hardening:
- Tighten sudoers wildcards to explicit paths (visudo, sudoers cp)
- setup.sh creates all groups (data-ops, dataread, data-private) and deploy user
- webapp-setup.sh copies sudoers-webapp from repo instead of inline definition
- deploy.sh conditional copy for data_description.md (not in git for OSS)
- deploy.sh ownership changed to deploy:data-ops for /data/{scripts,docs,examples}

Phase 3 - Config and misc:
- Add ${ENV_VAR} interpolation to config/loader.py
- Expand config/instance.yaml.example with all sections (admins, deployment, auth, etc.)
- Create config/.env.template for secret values
- Add MIT LICENSE
- Fix .gitignore: add .venv/, docs/data_description.md
- Fix README.md: CSV status Planned, remove metrics/, update license text
- Translate Czech comments in requirements.txt to English
- Fix test_account_service.py: mock username mapping instead of relying on instance config

All 118 tests pass.
2026-03-09 07:59:57 +01:00
Petr
c56905d34f Initial commit: OSS data distribution platform
Open-source AI data analyst platform extracted from internal repo.
Includes data sync engine, Keboola adapter, Flask web portal,
server deployment scripts, and configuration templates.
2026-03-08 23:31:28 +01:00