Commit graph

292 commits

Author SHA1 Message Date
ZdenekSrotyr
4c4dfee8e6 feat(profile): /profile/sessions page + audit on detector exception + correct SCHEDULER_AUDIT_ACTIONS
Three changes addressing user feedback during e2e test of #179 + Devin Review on e86dd5ed.

1) /profile/sessions — new self-service user page in the user menu.
   Lists all session jsonls the caller uploaded via `agnes push` joined
   against session_extraction_state. Each row shows uploaded_at, file
   size, status badge (pending/processed/extracted), processed_at, and
   items_extracted. The page docstring + help text explicitly call out
   that items_extracted=0 means the verification detector ran fine but
   the LLM found no claims to track — that's the documented "no items"
   outcome, not a broken pipeline. Closes the gap surfaced during the
   e2e test of #176 where a user could see their sessions on disk and
   process them through the LLM but had no UI to inspect what happened.

2) run_verification_detector audits unhandled exceptions (Devin #1).
   If detector.run() threw anything other than the already-translated
   ValueError, the audit_log row was never written. The endpoint now
   wraps detector.run in try/except, records the exception in
   audit_params["unhandled_error"], then re-raises as 500 after audit.
   The /admin/scheduler-runs page surfaces the failure row with the
   error type + message.

3) SCHEDULER_AUDIT_ACTIONS list corrected (Devin #2). Previous list
   had "marketplaces_sync_all" (wrong — actual is "marketplace.sync_all")
   plus "data_refresh" and "scripts_run_due" which app/api/sync.py and
   app/api/scripts.py don't write to audit_log. Fixed to the four
   actually-logged strings; comment points at the missing audit calls
   as a follow-up.

Tests: tests/test_web_ui.py adds TestAdminRoleGuards::test_profile_sessions_page_no_admin_required and tightens test_admin_scheduler_runs_page_admin_only to assert the correct marketplace.sync_all string.
2026-05-05 08:57:35 +02:00
ZdenekSrotyr
e86dd5edc5 fix(anthropic): strict json_schema (additionalProperties=false) + add /admin/scheduler-runs UI
E2E test on a real BQ deploy showed every verification-extraction call
fails with HTTP 400 invalid_request_error: "output_config.format.schema:
For 'object' type, 'additionalProperties' must be explicitly set to false".
The Anthropic structured-output API now requires the field on every object
node in the json_schema. Fix: connectors/llm/anthropic_provider.py wraps
the caller-supplied schema through a recursive _strict_json_schema()
walker that adds the field where missing (preserving any explicit
override), then passes the strict variant to the API. Six unit tests in
TestStrictJsonSchema pin the recursion across nested objects, array items,
and the no-mutation invariant.

Adds /admin/scheduler-runs — a read-only admin page that surfaces the
last 200 audit-log entries from scheduler-driven actions. New
AuditRepository.query_actions(actions, limit) helper, new admin nav
entry. Failed scheduler ticks (HTTP 401, network errors) don't reach
the audit_log; the page calls that out with a hint to set
SCHEDULER_API_TOKEN if no rows show up.
2026-05-05 08:00:57 +02:00
ZdenekSrotyr
9f9aabd72b fix(corporate-memory): CLI catches fail-fast ValueError, exits 1 with clean message (Devin Review on #179)
The PR's #176 fail-fast change made collect_all() raise ValueError when
neither an ai: block nor ANTHROPIC_API_KEY/LLM_API_KEY was available.
verification_detector's CLI was updated to handle it; corporate_memory's
CLI was missed and crashed with an unhandled traceback.

services/corporate_memory/collector.py:main() now wraps the collect_all
call in try/except ValueError, prints a one-line actionable message
to stderr, and returns rc=1.

Regression test:
test_llm_connector.py::TestCorporateMemoryCollector::test_main_returns_1_on_no_ai_config_instead_of_traceback.
2026-05-05 06:45:10 +02:00
ZdenekSrotyr
e68c2d3f0f fix(session-collector): argv-free run() helper, drop SystemExit footgun (Devin Review on #179)
run_session_collector called collector.main() which did argparse.parse_args()
on uvicorn's sys.argv (['app.main:app', '--host', ...]) → sys.exit(2) →
SystemExit(2), which inherits from BaseException, escapes FastAPI handlers,
and propagates through the thread pool. Every scheduler tick that fired the
endpoint either 500-ed or risked killing the uvicorn worker.

services/session_collector/collector.py now exposes run(dry_run, verbose)
that returns (rc, stats); main() is a thin CLI shim that parses argv and
delegates. The admin endpoint calls run() directly and audit-logs the
per-run stats (users_processed, files_copied, files_skipped) instead of
just the rc. Three regression tests in TestRunHelper.

Closes Devin Review finding on app/api/admin.py:2819 (#179).
2026-05-05 06:31:55 +02:00
ZdenekSrotyr
fa3a76a528 fix(scheduler): single env var drives cadence + grace (#179 review)
Devin NOTABLE: SCHEDULER_VERIFICATION_DETECTOR_INTERVAL was already
read by app/api/health.py to compute the staleness grace window, but
the actual scheduler cadence was hardcoded to 'every 15m'. The env
var name implied it controlled the cadence — it didn't. An operator
throttling the detector via the env was silently ignored by the
scheduler while the health grace silently widened.

Wired the env var into both ends. Same pattern applied to the other
two LLM-pipeline jobs:
- SCHEDULER_SESSION_COLLECTOR_INTERVAL     (default 600s = 10m)
- SCHEDULER_VERIFICATION_DETECTOR_INTERVAL (default 900s = 15m)
- SCHEDULER_CORPORATE_MEMORY_INTERVAL      (default 1020s = 17m)

Defaults preserve the existing 10m / 15m / 17m coprime offset so the
three jobs don't fire on the same tick.

build_jobs() now reads all three through _read_positive_int (matching
the existing pattern for data-refresh / health-check / script-runner)
and feeds them to _seconds_to_schedule. The smallest-interval check
includes the new variables so an operator can't accidentally set a
tick larger than any LLM cadence.

New tests in tests/test_scheduler.py:
- TestLLMPipelineCadenceEnvVars: env override changes the schedule
  string at scheduler-init time, with parametrized invalid-value
  rejection.
- TestVerificationDetectorGraceFollowsCadence: pinning the
  single-source-of-truth contract — same env var moves both the
  scheduler cadence and the health-check grace.
2026-05-05 05:59:18 +02:00
ZdenekSrotyr
9f33e24bf9 fix(config): overlay-aware LLM consumers + env-ref resolution (#179 review)
Devin BUG: /api/admin/configure seeds an ai: block to the writable
overlay at DATA_DIR/state/instance.yaml, but the three LLM consumers
imported from config.loader.load_instance_config — which reads the
static config dir only. Even if they had read the overlay, the loader
ran yaml.safe_load directly without passing through _resolve_env_refs,
so '${ANTHROPIC_API_KEY}' would have stayed a literal placeholder. The
pipeline appeared to work because the factory falls back to the env
var directly, but the overlay path itself was dead code.

Two fixes, both required:

1. Switched the three LLM consumers to app.instance_config.load_instance_config:
   - services/corporate_memory/collector.py:collect_all
   - services/verification_detector/__main__.py:main
   - app/api/admin.py:run_verification_detector

2. app/instance_config.py runs the loaded overlay through
   config.loader._resolve_env_refs *before* the deep-merge, so
   '${ANTHROPIC_API_KEY}' resolves at config-load time.

New regression suite tests/test_instance_config_overlay.py pins:
- env-ref resolution against the overlay (resolved when env set,
  empty when env missing — never the literal placeholder)
- deep-merge still preserves static-only sections
- the three consumers reach app.instance_config (inspected via
  inspect.getsource so a future refactor that reverts the import
  fails the test)
- end-to-end: a seeded overlay + ANTHROPIC_API_KEY env reaches the
  factory with a resolved api_key
2026-05-05 05:57:22 +02:00
ZdenekSrotyr
98a8aba3be fix(tests): align test_llm_connector with new factory + fail-fast (#179 review)
The PR rewrote collect_all() to call the new
create_extractor_from_env_or_config() helper, but the existing tests
still mocked the old direct create_extractor() symbol and the old
silent-skip-on-missing-config behavior. Five tests in
TestCorporateMemoryCollector and one in TestCollectorExtractorIntegration
were red on the PR branch.

Changes:
- Tests now mock connectors.llm.create_extractor_from_env_or_config
  (the symbol the collector imports lazily).
- Renamed test_collect_all_no_ai_config_skips ->
  test_collect_all_no_ai_config_or_env_raises and
  test_collector_handles_invalid_config -> test_collector_raises_on_invalid_config.
  Both assert pytest.raises(ValueError) — the explicit fail-fast
  semantics defect 5 of #176 was supposed to enforce.
- collect_all() no longer swallows the factory's ValueError into
  stats["errors"]; it propagates so the scheduler / admin endpoint
  surface the actionable misconfiguration message instead of
  pretending the run was a no-op.
- /api/admin/run-corporate-memory translates the propagated ValueError
  into a 500 with the factory's message, matching
  /api/admin/run-verification-detector.
2026-05-05 05:55:01 +02:00
ZdenekSrotyr
a621a415cc fix(health): session-pipeline staleness check (#176)
GET /api/health/detailed now returns a session_pipeline service entry.
Heuristic:
  max(mtime of /data/user_sessions/**/*.jsonl) <=
  max(processed_at in session_extraction_state) + grace_seconds

grace_seconds = 2 × verification-detector cadence (default 30 min;
configurable via SCHEDULER_VERIFICATION_DETECTOR_INTERVAL).

When the assert fails, status='warning' (never 'error') with an
actionable detail pointing at the verification-detector scheduler job.
A warning bubbles up to the existing overall='degraded' aggregation —
operators querying /api/health/detailed (or /agnes diagnose system)
get a clear breadcrumb instead of a silently-broken pipeline.

Cold-start case (no session files, or files newer than the grace
window with empty state table) is handled explicitly to avoid noise
on a fresh deploy.

Tests: tests/test_health_session_pipeline.py.
2026-05-05 00:04:28 +02:00
ZdenekSrotyr
c53c1e1572 fix(ui): admin pending-review banner on /corporate-memory (#176)
The /corporate-memory page filters status IN ('approved','mandatory')
and showed no hint that pending items exist. With approval_mode set to
'review_queue' (the default in instance.yaml.example), every collection
run would silently funnel new items into the pending bucket where no
operator ever saw them.

For admins (is_km_admin), the page now renders a banner above the
stats bar:
  N pending items awaiting review — review them at /corporate-memory/admin

Non-admins see no change (the route zeroes the count server-side
before passing to the template, so the hint is never leaked).

Tests: tests/test_corporate_memory_page.py.
2026-05-05 00:01:22 +02:00
ZdenekSrotyr
c3df03beb3 fix(compose): drop corporate-memory + session-collector services (#176)
**BREAKING** for operators using `COMPOSE_PROFILES=full` or custom
Compose overrides that referenced these stanzas — they're gone in
docker-compose.yml and docker-compose.prod.yml. The scheduler-v2 model
(previous commit) is now the sole driver: every cadence is a job in
services/scheduler/__main__.py:JOBS hitting an admin HTTP endpoint.

Why drop instead of keep behind `profiles: [full]`:
- The previous stanzas were tight `restart: unless-stopped` boot loops.
  When the scheduled run ended (every cycle), Docker re-spawned the
  container, defeating any cadence the service intended.
- The whole point of #176 is that there's now exactly one driver. Two
  drivers (scheduler HTTP + standalone container loop) would race on
  the same /data/user_sessions and knowledge_items writes.
- Removing the stanzas is a louder signal than commenting them out —
  operators upgrading get a clean failure mode (no stale containers),
  not a silently double-driven pipeline.

The Python entry points (services/{corporate_memory, session_collector,
verification_detector}/__main__.py) stay — they're still callable from
the CLI for manual one-shot runs and from the new admin endpoints.

docs/architecture.md updated to reflect the new schedule table.
tests/test_docker_compose.py pins the contract: the two services must
not reappear under either Compose file.
2026-05-04 23:59:44 +02:00
ZdenekSrotyr
45de71e8ab fix(scheduler): wire LLM pipeline into scheduler-v2 (#176)
The session-collector, verification-detector, and corporate-memory
services now run on the same scheduler-v2 model that already drives
data-refresh, health-check, script-runner, and marketplaces:

- New admin endpoints in app/api/admin.py:
    POST /api/admin/run-session-collector
    POST /api/admin/run-verification-detector
    POST /api/admin/run-corporate-memory
  All admin-gated, sync-def (FastAPI thread pool), with one audit row
  per invocation. Same single-writer-of-system.duckdb pattern as the
  existing /api/marketplaces/sync-all job.

- services/scheduler/__main__.py JOBS gains three entries with offset
  cadences (10m / 15m / 17m, all coprime modulo the 30s tick) so the
  three LLM-backed jobs don't fire on the same tick and stack their
  API + DB load.

- The verification-detector endpoint surfaces the LLM factory's
  fail-fast ValueError as HTTP 500 with the actionable message,
  preserving the no-silent-skip contract from the previous commit.

Tests:
- tests/test_admin_run_endpoints.py covers admin gating + scheduler
  registration + endpoint contract.
- tests/test_scheduler_sidecar.py existing tests continue to pass.
2026-05-04 23:57:43 +02:00
ZdenekSrotyr
bbb04ac041 fix(setup): seed default ai: block + env-var fallback (#176)
POST /api/admin/configure now writes a default ai: block into the
instance.yaml overlay when the request leaves it untouched and either
ANTHROPIC_API_KEY or LLM_API_KEY is set in the environment. The block
references the env var via ${VAR} syntax — secrets never land in YAML.

connectors.llm.factory grows create_extractor_from_env_or_config which
falls back to ANTHROPIC_API_KEY / LLM_API_KEY when ai_config is empty
and raises a clear ValueError when neither is available. Both
services/corporate_memory and services/verification_detector switch to
the new helper, replacing the old 'silently skip when ai: missing'
path that was the silent-failure root cause.

Tests:
- tests/test_setup_ai_block.py — overlay seeding contract.
- tests/test_llm_provider_env_fallback.py — fallback + fail-fast.
2026-05-04 23:55:19 +02:00
ZdenekSrotyr
d2104555c6 fix(deps): promote anthropic + openai to core dependencies (#176)
LLM provider SDKs are imported by services/corporate_memory and
services/verification_detector — both production code paths. Listing
them only in [project.optional-dependencies].dev caused the scheduler
container to boot-loop with ModuleNotFoundError on default
`docker compose up` deploys, because the Dockerfile installs core
deps only (`uv pip install --system --no-cache .`).

Adds tests/test_packaging.py to lock the contract: anthropic + openai
must live in [project].dependencies, not in dev extras.
2026-05-04 23:52:30 +02:00
ZdenekSrotyr
0612c1e1a1 fix(schema-v24): raise on deferred migration so retry path actually runs (Devin Review on db.py:1757)
Pre-fix: when v24 migration found rows to migrate but
data_source.bigquery.project was empty, it logged a warning per row
and returned normally. Schema_version then bumped to 24 unconditionally
→ next start's 'if current < 24:' gate skipped _v23_to_v24_finalize
forever, leaving rows in DuckDB-flavor SQL that the new
_wrap_admin_sql_for_jobs_api wrapping path rejects.

Devin escalated this from advisory ("idempotent retry") to critical
on rescan after my reply. The reply was wrong — the LIKE filter inside
the function gives idempotency IF the function is called again, but
the schema-version gate prevents that call from happening.

Fix (Devin's recommended Approach 1): raise RuntimeError BEFORE the
schema-version bump when rows need migration but project_id is empty.
The schema_version stays at 23, so on next start the 'if current < 24:'
gate fires and the migration runs again — this time with project_id
configured.

Side effect: a BQ-using deployment that hasn't set the project blocks
startup until they do. That's the right call for a config error that
would otherwise silently break all materialized tables. The error
message points at the right knob (data_source.bigquery.project +
restart).

No-rows-no-block invariant preserved: the early 'if not rows: return'
at the top of _v23_to_v24_finalize means non-BQ deployments are
unaffected.

Tests:
- test_v24_raises_when_project_not_configured_and_rows_need_migration:
  asserts raise + schema_version stays at 23 (the load-bearing
  invariant for retry-on-next-start to work)
- test_v24_skips_clean_when_no_rows_match_even_without_project:
  asserts non-BQ deployments don't block startup
- Existing 3 tests still pass
2026-05-04 23:11:34 +02:00
ZdenekSrotyr
36012e0833 fix(admin): register-table real-world UX gaps for materialized BQ
Three items from operator feedback after running the actual flow:

(1) Help docstring lied: "--bucket / --source-table ignored" for
materialized rows. Reality: --bucket is load-bearing because
`agnes schema <name>` builds the BQ identifier as
`bq.<bucket>.<source_table>`. An empty bucket registered the row but
broke schema/describe with HTTP 400 "unsafe BQ identifier in
registry". Fix: docstring rewritten to reflect reality, plus
client-side validation rejects materialized + empty bucket with a
clear error pointing at the right knob.

(2) Post-register UX cliff: `agnes pull` after register-table reports
"Updated 0 tables (1 total)" because registration adds a registry
row but does NOT trigger a parquet build. Operators routinely
assume something's broken when they need to run
`agnes setup first-sync` to kick off the materialization. Hint
emitted on success now points at first-sync.

(3) RBAC gotcha: `agnes catalog` is RBAC-filtered via
`resource_grants`, so non-admin users don't see freshly-registered
rows until a grant is created. Hint emitted on success now points at
`agnes admin grant create <group> table <name>`.

Tests: 8/8 in test_cli_admin_materialized.py, including two new
regression tests for the validation + the hint output.
2026-05-04 23:06:17 +02:00
ZdenekSrotyr
5915f92eaa fix(query-guardrail): single-pass alternation regex (Devin Review on query.py:464)
The iterative bare-name rewriter (one re.sub per name, longest-first)
was vulnerable to cross-contamination when the GCP project ID contained
a registered table name as a hyphen-delimited word.

Concrete repro:
  project        = 'my-ue-project'
  registered     = ['orders', 'ue']
  user SQL       = 'SELECT * FROM orders JOIN ue ON ...'
  iter 1 (orders): produces 'FROM `my-ue-project.fin.orders` JOIN ue ...'
  iter 2 (ue):     '\bue\b' matches 'ue' INSIDE 'my-ue-project' (hyphen
                   creates word boundary on both sides) — corrupts
                   the iter-1 path

Fallback at query.py:576 caught the resulting BQ parse error and fell
back to per-table SELECT * estimate, so impact was over-estimation,
not fail-open — but the #171 partition-pruning fix silently degraded
to pre-fix behavior whenever a project name shared a hyphen-segment
with a registered table.

Fix: single re.sub call with an alternation regex sorted longest-first.
Single-pass means each source position is processed exactly once, so
freshly-inserted backticked text from one match isn't re-scanned by
later names in the alternation.

Regression test
test_rewrite_helper_does_not_corrupt_when_project_id_contains_registered_name
covers the exact Devin repro.
2026-05-04 22:51:33 +02:00
ZdenekSrotyr
c432e90f62 fix(bq-materialize): TTL reclaim was dead code (Devin Review on extractor.py:166)
`_try_acquire_file_lock` opened the lock file with `open(mode='w')`
BEFORE the mtime check, which truncated the file and refreshed mtime
to now. The subsequent age check always saw ~0, so the TTL reclaim
branch was never reachable and `materialize.lock_ttl_seconds` was
a silently no-op config knob.

Repro:
  before open(w): mtime age = 100000s
  after  open(w): mtime age = 0s

Fix: stat the lock path BEFORE any open(). If pre-probe mtime is
older than TTL, unlink (forcing a fresh inode for the open + flock
that follows). Order is now stat-then-decide-then-probe, not
probe-then-stat-then-decide.

Two regression tests added in tests/test_bq_materialize_concurrency.py:
- test_stale_held_lock_is_reclaimed_despite_live_holder — exercises
  the full reclaim path with a still-living fcntl holder. Pre-fix
  this returned None (in_flight forever); post-fix returns a holder
  fd on a new inode.
- test_failed_probe_does_not_self_refresh_lock_mtime — sister test
  pins that a failed acquisition's mode='w' truncate doesn't
  pathologically loop.

Residual cross-process risk (genuinely overrunning materialize past
TTL races a fresh attempt — both write to the same parquet.tmp,
inode-level flock independence means new acquisition succeeds while
old holder is still alive) stays documented in the helper docstring.
In-process threading.Lock keyed on table_id blocks the single-process
race; cross-process protection relies on TTL being well above
longest plausible COPY (24h default).
2026-05-04 22:36:56 +02:00
ZdenekSrotyr
bc9dd5c5f0 test(setup-instructions): pin no-legacy-da-verbs invariant
Adds `test_unified_flow_uses_only_agnes_verbs` that asserts no `da `
substring (with trailing space, to dodge false positives on `Darwin` /
`database` / `adapter`) appears in any of the four
`resolve_lines()` shapes:

  - bare (no plugins, no ca)
  - plugins only
  - ca only
  - plugins + ca

Also pins the `agnes init --server-url … --token …` shape — commit
8784f10a's stale-on-disk-token fix relies on `init` receiving an
explicit `--token` argument; if a future refactor drops the flag from
the emitted command the test fails loudly instead of silently
regressing to 401-on-stale-token in production.

Plan: docs/superpowers/plans/2026-05-04-unified-setup-prompt.md task 10.
2026-05-04 22:20:40 +02:00
ZdenekSrotyr
2ee529533f refactor(setup-page): drop role query param
The `/setup` route no longer accepts `?role=analyst|admin`. The route
signature drops the `Literal[...] = Query(...)` parameter and the
silent admin-downgrade block (`if role == "admin" and not is_admin:
role = "analyst"`). The `role` ctx variable threaded into install.html
also goes away — Task 6 cleans up the template's role-tile UI and the
JS PAT-mint ternary.

`?role=` is silently ignored by FastAPI for unknown query params, so
existing bookmarks (none in production — the param was added in this
PR and never shipped) just degrade to the unified layout. No
RedirectResponse shim needed.

Tests: drop the entire `tests/test_setup_page_roles.py` file (eight
role-branching tests that no longer apply) and add
`tests/test_setup_page_unified.py` with three tests:

  - `test_setup_page_renders_unified_layout`
  - `test_setup_page_ignores_role_query_param`
  - `test_setup_page_renders_marketplace_for_user_with_grants`
  - `test_install_legacy_path_redirects_to_setup`

Also replace the role-aware `test_install_preview_*` tests in
test_web_ui.py with unified-layout assertions.

Plan: docs/superpowers/plans/2026-05-04-unified-setup-prompt.md task 5.
2026-05-04 22:16:59 +02:00
ZdenekSrotyr
291079b1d2 refactor(welcome-template): drop role param; resolve plugins per-user unconditionally
Removes the `role: Literal["analyst", "admin"] = "admin"` parameter from
`compute_default_agent_prompt`. The same RBAC pass
(`marketplace_filter.resolve_allowed_plugins`) now runs for every user —
admin or not. Users with no `resource_grants` rows get the
no-marketplace layout; users with grants get the marketplace block
inserted. Admin-vs-analyst is no longer a layout branch.

`render_agent_prompt_banner` no longer derives a `role` from
`user.is_admin`; it just delegates to `compute_default_agent_prompt`.
Two `compute_default_agent_prompt(...role=role)` call sites in
`app/web/router.py::setup_page` are updated to drop the keyword so the
route keeps rendering — Task 5 will remove the `?role=` query
parameter and the silent admin-downgrade block from the route signature
itself.

Tests: drop role-aware assertions from test_welcome_template_renderer
and test_welcome_template_api. Both files now assert the unified
default contains `agnes init` + `uv tool install` and bans the legacy
`agnes auth import-token` / `agnes auth whoami` verbs.

Plan: docs/superpowers/plans/2026-05-04-unified-setup-prompt.md task 4.
2026-05-04 22:13:46 +02:00
ZdenekSrotyr
74b7f6e254 feat(setup-instructions): preflight checks both git and claude
Renames `_git_check_block` to `_preflight_block` and adds a
`claude --version` check beside `git --version`. Both binaries are
required by the marketplace step — git for the clone fallback,
claude for `claude plugin marketplace add` / `claude plugin install` —
so checking them together gives one clear failure instead of two
confusing downstream errors.

Install hints: `npm i -g @anthropic-ai/claude-code` for Linux / WSL
plus a doc URL (https://docs.claude.com/claude-code) for the native
macOS / Windows installers. We don't try to one-line a native
installer; the canonical instructions live upstream.

Plan: docs/superpowers/plans/2026-05-04-unified-setup-prompt.md task 3.
2026-05-04 22:11:38 +02:00
ZdenekSrotyr
e16698c3cc refactor(setup-instructions): unified layout with mandatory agnes init
Adds `_step_numbers(*, has_marketplace, has_skills)` so step numbering
lives in one place instead of being split across three branches in
`resolve_lines`. Pins the unified layout in the tests:

  No plugins:     1 install, 2 init, 3 catalog, 4 diagnose, 5 skills, 6 confirm
  With plugins:   1, 2, 3, 4 preflight, 5 marketplace, 6 diagnose, 7 skills, 8 confirm

`agnes auth import-token` / `agnes auth whoami` are now banned from the
rendered prompt — `agnes init` subsumes them. The renamed
`test_resolve_lines_no_plugins_unified_six_step_layout` asserts those
strings are absent and that the new step headers (`Bootstrap your Agnes
workspace`, `Verify the data is queryable`) are present.

Plan: docs/superpowers/plans/2026-05-04-unified-setup-prompt.md task 2.
2026-05-04 22:10:05 +02:00
ZdenekSrotyr
9334beed15 refactor(setup-instructions): drop role param; collapse analyst/admin into one layout
Removes the `role: Literal["analyst", "admin"]` parameter from
`resolve_lines` / `render_setup_instructions` and deletes the
`_resolve_analyst_lines`, `_analyst_init_lines`, `_analyst_finale_lines`
helpers. The unified flow now always emits `agnes init` (the
workspace-rails delivery mechanism) in place of the legacy
`agnes auth import-token` + `agnes auth whoami` pair, and uses
`agnes catalog` as the smoke-verify step.

`agnes init` already verifies the PAT internally, and `agnes catalog`
doubles as a data-plane smoke check, so dropping `agnes auth whoami`
costs no signal.

Drops the now-redundant `tests/test_setup_instructions_analyst.py` and
patches the one ordering test in `tests/test_setup_instructions.py` that
referenced the old "Log in" / "Verify the login" headers. Also strips
the `role=role` kwarg from `compute_default_agent_prompt`'s call into
`resolve_lines` so the welcome-template render path keeps working;
welcome_template.py's own role param is removed in a follow-up task.

Plan: docs/superpowers/plans/2026-05-04-unified-setup-prompt.md task 1.
2026-05-04 22:08:48 +02:00
ZdenekSrotyr
8784f10a6b fix(devin-review): stale-token override + status sessions counter + lock comment
Three Devin Review findings on PR #173 addressed in one commit since
they're in adjacent code paths:

1. cli/commands/init.py:99 (\u{1F534}): `agnes init --token NEW` ran
   step 2 verify against the OLD on-disk token because `get_token()`
   read `~/.config/agnes/token.json` before the env var, and
   `_override_server_env` only set the env var. So `agnes init --force`
   on a machine with a stale token.json failed 401 with a confusing
   'token expired' even though the --token arg was valid.

   Fix: ContextVar-based override in `cli.config._token_override`
   checked by `get_token()` BEFORE the on-disk read.
   `_with_token_override` context manager scopes the override.
   `_override_server_env` now also sets the contextvar via
   `_with_token_override(token)`, so both env var and contextvar
   carry the override (env for back-compat with anything bypassing
   get_token; contextvar is the authoritative source).
   Async-safe (each task sees its own override) and leak-proof
   (resets on context exit).
   2 new tests: regression on stale-disk-token + scope leak guard.

2. cli/commands/status.py:43 (\u{1F7E1}): sessions_pending_upload only
   checked legacy `<workspace>/user/sessions/` and always reported 0
   in workspaces bootstrapped with `agnes init` (Claude Code writes
   to `~/.claude/projects/`, not the legacy path). Same bug we fixed
   for `agnes push` in 08e49591.

   Fix: route through `cli.lib.claude_sessions.list_session_files()`
   so status and push agree on what counts as a pending session.

3. connectors/bigquery/extractor.py:111 (\u{1F7E1}): docstring claimed
   "a live holder still wins the second flock attempt" — incorrect on
   Linux. After `unlink()` + `open()`, the new file is a new inode;
   fcntl.flock keys per-inode, so the old holder's lock does NOT block
   the new acquisition. In a genuine TTL-overrun scenario two writers
   CAN race the parquet.tmp.

   Fix: documentation only. Comment now honestly describes the
   inode-recreation behavior, names the threading.Lock as the actual
   in-process guard, and flags pid-gating as the next-iteration fix
   if real corruption surfaces. The 24h default TTL is well above
   typical COPY durations so the practical risk is low.

Tests: 17/17 across test_cli_init.py + test_lib_pull.py + the broader
regression set.
2026-05-04 21:26:30 +02:00
ZdenekSrotyr
8233c3e3f9 chore(docs): replace stale da verbs and vendor-specific install paths
Sweep operator runbooks (docs/QUICKSTART, docs/HEADLESS_USAGE,
docs/architecture, docs/sample-data, docs/agent-workspace-prompt,
docs/metrics/metrics.yml, dev_docs/server, dev_docs/disaster-recovery),
the corporate-memory service README, the jira connector README + backfill
scripts, the deploy skill, and test docstrings. Replaces `da sync` →
`agnes pull`, `da analyst setup` → `agnes init`, `da metrics ...` →
`agnes catalog --metrics` / `agnes admin metrics ...`, `da fetch` →
`agnes snapshot create`, plus the matching docker-compose admin
invocations.

Vendor-specific `/opt/data-analyst/` install paths in jira backfill /
consistency scripts and operator docs are replaced with the
placeholder `<install-dir>` and a new `AGNES_ENV_FILE` env-var override
that lets a deployment inject its actual install path without a code
change. Aligns with the OSS vendor-agnostic policy in CLAUDE.md.

CHANGELOG `### Internal` entry summarizes the audit and reaffirms the
intentional stale-marker tuples (`_LEGACY_STRINGS`, `_OUR_COMMAND_MARKERS`)
that must keep referencing `da sync` / `da fetch` / etc. for hook upgrade
and override-detection logic.
2026-05-04 21:22:19 +02:00
ZdenekSrotyr
976d0c7160 fix(pull): re-download parquet when file missing despite matching hash
Pre-fix `agnes pull` decided what to download from sync_state hash
equality alone:

    if server_hash != local_hash or tid not in local_tables or not server_hash:
        to_download.append(tid)

If the recorded local hash matched server but the actual parquet had
been deleted from disk, the download was skipped. The next DuckDB
view rebuild then fails on a missing file. Repro: `rm
server/parquet/X.parquet && agnes pull` → 'Updated 0 tables', X
still missing.

Failure modes that produce hash-equal-but-file-missing:
- manual `rm` of a single parquet
- operator-side cleanup of `server/parquet/`
- two workspaces sharing one user's
  `~/.config/agnes/sync_state.json` (TODO(workspace-scoped-sync-state)
  in pull.py): one workspace writes its parquets, the other reads
  sync_state and concludes 'I already have these'
- disk corruption / partial restore from backup

Fix: existence check runs alongside the hash compare. Missing file
forces a re-download regardless of hash equality. `parquet_dir` is
hoisted above the loop so the existence check is in scope when the
download set is built.

Tests: regression test for the hash-equal-but-missing-file case +
counterpart for the fast-path (hash-equal-and-file-present must
still skip).
2026-05-04 21:12:06 +02:00
ZdenekSrotyr
500db8cd3c fix(query-guardrail): dry-run user SQL not synthetic SELECT * (#171)
Closes #171. The /api/query cost guardrail used to dry-run a synthetic
`SELECT * FROM <table>` for each registered remote-BQ row referenced
by the user SQL — which made BigQuery estimate a full table scan, with
column projection, predicate pushdown, and partition pruning all
disabled. Narrow queries on big partitioned/clustered tables (the
documented happy path for `agnes query --remote`) hit ~30,000×
over-estimates and got rejected with 400 `remote_scan_too_large` even
when BQ's own dry-run reported single-digit MB.

Pavel's report on #171 traced the root cause and proposed the fix:
rewrite the user SQL to BQ-native syntax and dry-run it as a single
job, exactly the way `bq query --dry_run` works.

Implementation:
- New helper _rewrite_user_sql_for_bq_dry_run rewrites bare registered
  names (word-boundary, case-insensitive, longest-first to avoid prefix
  collisions) + bq."<ds>"."<tbl>" forms to backticked
  `<project>.<ds>.<tbl>` paths.
- _bq_quota_and_cap_guard runs ONE dry-run on the rewritten SQL. Cap
  check uses the real estimate.
- Fallback path: if BQ rejects with bq_bad_request (e.g. DuckDB-only
  syntax like ::INT casts), the guard falls back to the pre-fix
  per-table SELECT * approach so non-portable queries still get a
  (loose) cap estimate instead of fail-opening. Non-parse BQ errors
  (forbidden, upstream) still propagate as 502.
- _bq_guardrail_inputs now also returns name_lookups so the rewriter
  has the (registered_name, bucket, source_table) mapping it needs.
- Per-table breakdown is unavailable from a composite dry-run; total
  bytes are pinned to dry_run_set[0] for the post-flight
  record_bytes(sum(...)) call to keep returning the right total.

Tests (7 new, 3 existing still pass):
- dry-run receives rewritten user SQL with WHERE clause intact (the
  load-bearing assertion for #171)
- single dry-run per request even with multiple registered tables
  (JOIN, UNION) referenced
- fallback to per-table SELECT * on bq_bad_request
- non-parse BQ errors (forbidden) still 502
- rewriter unit tests: bare + bq.path in same SQL, longest-name-wins
  on prefix collision, case-insensitive bare-name match
2026-05-04 21:08:21 +02:00
ZdenekSrotyr
bd462187e8 test(welcome-template): tighten default-rendered assertions to new agnes verbs
The renderer no longer emits the legacy "da analyst setup" verb (the analyst
flow uses `agnes init`, the admin flow uses `agnes auth import-token`). The
disjunction assertions ("da analyst setup" OR "agnes auth" OR "curl") were
permissive and would have silently kept passing even if the renderer
regressed. Replace them with role-aware assertions that match the actual
emitted markers and explicitly check that no legacy verb survives.
2026-05-04 21:07:51 +02:00
ZdenekSrotyr
8890b6f09b fix(post-merge): clean up stale da verbs introduced via #174 merge
Four call sites where #174 (branched from main before the agnes rename
fully landed in some files) emitted or referenced `da fetch`. None are
operator-visible runtime crashes — but `extractor.py` logs a stale
verb to the operator log and `DATA_SOURCES.md` is current docs:

- connectors/bigquery/extractor.py:431,434 (operator-facing log line on
  unverified BQ entity_type — was suggesting `da fetch`).
- docs/DATA_SOURCES.md:77,85 (current public docs, two refs to
  `da fetch` in the workflow + the BQ scope description).
- tests/test_cli_query_render.py:7 (module docstring listed
  `da fetch / agnes schema / etc.` — now `agnes snapshot create / agnes
  schema / etc.`).
- tests/test_cli_snapshot_create.py:1 (docstring referenced `(folded
  from `da fetch`)` — historical, removed; no value once the rename
  landed).

Pre-existing stale `da` references elsewhere in the branch (templates,
operator runbooks, internal comments) are not touched by this commit —
they live outside the merge surface and are a separate cleanup task.

Verified: 10/10 across the affected test files pass.
2026-05-04 20:57:36 +02:00
ZdenekSrotyr
e438170ade merge: pull #174 (BQ materialize view fix + concurrency, 0.33.0) into bootstrap branch
Brings in zs/materialize-sync-fix (PR #174):
- BigQuery view materialize works (wrap admin SQL in bigquery_query())
- Per-table mutex + fcntl.flock for concurrent COPY corruption
- Cost guardrail dry-run engages on materialized rows
- Schema v23 -> v24 migration: rewrite source_query to BQ-native
- Server-generated trivial source_query from bucket+source_table
- Validator backtick relaxation for materialized rows
- 0.33.0 release cut

Conflict resolution:
- CHANGELOG.md: keep our [Unreleased] (bootstrap rewrite content) ABOVE
  the new [0.33.0] section from #174. The bootstrap rewrite remains
  unreleased; it'll cut 0.34.0 (or later) when this PR merges to main.
- tests/conftest.py: union — keep our analyst-bootstrap fixture
  re-export AND #174's bq_instance / stub_bq_extractor fixtures.
- pyproject.toml auto-merged to 0.33.0 (matches the cut), correct.
- src/db.py auto-merged: SCHEMA_VERSION = 24, _v23_to_v24_finalize
  added — no overlap with our work which left schema at v23.
- CLAUDE.md auto-merged: schema-history paragraph extended with v24.

Verified: 79/79 across CLI bootstrap suite + materialize suite +
schema v24 migration tests pass locally on Python 3.13/macOS.
2026-05-04 20:53:00 +02:00
ZdenekSrotyr
e6a2c4c51d tests: rename 'prj-grp' placeholder to 'my-project' for vendor-agnostic OSS
The dashed identifier is what the test exercises (backticks required for
dashed BQ project IDs); the literal string can be any synthetic value.
'prj-grp' is too close to a real customer-prefix pattern that the OSS
vendor-scrub regex flags. 'my-project' matches placeholders used elsewhere
in the project.
2026-05-04 20:38:47 +02:00
ZdenekSrotyr
08e4959185 fix(push): read sessions from ~/.claude/projects/<encoded-cwd>/
Real bug: `agnes push` was reading `<workspace>/user/sessions/`, but
Claude Code writes session jsonls to `~/.claude/projects/<encoded-cwd>/`
and nothing on the analyst side ever copies them across. The SessionEnd
hook ran `agnes push` happily and uploaded zero sessions every time.

`cli/lib/claude_sessions.py` probes both Claude Code encoding variants
(older `/`→`-` keeping spaces+tildes; newer all-non-alphanumeric→`-`
with collapsed runs) and unions whichever exist. Users who upgraded
Claude Code mid-project end up with both encoded dirs side-by-side on
disk; the union ensures no session is left behind. Same-named jsonl in
both dirs → newest mtime wins. `<workspace>/user/sessions/` survives as
a fallback for any setup that explicitly mirrors sessions there.

Verified on real disk: helper returns 2 dirs + 8 unioned session files
for the Agnes-test workspace where the previous code returned 0.
2026-05-04 20:29:59 +02:00
ZdenekSrotyr
92d477e422 fix(setup): default /setup to analyst, hide admin tile from non-admins
Three coupled UX fixes for the analyst-onboarding flow:

1. Dashboard "Setup a new Claude Code" CTA was rendering admin paste
   prompt for everyone (analysts couldn't actually execute the marketplace
   plugin install / skills setup steps). render_agent_prompt_banner now
   picks role based on user.is_admin — analysts get the analyst flow.

2. /setup default role changed from admin to analyst. Most visitors are
   analysts; admin layout is opt-in via the admin tile or ?role=admin.

3. Admin tile is admin-only on the role-tile nav. Non-admins see only
   the analyst tile. Server-side: non-admin requesting ?role=admin is
   silently downgraded to analyst (otherwise they'd see admin paste
   prompt despite no tile).

Tests:
- New: test_setup_page_admin_tile_hidden_for_non_admin (anonymous client
  can't see "Admin CLI" or role=admin link)
- New: test_setup_page_admin_role_downgraded_for_non_admin (anonymous
  ?role=admin → analyst layout, no marketplace step in clipboard)
- New: test_install_preview_default_role_is_analyst (admin signing in to
  bare /setup gets analyst clipboard by default)
- Renamed: test_setup_page_default_role_is_admin → ..._is_analyst
- Updated: test_setup_page_admin_clipboard_renders_admin_layout uses
  FastAPI dependency_overrides to inject admin user (admin layout is
  now admin-gated)
- Updated: test_install_preview_visible_for_signed_in_user explicitly
  passes ?role=admin to exercise admin layout
2026-05-04 20:20:37 +02:00
ZdenekSrotyr
d8dc7c7799 fix: update legacy-string assertions in tests + onboarding template
Caught by my own broader test scope after Devin fixes — three test files
asserted on user-visible strings that were renamed by the bootstrap PR
but the assertions weren't updated:

- tests/test_api_query_guardrail.py:110 — asserted `da fetch in suggestion`
  on /api/query 400 response. Renamed to `agnes snapshot create`.
- tests/test_query_materialized_error_message.py:56 — asserted `da sync`
  in materialized-not-yet error detail. Renamed to `agnes pull`.
- tests/test_cli_error_render.py:71 — fixture data + assertion both
  carried `da fetch`. Updated to `agnes snapshot create`.

Plus an actual content miss: docs/setup/claude_settings.json (a template
shipped to operators) still installed `da sync` / `da sync --upload-only`
hooks. The companion test file (tests/test_setup_hooks_template.py) was
asserting that legacy state. Updated both:
- Template hooks: `agnes pull --quiet` / `agnes push --quiet`
- Test assertions + function name match the new commands
2026-05-04 20:08:07 +02:00
ZdenekSrotyr
3d58768143 fix: address Devin Review findings — incomplete renames + estimate guard
13 Devin findings across 10 files:

🔴 Critical:
- app/api/v2_catalog.py:42 — `_fetch_hint` returns `da fetch` in /api/v2/catalog
  responses (user-visible in every catalog list)
- cli/skills/agnes-data-querying.md — 11 stale `da fetch`/`da sync` refs in the
  bundled skill markdown
- config/claude_md_template.txt:38 — referenced `agnes pull --docs-only` flag
  that does NOT exist in agnes pull (removed; spec only ships --quiet/--json/
  --dry-run)

🟡 Important:
- app/api/admin.py:252 — `da fetch` in bq_max_scan_bytes hint
- cli/commands/auth.py:119 — `da sync` in import-token docstring (--help text)
- cli/commands/tokens.py:48 — "Export it so `da` can use it" prose
- ARCHITECTURE.md — 4 stale rows in CLI commands table
- README.md — stale paragraphs for analysts (da sync, da analyst setup)

🚩 Substantive observations addressed:
- app/api/query.py:249,302,489 — server-side error/help strings still said
  `da sync`/`da fetch` (returned in API responses to clients)
- cli/commands/snapshot.py:235-241 — DuckDB existence guard incorrectly
  blocked `--estimate` (server-side dry-run that never opens local DB).
  Added test ensuring estimate path skips the guard.

Skipped (intentionally historical):
- app/api/admin.py:2377,2429,2437 — historical comments describing past
  manifest-vs-sync_state bug; past tense, accurate to keep as `da sync`.
2026-05-04 20:05:06 +02:00
ZdenekSrotyr
5fa1c94b5c fix(tests): smoke matrix asserts no-traceback only (per-command rc varies) 2026-05-04 19:47:18 +02:00
ZdenekSrotyr
5162c488bb fix(tests): strip ANSI escapes from --help output before substring asserts
Typer/rich emits ANSI styling in CI's --help output (e.g. `--metrics`
becomes `-\x1b[0m\x1b[1;36m-metrics`), so literal substring asserts
like `assert "--metrics" in result.output` fail. Locally the test runner
auto-detects no-TTY and produces plain text, masking the issue.

Add a small `_clean()` helper per test file that strips ANSI escape
codes (`\x1b\[[0-9;]*m`) before substring containment checks.
2026-05-04 19:43:47 +02:00
ZdenekSrotyr
675f8e1909 chore(lint): drop unused imports from new test files (ruff F401) 2026-05-04 19:32:31 +02:00
ZdenekSrotyr
ce108d4c6d fix(schema): code-review follow-ups for fac10b29
- _v23_to_v24_finalize: wrap row-update loop in BEGIN/COMMIT/ROLLBACK
  to match the project's transactional-finalizer pattern (compare
  _v12_to_v13_finalize, _v17_to_v18_finalize, _v18_to_v19_finalize).
  Pre-fix a process crash mid-loop left the schema_version unchanged
  but partially-converted rows persisted across restart — idempotent
  overall but inconsistent with project convention.
- _v23_to_v24_finalize: re.sub replacement now uses a function-form
  (lambda) instead of an f-string, so any future project_id with a
  backslash sequence isn't misinterpreted as a group reference.
- tests: add a Keboola-source materialized row case asserting the
  SELECT's source_type filter prevents non-BQ rewrites.
2026-05-04 19:32:24 +02:00
ZdenekSrotyr
8403529fcd test: clean-install integration suite (minimal/zero grants, force, pre-init) 2026-05-04 19:22:24 +02:00
ZdenekSrotyr
fac10b29e4 feat(schema): v24 — rewrite materialized BQ source_query to BQ-native
Materialize now wraps admin SQL into bigquery_query('<billing>', '<inner>')
which requires the inner SQL to be BigQuery-flavor (backticked
identifiers, native function syntax). v24 migrates existing rows from
DuckDB-flavor (bq."ds"."tbl") to (`<project>.ds.tbl`) using the
configured BQ project. Idempotent on already-converted rows; logs a
warning and skips when the project isn't configured (operator can
configure + restart for retry).
2026-05-04 19:15:54 +02:00
ZdenekSrotyr
42e108ae5e test: reader smoke matrix on zero-grants workspace 2026-05-04 19:15:39 +02:00
ZdenekSrotyr
a47c2be282 test: clean-bootstrap fixtures (fastapi_test_server, test_pat, zero_grants_workspace)
Task 20: reusable pytest fixtures for the clean-bootstrap test suite.
Tasks 21 and 22 (reader smoke matrix + init smoke matrix) consume them.

- fastapi_test_server boots a real uvicorn subprocess against a tmp DATA_DIR,
  pre-seeded with admin@example.com (Admin group), analyst@example.com
  (Everyone group), and three tables (one per query_mode: local /
  materialized / remote).
- web_session: cookie-authenticated httpx.Client for the admin user.
- test_pat: minted JWT for the analyst with table grants on local +
  materialized.
- test_pat_no_grants: same shape, zero resource_grants.
- zero_grants_workspace: subprocess invocation of `agnes init` against the
  no-grants PAT; returns the bootstrapped workspace path.
- NONEXISTENT_TABLE: module-level sentinel for the upcoming reader matrix.

Subprocess uvicorn (mirrors tests/test_e2e_corporate_memory.py) instead of
in-thread so DATA_DIR + module-level singletons in src.db don't bleed
across tests. agnes CLI invoked via `python -m cli.main` instead of the
.venv/bin/agnes shim, which depends on .pth file visibility that iCloud
Drive intermittently re-hides on macOS.
2026-05-04 19:11:54 +02:00
ZdenekSrotyr
7e1dd1adba refactor(cli): drop sync/fetch/analyst/metrics; register init/pull/push (BREAKING) 2026-05-04 18:59:51 +02:00
ZdenekSrotyr
6c0846fd17 feat(config): expose materialize.lock_ttl_seconds in server-config
New top-level 'materialize' section, single field (lock_ttl_seconds).
Default 86400 (24h). Backs the file-lock TTL reclaim added in the
per-table-mutex change. Editable via PUT /api/admin/server-config and
the /admin/server-config UI.
2026-05-04 18:52:54 +02:00
ZdenekSrotyr
ff5da0af90 feat(cli): agnes admin metrics {import,export,validate} 2026-05-04 18:39:05 +02:00
ZdenekSrotyr
3871d5320a feat(admin): server-generate materialized source_query, allow BQ backticks
When admin registers a materialized BQ row with bucket+source_table but
no source_query, the server generates 'SELECT * FROM `<project>.<ds>.<tbl>`'
from instance.yaml's configured BQ project. Same fallback fires on PUT
when flipping to materialized. The backtick rejection guard, which was
appropriate for DuckDB-flavor source_query, is relaxed for materialized
rows since the new wrapping path (Task 2) runs admin SQL through BQ
jobs API which uses BQ-native syntax (backticks for dashed identifiers).
2026-05-04 18:37:27 +02:00
ZdenekSrotyr
42b8d0309b feat(cli): agnes catalog --metrics replaces da metrics list/show 2026-05-04 18:33:17 +02:00
ZdenekSrotyr
8309141705 feat(cli): agnes snapshot create (folded from da fetch); friendly exit if no DuckDB 2026-05-04 18:32:30 +02:00
ZdenekSrotyr
5e1e8c4e14 feat(cli): agnes status = workspace state; old health check moves to agnes diagnose system 2026-05-04 18:29:15 +02:00