Adds whole-Store backup/restore primitives so an external CI/CD job can
mirror the Store to a git repo (and restore back from one).
REST:
- GET /api/store/bundle.zip — deterministic ZIP of all (filtered) Store
entities. Layout: manifest.json + entities/<id>/{plugin,assets}/.
Manifest carries owner_email for cross-instance restore. Auth: any
authenticated user (Store is community-open).
- POST /api/store/import-bundle — admin-only restore. Modes
merge|replace|skip; owner resolution by email with stub-disabled-user
fallback when the email is unknown on the target instance.
CLI:
- agnes store update <id> [--description X] [--zip PATH] ... — in-place
edit (server PUT permits owner OR admin per F4). Closes the missing
edit affordance for analysts who want to fix a typo or push a new
ZIP without losing install_count.
- agnes store pull [-o store.zip] [--unpack DIR] — download the bundle.
--unpack streams + extracts so an external git-backup workflow can
drop the tree straight into a repo and `git add .`.
- agnes store info [--json] — counts + size summary.
- agnes admin store push <zip-or-dir> [--mode ...] — admin-only restore.
Auto-zips a directory client-side so a working-tree → server
round-trip is one command.
cli/v2_client.py gains api_get_stream helper for binary downloads.
Tests: 5 new server tests (bundle shape + filters + round-trip + stub
user creation + skip mode + admin-only gate) + 11 new CLI tests
(update, pull/unpack, info, admin push). 66/66 store-related tests
green locally.
create_entity + update_entity created the `scratch` temp dir inside one
try/finally but cleaned it up in a separate one. Validation HTTPExceptions
raised by _safe_zip_extract (zip_unsafe_path, zip_too_large_uncompressed)
or the BadZipFile→422 conversion exited the first scope, and the second
finally was never entered → temp dir leaked on every failed upload.
Devin flagged this on the F2 commit. The leak pre-existed (zip_unsafe_path
was the original vector); F2 added zip_too_large_uncompressed to the same
broken cleanup path. Fixed by collapsing scratch creation + cleanup into
one outer try/finally that covers both extraction AND metadata/bake; the
inner try/except/finally still handles BadZipFile→422 + tmp file cleanup.
Same restructure in update_entity. Regression test
`test_scratch_dir_cleaned_up_after_failed_extraction` triggers a
zip_unsafe_path 422 and asserts tmp/agnes_store_* contains no leaked
dirs.
Three independent reviews of PR #180 surfaced four real defects in the new
Store / my-ai-stack surface. CHANGELOG entries detail each; one-liners:
- F1 video_url XSS: any authenticated user could upload a Store entity
with `video_url=javascript:...` and pop XSS in any viewer's session via
the `<a href=...>` "Watch video" link in store_detail.html. Jinja2
autoescape doesn't block URI schemes inside attribute values. Fixed by
scheme-validating to http(s) only on create + update; 400 invalid_video_url.
- F2 ZIP decompression bomb: _safe_zip_extract checked path-traversal but
not declared file_size totals — a 50 MB compressed upload at 1:1000
ratio decompresses to 50 GB and DOS the host disk. Fixed by summing
zinfo.file_size across infolist() and refusing > 200 MB before
extractall touches disk. 413 zip_too_large_uncompressed.
- F4 admin authz parity: PUT /api/store/entities/{id} was owner-only while
DELETE allowed owner OR admin; the store-detail page hid Edit/Delete
buttons from admin even though DELETE was permitted. Fixed by allowing
admin on PUT and passing is_admin to the template; gate is now
is_owner OR is_admin everywhere.
- F5 cross-owner suffix collision: sanitize_username is many-to-one
(alice.smith / alice_smith both → alice-smith). Two such users uploading
entities with the same display name produced identical
`<name>-by-<username>` suffixes, silently colliding in the served
agnes-store-bundle on-disk paths AND the manifest catalog (Claude Code
dedupes by plugin.json `name`). Fixed by enforcing global uniqueness on
the suffixed value at create_entity; 409 conflict_global_suffix.
F3 (ZIP symlink members) was investigated and confirmed to be a
false-positive — Python's stdlib ZipFile.extractall does not honor
symlink mode bits, so no exploit exists.
9 new regression tests in tests/test_store_api.py::TestStoreSecurityFixes
covering all four. Test run locally: 60/60 store-related tests pass.
Adds a community-driven Store where any authenticated user uploads
skills/agents/plugins as ZIPs, plus /my-ai-stack as the per-user
composition view. The served Claude Code marketplace is now:
(admin_granted ∖ opt_outs) ∪ store_installs
Skill + agent installs are merged into a single `agnes-store-bundle`
plugin in the served marketplace; type=plugin uploads stay standalone.
Names are suffixed with `-by-<owner-username>` at upload time so two
owners can use the same display name without colliding in Claude Code's
flat skill/agent namespace.
Schema v23 → v24 adds three tables:
- store_entities — community-uploaded skills/agents/plugins
- user_store_installs — what each user has chosen to install
- user_plugin_optouts — opt-out overlay on top of admin grants
Admin grant-delete drops every user's opt-out for that plugin so
re-grant resets cleanly to enabled (no sticky personal preference).
UI:
- /store — e-commerce-style listing with type/category/owner
filters, search, pagination, owner-aware [Install]
buttons, clickable cards
- /store/new — 2-step upload wizard with drag & drop, preview
validation (POST /api/store/entities/preview), docs
multi-upload, photo + video URL
- /store/{id} — detail page with hero, file list, docs, owner
actions (Edit/Delete) for the uploader
- /my-ai-stack — Granted plugins (toggle opt-out) + From the Store
(uninstall) sections
- Admin nav: Marketplaces moved into Admin dropdown, renamed to
"Curated Marketplaces"
Validation hardening: type-mismatch guards reject skill ZIP uploaded as
agent (or vice versa), and plugin ZIPs masquerading as skills/agents.
Human-readable error messages mapped client-side from machine codes.
Cross-source naming: Store entity-id-prefixed dirs (`plugins/store-<id>/`)
plus the bundle (`plugins/store-bundle/`) avoid collisions with admin
marketplaces (whose `store` slug is reserved by `is_valid_slug`).
Bundle composition is content-hashed at serve time — install/uninstall
or owner re-upload bumps the bundle's plugin.json `version`, so Claude
Code's auto-update toggle picks up changes.
Tests: 50+ new tests across naming, repositories, filter (admin ∪ store
∪ bundle), API (upload/install/uninstall/delete/preview/docs), end-to-end
marketplace.zip with bundle merging.
Pre-fix: when v24 migration found rows to migrate but
data_source.bigquery.project was empty, it logged a warning per row
and returned normally. Schema_version then bumped to 24 unconditionally
→ next start's 'if current < 24:' gate skipped _v23_to_v24_finalize
forever, leaving rows in DuckDB-flavor SQL that the new
_wrap_admin_sql_for_jobs_api wrapping path rejects.
Devin escalated this from advisory ("idempotent retry") to critical
on rescan after my reply. The reply was wrong — the LIKE filter inside
the function gives idempotency IF the function is called again, but
the schema-version gate prevents that call from happening.
Fix (Devin's recommended Approach 1): raise RuntimeError BEFORE the
schema-version bump when rows need migration but project_id is empty.
The schema_version stays at 23, so on next start the 'if current < 24:'
gate fires and the migration runs again — this time with project_id
configured.
Side effect: a BQ-using deployment that hasn't set the project blocks
startup until they do. That's the right call for a config error that
would otherwise silently break all materialized tables. The error
message points at the right knob (data_source.bigquery.project +
restart).
No-rows-no-block invariant preserved: the early 'if not rows: return'
at the top of _v23_to_v24_finalize means non-BQ deployments are
unaffected.
Tests:
- test_v24_raises_when_project_not_configured_and_rows_need_migration:
asserts raise + schema_version stays at 23 (the load-bearing
invariant for retry-on-next-start to work)
- test_v24_skips_clean_when_no_rows_match_even_without_project:
asserts non-BQ deployments don't block startup
- Existing 3 tests still pass
Three items from operator feedback after running the actual flow:
(1) Help docstring lied: "--bucket / --source-table ignored" for
materialized rows. Reality: --bucket is load-bearing because
`agnes schema <name>` builds the BQ identifier as
`bq.<bucket>.<source_table>`. An empty bucket registered the row but
broke schema/describe with HTTP 400 "unsafe BQ identifier in
registry". Fix: docstring rewritten to reflect reality, plus
client-side validation rejects materialized + empty bucket with a
clear error pointing at the right knob.
(2) Post-register UX cliff: `agnes pull` after register-table reports
"Updated 0 tables (1 total)" because registration adds a registry
row but does NOT trigger a parquet build. Operators routinely
assume something's broken when they need to run
`agnes setup first-sync` to kick off the materialization. Hint
emitted on success now points at first-sync.
(3) RBAC gotcha: `agnes catalog` is RBAC-filtered via
`resource_grants`, so non-admin users don't see freshly-registered
rows until a grant is created. Hint emitted on success now points at
`agnes admin grant create <group> table <name>`.
Tests: 8/8 in test_cli_admin_materialized.py, including two new
regression tests for the validation + the hint output.
The iterative bare-name rewriter (one re.sub per name, longest-first)
was vulnerable to cross-contamination when the GCP project ID contained
a registered table name as a hyphen-delimited word.
Concrete repro:
project = 'my-ue-project'
registered = ['orders', 'ue']
user SQL = 'SELECT * FROM orders JOIN ue ON ...'
iter 1 (orders): produces 'FROM `my-ue-project.fin.orders` JOIN ue ...'
iter 2 (ue): '\bue\b' matches 'ue' INSIDE 'my-ue-project' (hyphen
creates word boundary on both sides) — corrupts
the iter-1 path
Fallback at query.py:576 caught the resulting BQ parse error and fell
back to per-table SELECT * estimate, so impact was over-estimation,
not fail-open — but the #171 partition-pruning fix silently degraded
to pre-fix behavior whenever a project name shared a hyphen-segment
with a registered table.
Fix: single re.sub call with an alternation regex sorted longest-first.
Single-pass means each source position is processed exactly once, so
freshly-inserted backticked text from one match isn't re-scanned by
later names in the alternation.
Regression test
test_rewrite_helper_does_not_corrupt_when_project_id_contains_registered_name
covers the exact Devin repro.
`_try_acquire_file_lock` opened the lock file with `open(mode='w')`
BEFORE the mtime check, which truncated the file and refreshed mtime
to now. The subsequent age check always saw ~0, so the TTL reclaim
branch was never reachable and `materialize.lock_ttl_seconds` was
a silently no-op config knob.
Repro:
before open(w): mtime age = 100000s
after open(w): mtime age = 0s
Fix: stat the lock path BEFORE any open(). If pre-probe mtime is
older than TTL, unlink (forcing a fresh inode for the open + flock
that follows). Order is now stat-then-decide-then-probe, not
probe-then-stat-then-decide.
Two regression tests added in tests/test_bq_materialize_concurrency.py:
- test_stale_held_lock_is_reclaimed_despite_live_holder — exercises
the full reclaim path with a still-living fcntl holder. Pre-fix
this returned None (in_flight forever); post-fix returns a holder
fd on a new inode.
- test_failed_probe_does_not_self_refresh_lock_mtime — sister test
pins that a failed acquisition's mode='w' truncate doesn't
pathologically loop.
Residual cross-process risk (genuinely overrunning materialize past
TTL races a fresh attempt — both write to the same parquet.tmp,
inode-level flock independence means new acquisition succeeds while
old holder is still alive) stays documented in the helper docstring.
In-process threading.Lock keyed on table_id blocks the single-process
race; cross-process protection relies on TTL being well above
longest plausible COPY (24h default).
Replace the analyst-vs-admin `?role=` design summary with the unified
flow we're shipping: single tile, single PAT-mint shape (general /
90 d), `agnes init` mandatory for everyone, marketplace block gated
by `resource_grants`, pre-flight check now validates both git and
claude.
The intro paragraph references the 10-task unification follow-up and
the `?role=` introduction-and-removal cycle so a future operator
reading the diff doesn't think they missed a release.
Plan: docs/superpowers/plans/2026-05-04-unified-setup-prompt.md task 9.
Three Devin Review findings on PR #173 addressed in one commit since
they're in adjacent code paths:
1. cli/commands/init.py:99 (\u{1F534}): `agnes init --token NEW` ran
step 2 verify against the OLD on-disk token because `get_token()`
read `~/.config/agnes/token.json` before the env var, and
`_override_server_env` only set the env var. So `agnes init --force`
on a machine with a stale token.json failed 401 with a confusing
'token expired' even though the --token arg was valid.
Fix: ContextVar-based override in `cli.config._token_override`
checked by `get_token()` BEFORE the on-disk read.
`_with_token_override` context manager scopes the override.
`_override_server_env` now also sets the contextvar via
`_with_token_override(token)`, so both env var and contextvar
carry the override (env for back-compat with anything bypassing
get_token; contextvar is the authoritative source).
Async-safe (each task sees its own override) and leak-proof
(resets on context exit).
2 new tests: regression on stale-disk-token + scope leak guard.
2. cli/commands/status.py:43 (\u{1F7E1}): sessions_pending_upload only
checked legacy `<workspace>/user/sessions/` and always reported 0
in workspaces bootstrapped with `agnes init` (Claude Code writes
to `~/.claude/projects/`, not the legacy path). Same bug we fixed
for `agnes push` in 08e49591.
Fix: route through `cli.lib.claude_sessions.list_session_files()`
so status and push agree on what counts as a pending session.
3. connectors/bigquery/extractor.py:111 (\u{1F7E1}): docstring claimed
"a live holder still wins the second flock attempt" — incorrect on
Linux. After `unlink()` + `open()`, the new file is a new inode;
fcntl.flock keys per-inode, so the old holder's lock does NOT block
the new acquisition. In a genuine TTL-overrun scenario two writers
CAN race the parquet.tmp.
Fix: documentation only. Comment now honestly describes the
inode-recreation behavior, names the threading.Lock as the actual
in-process guard, and flags pid-gating as the next-iteration fix
if real corruption surfaces. The 24h default TTL is well above
typical COPY durations so the practical risk is low.
Tests: 17/17 across test_cli_init.py + test_lib_pull.py + the broader
regression set.
Sweep operator runbooks (docs/QUICKSTART, docs/HEADLESS_USAGE,
docs/architecture, docs/sample-data, docs/agent-workspace-prompt,
docs/metrics/metrics.yml, dev_docs/server, dev_docs/disaster-recovery),
the corporate-memory service README, the jira connector README + backfill
scripts, the deploy skill, and test docstrings. Replaces `da sync` →
`agnes pull`, `da analyst setup` → `agnes init`, `da metrics ...` →
`agnes catalog --metrics` / `agnes admin metrics ...`, `da fetch` →
`agnes snapshot create`, plus the matching docker-compose admin
invocations.
Vendor-specific `/opt/data-analyst/` install paths in jira backfill /
consistency scripts and operator docs are replaced with the
placeholder `<install-dir>` and a new `AGNES_ENV_FILE` env-var override
that lets a deployment inject its actual install path without a code
change. Aligns with the OSS vendor-agnostic policy in CLAUDE.md.
CHANGELOG `### Internal` entry summarizes the audit and reaffirms the
intentional stale-marker tuples (`_LEGACY_STRINGS`, `_OUR_COMMAND_MARKERS`)
that must keep referencing `da sync` / `da fetch` / etc. for hook upgrade
and override-detection logic.
Pre-fix `agnes pull` decided what to download from sync_state hash
equality alone:
if server_hash != local_hash or tid not in local_tables or not server_hash:
to_download.append(tid)
If the recorded local hash matched server but the actual parquet had
been deleted from disk, the download was skipped. The next DuckDB
view rebuild then fails on a missing file. Repro: `rm
server/parquet/X.parquet && agnes pull` → 'Updated 0 tables', X
still missing.
Failure modes that produce hash-equal-but-file-missing:
- manual `rm` of a single parquet
- operator-side cleanup of `server/parquet/`
- two workspaces sharing one user's
`~/.config/agnes/sync_state.json` (TODO(workspace-scoped-sync-state)
in pull.py): one workspace writes its parquets, the other reads
sync_state and concludes 'I already have these'
- disk corruption / partial restore from backup
Fix: existence check runs alongside the hash compare. Missing file
forces a re-download regardless of hash equality. `parquet_dir` is
hoisted above the loop so the existence check is in scope when the
download set is built.
Tests: regression test for the hash-equal-but-missing-file case +
counterpart for the fast-path (hash-equal-and-file-present must
still skip).
Closes#171. The /api/query cost guardrail used to dry-run a synthetic
`SELECT * FROM <table>` for each registered remote-BQ row referenced
by the user SQL — which made BigQuery estimate a full table scan, with
column projection, predicate pushdown, and partition pruning all
disabled. Narrow queries on big partitioned/clustered tables (the
documented happy path for `agnes query --remote`) hit ~30,000×
over-estimates and got rejected with 400 `remote_scan_too_large` even
when BQ's own dry-run reported single-digit MB.
Pavel's report on #171 traced the root cause and proposed the fix:
rewrite the user SQL to BQ-native syntax and dry-run it as a single
job, exactly the way `bq query --dry_run` works.
Implementation:
- New helper _rewrite_user_sql_for_bq_dry_run rewrites bare registered
names (word-boundary, case-insensitive, longest-first to avoid prefix
collisions) + bq."<ds>"."<tbl>" forms to backticked
`<project>.<ds>.<tbl>` paths.
- _bq_quota_and_cap_guard runs ONE dry-run on the rewritten SQL. Cap
check uses the real estimate.
- Fallback path: if BQ rejects with bq_bad_request (e.g. DuckDB-only
syntax like ::INT casts), the guard falls back to the pre-fix
per-table SELECT * approach so non-portable queries still get a
(loose) cap estimate instead of fail-opening. Non-parse BQ errors
(forbidden, upstream) still propagate as 502.
- _bq_guardrail_inputs now also returns name_lookups so the rewriter
has the (registered_name, bucket, source_table) mapping it needs.
- Per-table breakdown is unavailable from a composite dry-run; total
bytes are pinned to dry_run_set[0] for the post-flight
record_bytes(sum(...)) call to keep returning the right total.
Tests (7 new, 3 existing still pass):
- dry-run receives rewritten user SQL with WHERE clause intact (the
load-bearing assertion for #171)
- single dry-run per request even with multiple registered tables
(JOIN, UNION) referenced
- fallback to per-table SELECT * on bq_bad_request
- non-parse BQ errors (forbidden) still 502
- rewriter unit tests: bare + bq.path in same SQL, longest-name-wins
on prefix collision, case-insensitive bare-name match
Ports Minas's PR #172 (against pre-rename `da` CLI on main) and applies
the principle to the post-rename `agnes` CLI. Two distinct failure modes
on Windows consoles whose default codepage is cp1250 (cs-CZ) / cp1252
(en-US):
1. `agnes pull` and other Rich-progress codepaths
UnicodeEncodeError on Braille spinner glyphs. Fix: `cli/main.py`
reconfigures stdout/stderr to UTF-8 with errors='replace' at import
time on `sys.platform == 'win32'` so Rich's legacy-Windows render
path emits decodable bytes. Wrapped in try/except so pytest's
captured streams (which aren't TextIOWrapper) don't break.
2. `agnes skills list` and `agnes skills show`
UnicodeDecodeError when reading skill markdown containing em-dashes /
accented chars. Default `Path.read_text()` uses
locale.getpreferredencoding(False), which is the broken codepage on
Windows. Fix: every call site passes encoding='utf-8' explicitly.
Broader scope than #172 because:
- The bootstrap rewrite renamed/removed several files Minas's PR
patched (`cli/commands/analyst.py` -> rolled into init.py;
`cli/commands/sync.py` -> split into pull/push). Those targets no
longer exist; the equivalent code lives in init.py.
- Other call sites Minas didn't touch (still bare in his branch) are
patched here too — config.py / update_check.py / snapshot_meta.py /
setup.py / skills.py — so the codebase has zero locale-default text
I/O in cli/.
Side cleanup: stale `Run `da`` reference in snapshot_meta.py:88 fixed
to `agnes` while touching the file.
CI failure: test_readers_in_pre_init_dir asserted no Traceback in stderr
when running `agnes snapshot create x --as y --estimate` in a folder
that never saw `agnes init`. The estimate-guard fix in 3d587681 let
`--estimate` skip the local_db check and reach `api_post_json`, but the
existing `except V2ClientError` doesn't cover transport-layer failures.
With no server configured the URL defaults to http://localhost:8000;
httpx raises ConnectError → ConnectError isn't a V2ClientError → the
exception bubbles up through Typer/rich as a full traceback.
Add `except httpx.HTTPError` next to V2ClientError so connection /
DNS / TLS / timeout failures all render the friendly hint
`Run `agnes init …` first` instead of leaking transport noise.
Real bug: `agnes push` was reading `<workspace>/user/sessions/`, but
Claude Code writes session jsonls to `~/.claude/projects/<encoded-cwd>/`
and nothing on the analyst side ever copies them across. The SessionEnd
hook ran `agnes push` happily and uploaded zero sessions every time.
`cli/lib/claude_sessions.py` probes both Claude Code encoding variants
(older `/`→`-` keeping spaces+tildes; newer all-non-alphanumeric→`-`
with collapsed runs) and unions whichever exist. Users who upgraded
Claude Code mid-project end up with both encoded dirs side-by-side on
disk; the union ensures no session is left behind. Same-named jsonl in
both dirs → newest mtime wins. `<workspace>/user/sessions/` survives as
a fallback for any setup that explicitly mirrors sessions there.
Verified on real disk: helper returns 2 dirs + 8 unioned session files
for the Agnes-test workspace where the previous code returned 0.
CHANGELOG: rename [Unreleased] → [0.32.0] — 2026-05-04, prepend a new
empty [Unreleased] for next-PR landing zone.
pyproject.toml: version 0.31.0 → 0.32.0.
Per repo discipline (memory: feedback_release_cut_with_pr.md), the
release-cut commit lands as the FINAL commit of the PR that contained
the user-visible behavior change — it does not get a separate PR.
After merge: tag v0.32.0 on the merge commit + create a GitHub Release
(memory: feedback_github_release_per_tag.md — the tag alone isn't
enough; the Release prose is the operator-visible artifact).
Headline: closes#160. da query --remote now resolves query_mode='remote'
BQ rows whose entity is VIEW or MATERIALIZED_VIEW (the bug Pavel hit).
Plus 4 reinforcing fixes — server-side cost guardrail (bq_max_scan_bytes,
default 5 GiB), registry-gating of direct bq.* paths, bigquery_query()
function-call backdoor closed, structured CLI render of typed BQ errors —
and one operator-side admin convenience (BQ test-connection endpoint +
billing_project placeholder UI).
14 issues caught and addressed across 6 iterations of Devin Review.
E2E verified on agnes-zsrotyr.groupondev.com (commit 7f743d03):
- VIEW path resolves (count=23 from active_inventory_view)
- VIEW aggregate parity vs filtered BASE TABLE
- cost guardrail rejects with structured 400 detail
- bq_path_not_registered 403 (incl. quoted "bq" variant)
- bigquery_query() blocklist 400
- test-connection endpoint 200 with elapsed_ms
Devin Review iter #6 found 2 issues.
🟡 BUG: cli/error_render.py filtered out empty-string values via
`detail[key] not in (None, "")` and `value not in (None, "")` before
they could reach `_kv_line`. But `_kv_line` was specifically designed
to render empty strings as `(empty)` — the filter shadowed that
branch. The hidden field happens to be the most operator-actionable
one in `cross_project_forbidden`: `billing_project: ""` is the exact
diagnostic confirming WHY USER_PROJECT_DENIED fires.
Change filter to `is not None`. Empty strings now flow through
`_kv_line` and render as `billing_project: (empty)`.
📝 ANALYSIS: CHANGELOG wording for the test-connection endpoint said
"the saved data_source.bigquery config", which Devin flagged as
slightly misleading because `get_bq_access` is `@functools.cache`d —
"Test connection" tests the config in the running process, not the
just-saved YAML overlay. The save flow already returns
`restart_required: True` and the UI shows a banner, so the behavior
is documented; only the CHANGELOG wording was loose. Tightened to
"the **process-cached** BqAccess... Tests the config active in the
running process — after a save the response includes restart_required;
click Test AFTER restart to validate the freshly-saved values."
New test: test_renders_empty_string_as_empty_marker locks in the
empty-string-as-(empty) rendering for the cross_project_forbidden
case so a future filter change won't silently drop the diagnostic
again. 9 affected render tests pass.
Devin Review on PR #168 found 5 issues — all real, all addressed.
🚩 ANALYSIS_001 (architectural): concurrent-slot guard didn't protect
actual BQ query execution. Earlier `_enforce_remote_bq_quota_and_cap`
ran dry-run + cap check inside `with quota.acquire(user_id):`, then
returned — releasing the slot BEFORE `analytics.execute(...)` ran. Spec
§4.3.3 explicitly designs the slot to wrap execute so the per-user
concurrent cap limits BQ scans, not just dry-runs.
Refactor to a context manager `_bq_quota_and_cap_guard`. Caller's `with`
block now holds the slot through dry-run, cap check, the actual
`analytics.execute(...)` (which is what triggers the BQ scan when DuckDB
resolves the master view), AND the post-flight record_bytes. Slot
released only when caller's `with` body exits.
🟡 BUG_001: placeholder JS walked `original` (full GET payload root)
instead of `original.sections`. `placeholder_from: ["data_source",
"bigquery", "project"]` is a section-relative path, so billing_project
placeholder NEVER rendered. Fix: walk `original.sections` (with fallback
to `original` for safety).
🟡 BUG_002 + BUG_003: admin_tables.html register and edit modals'
operator help text referenced `max_bytes_per_remote_query` (the old
name from the spec) but the actual config key is `bq_max_scan_bytes`
after the fix-up commit `6423888d` moved it. Replace both occurrences.
🟡 BUG_004: CHANGELOG entry said `api.query.bq_max_scan_bytes` (the
old path) but the read at app/api/query.py:53 is
`get_value("data_source", "bigquery", "bq_max_scan_bytes", ...)`. An
operator who set it under `api.query` in their yaml would have no
effect. Correct path in CHANGELOG.
All 95 #160-affected tests pass after the changes.
Fixes the rails docs that PR #154 over-promised. The reporter (#160)
tried `da query --remote` against a VIEW row and saw a catalog error;
the previous version of the docs said this would work as a one-shot
server-side execution. Now it actually does (see prior commits), but
the docs also need to acknowledge the new cost guardrail and the
registry-gated direct-bq path.
Touched files:
- **CLAUDE.md** (root, "Querying Agnes data — agent rails"): the
`da query --remote` bullet under "Choose the right tool" now spells
out the BASE TABLE vs VIEW/MATERIALIZED_VIEW pushdown asymmetry +
the 5 GiB scan cap + the registry-gating of direct bq.* paths.
"When NOT to use `da fetch`" decision matrix updated with a separate
row for VIEW aggregates so analysts see why the cap might trip.
- **config/claude_md_template.txt** (PR #154's analyst CLAUDE.md):
three-patterns table caveat for the cost guardrail.
- **cli/skills/agnes-data-querying.md**: `When NOT to use da fetch`
matrix updated with the same VIEW caveat + registry-gating note.
- **cli/skills/agnes-table-registration.md:121**: replaced the
example that suggested raw `bq."<project>.<dataset>.<table>"` syntax
(now blocked by the RBAC patch) with the registered-name form.
- **CHANGELOG.md**: full Unreleased entry with Added (Test Connection
endpoint + cost-cap server-config knob + placeholder UI), Fixed (the
five #160-class fixes: VIEW resolution, RBAC patch, blocklist,
bigquery_query() blocking, CLI render, hybrid endpoint detail
flattening), Changed (BREAKING legacy_wrap_views removal + quota
relocation).
140 tests pass across the issue-affected files.
- CHANGELOG.md: add Agent Workspace Prompt bullets under [Unreleased]; remove
stale BREAKING (CLI) and BREAKING (API) bullets about CLAUDE.md removal and
GET /api/welcome deletion — both behaviors are restored in this PR; replace
with a neutral Changed bullet describing da analyst setup writing CLAUDE.md
- docs/agent-workspace-prompt.md: operator reference for the feature (when
written, editing via UI/API, template language, full placeholder table,
Jinja2 examples, reset to default)
The /admin/agent-prompt editor now pre-fills with the full bash bootstrap
script from setup_instructions.resolve_lines() instead of being empty.
When an admin saves an override it replaces the default everywhere — the
/setup page display and the dashboard clipboard CTA — rather than adding a
banner above the auto-generated commands.
GET /api/admin/welcome-template now returns a `default` field with the live
computed script so the editor always shows meaningful starting content.
{server_url} and {token} single-brace placeholders survive Jinja2 rendering
and are substituted by JavaScript at clipboard-copy time as before.
Preview pane switches to textContent (not innerHTML) since content is bash.
Update [Unreleased] to reflect the actual shipped behavior:
- banner on /setup replaces CLAUDE.md generation
- BREAKING: da analyst setup no longer writes CLAUDE.md
- BREAKING: GET /api/welcome removed
- schema v21/v22 notes corrected
- drop sync_interval bullet (not part of this feature set)
Remove the setup_banner feature (admin-editable /setup page banner) and
all associated code: API router, repository, renderer, admin template,
tests, and docs. The setup_page handler no longer calls render_setup_banner;
the install.html template no longer renders banner_html. The setup_banner
DuckDB table (v22) is kept intact for forward-compat with already-migrated
instances — only the application code is removed.
CHANGELOG updated: setup_banner bullets removed; Agent Setup Prompt
(welcome-template feature) now stands alone as the single editable prompt.
Adds an optional Jinja2/HTML banner displayed above the bootstrap
commands on /setup. Empty by default; admin authors it at
/admin/setup-banner. autoescape=True — safe for HTML context.
Render failures return "" so a broken banner never breaks /setup.
Schema v22: setup_banner singleton table, auto-migration v21→v22.
`/api/health` is the auth-free LB probe — returns `status` + `db_schema`
only. `version` lives in `/api/version` and the richer
`services.duckdb_state` lives in `/api/health/detailed` (auth-gated).
The two e2e asserts had drifted and broke nightly on main.
* security(auth): per-IP rate limit on auth endpoints + generalize last-admin guard
Closes#45 and #151.
#45 — every auth endpoint was unthrottled (login, magic-link, token,
bootstrap), leaving us open to password brute-force and SMTP
email-bombing. Wires slowapi (new dep) into the middleware chain with
per-route limits: 10/min on login + token, 5/min on send-link, 3/min on
bootstrap. Returns 429 with Retry-After: 60 once exceeded. Per-IP key
respects the leftmost X-Forwarded-For hop (Caddy in front of the app
strips client-supplied XFF). Operator escape hatch:
AGNES_AUTH_RATELIMIT_ENABLED=0. Test suite disables the limiter via
autouse conftest fixture so existing auth tests that hammer endpoints
in tight loops are unaffected.
#151 — DELETE /api/admin/users/{id}/memberships/{group_id} and the
mirror DELETE /api/admin/groups/{group_id}/members/{user_id} only
guarded against self-removal as last admin. Generalizes to refuse
removing anyone from the seeded Admin group when they are the only
remaining active admin (mirrors the existing
count_admins(active_only=True) <= 1 check on delete_user / update_user).
Recovery from zero admins requires direct DB access, so this closes
a path where a scheduler/bootstrap actor that bypasses normal admin
checks could otherwise empty the group.
* security(auth): throttle remaining email-bombing + token-confirm endpoints
Address code-review gap on PR #165 — the first commit covered /send-link
but missed two endpoints with the IDENTICAL email-bombing surface:
- POST /auth/password/reset — sends reset mail, anti-enum response
- POST /auth/password/setup/request — sends setup mail, anti-enum response
Both now share the 5/min limit with /send-link.
Also add 10/min to the token-confirm surfaces — high-entropy tokens but
partial leaks via logs / referer have surfaced before, and unbounded
guess rate would let an attacker exhaust the keyspace adjacent to a
leaked prefix:
- POST /auth/email/verify
- GET /auth/email/verify — closes the click-through bypass
- POST /auth/password/reset/confirm
- POST /auth/password/setup/confirm
Doc fix: rate_limit.py module docstring + CHANGELOG entry no longer
claim "disable without a redeploy" (misleading). The Limiter constructor
freezes `enabled` from env at import time, matching every other Agnes
env knob — operators set the flag and bounce the container.
Tests: 4 new cases in test_auth_rate_limit.py covering
/reset, /setup/request, /reset/confirm, GET /verify. Full suite:
2583 passed, 32 skipped, 0 failed.
* security(auth): throttle JSON /auth/password/setup — closes form-throttle bypass
Second code-review pass on PR #165 caught a fifth gap: POST /auth/password/setup
(JSON variant, kept for backward compat) consumes the same setup_token as
the web form /setup/confirm but was unthrottled — an attacker brute-forcing
the token just switches from the form path to the JSON path and resumes
at unbounded RPS. Apply the same 10/min limit and signature shape used
on /setup/confirm.
Also extend CHANGELOG note about the JSON-variant bypass for future
operators reading the security entry.
Test: 1 new case (test_password_setup_json_rate_limited_after_10_requests),
9 rate-limit tests + 28 password-flow tests + 41 auth-provider tests pass,
no regressions.
* chore(release): cut 0.30.1 — auth security hardening (rate limit + last-admin guard)
E2E sub-agent finding: `da query --remote "SELECT * FROM <id>"` against a
materialized table that hasn't yet been rebuilt in the server's
analytics.duckdb returns a confusing DuckDB "Table does not exist"
message even though the table is in the registry. Materialized rows
produce parquets at `${DATA_DIR}/extracts/<source>/data/<id>.parquet`,
but the orchestrator's master-view creation is `_meta`-driven — fresh
instances or pre-tick states have the registry row without a
corresponding view, so analysts hit the bare "does not exist" with no
path forward.
Improve the error rendering in `app/api/query.py:execute_query`. When
DuckDB raises a "table does not exist" error, scan the registry for any
`query_mode='materialized'` row whose id or name appears in the failed
SQL. On a hit, return a 400 whose detail names the table, explains the
materialize state, and offers two concrete next steps:
1. Run `da sync` (or wait for the scheduler tick / hit
POST /api/sync/trigger) to materialize the parquet, OR
2. Query the source directly via the catalog alias when the registry row
carries bucket+source_table (e.g. `bq."dataset"."table"` for BigQuery,
`kbc."bucket"."table"` for Keboola).
Detection is bounded — the registry round-trip only fires when DuckDB's
error mentions a missing table, so happy-path queries pay no cost.
Non-materialized unknowns fall through to DuckDB's raw error.
2 new tests: materialized id surfaces the hint with the bucket+source_table
payload; unknown table falls back to the generic error path with no false
positive on the new hint.
E2E sub-agent finding: instance configured with `data_source.type='bigquery'`
and no `data_source.keboola.*` block. Admin POSTs `{source_type: 'keboola'}`
to /api/admin/register-table → returns 201, row lands in the registry, but
never syncs because the scheduler has no Keboola URL/token to ATTACH
against. Operator only notices the gap when `da catalog` keeps showing
nothing.
The new `_validate_source_type_configured` helper runs immediately after
the id/view-name collision checks in `register_table`. A source_type is
considered configured when:
- it matches `get_data_source_type()` (the instance's primary), OR
- a non-empty `data_source.<source_type>` block exists in the effective
`instance.yaml` (multi-source instance), OR
- it's in `_SOURCE_TYPES_INDEPENDENT_OF_DATA_SOURCE` (Jira / local — both
get data through paths that don't involve `data_source.*`).
Returns 422 with a message that names the configured primary source and
points at `/admin/server-config` for enabling a secondary one. None /
empty source_type is still tolerated for backward compat with legacy CLI
scripts that don't set the field — the route resolves it later.
5 new tests cover: keboola-on-bq rejected, bq-on-keboola rejected,
matching source_type still works, jira allowed regardless, omitted
source_type passes through.
Existing tests that registered Keboola rows on the unconfigured default
test instance now opt into a `keboola_instance` fixture to satisfy the
new validator (tests/test_admin_bq_register.py + .keboola_materialized
+ .unregister_cleanup; the multi-source PUT test in test_admin_bq_register
adds a `keboola` block to its synthetic config).
Pre-existing test_missing_project_returns_error failure in
TestRebuildFromRegistry is unrelated (config-cache leakage from a
previous test in the same class) — confirmed pre-existing on the prior
commit via `git stash` reproduction.
E2E sub-agent finding: register a materialized BQ row → sync to materialize
the parquet at `/data/extracts/bigquery/data/<id>.parquet` → DELETE the
registry row. The DB row goes away but:
- the parquet file stays on disk forever, AND
- the sync_state row stays, so `/api/sync/manifest` keeps advertising the
dropped table to `da sync`, AND
- the orchestrator's next rebuild can resurrect a master view by picking
up the leftover parquet.
Two-part fix in `unregister_table`:
1. For materialized rows on bigquery/keboola, remove
`${DATA_DIR}/extracts/<source_type>/data/<name>.parquet` (and any stale
`<name>.parquet.tmp` from a crashed prior materialize). Filename is
keyed on `table_registry.name` to match sync_state bookkeeping.
File-removal errors are logged but don't fail the DELETE — the registry
row is already gone, and an orphan parquet won't get a master view at
next rebuild because the orchestrator's _meta-driven scan never picks
up bare parquet files.
2. Always clear `sync_state` + `sync_history` rows for the dropped table_id
so the manifest stops advertising the table — applies to all source
types and modes, not just materialized, since any synced row had a
sync_state entry.
Orchestrator-side defensive guard (Finding 2b) is a no-op in the current
implementation: `_attach_and_create_views` only creates master views from
`_meta` rows in each connector's `extract.duckdb`, so a parquet without a
matching `_meta` entry is already invisible to the rebuild. The new
test `test_orchestrator_skips_orphan_parquet_in_extracts` is kept as a
regression guard for that contract.
5 tests cover: BQ + Keboola materialized DELETE removes parquet, remote
DELETE doesn't error trying to remove a non-existent file, sync_state
cleared on DELETE, orchestrator orphan-skip invariant.
E2E testing showed admin POSTs of materialized BQ rows whose source_query
uses BigQuery-native backtick identifiers (`prj.ds.t`) silently no-op'd at
the next sync tick — the materialize path runs the SQL through the DuckDB
BQ extension's COPY which uses DuckDB's parser; backticks aren't recognized
and the query either parse-errors or matches zero rows. No parquet lands at
the canonical path and no error reaches an operator-visible surface.
Two-part fix:
1. RegisterTableRequest's _check_mode_query_coherence model_validator now
rejects any source_query containing a backtick with a 422 + actionable
message pointing at the DuckDB equivalent (bq."dataset"."table"). Same
check is applied in update_table on the merged record so PATCHes that
flip a stored source_query to backtick form are also caught. Covers BQ
AND Keboola materialized rows since both connectors funnel source_query
through DuckDB's COPY.
2. _run_materialized_pass now persists per-row failures via the new
SyncStateRepository.set_error / clear_error methods (existing
sync_state.error / status columns — no schema migration). GET
/api/admin/registry enriches each row with `last_sync_error` from a
single batched SELECT against sync_state, so the admin UI / da admin
status can show "this table failed last sync because: X" instead of
operators having to trawl scheduler logs. Recovered rows have the
error cleared automatically — update_sync's success path resets
status='ok' / error=NULL on the upsert.
The materialized-path test fixture's _materialized_payload helper is
updated to use DuckDB-flavor SQL (the prior backtick example pre-dated the
fix). 6 new tests cover register/update rejection on BQ + Keboola, the
sync_state error persistence, and the registry response surface.
Diagnostic + operator-facing documentation that closes the loop on the work in this PR.
`da diagnose` (via /api/health/detailed):
- New _check_bq_billing_project() helper. When data_source.type='bigquery' and BqProjects.billing == .data, surface a yellow warning: 'BigQuery billing project equals data project'. Hint includes the YAML field path + the /admin/server-config UI shortcut. Diagnose's overall status promotes warning → degraded so the CLI echoes it.
- Non-BQ instances (Keboola-only, etc.) skip the check.
- Implementation hooks into the existing /api/health/detailed surface — no new endpoint, no CLI changes.
config/instance.yaml.example documentation:
- data_source.bigquery.billing_project: USER_PROJECT_DENIED hint, /admin/server-config UI reference
- data_source.bigquery.legacy_wrap_views: analyst-side discipline note (use `da fetch` / `da query --remote`), issue #101 history, view-heavy deployment guidance
- data_source.bigquery.max_bytes_per_materialize: cost guardrail block (NEW — wasn't documented in .example before)
- ai.base_url: provider list + UI hint
- openmetadata + desktop: 'configurable via /admin/server-config UI' headers
- corporate_memory: leading note that the schema is editable via UI
Other docs:
- CHANGELOG.md: comprehensive Unreleased section
- CLAUDE.md: schema chain → v20 + Materialized SQL connector mode + per-connector tab UI mention
- README.md: mode-first source table summary
- docs/architecture.md: per-connector tab UI mention
- cli/skills/connectors.md: bootstrap rails (parallel to #154)
- docs/superpowers/plans/2026-05-01-admin-tables-form-cleanup.md: implementation plan archive (2515 lines)
- scripts/seed_dummy_tables.py: drop is_public after #150 RBAC migration (column gone)
Tests:
- test_diagnose_billing.py — 3 cases (BQ with billing==data warns, BQ with billing!=data clean, non-BQ skips)
* fix(tls-rotate): self-signed fallback sets basicConstraints=critical,CA:FALSE
OpenSSL's default '[v3_ca]' config marks CA:TRUE on 'req -x509', which
causes strict TLS stacks (rustls / webpki, used by uv, cargo, and
future versions of pip) to reject the cert with
'invalid peer certificate: CaUsedAsEndEntity' per RFC 5280 §4.2.1.9.
Browsers, curl, and OpenSSL-based clients tolerated the violation,
hiding the bug until a uv user hit it.
Affects every VM running on the self-signed fallback while the corp
PKI hasn't published the real chain yet. Fix lands on the next
agnes-tls-rotate.timer tick (or 'systemctl start
agnes-tls-rotate.service' for an immediate refresh). Existing CSR /
real-cert paths unaffected; only the bring-up fallback regenerates.
* chore(release): cut 0.29.0
---------
Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
* fix(analyst): document BigQuery remote-query capability in bootstrap CLAUDE.md template
Closes#153.
The CLAUDE.md template generated by `da analyst bootstrap` (config/claude_md_template.txt)
covered metrics, sync, corporate memory, and directory layout — but had ZERO
mention of query_mode: "remote", da fetch, da query --remote, or --register-bq.
Result: the AI analyst running in a freshly-bootstrapped workspace had no
idea BigQuery-backed tables existed, no path to fetch unsynced data, and no
fallback for tables not in the catalog.
Validated against /Users/<user>/foundry-ai/foundryai-data-analyst/CLAUDE.md
on 2026-05-01: section confirmed missing. Workspace-level (parent-dir)
CLAUDE.md carried legacy SSH-heredoc instructions but the analyst-level
file (which Claude reads as primary project context) had nothing.
## Changes
### config/claude_md_template.txt (+83)
Added a `## Remote Queries (BigQuery)` section covering:
- Discovery first — `da catalog --json | jq '...'` to see all tables
with their query_mode, then `da schema` and `da describe` for shape.
- Three query patterns:
- `da fetch` (preferred) — materialize a filtered subset locally,
query the snapshot, drop when done.
- `da query --remote` — one-shot server-side execution (cheap probes).
- `da query --register-bq` — hybrid joins between local + ad-hoc BQ.
- `da fetch` estimate-first discipline — rules of thumb on
--select / --where / --estimate / snapshot reuse.
- BigQuery SQL flavor cheat sheet for `--where` (DATE literal,
DATE_SUB, REGEXP_CONTAINS, CAST AS INT64).
- Unknown-table fallback: when a table isn't in `da catalog` at all,
use ad-hoc `--register-bq` if the agnes server SA has BQ access, or
ask admin to register with `query_mode: "remote"` for ongoing use.
- Pointer to `da skills show agnes-data-querying` for deeper guidance.
### docs/setup/claude_md_template.txt (deleted)
Stale 359-line template that documented the deprecated SSH-heredoc
remote_query.sh protocol. No code references it (verified via grep
across .py / .sh / .yml / .md). Removing eliminates two failure
modes:
1. A future refactor accidentally pulling it into a workspace and
shipping deprecated guidance to analyst Claude sessions.
2. Reviewer confusion over which template is canonical.
### CHANGELOG.md
`### Fixed` and `### Removed` entries under [Unreleased].
## Tested
- Manually walked the diff against `da skills show agnes-data-querying`
output on a live VM (foundryai-development) — patterns + flags
match the modern CLI exactly.
- Re-bootstrap test deferred: requires network round-trip; pattern
is identical to existing template substitution path so render is
not at risk.
## Out of scope
- The companion gap that data_description.md often only enumerates
query_mode: "local" tables (no signal that other modes exist) —
separate concern, fix likely belongs in the metadata generator
on the server side, not in the analyst template.
- Encouraging admins to register frequently-queried BQ tables as
`query_mode: "remote"` in the registry — workflow improvement, not
a code bug.
* chore(release): cut 0.28.0
---------
Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
* feat(rbac): drop dataset_permissions + access_requests + users.role + is_public; v19 migration
BREAKING. Sjednocení datové RBAC vrstvy do per-group resource_grants modelu.
Před PR byla legacy data RBAC vrstva (dataset_permissions + is_public bypass)
de-facto neaktivní — is_public neměl API/UI/CLI surface, default true znamenal
že can_access_table vždycky bypassl. Dnes každý non-admin přístup vyžaduje
explicitní resource_grants(group, "table", id) řádek.
Schema v18 → v19 (src/db.py:_v18_to_v19_finalize):
- DROP TABLE dataset_permissions, access_requests
- DROP COLUMN users.role (NULL artifact since v13)
- DROP COLUMN table_registry.is_public
- Drops přes table-rebuild idiom (rename → create new → INSERT … SELECT
→ drop old) kvůli DuckDB ALTER DROP COLUMN limitacím na tabulkách
s historic FK constraints. INSERT picks intersection sloupců, takže
test fixtures s minimal pre-v19 schemou migrate cleanly.
Runtime:
- src/rbac.py:can_access_table → deleguje na app.auth.access.can_access
- DatasetPermissionRepository, AccessRequestRepository smazány
- AGNES_ENABLE_TABLE_GRANTS env-gate v app/resource_types.py odstraněn
(TABLE je unconditionally enabled)
API drop:
- app/api/permissions.py, app/api/access_requests.py celé soubory
- /admin/permissions web route + admin_permissions.html
- "Request Access" modal v catalog.html + locked-row UI
- ~10 if user.get("role") != "admin" checků nahrazeno (admin shortcut
je uvnitř can_access_table)
- /api/settings: drop permissions field z GET; PUT /api/settings/dataset
gate přepnut na can_access(user_id, "table", dataset, conn)
Auth:
- app/auth/jwt.py:create_access_token: drop role parametr (claim zmizí
z nově vydávaných JWT; staré tokeny zůstávají valid, claim ignored)
- app/api/users.py: drop role z CreateUserRequest / UpdateUserRequest
(admin promotion = explicit add to Admin group via memberships API)
- src/repositories/users.py: drop role z create() / update()
CLI:
- da admin set-role smazán → hard-fail s replacement command
- da admin add-user --role flag pryč
- da auth import-token --role flag pryč
- da auth whoami: drop "Role:" výpis
- cli/config.py:save_token: role parametr now optional, no longer written
(back-compat se starými token.json soubory zachována — pole se ignoruje)
Tests:
- DELETE: test_permissions.py, test_permissions_api.py, test_access_requests_api.py
- REWRITE: test_access_control.py (resource_grants flow), test_rbac.py
(can_access_table over resource_grants), test_journey_rbac.py
(drop access-request flow), test_resource_types.py (drop env-gate
tests, drop is_public from helpers), test_v2_*.py (drop role-based
user dicts in favor of id-based + Admin group membership),
test_settings_api.py (no permissions field, can_access gate)
- TRIVIAL: ~30 souborů — drop role="admin" arg z UserRepository.create
a 3rd positional role z create_access_token
- NEW: test_v18_to_v19 migration test (test_db.py),
test_can_access_table_no_implicit_public (test_rbac.py),
test_admin_set_role_returns_hardfail (test_cli_admin.py)
- OpenAPI snapshot regenerated
Docs:
- CHANGELOG: BREAKING entry pod [Unreleased]
- CLAUDE.md: schema v18 → v19
- docs/architecture.md: schema table + RBAC sekce přepsána
- docs/auth-google-oauth.md: admin promotion přes da admin break-glass
- cli/skills/security.md: kompletně přepsáno na group-based model
- docs/TODO-rbac-data-enforcement.md: smazáno (TODO splněn)
Test results: 2363 passed, 19 failed. Zbývající failures jsou pre-existing
Windows-specific issues (fcntl, charset) nesouvisející s tímto PR —
ověřeno git stash pop.
Plan: ~/.claude/plans/floofy-coalescing-parnas.md
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore(release): cut 0.27.0
---------
Co-authored-by: Minas Arustamyan <arustamyan.minas@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
* refactor(ops): bake all host artifacts into image, drop every curl-from-main
Replaces the curl-from-main pattern (originally introduced in 0.25.0 for
agnes-auto-upgrade.sh; older for the compose files + Caddyfile) with image-
bundled host artifacts. Same-tag delivery for everything the host runs,
version-pinned by AGNES_TAG, atomically rolled back by reverting the image.
## Motivation
The customer-instance startup template was curling 6 files from
raw.githubusercontent.com on every VM boot:
docker-compose.yml
docker-compose.prod.yml
docker-compose.host-mount.yml
docker-compose.tls.yml
Caddyfile
scripts/ops/agnes-auto-upgrade.sh (added in 0.25.0)
Every one of them already lives inside the image (`COPY . .` copies the
whole repo to /app/). Curling them from the public internet duplicates
content the image already carries and introduces three problems:
1. **Split-brain version pinning.** image_tag pins the docker image to an
immutable digest. The compose files + script bypassed that pinning by
tracking `main` (or the rarely-set compose_ref). A customer pinned to
stable-2026.04.516 could wake up tomorrow with their host artifacts
floating on whatever shipped to main overnight — even though they're
explicitly pinned for stability.
2. **No rollback knob.** Reverting a bad host artifact meant reverting
the upstream PR globally — affects every customer that reboots after
the bad commit. No "rollback for me only" path; tag-pinning gave no
protection.
3. **Public-internet dependency on every boot.** The image is already
pulled from a private registry on the same boot. Reusing that channel
is strictly cheaper than adding a second one. Customers with restricted
egress (no raw.githubusercontent.com reachability) silently broke on
every boot.
## Changes
### Dockerfile (+19 -8)
After `COPY . .` and before the wheel build, an explicit `cp` lifts every
host-side artifact into a stable contract path /opt/agnes-host/:
agnes-auto-upgrade.sh (mode 0755 — host cron driver)
docker-compose.{yml,prod,host-mount,tls}.yml
Caddyfile (mode 0644)
Why a copy instead of pointing at /app directly: /app is owned by uid 999
(USER agnes); /opt/agnes-host is root-owned, mode 0755 across the board,
stable path that won't shift if /app structure refactors.
### infra/modules/customer-instance/startup-script.sh.tpl (+22 -36)
Replaced six curls and the standalone agnes-auto-upgrade.sh extract block
(introduced earlier in this PR) with one extract sequence in section 3:
docker pull "$${IMAGE_REPO}:$${IMAGE_TAG}"
EXTRACT_CONTAINER=$(docker create "$${IMAGE_REPO}:$${IMAGE_TAG}")
trap "docker rm '$EXTRACT_CONTAINER' >/dev/null 2>&1 || true" EXIT
docker cp "$EXTRACT_CONTAINER:/opt/agnes-host/." "$APP_DIR/"
docker cp "$EXTRACT_CONTAINER:/opt/agnes-host/agnes-auto-upgrade.sh" /usr/local/bin/agnes-auto-upgrade.sh
chmod +x /usr/local/bin/agnes-auto-upgrade.sh
The auto-upgrade section (#6) is now a no-op — script is already in place.
### infra/modules/customer-instance/variables.tf (+1 -1)
`compose_ref` marked DEPRECATED in description. Default unchanged for
one release cycle to avoid breaking existing terraform plans. Will be
removed in a future major bump.
### CHANGELOG.md
`### Changed` entry under [Unreleased] — supersedes the narrower entry
this PR previously had (which only covered the script).
## Out of scope (filed as follow-ups)
1. **agnes-the-ai-analyst-infra/startup.sh (operator deploy)** still
curls the same artifacts from main. Symmetric fix needed there.
Will file as a separate PR against the infra repo.
2. **Self-update inside agnes-auto-upgrade.sh** after a successful
`docker compose pull` of a new digest. Otherwise the running cron
keeps using the OLD baked-in script for one tick after image upgrade.
~10 LOC. Deferred to keep this PR scoped.
3. **scripts/ops/agnes-tls-rotate.sh** has the same shape — host-side
bash currently sourced via the infra repo. Should follow the same
bake-into-image pattern.
## Tested
- Local: `docker build .` succeeds with the new RUN block.
- `docker create` + `docker cp /opt/agnes-host/.` round-trips all 6
artifacts; sha matches each source file.
- Not yet tested on a live VM bring-up — that requires a CI image with
this Dockerfile change. **Recommend reviewer trigger CI build, then
do a single VM-recreate against a dev VM (e.g. foundryai-development)
to confirm the extract path works end-to-end before merge.**
## Compatibility
- Existing VMs running 0.25.0 are unaffected — they have host artifacts
in place from `curl from main` already; this PR doesn't touch them.
They pick up the new pattern only on next VM recreate.
- VMs pinned to an image_tag *older* than this PR (no /opt/agnes-host
in the image) would FAIL the docker cp. Current diff fails-loud (no
fallback). Recommend operators upgrade to a fresh-enough image_tag
alongside the template upgrade — same coupling as any compose-flag bump.
* docs(infra): document image_tag >= v0.26.0 minimum on prod/dev_instances
The new startup script extracts host artifacts from /opt/agnes-host/
inside the image — a directory added in this PR (will ship as v0.26.0).
Pinning image_tag to an older tag would fail-loud at first boot with
'docker cp: No such file or directory'. Existing VMs are unaffected
because the module ignores metadata_startup_script changes.
Devin ANALYSIS_0004 on PR #149.
* fix(changelog): mark BREAKING + drop private-repo reference
Per CLAUDE.md, breaking changes start with **BREAKING** so operators
can grep before bumping the pin. The image_tag minimum constraint
introduced here qualifies — older tags fail-loud at first boot.
Also drop the explicit 'agnes-the-ai-analyst-infra' name from the
entry; the OSS distribution shouldn't reference operator-side
deploy templates by their private-repo names. Generic 'consumer-
side deploy templates' wording instead.
Devin BUG_0001 + WARN_0001 on PR #149.
* chore(release): cut 0.26.0
---------
Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>