* feat(observability): optional PostHog integration (errors, LLM traces, replay, flags)
Off by default. Activates when POSTHOG_API_KEY is set in env. Defaults
to PostHog Cloud EU; override host for US Cloud or self-hosted.
Coverage:
- FastAPI 500 handler captures unhandled exceptions
- src/orchestrator.py rebuild + rebuild_source failures
- services/scheduler/ HTTP-job failures
- cli/main.py uncaught CLI errors (Typer.Exit/SystemExit/KeyboardInterrupt
skipped; flushes before re-raise so short-lived CLI invocations don't
drop events)
- connectors/llm/anthropic_provider.py + openai_compat.py emit
$ai_generation events with provider, model, latency, token counts
(prompt/completion bodies stay off unless POSTHOG_LLM_PAYLOADS=1
because LLM prompts here routinely include customer SQL/data)
- Browser snippet injected into every text/html response by
PosthogInjectionMiddleware — registered inside the GZip layer so it
sees uncompressed HTML before compression. Many templates are
standalone (their own DOCTYPE) and never extend base.html, so a
per-template include would miss them.
- Frontend: $pageview, $pageleave, JS error capture via window.error
and unhandledrejection handlers, masked session replay
(maskAllInputs: true plus CSS-selector mask for known data surfaces),
feature flags (browser posthog.isFeatureEnabled + server-side
feature_enabled with fallback for older SDKs).
Identification mode operator-configurable: none / id / email / full.
Default email ships user.id + email but never name. CLI entry point
moves from cli.main:app to cli.main:main (Typer wrapper).
Files:
- src/observability/posthog_client.py — lazy singleton, no network
when disabled, single-process flush on shutdown
- src/observability/llm_tracing.py — trace_generation context manager
- app/middleware/posthog_inject.py — HTML rewrite middleware
- app/web/templates/_posthog.html — browser snippet template
- docs/observability.md — operator guide
- config/.env.template — documented POSTHOG_* knobs
- tests/test_posthog_disabled.py + tests/test_posthog_client.py +
tests/test_llm_tracing.py — 18 tests covering disabled state,
identify-mode payloads, $ai_generation shape, error variant.
CHANGELOG entry under [Unreleased] Added.
* feat(observability): tag every PostHog event with environment + release
Splits PostHog dashboards cleanly between localhost / dev / staging /
production without manual tagging on every capture call.
- POSTHOG_ENVIRONMENT explicit override; auto-resolves to "local" when
LOCAL_DEV_MODE=1, else RELEASE_CHANNEL, else AGNES_DEPLOYMENT_ENV,
else "unknown".
- AGNES_VERSION → RELEASE_CHANNEL fallback feeds the `release` property
for "is this error new in this release?" cohorting.
- Backend gets both via the PostHog SDK's super_properties constructor
arg (every captured event picks them up automatically).
- Browser snippet calls posthog.register({environment, release}) inside
the loaded callback so $pageview, $exception, autocapture, etc. all
carry the same labels.
- request.state.user now populated by auth dependencies so the snippet
can actually call posthog.identify(user_id, {email}) for logged-in
users (previously the user block always resolved to None because
nothing wrote to request.state.user).
4 new tests cover env resolution: explicit > LOCAL_DEV_MODE > channel
> unknown, plus super-properties forwarding into the SDK constructor.
* feat(observability): inline user attrs on every PostHog event + debug throw route
PostHog's UI shows person properties on the Person profile page, not
inline on each event — so a reviewer triaging an exception couldn't tell
which user hit the bug without clicking through. Fix it on both sides.
- Backend capture_exception merges user_id / user_email / user_name into
the event properties (gated by POSTHOG_IDENTIFY_PII: none/id/email/full).
Backed by a new _user_props_for_event helper on PosthogClient.
- Browser snippet registers user_id + user_email + user_name as super-
properties via posthog.register({...}) so every $exception, $pageview,
and custom event coming from posthog.captureException() carries them
inline. Mirrors the backend so cross-referencing client/server events
doesn't require a person-profile lookup.
- /api/debug/throw — debug-only endpoint gated by DEBUG=1 (404 in prod).
Runs Depends(get_current_user) first so request.state.user is set when
the unhandled-exception handler captures the event. Lets operators
exercise the full observability path end-to-end without hand-rolling
a TestClient script. Configurable via ?kind=ValueError&msg=...
7 new tests cover: backend user-attr merge across identify modes,
anonymous request fall-through, browser snippet super-prop emission for
logged-in / anonymous / id-only / full-name cases.
* fix(observability): address minasarustamyan PR #231 review
Two bugs caught in review.
1. PosthogInjectionMiddleware dropped Response.background on every
return path. BaseHTTPMiddleware materialises the body and asks
subclasses to return a fresh Response — three paths in dispatch()
omitted background=, silently cancelling any BackgroundTask /
BackgroundTasks the route attached (audit logging, async webhooks,
email sends) with no log line. Fix: route every return through a
_passthrough() helper that forwards background.
Also adds a _MAX_BUFFER_BYTES (4 MB) cap so a streamed-HTML response
can't balloon RSS during buffering. Bigger bodies short-circuit
through with a warning rather than being injected.
Regression tests in tests/test_posthog_inject_middleware.py exercise
four return paths (snippet present, render-fail, double-injection
guard, non-HTML passthrough) plus the streaming-guard short-circuit.
2. $ai_input / $ai_output_choices were emitted without truncation, so
POSTHOG_LLM_PAYLOADS=1 silently dropped events past PostHog's ~32 KB
per-event ingest limit — exactly the calls (large prompts with
schemas / sample rows / SQL) an operator would want to inspect.
Fix: clip both at POSTHOG_LLM_PAYLOAD_MAX_CHARS (default 30000) with
an explicit "…[truncated N chars]" marker so readers don't mistake
truncated captures for complete ones. Metadata (provider, model,
tokens, latency, error) flows regardless. Three new tests cover
default-cap clipping, env-override, and pass-through under the cap.
37 PostHog tests pass.
## Summary
Two minimum-viable fixes after today's 0.44.0 → 0.47.3 release train and the production 30-user launch. Devil's advocate review of a 3-PR / 7-item plan cut scope to these 2 — the rest is deferred to a separate "operate-first, instrument-second" backlog item.
### B2 — Docker session_collector log skip
`services/session_collector` was logging `Collection complete: 0 users, 0 files copied` + `WARNING: Group 'data-ops' not found, using default group` every 10 minutes in the Docker layout (where `/home/*/user/sessions/` doesn't exist). New env var `AGNES_SKIP_LEGACY_COLLECTOR=1` set by default in `docker-compose.yml` short-circuits the collector pass.
The bare-VM deployment path (where /home/* IS populated by Claude Code) leaves the env var unset and continues to scan normally — including the data-ops warning, which is load-bearing for catching missing-group mis-deploys.
### O2 — FIFO check in `_check_session_pipeline`
The existing check compares `MAX(processed_at)` to newest jsonl mtime — catches "detector hasn't run lately" but blind to "old file was skipped while newer ones were processed". New code finds the oldest FS jsonl that's NOT in `session_extraction_state.session_file` and flags if its mtime is older than `SESSION_PIPELINE_STUCK_FILE_GRACE_SECONDS` (default 4× the existing grace = 2h).
Severity intentionally starts at `info` so we can collect prod data on false-positive rate before tightening to `warning`. The aggregator already treats `info` as non-promoting (see the severity vocabulary docstring at the top of `app/api/health.py`), so the headline `status` stays at `healthy` even when this fires — the operator sees the entry in the per-check breakdown but no spurious `degraded` overall.
## Test plan
- [x] `pytest tests/test_session_collector.py` — 17 tests pass (existing 9 + new 8 covering env-set/unset, truthy variants, falsy non-skip).
- [x] `pytest tests/test_health_session_pipeline.py` — 8 tests pass (existing 4 + new 4 FIFO tests covering stuck-file, under-threshold, all-processed, env-override).
<!-- devin-review-badge-begin -->
---
<a href="https://app.devin.ai/review/keboola/agnes-the-ai-analyst/pull/229" target="_blank">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review">
</picture>
</a>
<!-- devin-review-badge-end -->
## Summary
`agnes self-upgrade` without `--force` previously short-circuited on the local 24h `update_check.json` cache. After a server-side version bump within that window, the explicit command exited silently as a no-op — empirically observed today when prod 0.47.1 → 0.47.2 didn't propagate.
Fix: always invalidate the cache in `_resolve_info`. The cache still gates the implicit warning loop in the root callback (correctly — that runs on every `agnes <anything>` and can't hammer `/cli/latest`).
## Test plan
- [x] New `test_self_upgrade_bypasses_24h_cache_without_force` — stale cache claims current; mocked server reports newer; assert UpdateInfo carries the newer version, not the cached one.
- [x] Existing self-upgrade tests pass (including `--force` semantics — force is now downstream-only, behavior preserved).
## Summary
Smoke-testing the just-shipped 0.47.1 against production exposed two regressions:
1. `agnes query --remote "SELECT FROM unit_economics WHERE bad_col=1"` returned `Table "unit_economics" must be qualified` (the OLD error) instead of `Unrecognized name: bad_col` (the #218 fix's intended behavior).
2. `agnes query "DESCRIBE unit_economics"` showed only DuckDB's misleading `Did you mean order_economics?` with no Agnes hint paragraph (the #219 fix is missing).
Root cause: PR #217's squash merge (`506a378c`) carried stale snapshots of `app/api/query.py` and `cli/commands/query.py` from before #218 and #219 merged. The rebase-and-merge auto-merged those files cleanly (no conflict markers) but the result silently reverted both fixes.
Restore the two changes verbatim. Tests for both fixes already on main and continue to pass against the restored code.
## Test plan
- [x] `pytest tests/test_api_query_guardrail.py tests/test_cli_query.py` — clean
- [x] Manual repro against prod after deploy: both flows now surface the intended diagnostic.
<!-- devin-review-badge-begin -->
---
<a href="https://app.devin.ai/review/keboola/agnes-the-ai-analyst/pull/225" target="_blank">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review">
</picture>
</a>
<!-- devin-review-badge-end -->
## Summary
Brings the Keboola connector to feature parity with the legacy internal data-analyst's per-table sync strategies. Closes the four documented gaps from the spec branch (`zs/keboola-connector-specs`):
- **Typed parquet** in the legacy SDK extraction path — column types from Keboola Storage metadata (provider cascade `user > ai-metadata-enrichment > keboola.snowflake-transformation`) survive the CSV → parquet roundtrip; invalid date strings (`'0000-00-00'`) and invalid numeric strings (`'Non-Manager'`) become NULL while keeping the column's typed schema. Pre-fix everything was VARCHAR.
- **Incremental sync** via Storage API `changedSince` — opt-in per table; pulls only delta rows, merges into the existing parquet by `primary_key` (drop_duplicates with keep='last'). Cuts daily extraction from O(full table) to O(delta).
- **Partitioned sync** — flat per-partition layout `data/<table>/<key>.parquet` (e.g. `2026_05.parquet`), per-affected-partition merge for daily updates, chunked initial load with 1-day overlap and 2-empty-chunk stop heuristic.
- **`where_filters`** — server-side row filter with date placeholders (`{{today}}`, `{{last_3_months}}`, `{{start_of_3_months_ago}}`, etc.) resolved at sync time. Force the SDK path; reject `incremental + where_filters` combination at API layer (changedSince already filters temporally).
## Architecture
- **Schema migration v25 → v26**: 7 new columns on `table_registry`. Existing `sync_strategy` column reused (pre-v26 it was inert catalog metadata; post-v26 the extractor dispatches off it).
- **Per-table dispatcher** in `extractor.run()` routes to one of `_extract_via_extension` (full_refresh + extension), `_extract_via_legacy` (full_refresh + filters or extension fallback), `extract_incremental`, or `extract_partitioned`.
- **API conflict policy**: `incremental + where_filters` → 422; `partitioned + query_mode='remote'` → 422; `partitioned ⇒ partition_by required`.
- **Admin UI**: third "Direct extract (Storage API)" radio in the Keboola Register / Edit modals, alongside existing "Whole table (extension)" and "Custom SQL". When selected, exposes a v26 sync-strategy panel with conditional fields per strategy.
## Test plan
- [x] **Unit + module** — 134 v26 tests covering migration, repo, parquet_io, where_filters, incremental (compute_changed_since + merge_parquet + extract_incremental E2E), partitioned (key derivation + merge_partition + chunked windows + extract_partitioned E2E), extractor dispatcher, admin API validators, PUT field clearing, registry-shape → dispatcher bridge
- [x] **HTML form structure** — all v26 inputs + visibility classes + JS payload fields verified in rendered template
- [x] **Real Keboola roundtrip** — registered a small test table as `sync_strategy='incremental'` against a test Storage project, triggered two syncs:
- Sync 1: `changedSince=None` → full pull → 9 rows typed parquet
- Sync 2: `changedSince=last_sync - 1d window` → 9 delta rows merged with 9 existing → 9 after dedup on primary_key (PK merge confirmed)
- [x] **Browser UX** — agent-browser session against a local uvicorn: login → admin/tables → register modal → switch radios → verify field visibility per strategy → submit → edit existing row → switch to Direct/Incremental → save → confirm DB persistence
- [x] **Regression** — no regressions in the broader 3252-test suite (3 pre-v26 tests updated for the deprecation-marker removal + schema-version bump; 2 pre-existing environment-sensitive test failures unrelated to this change)
## Bugs caught + fixed during E2E
The browser + real-Keboola roundtrip exposed four bugs the unit tests missed:
1. **JS visibility race** — two competing `forEach` loops set `display=''` then `display='none'` on form elements sharing `kb-strategy-incremental kb-strategy-partitioned` classes (window_days + max_history_days are reused across strategies). Fix: single-pass selector with class-based visibility resolver.
2. **PUT cannot clear field** — pre-v26 `updates = {k: v ... if v is not None}` collapsed "omitted from body" and "sent as null" into the same case, so admin couldn't switch a partitioned row back to full_refresh and have stale `partition_by` clear. Fix: `model_dump(exclude_unset=True)`.
3. **Subprocess DB lock conflict** — `_read_last_sync` reopened `system.duckdb` while the parent server held the write lock (subprocess contract at `app/api/sync.py:_run_sync` line 260). Fix: parent injects `__last_sync__` into table_config before subprocess spawn.
4. **Wrong KBC table_id** — `extract_incremental` / `extract_partitioned` built the Storage API table_id from the registry row's slugified `id` (`circle_inc`) instead of `bucket.source_table` (`in.c-finance.circle`), producing 404s. Fix: prefer `bucket+source_table`; fall back to `id` only when bucket empty.
## Operator notes
- Existing tables stay on `full_refresh` after migration; admins opt individual tables in via `agnes admin register-table --sync-strategy ...`, the Keboola Edit modal, or `POST/PUT /api/admin/registry`.
- `merge_parquet` and `merge_partition` use `pd.concat + drop_duplicates`, loading both existing and delta into pandas RAM. For tables in the multi-million-row range this may OOM — switch to `partitioned` strategy for those (per-partition merge keeps memory bounded). Documented in `### Internal` of the changelog entry.
- Date placeholders are resolved at **sync time**, not register time — a typo'd `{{lasst_week}}` is accepted at register and surfaces only when the next sync runs. By design (rolling windows need late-binding).
## Spec source
The four corresponding plans on the `zs/keboola-connector-specs` branch under `docs/superpowers/plans/2026-05-07-0[1-4]-*.md` capture the design rationale and link back to internal repo references for each subsystem.
<!-- devin-review-badge-begin -->
---
<a href="https://app.devin.ai/review/keboola/agnes-the-ai-analyst/pull/217" target="_blank">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review">
</picture>
</a>
<!-- devin-review-badge-end -->
## Summary
- Catalog enrichment for `query_mode='remote'` rows: `rows`, `size_bytes`, `partition_by`, `clustered_by` per table (BQ + Keboola providers).
- `/api/v2/schema/{id}` cache miss: 2 BQ jobs → 1 (-50%) via shared `fetch_bq_columns_full`.
- All four catalog/schema/sample/metadata caches flush on registry change; single-row re-warm scheduled.
- Automatic cache warmup at server startup (bounded concurrency, opt-out via `AGNES_SKIP_CACHE_WARMUP=1`).
- SSE-driven freshness toolbar on `/admin/tables` with progress bar, log, and per-row badge.
- New admin doc `docs/admin/query-modes.md` — single source of truth on `local` / `remote` / `materialized` choice.
Closes#155.
Closes#156.
## Test plan
- [x] 65+ targeted tests pass across 11 new test modules + 3 modified ones.
- [x] No DB migration; no wire-break; `MIN_COMPAT_CLI_VERSION` unchanged.
- [ ] Reviewer: register a remote BQ table via `/admin/tables`, observe the toolbar populates within ~2 s and the per-row badge transitions warming → fresh.
- [ ] Reviewer: trigger `Re-warm all`, verify SSE log scrolls and `cacheWarmupBar` progresses.
- [ ] Reviewer: edit a registered row's bucket, verify `agnes schema <id>` returns updated columns immediately (no 1-hour staleness).
- [ ] Reviewer: confirm `agnes admin register-table --query-mode remote` prints the new IAM-smoke-check hint.
## Notable design decisions
- BigQuery `INFORMATION_SCHEMA.TABLE_STORAGE` is the only valid scope for size+rows (verified live 2026-05-07; dataset-scoped doesn't exist). Region resolved from `instance.yaml.data_source.bigquery.location` → `bq.client().get_dataset(...)` → fall back to legacy `__TABLES__`.
- VIEW handling: TABLE_STORAGE returns no rows for views, fall through to `__TABLES__` (also empty) → `TableMetadata(rows=None, size_bytes=None, partition_by=..., clustered_by=...)`. Null size signals analyst Claude to apply existing CLAUDE.md guidance.
- `size_bytes` is `active_logical_bytes + long_term_logical_bytes` — full BQ scan reads both; reporting only active undercounts aged partitioned tables.
- Source-agnostic provider seam: per-source `connectors/<source>/metadata.py:fetch(MetadataRequest)`; dispatcher in `app/api/v2_catalog.py:_metadata_provider_for` lazily imports per source_type so a Keboola-only deployment doesn't pay the BQ-extension import cost.
- Warmup non-blocking: FastAPI `lifespan` schedules `asyncio.create_task(_warm_catalog_caches_bg)` before `yield`. Per-row failures isolated.
## Out of scope
- Profile / column histograms / dimension cardinality for remote tables (separate issue).
- Onboarding nudge ("you have 0 remote tables, consider registering some BQ ones") — separate UX call.
- Provider plug-in registration via entry-points (the dispatch table is a hardcoded if-tree today; one line per future source).
## Release
Bumps `pyproject.toml` 0.46.1 → 0.47.0 (main shipped 0.46.0 + 0.46.1 during this PR — see commit `d98976ec`). New CHANGELOG section under `## [0.47.0] — 2026-05-07`.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
<!-- devin-review-badge-begin -->
---
<a href="https://app.devin.ai/review/keboola/agnes-the-ai-analyst/pull/223" target="_blank">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review">
</picture>
</a>
<!-- devin-review-badge-end -->
## Summary
Two bugs in `agnes describe` surfaced from a real analyst session following the CLAUDE.md agent-rails discovery workflow. Together they break `agnes describe` end-to-end for any analyst (or analyst-AI) who follows the documented form.
### A) CLI parsing
`agnes describe TABLE -n 5` failed with `Missing argument 'TABLE_ID'`. Root cause: the command was registered as a `Typer.Typer` subcommand group via `app.add_typer(describe_app, name="describe")` + `@describe_app.callback(invoke_without_command=True)`, and that pattern mis-parses positional + short-int option in some orderings. Same pattern in `cli/commands/schema.py` works only because schema has no INTEGER short option. Fix: switch to flat `@app.command("describe")`.
### B) Server NaN
`/api/v2/sample/<id>` (called by `agnes describe`) returned HTTP 500 with `ValueError: Out of range float values are not JSON compliant: nan` whenever a row contained NaN. Fix: sanitize NaN/±inf to None before JSON serialization.
## Test plan
- [x] `pytest tests/test_cli_describe*.py` — added regression tests pinning `-n` parsing on either side of the positional.
- [x] `pytest tests/test_api_v2_sample*.py` — added regression test for NaN row → JSON `null` (not 500).
<!-- devin-review-badge-begin -->
---
<a href="https://app.devin.ai/review/keboola/agnes-the-ai-analyst/pull/224" target="_blank">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review">
</picture>
</a>
<!-- devin-review-badge-end -->
## Summary
`claude -p` (headless mode) gives SessionEnd hook subprocesses ~1 second before SIGTERM, regardless of work in progress. `agnes push` for a typical workspace takes 5-30s. The current synchronous SessionEnd hook (`agnes push --quiet 2>/dev/null || true`) was therefore being killed mid-first-upload — `|| true` masks the SIGTERM as exit 0, so this regression was invisible until I traced it via a wrapper script and Claude's `~/.claude/debug/<sid>.txt` log.
Fix: wrap SessionEnd push in `bash -c "( nohup agnes push --quiet </dev/null >/dev/null 2>&1 & ) ; true"`. The subshell exits immediately, orphaning the upload child to init so it survives the hook subprocess kill. Same `bash -c` pattern as the existing `refresh-marketplace` SessionStart entry (for Windows compatibility).
End-to-end verified against production: claude exited in 5s, detached child completed the upload, file `491e3a23-...jsonl` landed on the server within 30s with mtime 14:30 UTC.
## Test plan
- [x] `pytest tests/test_lib_hooks.py` — added `test_session_end_push_is_detached` regression test asserting `nohup`, `&`, `</dev/null` are all present.
- [x] `pytest tests/test_setup_hooks_template.py` — assertions loosened from `==` to `in` where necessary.
- [x] Verified end-to-end against production with the detached wrapper before opening this PR (manual probe).
<!-- devin-review-badge-begin -->
---
<a href="https://app.devin.ai/review/keboola/agnes-the-ai-analyst/pull/222" target="_blank">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review">
</picture>
</a>
<!-- devin-review-badge-end -->
## Summary
Verified against production: `claude -p` headless mode doesn't fire SessionEnd hooks (proven via `--output-format stream-json --include-hook-events`: zero `SessionEnd` events), so any session JSONLs from `-p` invocations stay orphaned locally and never reach the server. Fix: add `agnes push --quiet` as a third SessionStart entry — symmetric self-heal alongside the existing `agnes pull` entry. Existing workspaces pick this up on their next `agnes init` via the marker-based migration already in `cli/lib/hooks.py`.
Separately: a colleague's fresh install showed `agnes diagnose` warning "uploads are not being processed", which led them to suspect their `agnes push` was broken. The warning is actually about the LLM-based `verification-detector` backlog (uploads themselves were arriving fine — confirmed by 23+3 JSONLs landed on the server while the warning was firing). Reword the warning to "verification-detector backlog" + add `last_processed` to the diagnose dict so operators don't have to grep logs to confirm.
## Test plan
- [x] `pytest tests/test_lib_hooks.py` — updated count + added `agnes push in SessionStart` assertion.
- [x] `pytest tests/test_setup_hooks_template.py` — updated.
- [x] `pytest tests/test_clean_install_integration.py` — updated.
- [x] `pytest tests/test_health_session_pipeline.py` — updated warning text + asserted `last_processed` field.
<!-- devin-review-badge-begin -->
---
<a href="https://app.devin.ai/review/keboola/agnes-the-ai-analyst/pull/220" target="_blank">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review">
</picture>
</a>
<!-- devin-review-badge-end -->
## Summary
`agnes query "DESCRIBE unit_economics"` (where `unit_economics` is `query_mode='remote'`) previously returned DuckDB's nearest-name suggestion (`Did you mean "order_economics"`?), sending users down the wrong path. Now appends a friendly hint about remote tables.
Reproduced from a real analyst session — colleague spent ~30s diagnosing what was actually "this is a remote table, not materialized locally".
## Test plan
- [x] New test: `_query_local("DESCRIBE unit_economics", ...)` against an empty local DuckDB triggers the new hint, original DuckDB error still echoed.
- [x] Negative test: a syntax-error query does NOT trigger the hint (regex only matches "Table with name X does not exist").
- [x] `pytest tests/test_cli_query*.py` clean.
<!-- devin-review-badge-begin -->
---
<a href="https://app.devin.ai/review/keboola/agnes-the-ai-analyst/pull/219" target="_blank">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review">
</picture>
</a>
<!-- devin-review-badge-end -->
## Summary
When `agnes query --remote` references a column that doesn't exist on the FROM table, users were seeing `Table "<id>" must be qualified with a dataset` instead of the actually-useful `Unrecognized name: <column>` from BigQuery. Surface the first-attempt diagnostic now; keep the second-attempt context as `underlying_original`.
Reproduced against production:
```
$ agnes query --remote "SELECT COUNT(*) FROM unit_economics WHERE authorize_date = DATE '2025-05-06'"
Error: remote_estimate_failed (HTTP 400)
message: Could not estimate scan size for this query.
underlying: 400 ... Table "unit_economics" must be qualified with a dataset.
```
(`unit_economics` has `authorize_timestamp`, not `authorize_date`.)
## Test plan
- [x] New `test_remote_estimate_failed_surfaces_first_error_when_attempts_differ` asserts the first-attempt message wins, second-attempt is preserved as `underlying_original`, hint points to `agnes schema`.
- [x] Existing `test_guardrail_returns_400_remote_estimate_failed_on_double_parse_error` still passes (both attempts mocked to identical error).
- [x] `pytest tests/test_api_query_guardrail.py` clean.
<!-- devin-review-badge-begin -->
---
<a href="https://app.devin.ai/review/keboola/agnes-the-ai-analyst/pull/218" target="_blank">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review">
</picture>
</a>
<!-- devin-review-badge-end -->
Cuts the [Unreleased] section into [0.46.0] in CHANGELOG.md and
bumps pyproject.toml. The user-visible content was already on main
via PR #190 (commit 28430ced); this is the release-cut commit that
should have been the last commit on that PR — splitting it out so
the operator-facing release artifact (tag + GitHub Release) lines up
with what's already deployed at :stable.
* fix: cutover regressions + parallel Keboola legacy fallback
Bundled fixes from a fresh-deploy run on a Keboola Storage backend with
the block-shared-snowflake-access feature flag — DuckDB Keboola
extension's per-table scan can't access bucket schemas, so the legacy
kbcstorage Storage-API client is the only working path.
CUTOVER REGRESSIONS
- agnes pull hash mismatch on every Keboola local-mode table —
src/orchestrator.py:_update_sync_state stored md5(mtime+size)[:12]
while the CLI compares against full 32-char content MD5. Now stores
the same content MD5 the materialized SQL path already used.
- Trailing-slash sanitization in connectors/keboola/access.py and
extractor.py — DuckDB Keboola extension's ATTACH fails when the URL
ends in / (canonical form).
- src/profiler.py:TableInfo.description becomes optional — two call
sites instantiated without it, crashing the profiler pass.
- scripts/ops/agnes-auto-upgrade.sh: chown on UID change — older images
ran as root, current runs as agnes (uid 999). Reads target uid:gid
from /etc/passwd inside the new image and chowns ${STATE_DIR},
/data/extracts, /data/analytics when the digest moves.
- POST /api/sync/trigger is now singleton per process — two
near-simultaneous trigger calls each forked an extractor subprocess,
fought for extract.duckdb's file lock, starved uvicorn, flipped the
container to unhealthy. Trigger now returns 409
(sync_already_in_progress) when held; _run_sync acquires non-blocking.
PARALLEL LEGACY FALLBACK
- Process pool fan-out for the _extract_via_legacy queue (default 8
workers, override via AGNES_KEBOOLA_PARALLELISM). Process pool, not
thread pool, because connectors/keboola/client.py:export_table does
os.chdir(temp_dir) — process-global, so threads raced and slice files
landed in the wrong directory ("[Errno 2] No such file or directory:
'<job_id>.csv_X_Y_Z.csv'").
- Extractor subprocess timeout 1800s -> 3600s (configurable via
AGNES_EXTRACTOR_TIMEOUT_SEC). 28+ tables × multi-minute Keboola export
jobs need the headroom on telemetry-class projects.
- Process group cleanup on timeout — Popen(start_new_session=True) puts
the extractor in its own group. On timeout the parent SIGTERMs the
group (10s grace) then SIGKILLs stragglers. Without this, the pool
workers were reparented to PID 1 and continued holding open Keboola
Storage export jobs. Inline extractor script also installs a SIGTERM
-> sys.exit(143) handler so the with ProcessPoolExecutor(...) block
__exit__ runs cleanly.
Tests: existing tests that patched subprocess.run updated to patch
subprocess.Popen with a _FakePopen stand-in (same exit-code-injection
contract). Two tests that exercised the parallel path forced
AGNES_KEBOOLA_PARALLELISM=1 to keep mocks alive (mocks don't ride into
ProcessPoolExecutor subprocesses).
Squashed onto current main (was 7 commits + multi-commit CHANGELOG +
agnes-auto-upgrade.sh conflicts; squash avoids per-commit conflict
resolution against main's flat-mount STATE_DIR refactor and 0.38.0
release cut).
* feat(keboola): Storage API direct extract path; drop extension data path
The DuckDB Keboola extension's COPY routes through Keboola QueryService,
which is unreliable on linked-bucket projects (extension v0.1.6 fixes
that case but isn't yet in the community CDN, and pre-fix any project
with the block-shared-snowflake-access feature flag couldn't see bucket
schemas at all). Move the extract path off the extension entirely and
talk to the Storage API directly via signed-URL download — works on any
project, regardless of extension state.
connectors/keboola/storage_api.py (NEW)
Lightweight client built on requests.Session. Three endpoints:
- POST /v2/storage/tables/{id}/export-async (kicks off job)
- GET /v2/storage/jobs/{id} (poll until done)
- GET /v2/storage/files/{id}?federationToken=1 (signed URL detail)
- GET <signed_url> (download bytes)
Supports sliced exports (manifest + per-slice signed URLs) and gzipped
payloads. ExportFilter dataclass mirrors the Keboola filter spec
(whereFilters / columns / changedSince / limit) and handles JSON
round-trip with the registry's source_query column. Token redaction
in error messages. Bounded exponential backoff on job polling.
No cloud-SDK dependency on the data path; thread-safe.
connectors/keboola/extractor.py
- materialize_query() rewritten: takes bucket/source_table/source_query
(JSON filter spec), exports via KeboolaStorageClient, converts CSV
to parquet via DuckDB, atomic os.replace. Same return shape so
sync.py downstream code stays uniform with the BQ branch.
- _extract_via_legacy() also moved to Storage API direct (kept the
name for caller compatibility with _legacy_worker / the parallel
batch extractor). Per-call temp directories — no os.chdir, threads
don't race.
app/api/sync.py
_run_materialized_pass for source_type='keboola' rows now constructs a
KeboolaStorageClient (replaces KeboolaAccess) and passes
bucket/source_table/source_query to materialize_query. Reuses one
client across rows for HTTP keep-alive. Sources keboola URL from env
too (KEBOOLA_STACK_URL) when instance.yaml doesn't have stack_url
configured.
cli/commands/admin.py
discover-and-register defaults Keboola rows to query_mode='materialized'
(NULL source_query = full table), matching the v26 migration's
unification of the local/materialized split for Keboola. BigQuery and
Jira keep their per-source defaults.
src/db.py
Schema bump 25 → 26. Migration: UPDATE table_registry SET
query_mode='materialized' WHERE source_type='keboola' AND
query_mode='local'. NULL source_query on those rows means "full table
export" — same effective behavior the local mode provided, but now
via Storage API instead of the extension.
pyproject.toml
kbcstorage dep stays (admin-side bucket/table list still uses the
SDK in app/api/admin.py / connectors/keboola/client.py); only the
data path is migrated off the SDK. Comment updated to reflect the
new boundary.
tests
- test_keboola_storage_api.py (NEW, 19 tests): ExportFilter parsing,
HTTP client (token redaction, retry logic, polling), download_file
(single, gzipped, sliced), end-to-end export_table_to_csv.
- test_keboola_materialize.py rewritten: mocks KeboolaStorageClient
instead of FakeAccess; same atomic-write + zero-rows + unsafe-id
contracts.
- test_sync_trigger_keboola_materialized.py: registry rows now carry
bucket+source_table+JSON-shape source_query.
114+ Keboola-impacted tests green locally.
* test: schema version assertion bumped to 26 alongside the keboola query_mode migration
* fix(keboola): cutover hot-patches surfaced on agnes-dev
Five small fixes that were applied as in-container hot-patches during
agnes-dev cutover and need to be on the source-of-truth image so a fresh
upgrade does not undo them.
- app/api/sync.py: auto-discover gate considers the WHOLE registry (any
source, any mode), not just rows where source matches and query_mode
is local. After the v25→v26 keboola materialized migration an
instance can have 30 materialized rows and zero local rows; the
previous gate kept re-firing _discover_and_register_tables every
scheduler tick, creating duplicate auto-discovered rows with the
wrong bucket prefix every time.
- app/api/admin.py: _discover_and_register_tables reassembles the
bucket as <stage>.<bucket-id> (e.g. in.c-finance) instead of
dropping the stage prefix; default query_mode for keboola is now
materialized (the v26 contract); validator allows NULL source_query
for keboola materialized rows (full-table export via Storage API
export-async, no SQL needed).
- cli/commands/admin.py: register-table mirrors the server validator
(NULL source_query allowed for source_type=keboola); --bucket help
text generalized to cover both BQ dataset and Keboola bucket id.
- connectors/keboola/extractor.py: max_line_size=64 MiB on
read_csv_auto so embedded JSON / SQL cells (kbc_component_configuration
in particular) do not trip the default 2 MiB ceiling.
- connectors/keboola/storage_api.py: GCP backend support — when the
Storage API returns a manifest whose slice URLs are gs://
references with a gcsCredentials block, rewrite to the JSON REST
download endpoint and authenticate with the issued OAuth bearer
token; redact tokens in any surfaced error string.
* test: align with new keboola materialized + auto-discover-gate contracts
- test_admin_keboola_materialized: rename
test_register_keboola_materialized_rejects_missing_source_query →
test_register_keboola_materialized_accepts_missing_source_query.
v25→v26 introduced 'keboola materialized with NULL source_query
means full-table export via Storage API export-async' as the
default registration shape; the rejection case is no longer the
contract.
- test_sync_filter: add list_all() to _StubRegistry. The auto-discover
gate in _run_sync now keys off the WHOLE registry (not just local
rows) so materialized-only Keboola instances do not re-trigger
discovery on every tick.
* feat(keboola): native parquet export — skip CSV roundtrip
Storage API export-async accepts fileType={csv,parquet}. Switching the
materialized sync to parquet eliminates the CSV → DuckDB COPY → parquet
roundtrip that pinned a single uvicorn worker over 4 GiB on multi-GB
tables (read_csv with all_varchar + max_line_size=64MB has to
materialize the whole CSV in memory before COPY can stream out a
parquet). Snowflake UNLOAD on Keboola's side already produces typed,
self-contained parquet files; the extractor downloads them and renames
into place.
Two cases:
- **Single-file** export (small table): file_info.url points at one
signed URL; download_file streams chunks straight to .parquet.tmp
and we're done. No DuckDB.
- **Sliced** export (Snowflake UNLOAD respects MAX_FILE_SIZE — 16 MiB
default — so anything larger arrives as N parquet slices): each
slice is a complete parquet file with its own footer; naive concat
would corrupt them. download_file_slices keeps the slices as
separate files in a tempdir, then DuckDB COPY (SELECT * FROM
read_parquet([slice0, slice1, ...])) merges them into one
consolidated parquet. DuckDB streams row groups during this — peak
memory bounded to one row group (~1 MiB) regardless of source size.
The legacy CSV path stays as the explicit opt-in via source_query=
'{"file_type":"csv"}' for projects whose backend can't UNLOAD
parquet (none known today; cheap escape hatch). Backward-compat alias
KeboolaStorageClient.export_table_to_csv kept.
Also fixes a latent bug in download_file's gzip detection: previous
heuristic flagged any unencrypted file as gzipped, which would have
corrupted parquet downloads at gunzip time. Name-suffix-only now.
* fix: tempdir leak cleanup, every 0m schedule, /sync/trigger body shapes
Three small self-contained fixes uncovered during agnes-dev cutover.
- connectors/keboola/extractor.py: tempfile.TemporaryDirectory now uses
ignore_cleanup_errors=True so a worker death mid-write doesn't leave
multi-GiB stale slice trees on the boot disk. (12 GiB seen after a
disk-full crash where TemporaryDirectory's own cleanup also raised
and got swallowed.)
- src/scheduler.py: is_valid_schedule accepts 'every 0m' (interval=0
= always due). Force-resync of an errored row no longer requires
waiting out the default 'every 1h' interval — admin can flip the
schedule, trigger, then flip back.
- app/api/sync.py: POST /api/sync/trigger accepts both ['table_id']
(legacy bare-array body) and {'tables': ['table_id']} (matches the
response payload shape, more discoverable for clients building
requests by hand). Malformed bodies return 422 with a structured
detail; null/missing means 'sync everything' as before.
Tests cover: tempdir cleanup on raise (sliced parquet path),
is_valid_schedule + is_table_due 'every 0m' acceptance, and trigger
body parametrized matrix (8 valid shapes + 6 rejection cases).
* fix: targeted-trigger filter in materialized pass + auto-upgrade defer
Two operational gaps observed during agnes-dev cutover, in the same
sync-routing area.
- _run_materialized_pass now takes a 'tables' arg and skips rows not in
the target set with reason='not_in_target'. POST /api/sync/trigger
with a body of tables previously only scoped the legacy extractor
subprocess — the materialized pass kept iterating every due
materialized row, so an admin asking to re-sync kbc_job re-ran
every other due materialized row alongside it. Match on registry id
OR name (admins commonly pass either form). tables=None preserves
the no-filter behavior.
- New GET /api/sync/status (public, no auth) returns {locked: bool}
off _sync_lock.locked(). agnes-auto-upgrade.sh probes this before
docker compose up -d and exits 0 with a 'deferred recreate' log
line if a sync is in flight — the next 5-min cron tick retries.
Pre-fix, an auto-upgrade triggered mid-sync would recreate the
uvicorn worker and kill the in-flight extractor / Snowflake-UNLOAD
download (observed when kbc_job's first 7-day retry got SIGKILLed).
Connection failures in the probe fall through to the upgrade —
being stuck on a wedged image is worse than interrupting a
hypothetical sync.
* fix: auto-discover protects admin overrides + surfaces drift
Two real-world incidents on agnes-dev drove this:
1. kbc_job was registered manually with the correct
(in.c-kbc_telemetry, kbc_job) coordinates. A naive auto-discover
re-run would have inserted a SECOND kbc_job row at the slugified
id 'in_c-keboola-storage_kbc_job' (where Keboola's discovery
places it) — and that row's Storage API export-async 404s.
2. An earlier auto-discover bug stripped the stage prefix from
bucket ids ('c-finance' instead of 'in.c-finance'), inserting
137 rows whose syncs all failed.
Fix:
- _discover_and_register_tables now builds a plan first
(_build_keboola_discovery_plan) classifying each discovered table
into one of new / existing_match / existing_drift / invalid, then
executes only the 'new' bucket. Drift rows are reported with both
sides of the disagreement plus drift_kind:
- same_id_diff_coords: registry has the same id but different
bucket / source_table (admin migrated coords inline).
- name_collision: discovery's slugified id differs from any
registry id, but the discovered .name matches an existing row's
.name (case-insensitive). Catches the kbc_job case.
- Bucket detection now prefers the API's authoritative bucket_id
field (separate field on the Keboola tables.list response,
normalised by KeboolaClient.discover_all_tables). Falls back to
id-string parsing only when bucket_id is missing (older fallback
path inside discover_all_tables).
- Endpoint POST /api/admin/discover-and-register?dry_run=true
returns the plan without writing — would_register, drift,
invalid lists. Lets an operator audit before merging discovery
with a registry that has admin overrides.
Removed 'every 0m' from test_register_request_rejects_malformed_sync_schedule
— the runtime started accepting it in the previous commit (force-resync
override) and the validator follows suit.
* feat(keboola): AGNES_TEMP_DIR routes tempfiles off overlayfs /tmp
The container's /tmp lives on the boot disk's overlayfs (29 GiB on
agnes-dev, shared with /var). Snowflake UNLOAD of a wide table writes
slices into per-call /tmp tempdirs that fill multi-GiB / many-slice
exports long before the dedicated data disk fills. agnes-dev hit
100% boot-disk while the 20 GiB data disk had 15 GiB free.
connectors.keboola.storage_api.get_temp_root() reads AGNES_TEMP_DIR;
mkdirs the target on first use; unset / empty / unwritable falls
back to None (system tempdir, OSS-pre-fix behaviour). Both
materialize_query (parquet path) and _extract_via_legacy (CSV
fallback) and the sliced-CSV concat path in storage_api use the
helper now.
docker-compose.yml defaults AGNES_TEMP_DIR=/data/tmp on app, scheduler,
and extract services. The data volume is the dedicated disk in
production layouts and a plain docker volume in single-disk
dev/laptop setups — same blast radius as the previous /tmp default
on the latter, no regression.
Operator-and-analyst quality bundle: a security fix for the optional
Telegram bot, two CLI gaps closed, and three rounds of UX polish on
`agnes diagnose` and `agnes pull` so non-TTY consumers (CI runners,
Claude Code SessionStart hooks, sub-agent watchdogs) get readable,
actionable signal.
- Pairing-code RNG: random.choices -> secrets.choice (CSPRNG).
- Telegram script runner: refuse out-of-shape usernames before sudo -u.
CLAUDE.md.bak.<ISO-timestamp> before regenerating.
- agnes admin unregister-table <id> -> DELETE /api/admin/registry/{id}
- agnes admin update-table <id> --field=value ... -> PUT /api/admin/registry/{id}
response but never promotes the headline. BQ billing-equals-data check
downgraded warning -> info.
default (5 s / 1 MiB vs 30 s / 10%) so sub-agent watchdogs don't kill
the pull as a hung process. New env knobs:
AGNES_PULL_PROGRESS_INTERVAL_{SECONDS,BYTES}.
--include-schema (or ?include=schema) to opt back in.
Tests: 120 passed across the touched modules, including new tests for
each fix. Pre-existing failures on main (DB migration v1->v9, binary
rename) are unrelated and not introduced here.
See CHANGELOG.md for the full entry. (Bumped from 0.42.0 to 0.43.0 since
0.42.0 was taken by PR #208's backtick-rewriter fix during this branch's
review cycle.)
0.40.0 added _persist_materialized_inner_view in materialize_query, which
tried to open extract.duckdb from a fresh DuckDB handle to write the _meta
row + inner view. In production this conflicts with the same uvicorn
process's existing read-only ATTACH (orchestrator's analytics conn holds
extract.duckdb ATTACHed as <source_name> alias), and DuckDB single-process
file-handle uniqueness rejects with:
Binder Error: Unique file handle conflict: Cannot attach "extract"
— already attached by database "<source>"
The helper logs WARNING fail-soft, parquet stays canonical, but the
master view never appears via the meta path.
Fix: at the end of _attach_and_create_views, scan
<extract_dir>/data/*.parquet and CREATE OR REPLACE VIEW <id> AS
SELECT * FROM read_parquet('<path>') for any parquet whose <id> is not
already in the per-source tables list (= meta path didn't pick it up).
Decoupled from materialize_query open-handle race. Honors the same
view_ownership cross-connector collision rules as the meta path
(first-come-first-served via view_repo.claim).
Tests:
- filesystem-fallback fires when _meta row missing
- skipped when meta path already created the view (no shadow)
- skips invalid identifiers (e.g. parquet stem starting with a digit)
- doesn't crash when source has no data/ subdir
Pre-fix flow:
1. extractor subprocess writes _meta with N remote rows + creates N inner
views in extract.duckdb (rebuild_from_registry skips materialized rows
per design — explicit `continue` at line 389)
2. _run_materialized_pass calls materialize_query, which writes parquet
atomically + returns stats — but never updates _meta
3. orchestrator.rebuild scans _meta, finds only the N remote rows, creates
master views only for them. Materialized parquet is on disk but
invisible to /api/query → 400 'not yet materialized'
Symptom appears after every container recreate (the previous run's _meta
state is wiped because docker compose down nukes the named volume that
backs extract.duckdb on some compose layouts; even on volumes that
persist, the next extractor pass calls _create_meta_table which DROPs
+ CREATEs _meta cleanly).
Fix: after os.replace(tmp_path, parquet_path) in materialize_query, open
extract.duckdb (read-write), DELETE existing _meta row for table_id,
INSERT new one with query_mode='materialized', and CREATE OR REPLACE
VIEW <table_id> AS SELECT * FROM read_parquet(<path>). All inside a
single transaction so concurrent reads see either old or new state, not
torn rows. Fail-soft on lock contention or schema drift — parquet
remains canonical, next sync pass recovers.
Tests: 3 new in test_bq_materialize.py covering:
- meta + inner view registered after materialize, alongside existing
remote rows
- re-run replaces (not duplicates) the meta row
- skips inner-view registration when extract.duckdb doesn't exist yet
(fresh BQ-only deployment edge case)
Pool the httpx.Client used by `stream_download` so N parquet downloads
share a single TLS handshake instead of one handshake each. With the
optional `h2` package installed, HTTP/2 multiplexing further lets all
chunk Range requests share a single TCP connection — synergizes with
the range-chunked download path added in the previous commit.
The shared client is created lazily on first stream-download call, kept
alive for the duration of the process via a module-level slot, and
closed at exit via `atexit.register`. Construction wraps in a
try/except: when `h2` is unavailable (slim install), httpx raises
ImportError on `http2=True` and we transparently fall back to an
HTTP/1.1 client — pooling alone still amortizes TLS handshakes.
`agnes pull` must never crash on a missing optional package, so the
fallback path is non-negotiable. `h2>=4.1.0` is added to the core
dependency set; downstream slim installs that drop it lose the HTTP/2
benefit but keep correctness.
Renames the [Unreleased] section to [0.36.0] in CHANGELOG, adds the
top-level summary, drops a fresh empty [Unreleased] above, and bumps
pyproject from 0.35.1.
Also fixes the third Devin Review finding on this PR: the CLI
ReadTimeout message hardcoded QUERY_TIMEOUT_S (300s) so a 30s-default
call (agnes catalog, agnes auth, …) reported a wait window that
didn't match reality. _translate_transport_error now takes the actual
httpx timeout from the calling helper; the BQ-job advisory only
appears for calls where the timeout was set ≥ 60s.
Three first-try-failure-surface fixes from Pavel's #185 trace + the
template guidance question, all under PR #188's umbrella so they land
together with the file_server / parallel pull / Tier 1 work.
1. CLI clean-error wrapper — new AgnesTransportError raised by the
api_*/stream_download helpers when httpx times out / drops /
refuses, plus a top-level Typer wrapper (cli/main.py) that prints
one-line "Error: …" + actionable hint and exits non-zero. Full
traceback goes to ~/.config/agnes/last-error.log for support
forwarding. Unhandled Exceptions are caught at the same boundary
so no Python traceback ever leaks to the analyst's terminal.
Pavel's #185 Phase 3B: a 30-frame httpx traceback from a slow BQ
--remote query made it look like a CLI bug. Now: clean message +
hint pointing at `agnes snapshot create` / partition-column
guidance.
Entry point in pyproject.toml flipped from `cli.main:app` →
`cli.main:_run_with_clean_errors` so the wrapper actually runs
under the installed `agnes` binary.
2. agnes init / agnes pull --skip-materialize + progress bar.
--skip-materialize omits query_mode='materialized' rows from the
download set so a first init doesn't spend 44 minutes silently
pulling a single 6 GB parquet (Pavel's #185 Phase 1). Rich-driven
per-file progress bar with label/bytes/rate/ETA renders to stderr
when not --quiet and not --json. Aggregates across the parallel
ThreadPoolExecutor workers added earlier in this PR.
3. config/claude_md_template.txt: explicit one-line snippet pointing
at `agnes catalog --json | jq '.tables[] | select(.id=="<id>")'`
for per-table descriptions + restated invariant: "the description
field on each catalog row is the authoritative business-rules
text — re-read live, never copy into this file." Resolves the
regression-or-feature debate between Pavel (wants annotations)
and the user feedback that landed in the prior commit (don't
embed table-specific content; tables change). Catalog command
stays the source of truth.
* fix(keboola): per-table fallback to legacy Storage-API client
The DuckDB Keboola extension's per-table COPY fails with
`Schema '..."in.c-..."' does not exist or not authorized` on
projects whose Snowflake backend doesn't expose bucket schemas
to the storage-token-derived QueryService role
(keboola/duckdb-extension#17). ATTACH itself succeeds, so the
existing extension-level fallback in `_try_attach_extension`
never triggers — the table is just marked failed.
- Promote `kbcstorage>=0.9.0` from optional to core dep so the
legacy client import in `_extract_via_legacy` doesn't crash
default installs with `ModuleNotFoundError`.
- Wrap `_extract_via_extension` in a per-table try/except so a
scan failure retries via `_extract_via_legacy` instead of
recording `tables_failed` and moving on.
Slower than the extension path, but produces correct parquets
on affected projects while the upstream extension fix lands.
* test(keboola): cover per-table extension→legacy fallback
Two existing tests mocked _extract_via_extension to throw and asserted
the original message survived in result["errors"]. With per-table
fallback, the new flow retries via _extract_via_legacy — which on the
mock URLs would throw a different (404 / DNS-fail) error, replacing the
asserted message.
- Mock _extract_via_legacy alongside _extract_via_extension in
test_network_timeout_during_extraction +
test_partial_failure_continues +
test_all_tables_fail_returns_full_failure_stats so the assertion
observes the final propagated error from the fallback chain.
- Add test_extension_per_table_failure_falls_back_to_legacy that
exercises the new behavior directly: extension scan fails with the
QueryService schema-not-authorized message
(keboola/duckdb-extension#17), legacy succeeds, parquet ends up
queryable.
Patch release bundling the only Unreleased change: bump httpx client
timeout for agnes query --remote from 30s to 300s (configurable via
AGNES_QUERY_TIMEOUT). Renames CHANGELOG [Unreleased] section to
[0.35.1] and bumps pyproject version to match.
Five compounding defects on default `docker compose up` deploys made the
session pipeline silently broken: sessions uploaded by analysts via
`agnes push` landed on /data/user_sessions/<user>/*.jsonl but nothing
ever processed them. Fix is one PR: promote anthropic + openai to core
deps, wire all three LLM-pipeline jobs into scheduler-v2 with offset
cadences (10m/15m/17m), drop the side-car services from compose, seed a
default ai: block on first-time setup with an env-var fallback in code,
surface the pending review queue to admins, and expose a health check
that warns when uploaded jsonls aren't being processed.
**BREAKING** for operators on COMPOSE_PROFILES=full or with custom
Compose overrides referencing the corporate-memory or session-collector
service stanzas — drop them. The scheduler is now the sole driver.
LLM provider SDKs are imported by services/corporate_memory and
services/verification_detector — both production code paths. Listing
them only in [project.optional-dependencies].dev caused the scheduler
container to boot-loop with ModuleNotFoundError on default
`docker compose up` deploys, because the Dockerfile installs core
deps only (`uv pip install --system --no-cache .`).
Adds tests/test_packaging.py to lock the contract: anthropic + openai
must live in [project].dependencies, not in dev extras.
CHANGELOG: rename [Unreleased] → [0.32.0] — 2026-05-04, prepend a new
empty [Unreleased] for next-PR landing zone.
pyproject.toml: version 0.31.0 → 0.32.0.
Per repo discipline (memory: feedback_release_cut_with_pr.md), the
release-cut commit lands as the FINAL commit of the PR that contained
the user-visible behavior change — it does not get a separate PR.
After merge: tag v0.32.0 on the merge commit + create a GitHub Release
(memory: feedback_github_release_per_tag.md — the tag alone isn't
enough; the Release prose is the operator-visible artifact).
Headline: closes#160. da query --remote now resolves query_mode='remote'
BQ rows whose entity is VIEW or MATERIALIZED_VIEW (the bug Pavel hit).
Plus 4 reinforcing fixes — server-side cost guardrail (bq_max_scan_bytes,
default 5 GiB), registry-gating of direct bq.* paths, bigquery_query()
function-call backdoor closed, structured CLI render of typed BQ errors —
and one operator-side admin convenience (BQ test-connection endpoint +
billing_project placeholder UI).
14 issues caught and addressed across 6 iterations of Devin Review.
E2E verified on agnes-zsrotyr.groupondev.com (commit 7f743d03):
- VIEW path resolves (count=23 from active_inventory_view)
- VIEW aggregate parity vs filtered BASE TABLE
- cost guardrail rejects with structured 400 detail
- bq_path_not_registered 403 (incl. quoted "bq" variant)
- bigquery_query() blocklist 400
- test-connection endpoint 200 with elapsed_ms
* security(auth): per-IP rate limit on auth endpoints + generalize last-admin guard
Closes#45 and #151.
#45 — every auth endpoint was unthrottled (login, magic-link, token,
bootstrap), leaving us open to password brute-force and SMTP
email-bombing. Wires slowapi (new dep) into the middleware chain with
per-route limits: 10/min on login + token, 5/min on send-link, 3/min on
bootstrap. Returns 429 with Retry-After: 60 once exceeded. Per-IP key
respects the leftmost X-Forwarded-For hop (Caddy in front of the app
strips client-supplied XFF). Operator escape hatch:
AGNES_AUTH_RATELIMIT_ENABLED=0. Test suite disables the limiter via
autouse conftest fixture so existing auth tests that hammer endpoints
in tight loops are unaffected.
#151 — DELETE /api/admin/users/{id}/memberships/{group_id} and the
mirror DELETE /api/admin/groups/{group_id}/members/{user_id} only
guarded against self-removal as last admin. Generalizes to refuse
removing anyone from the seeded Admin group when they are the only
remaining active admin (mirrors the existing
count_admins(active_only=True) <= 1 check on delete_user / update_user).
Recovery from zero admins requires direct DB access, so this closes
a path where a scheduler/bootstrap actor that bypasses normal admin
checks could otherwise empty the group.
* security(auth): throttle remaining email-bombing + token-confirm endpoints
Address code-review gap on PR #165 — the first commit covered /send-link
but missed two endpoints with the IDENTICAL email-bombing surface:
- POST /auth/password/reset — sends reset mail, anti-enum response
- POST /auth/password/setup/request — sends setup mail, anti-enum response
Both now share the 5/min limit with /send-link.
Also add 10/min to the token-confirm surfaces — high-entropy tokens but
partial leaks via logs / referer have surfaced before, and unbounded
guess rate would let an attacker exhaust the keyspace adjacent to a
leaked prefix:
- POST /auth/email/verify
- GET /auth/email/verify — closes the click-through bypass
- POST /auth/password/reset/confirm
- POST /auth/password/setup/confirm
Doc fix: rate_limit.py module docstring + CHANGELOG entry no longer
claim "disable without a redeploy" (misleading). The Limiter constructor
freezes `enabled` from env at import time, matching every other Agnes
env knob — operators set the flag and bounce the container.
Tests: 4 new cases in test_auth_rate_limit.py covering
/reset, /setup/request, /reset/confirm, GET /verify. Full suite:
2583 passed, 32 skipped, 0 failed.
* security(auth): throttle JSON /auth/password/setup — closes form-throttle bypass
Second code-review pass on PR #165 caught a fifth gap: POST /auth/password/setup
(JSON variant, kept for backward compat) consumes the same setup_token as
the web form /setup/confirm but was unthrottled — an attacker brute-forcing
the token just switches from the form path to the JSON path and resumes
at unbounded RPS. Apply the same 10/min limit and signature shape used
on /setup/confirm.
Also extend CHANGELOG note about the JSON-variant bypass for future
operators reading the security entry.
Test: 1 new case (test_password_setup_json_rate_limited_after_10_requests),
9 rate-limit tests + 28 password-flow tests + 41 auth-provider tests pass,
no regressions.
* chore(release): cut 0.30.1 — auth security hardening (rate limit + last-admin guard)
* fix(tls-rotate): self-signed fallback sets basicConstraints=critical,CA:FALSE
OpenSSL's default '[v3_ca]' config marks CA:TRUE on 'req -x509', which
causes strict TLS stacks (rustls / webpki, used by uv, cargo, and
future versions of pip) to reject the cert with
'invalid peer certificate: CaUsedAsEndEntity' per RFC 5280 §4.2.1.9.
Browsers, curl, and OpenSSL-based clients tolerated the violation,
hiding the bug until a uv user hit it.
Affects every VM running on the self-signed fallback while the corp
PKI hasn't published the real chain yet. Fix lands on the next
agnes-tls-rotate.timer tick (or 'systemctl start
agnes-tls-rotate.service' for an immediate refresh). Existing CSR /
real-cert paths unaffected; only the bring-up fallback regenerates.
* chore(release): cut 0.29.0
---------
Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
* fix(analyst): document BigQuery remote-query capability in bootstrap CLAUDE.md template
Closes#153.
The CLAUDE.md template generated by `da analyst bootstrap` (config/claude_md_template.txt)
covered metrics, sync, corporate memory, and directory layout — but had ZERO
mention of query_mode: "remote", da fetch, da query --remote, or --register-bq.
Result: the AI analyst running in a freshly-bootstrapped workspace had no
idea BigQuery-backed tables existed, no path to fetch unsynced data, and no
fallback for tables not in the catalog.
Validated against /Users/<user>/foundry-ai/foundryai-data-analyst/CLAUDE.md
on 2026-05-01: section confirmed missing. Workspace-level (parent-dir)
CLAUDE.md carried legacy SSH-heredoc instructions but the analyst-level
file (which Claude reads as primary project context) had nothing.
## Changes
### config/claude_md_template.txt (+83)
Added a `## Remote Queries (BigQuery)` section covering:
- Discovery first — `da catalog --json | jq '...'` to see all tables
with their query_mode, then `da schema` and `da describe` for shape.
- Three query patterns:
- `da fetch` (preferred) — materialize a filtered subset locally,
query the snapshot, drop when done.
- `da query --remote` — one-shot server-side execution (cheap probes).
- `da query --register-bq` — hybrid joins between local + ad-hoc BQ.
- `da fetch` estimate-first discipline — rules of thumb on
--select / --where / --estimate / snapshot reuse.
- BigQuery SQL flavor cheat sheet for `--where` (DATE literal,
DATE_SUB, REGEXP_CONTAINS, CAST AS INT64).
- Unknown-table fallback: when a table isn't in `da catalog` at all,
use ad-hoc `--register-bq` if the agnes server SA has BQ access, or
ask admin to register with `query_mode: "remote"` for ongoing use.
- Pointer to `da skills show agnes-data-querying` for deeper guidance.
### docs/setup/claude_md_template.txt (deleted)
Stale 359-line template that documented the deprecated SSH-heredoc
remote_query.sh protocol. No code references it (verified via grep
across .py / .sh / .yml / .md). Removing eliminates two failure
modes:
1. A future refactor accidentally pulling it into a workspace and
shipping deprecated guidance to analyst Claude sessions.
2. Reviewer confusion over which template is canonical.
### CHANGELOG.md
`### Fixed` and `### Removed` entries under [Unreleased].
## Tested
- Manually walked the diff against `da skills show agnes-data-querying`
output on a live VM (foundryai-development) — patterns + flags
match the modern CLI exactly.
- Re-bootstrap test deferred: requires network round-trip; pattern
is identical to existing template substitution path so render is
not at risk.
## Out of scope
- The companion gap that data_description.md often only enumerates
query_mode: "local" tables (no signal that other modes exist) —
separate concern, fix likely belongs in the metadata generator
on the server side, not in the analyst template.
- Encouraging admins to register frequently-queried BQ tables as
`query_mode: "remote"` in the registry — workflow improvement, not
a code bug.
* chore(release): cut 0.28.0
---------
Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
* feat(rbac): drop dataset_permissions + access_requests + users.role + is_public; v19 migration
BREAKING. Sjednocení datové RBAC vrstvy do per-group resource_grants modelu.
Před PR byla legacy data RBAC vrstva (dataset_permissions + is_public bypass)
de-facto neaktivní — is_public neměl API/UI/CLI surface, default true znamenal
že can_access_table vždycky bypassl. Dnes každý non-admin přístup vyžaduje
explicitní resource_grants(group, "table", id) řádek.
Schema v18 → v19 (src/db.py:_v18_to_v19_finalize):
- DROP TABLE dataset_permissions, access_requests
- DROP COLUMN users.role (NULL artifact since v13)
- DROP COLUMN table_registry.is_public
- Drops přes table-rebuild idiom (rename → create new → INSERT … SELECT
→ drop old) kvůli DuckDB ALTER DROP COLUMN limitacím na tabulkách
s historic FK constraints. INSERT picks intersection sloupců, takže
test fixtures s minimal pre-v19 schemou migrate cleanly.
Runtime:
- src/rbac.py:can_access_table → deleguje na app.auth.access.can_access
- DatasetPermissionRepository, AccessRequestRepository smazány
- AGNES_ENABLE_TABLE_GRANTS env-gate v app/resource_types.py odstraněn
(TABLE je unconditionally enabled)
API drop:
- app/api/permissions.py, app/api/access_requests.py celé soubory
- /admin/permissions web route + admin_permissions.html
- "Request Access" modal v catalog.html + locked-row UI
- ~10 if user.get("role") != "admin" checků nahrazeno (admin shortcut
je uvnitř can_access_table)
- /api/settings: drop permissions field z GET; PUT /api/settings/dataset
gate přepnut na can_access(user_id, "table", dataset, conn)
Auth:
- app/auth/jwt.py:create_access_token: drop role parametr (claim zmizí
z nově vydávaných JWT; staré tokeny zůstávají valid, claim ignored)
- app/api/users.py: drop role z CreateUserRequest / UpdateUserRequest
(admin promotion = explicit add to Admin group via memberships API)
- src/repositories/users.py: drop role z create() / update()
CLI:
- da admin set-role smazán → hard-fail s replacement command
- da admin add-user --role flag pryč
- da auth import-token --role flag pryč
- da auth whoami: drop "Role:" výpis
- cli/config.py:save_token: role parametr now optional, no longer written
(back-compat se starými token.json soubory zachována — pole se ignoruje)
Tests:
- DELETE: test_permissions.py, test_permissions_api.py, test_access_requests_api.py
- REWRITE: test_access_control.py (resource_grants flow), test_rbac.py
(can_access_table over resource_grants), test_journey_rbac.py
(drop access-request flow), test_resource_types.py (drop env-gate
tests, drop is_public from helpers), test_v2_*.py (drop role-based
user dicts in favor of id-based + Admin group membership),
test_settings_api.py (no permissions field, can_access gate)
- TRIVIAL: ~30 souborů — drop role="admin" arg z UserRepository.create
a 3rd positional role z create_access_token
- NEW: test_v18_to_v19 migration test (test_db.py),
test_can_access_table_no_implicit_public (test_rbac.py),
test_admin_set_role_returns_hardfail (test_cli_admin.py)
- OpenAPI snapshot regenerated
Docs:
- CHANGELOG: BREAKING entry pod [Unreleased]
- CLAUDE.md: schema v18 → v19
- docs/architecture.md: schema table + RBAC sekce přepsána
- docs/auth-google-oauth.md: admin promotion přes da admin break-glass
- cli/skills/security.md: kompletně přepsáno na group-based model
- docs/TODO-rbac-data-enforcement.md: smazáno (TODO splněn)
Test results: 2363 passed, 19 failed. Zbývající failures jsou pre-existing
Windows-specific issues (fcntl, charset) nesouvisející s tímto PR —
ověřeno git stash pop.
Plan: ~/.claude/plans/floofy-coalescing-parnas.md
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore(release): cut 0.27.0
---------
Co-authored-by: Minas Arustamyan <arustamyan.minas@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
* refactor(ops): bake all host artifacts into image, drop every curl-from-main
Replaces the curl-from-main pattern (originally introduced in 0.25.0 for
agnes-auto-upgrade.sh; older for the compose files + Caddyfile) with image-
bundled host artifacts. Same-tag delivery for everything the host runs,
version-pinned by AGNES_TAG, atomically rolled back by reverting the image.
## Motivation
The customer-instance startup template was curling 6 files from
raw.githubusercontent.com on every VM boot:
docker-compose.yml
docker-compose.prod.yml
docker-compose.host-mount.yml
docker-compose.tls.yml
Caddyfile
scripts/ops/agnes-auto-upgrade.sh (added in 0.25.0)
Every one of them already lives inside the image (`COPY . .` copies the
whole repo to /app/). Curling them from the public internet duplicates
content the image already carries and introduces three problems:
1. **Split-brain version pinning.** image_tag pins the docker image to an
immutable digest. The compose files + script bypassed that pinning by
tracking `main` (or the rarely-set compose_ref). A customer pinned to
stable-2026.04.516 could wake up tomorrow with their host artifacts
floating on whatever shipped to main overnight — even though they're
explicitly pinned for stability.
2. **No rollback knob.** Reverting a bad host artifact meant reverting
the upstream PR globally — affects every customer that reboots after
the bad commit. No "rollback for me only" path; tag-pinning gave no
protection.
3. **Public-internet dependency on every boot.** The image is already
pulled from a private registry on the same boot. Reusing that channel
is strictly cheaper than adding a second one. Customers with restricted
egress (no raw.githubusercontent.com reachability) silently broke on
every boot.
## Changes
### Dockerfile (+19 -8)
After `COPY . .` and before the wheel build, an explicit `cp` lifts every
host-side artifact into a stable contract path /opt/agnes-host/:
agnes-auto-upgrade.sh (mode 0755 — host cron driver)
docker-compose.{yml,prod,host-mount,tls}.yml
Caddyfile (mode 0644)
Why a copy instead of pointing at /app directly: /app is owned by uid 999
(USER agnes); /opt/agnes-host is root-owned, mode 0755 across the board,
stable path that won't shift if /app structure refactors.
### infra/modules/customer-instance/startup-script.sh.tpl (+22 -36)
Replaced six curls and the standalone agnes-auto-upgrade.sh extract block
(introduced earlier in this PR) with one extract sequence in section 3:
docker pull "$${IMAGE_REPO}:$${IMAGE_TAG}"
EXTRACT_CONTAINER=$(docker create "$${IMAGE_REPO}:$${IMAGE_TAG}")
trap "docker rm '$EXTRACT_CONTAINER' >/dev/null 2>&1 || true" EXIT
docker cp "$EXTRACT_CONTAINER:/opt/agnes-host/." "$APP_DIR/"
docker cp "$EXTRACT_CONTAINER:/opt/agnes-host/agnes-auto-upgrade.sh" /usr/local/bin/agnes-auto-upgrade.sh
chmod +x /usr/local/bin/agnes-auto-upgrade.sh
The auto-upgrade section (#6) is now a no-op — script is already in place.
### infra/modules/customer-instance/variables.tf (+1 -1)
`compose_ref` marked DEPRECATED in description. Default unchanged for
one release cycle to avoid breaking existing terraform plans. Will be
removed in a future major bump.
### CHANGELOG.md
`### Changed` entry under [Unreleased] — supersedes the narrower entry
this PR previously had (which only covered the script).
## Out of scope (filed as follow-ups)
1. **agnes-the-ai-analyst-infra/startup.sh (operator deploy)** still
curls the same artifacts from main. Symmetric fix needed there.
Will file as a separate PR against the infra repo.
2. **Self-update inside agnes-auto-upgrade.sh** after a successful
`docker compose pull` of a new digest. Otherwise the running cron
keeps using the OLD baked-in script for one tick after image upgrade.
~10 LOC. Deferred to keep this PR scoped.
3. **scripts/ops/agnes-tls-rotate.sh** has the same shape — host-side
bash currently sourced via the infra repo. Should follow the same
bake-into-image pattern.
## Tested
- Local: `docker build .` succeeds with the new RUN block.
- `docker create` + `docker cp /opt/agnes-host/.` round-trips all 6
artifacts; sha matches each source file.
- Not yet tested on a live VM bring-up — that requires a CI image with
this Dockerfile change. **Recommend reviewer trigger CI build, then
do a single VM-recreate against a dev VM (e.g. foundryai-development)
to confirm the extract path works end-to-end before merge.**
## Compatibility
- Existing VMs running 0.25.0 are unaffected — they have host artifacts
in place from `curl from main` already; this PR doesn't touch them.
They pick up the new pattern only on next VM recreate.
- VMs pinned to an image_tag *older* than this PR (no /opt/agnes-host
in the image) would FAIL the docker cp. Current diff fails-loud (no
fallback). Recommend operators upgrade to a fresh-enough image_tag
alongside the template upgrade — same coupling as any compose-flag bump.
* docs(infra): document image_tag >= v0.26.0 minimum on prod/dev_instances
The new startup script extracts host artifacts from /opt/agnes-host/
inside the image — a directory added in this PR (will ship as v0.26.0).
Pinning image_tag to an older tag would fail-loud at first boot with
'docker cp: No such file or directory'. Existing VMs are unaffected
because the module ignores metadata_startup_script changes.
Devin ANALYSIS_0004 on PR #149.
* fix(changelog): mark BREAKING + drop private-repo reference
Per CLAUDE.md, breaking changes start with **BREAKING** so operators
can grep before bumping the pin. The image_tag minimum constraint
introduced here qualifies — older tags fail-loud at first boot.
Also drop the explicit 'agnes-the-ai-analyst-infra' name from the
entry; the OSS distribution shouldn't reference operator-side
deploy templates by their private-repo names. Generic 'consumer-
side deploy templates' wording instead.
Devin BUG_0001 + WARN_0001 on PR #149.
* chore(release): cut 0.26.0
---------
Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
* fix(ops): fail-fast guard in agnes-auto-upgrade — refuse to start containers if config disk not mounted
Companion to keboola/agnes-the-ai-analyst-infra#62. Same incident:
foundryai-development 2026-04-30, marketplaces / DuckDB / session secret
written to /data (sdb) instead of the config disk (sdc), wiped on next
container recreate.
## Why an app-side guard
agnes-auto-upgrade.sh fires every 5 min on every VM. If `/data/state` is
not on the config disk (because of the propagation regression fixed by the
infra PR, or the boot-time udev race fixed by infra #58, or any future
mount-loss path), this script previously ran `docker compose up -d`
anyway — and the app silently wrote state onto the wrong disk. Next
recreate, that state was gone.
The boot-time fixes in infra are preventive. This is the runtime backstop.
## Behavior
Before the existing pull/up logic, when /dev/disk/by-id/google-config-disk
exists on the VM:
1. Up to 3 mount-and-verify attempts with backoff (2s, 4s, 6s).
- Mount the config disk if /data/state is not a mountpoint.
- Detect mismatch: if /data/state is mounted from the wrong source,
umount and retry.
2. After the loop, assert findmnt source matches the config disk.
- On mismatch: `logger -t agnes-auto-upgrade FATAL` + exit 1. systemd
marks the service failed; no docker compose action runs; existing
containers (if any) keep running on stale state, but no new write
lands on the wrong disk.
3. Once verified mounted: re-apply `mount --make-rprivate /data /data/state`
on every run. Idempotent. Guards against propagation regressions
sneaking back in via future docker / kernel changes.
VMs without a config disk (foundryai-poc, single-disk legacy) skip the
whole block — the `if [ -e $CONFIG_DEVICE ]` guard.
## Tested
Patched script installed on foundryai-development as a hotfix; manual run
post-migration was a no-op (digest unchanged); /data/state stayed on sdc
across a full `docker compose down + up -d` cycle.
## Rollout
- This file is fetched by infra startup.sh from
raw.githubusercontent.com/keboola/agnes-the-ai-analyst/main on every
boot. Once merged to main, all VMs pick up the new script on their
next boot — no infra recreate needed.
- For immediate rollout to running VMs without waiting for next boot:
`scp scripts/ops/agnes-auto-upgrade.sh <vm>:/tmp/ &&
ssh <vm> sudo install -m755 -o root -g root /tmp/agnes-auto-upgrade.sh
/usr/local/bin/agnes-auto-upgrade.sh` (already done on
foundryai-development).
* chore: vendor-agnostic comment + changelog text
Drop customer-specific VM names from the script comment and
CHANGELOG entry. The OSS distribution should not name a particular
operator's hosts; the technical description already conveys why
the guard exists.
* fix(ops): suppress mount stderr in retry loop
Match the rest of the script's error-tolerant idiom (2>/dev/null).
Mount failures in the cold-boot udev race the loop is designed
to handle gracefully should not flow to stdout — cron would mail
on every transient retry.
Devin BUG_0001 on PR #146.
* fix(changelog): move auto-upgrade entry to [Unreleased]
Entry landed under v0.20.0 because that section was [Unreleased]
when this branch first opened — releases v0.21–v0.24 cut in the
meantime stranded it inside an already-released section. Move it
back where new entries belong.
Devin BUG_0001 on PR #146.
* fix(infra): single-source agnes-auto-upgrade.sh via curl from main
Replace the inline heredoc copy of the auto-upgrade script in the
customer-instance Terraform startup template with a curl fetch from
raw.githubusercontent.com on every boot. The inline copy had drifted
several iterations behind canonical scripts/ops/agnes-auto-upgrade.sh
(missing TLS overlay detection, array-form COMPOSE_FILES, and now
the config-disk fail-fast guard from this PR).
Devin ANALYSIS_0001 on PR #146.
* fix(infra): fetch docker-compose.tls.yml unconditionally + document coupling
The canonical agnes-auto-upgrade.sh from main detects TLS at runtime
via cert files on disk, regardless of the TLS_MODE Terraform variable.
Certs can appear after boot via agnes-tls-rotate.sh or manual
provisioning, and the cron job would then fail every 5 min under
'set -euo pipefail' because docker-compose.tls.yml was never fetched.
Also document the main-vs-COMPOSE_REF coupling: when the canonical
script references a new compose file, the fetch list above must be
updated to match — pinned-ref VMs would otherwise break.
Devin BUG_0001 + ANALYSIS_0001 on PR #146.
* fix(ops,infra): unconditional Caddyfile + skip tls overlay if missing
Caddyfile fetch now matches docker-compose.tls.yml: unconditional in
startup-script.sh.tpl. Without it, Docker would auto-create an empty
directory at the bind-mount target and Caddy would crash-loop while
the tls overlay has already closed :8000 — making the app
unreachable on any non-caddy VM where certs land via rotate or
manual provisioning.
Defensive layer: agnes-auto-upgrade.sh now also requires Caddyfile
to exist (size > 0) before activating the tls profile, with a
WARN log if it's missing. Belt-and-suspenders so the failure mode
is contained even when the script is deployed by some other path
(not just the customer-instance TF module).
Devin BUG_0001 on PR #146.
* chore(release): cut 0.25.0
---------
Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>