The startup script runs on every boot but the metadata_startup_script
field is in lifecycle.ignore_changes — so a TF apply that changed
image_tag does NOT reach a long-lived VM until someone explicitly
recreates it. Meanwhile, operators commonly hand-edit /opt/agnes/.env
to pin a specific image (custom branch builds, staged rollouts).
Pre-fix, every boot rewrote .env from the baked-in template and
clobbered the operator's choice — concretely, a stop+start triggered
by a machine_type change would reset AGNES_TAG to whatever was in
the template at first provision, regardless of the operator's
intervening edit.
Now the script reads the existing .env (when present) for AGNES_TAG
and AGNES_TEMP_DIR; when those keys are set, the existing values
win over the template-computed ones. Logged on stdout when AGNES_TAG
disagrees with $IMAGE_TAG so an operator audit-trails the boot.
Fresh provisions are unchanged (no .env yet → template values land).
To force a TF-driven reset on an existing VM: rm /opt/agnes/.env
and reboot. Cut as infra-v1.8.0 — additive, downstream consumers
opt in by bumping the module ref.
* refactor(ops): bake all host artifacts into image, drop every curl-from-main
Replaces the curl-from-main pattern (originally introduced in 0.25.0 for
agnes-auto-upgrade.sh; older for the compose files + Caddyfile) with image-
bundled host artifacts. Same-tag delivery for everything the host runs,
version-pinned by AGNES_TAG, atomically rolled back by reverting the image.
## Motivation
The customer-instance startup template was curling 6 files from
raw.githubusercontent.com on every VM boot:
docker-compose.yml
docker-compose.prod.yml
docker-compose.host-mount.yml
docker-compose.tls.yml
Caddyfile
scripts/ops/agnes-auto-upgrade.sh (added in 0.25.0)
Every one of them already lives inside the image (`COPY . .` copies the
whole repo to /app/). Curling them from the public internet duplicates
content the image already carries and introduces three problems:
1. **Split-brain version pinning.** image_tag pins the docker image to an
immutable digest. The compose files + script bypassed that pinning by
tracking `main` (or the rarely-set compose_ref). A customer pinned to
stable-2026.04.516 could wake up tomorrow with their host artifacts
floating on whatever shipped to main overnight — even though they're
explicitly pinned for stability.
2. **No rollback knob.** Reverting a bad host artifact meant reverting
the upstream PR globally — affects every customer that reboots after
the bad commit. No "rollback for me only" path; tag-pinning gave no
protection.
3. **Public-internet dependency on every boot.** The image is already
pulled from a private registry on the same boot. Reusing that channel
is strictly cheaper than adding a second one. Customers with restricted
egress (no raw.githubusercontent.com reachability) silently broke on
every boot.
## Changes
### Dockerfile (+19 -8)
After `COPY . .` and before the wheel build, an explicit `cp` lifts every
host-side artifact into a stable contract path /opt/agnes-host/:
agnes-auto-upgrade.sh (mode 0755 — host cron driver)
docker-compose.{yml,prod,host-mount,tls}.yml
Caddyfile (mode 0644)
Why a copy instead of pointing at /app directly: /app is owned by uid 999
(USER agnes); /opt/agnes-host is root-owned, mode 0755 across the board,
stable path that won't shift if /app structure refactors.
### infra/modules/customer-instance/startup-script.sh.tpl (+22 -36)
Replaced six curls and the standalone agnes-auto-upgrade.sh extract block
(introduced earlier in this PR) with one extract sequence in section 3:
docker pull "$${IMAGE_REPO}:$${IMAGE_TAG}"
EXTRACT_CONTAINER=$(docker create "$${IMAGE_REPO}:$${IMAGE_TAG}")
trap "docker rm '$EXTRACT_CONTAINER' >/dev/null 2>&1 || true" EXIT
docker cp "$EXTRACT_CONTAINER:/opt/agnes-host/." "$APP_DIR/"
docker cp "$EXTRACT_CONTAINER:/opt/agnes-host/agnes-auto-upgrade.sh" /usr/local/bin/agnes-auto-upgrade.sh
chmod +x /usr/local/bin/agnes-auto-upgrade.sh
The auto-upgrade section (#6) is now a no-op — script is already in place.
### infra/modules/customer-instance/variables.tf (+1 -1)
`compose_ref` marked DEPRECATED in description. Default unchanged for
one release cycle to avoid breaking existing terraform plans. Will be
removed in a future major bump.
### CHANGELOG.md
`### Changed` entry under [Unreleased] — supersedes the narrower entry
this PR previously had (which only covered the script).
## Out of scope (filed as follow-ups)
1. **agnes-the-ai-analyst-infra/startup.sh (operator deploy)** still
curls the same artifacts from main. Symmetric fix needed there.
Will file as a separate PR against the infra repo.
2. **Self-update inside agnes-auto-upgrade.sh** after a successful
`docker compose pull` of a new digest. Otherwise the running cron
keeps using the OLD baked-in script for one tick after image upgrade.
~10 LOC. Deferred to keep this PR scoped.
3. **scripts/ops/agnes-tls-rotate.sh** has the same shape — host-side
bash currently sourced via the infra repo. Should follow the same
bake-into-image pattern.
## Tested
- Local: `docker build .` succeeds with the new RUN block.
- `docker create` + `docker cp /opt/agnes-host/.` round-trips all 6
artifacts; sha matches each source file.
- Not yet tested on a live VM bring-up — that requires a CI image with
this Dockerfile change. **Recommend reviewer trigger CI build, then
do a single VM-recreate against a dev VM (e.g. foundryai-development)
to confirm the extract path works end-to-end before merge.**
## Compatibility
- Existing VMs running 0.25.0 are unaffected — they have host artifacts
in place from `curl from main` already; this PR doesn't touch them.
They pick up the new pattern only on next VM recreate.
- VMs pinned to an image_tag *older* than this PR (no /opt/agnes-host
in the image) would FAIL the docker cp. Current diff fails-loud (no
fallback). Recommend operators upgrade to a fresh-enough image_tag
alongside the template upgrade — same coupling as any compose-flag bump.
* docs(infra): document image_tag >= v0.26.0 minimum on prod/dev_instances
The new startup script extracts host artifacts from /opt/agnes-host/
inside the image — a directory added in this PR (will ship as v0.26.0).
Pinning image_tag to an older tag would fail-loud at first boot with
'docker cp: No such file or directory'. Existing VMs are unaffected
because the module ignores metadata_startup_script changes.
Devin ANALYSIS_0004 on PR #149.
* fix(changelog): mark BREAKING + drop private-repo reference
Per CLAUDE.md, breaking changes start with **BREAKING** so operators
can grep before bumping the pin. The image_tag minimum constraint
introduced here qualifies — older tags fail-loud at first boot.
Also drop the explicit 'agnes-the-ai-analyst-infra' name from the
entry; the OSS distribution shouldn't reference operator-side
deploy templates by their private-repo names. Generic 'consumer-
side deploy templates' wording instead.
Devin BUG_0001 + WARN_0001 on PR #149.
* chore(release): cut 0.26.0
---------
Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
* fix(ops): fail-fast guard in agnes-auto-upgrade — refuse to start containers if config disk not mounted
Companion to keboola/agnes-the-ai-analyst-infra#62. Same incident:
foundryai-development 2026-04-30, marketplaces / DuckDB / session secret
written to /data (sdb) instead of the config disk (sdc), wiped on next
container recreate.
## Why an app-side guard
agnes-auto-upgrade.sh fires every 5 min on every VM. If `/data/state` is
not on the config disk (because of the propagation regression fixed by the
infra PR, or the boot-time udev race fixed by infra #58, or any future
mount-loss path), this script previously ran `docker compose up -d`
anyway — and the app silently wrote state onto the wrong disk. Next
recreate, that state was gone.
The boot-time fixes in infra are preventive. This is the runtime backstop.
## Behavior
Before the existing pull/up logic, when /dev/disk/by-id/google-config-disk
exists on the VM:
1. Up to 3 mount-and-verify attempts with backoff (2s, 4s, 6s).
- Mount the config disk if /data/state is not a mountpoint.
- Detect mismatch: if /data/state is mounted from the wrong source,
umount and retry.
2. After the loop, assert findmnt source matches the config disk.
- On mismatch: `logger -t agnes-auto-upgrade FATAL` + exit 1. systemd
marks the service failed; no docker compose action runs; existing
containers (if any) keep running on stale state, but no new write
lands on the wrong disk.
3. Once verified mounted: re-apply `mount --make-rprivate /data /data/state`
on every run. Idempotent. Guards against propagation regressions
sneaking back in via future docker / kernel changes.
VMs without a config disk (foundryai-poc, single-disk legacy) skip the
whole block — the `if [ -e $CONFIG_DEVICE ]` guard.
## Tested
Patched script installed on foundryai-development as a hotfix; manual run
post-migration was a no-op (digest unchanged); /data/state stayed on sdc
across a full `docker compose down + up -d` cycle.
## Rollout
- This file is fetched by infra startup.sh from
raw.githubusercontent.com/keboola/agnes-the-ai-analyst/main on every
boot. Once merged to main, all VMs pick up the new script on their
next boot — no infra recreate needed.
- For immediate rollout to running VMs without waiting for next boot:
`scp scripts/ops/agnes-auto-upgrade.sh <vm>:/tmp/ &&
ssh <vm> sudo install -m755 -o root -g root /tmp/agnes-auto-upgrade.sh
/usr/local/bin/agnes-auto-upgrade.sh` (already done on
foundryai-development).
* chore: vendor-agnostic comment + changelog text
Drop customer-specific VM names from the script comment and
CHANGELOG entry. The OSS distribution should not name a particular
operator's hosts; the technical description already conveys why
the guard exists.
* fix(ops): suppress mount stderr in retry loop
Match the rest of the script's error-tolerant idiom (2>/dev/null).
Mount failures in the cold-boot udev race the loop is designed
to handle gracefully should not flow to stdout — cron would mail
on every transient retry.
Devin BUG_0001 on PR #146.
* fix(changelog): move auto-upgrade entry to [Unreleased]
Entry landed under v0.20.0 because that section was [Unreleased]
when this branch first opened — releases v0.21–v0.24 cut in the
meantime stranded it inside an already-released section. Move it
back where new entries belong.
Devin BUG_0001 on PR #146.
* fix(infra): single-source agnes-auto-upgrade.sh via curl from main
Replace the inline heredoc copy of the auto-upgrade script in the
customer-instance Terraform startup template with a curl fetch from
raw.githubusercontent.com on every boot. The inline copy had drifted
several iterations behind canonical scripts/ops/agnes-auto-upgrade.sh
(missing TLS overlay detection, array-form COMPOSE_FILES, and now
the config-disk fail-fast guard from this PR).
Devin ANALYSIS_0001 on PR #146.
* fix(infra): fetch docker-compose.tls.yml unconditionally + document coupling
The canonical agnes-auto-upgrade.sh from main detects TLS at runtime
via cert files on disk, regardless of the TLS_MODE Terraform variable.
Certs can appear after boot via agnes-tls-rotate.sh or manual
provisioning, and the cron job would then fail every 5 min under
'set -euo pipefail' because docker-compose.tls.yml was never fetched.
Also document the main-vs-COMPOSE_REF coupling: when the canonical
script references a new compose file, the fetch list above must be
updated to match — pinned-ref VMs would otherwise break.
Devin BUG_0001 + ANALYSIS_0001 on PR #146.
* fix(ops,infra): unconditional Caddyfile + skip tls overlay if missing
Caddyfile fetch now matches docker-compose.tls.yml: unconditional in
startup-script.sh.tpl. Without it, Docker would auto-create an empty
directory at the bind-mount target and Caddy would crash-loop while
the tls overlay has already closed :8000 — making the app
unreachable on any non-caddy VM where certs land via rotate or
manual provisioning.
Defensive layer: agnes-auto-upgrade.sh now also requires Caddyfile
to exist (size > 0) before activating the tls profile, with a
WARN log if it's missing. Belt-and-suspenders so the failure mode
is contained even when the script is deployed by some other path
(not just the customer-instance TF module).
Devin BUG_0001 on PR #146.
* chore(release): cut 0.25.0
---------
Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
* fix(scheduler): HTTP marketplaces job + SCHEDULER_API_TOKEN shared secret
Two scheduler-reliability bugs surfaced after the v0.12.1 USER-agnes flip:
1. The marketplaces job called src.marketplace.sync_marketplaces() in-process
from the scheduler container, racing the app's long-lived system.duckdb
handle. DuckDB rejects cross-process writers — every cron tick 500-ed on
"Could not set lock on file ... PID 0".
2. The data-refresh + new marketplaces jobs both 401-ed on the API because
SCHEDULER_API_TOKEN was never propagated by the Terraform startup script.
The scheduler had no credential to authenticate with.
Fix:
- New POST /api/marketplaces/sync-all (admin-only) drives the nightly refresh
through the app process so it inherits the existing DB connection.
- Scheduler swaps fn->http for marketplaces; all jobs are now plain HTTP and
the scheduler is reduced to a cron clock.
- New app/auth/scheduler_token.py adds a shared-secret auth path. The
startup script generates a 256-bit secret on first boot, persists it
across reboots, and writes it to /opt/agnes/.env. Both containers source
the same .env. The app validates incoming Bearer tokens against the env
var (constant-time, length-floored) and resolves matches to a synthetic
scheduler@system.local user that's a member of the Admin system group.
Audit-log entries from the scheduler are attributed to this user.
- app/main.py seeds the synthetic user at startup so the first cron tick
has a valid actor; lazy seed in get_scheduler_user covers token rotation
before the next app restart.
Tests: 5 new in tests/test_auth_scheduler_token.py covering empty/short
secret rejection, exact-match comparison, idempotent user seeding, and
lazy provisioning. 142 marketplace + scheduler tests + 96 auth tests
remain green.
Existing VMs with .env from before this change need a one-time
re-provisioning (re-run startup-script or rotate via openssl rand);
documented in CHANGELOG.
* fix(audit): use '_all' sentinel for bulk marketplace sync — Devin review #127
Avoids the literal string 'marketplace:None' in the audit_log resource
column when the bulk sync endpoint writes its summary row.
* fix(scheduler): unblock event loop + per-job timeouts — Devin review #127
Two findings from Devin re-review on commit 5fbad15:
1. BUG: trigger_sync_all was async def, so FastAPI ran it on the asyncio
event loop. sync_marketplaces() does blocking I/O (subprocess git
clones up to GIT_TIMEOUT_SEC=300 each, threading.Lock, DuckDB writes)
and would freeze every concurrent request for the duration of a bulk
sync. Switched to plain def so FastAPI auto-routes to the thread pool.
2. ANALYSIS: scheduler used a fixed 120s httpx timeout for every POST.
Bulk marketplace sync iterates the registry under a single lock with
up to 300s per repo — easily exceeds 120s on 2-3 slow repos. The
scheduler then sees a timeout, doesn't update last_run, and re-fires
on the next 30s tick, queueing redundant work. Per-job timeout
override added to the JOBS tuple; marketplaces gets 900s (15 min),
data-refresh keeps 120s, health-check 30s.
* fix(auth): require_session_token rejects scheduler shared secret — Devin review #127
require_session_token gates /auth/tokens (PAT minting). Pre-fix it only
rejected JWTs with typ=pat — but the scheduler shared secret is an opaque
string, so verify_token() returns None, payload becomes {}, and the
PAT-claim check silently passed. A caller bearing SCHEDULER_API_TOKEN
could mint persistent PATs that survive a secret rotation.
Added explicit is_scheduler_token() check before the PAT-claim check;
new regression test in tests/test_auth_scheduler_token.py.
Devin's other note (pre-existing async def trigger_sync at marketplaces.py:392
also calls blocking sync_one) — Devin flagged it as out-of-scope for this PR
and I agree; tracking separately.
* release(0.17.0): cut + clean up CHANGELOG duplicates
Cuts 0.17.0 (minor: scheduler shared-secret auth + sync-all endpoint
plus the deploy-shape fixes that landed since the last release tag).
Bumps pyproject from 0.15.0 — also corrects the missed bump from PR #120
(v0.16.0 was tagged on GitHub and shipped as :stable, but pyproject
stayed at 0.15.0, so /api/version, /cli/latest, and `da --version` had
been under-reporting the running release).
Removes the long-form duplicate entries for 0.13.0 / 0.14.0 / 0.15.0
above [0.16.0] — the canonical short summaries (with GitHub-release
links) already exist below 0.16.0, the long forms were leftover state
from before those versions were cut and have been silently shadowed
ever since.
* feat(auth): display Google Workspace groups on /profile
- Request cloud-identity.groups.readonly scope in Google OAuth
- Fetch groups via Cloud Identity API after callback; tolerate 4xx
(non-Workspace tenants) and network errors — never break login
- Store result in Starlette session as google_groups
- Replace /profile redirect with a real profile page rendering
account details (email, name, role) and the group list; show a
friendly empty state when no groups are available
- Tests: helper parsing + 403 + exception paths; profile page
smoke test; updated the old redirect test
* test: remove stale /profile redirect tests
Cherry-pick of Zdeněk's 4f7e4cd ("display Google Workspace groups on
/profile") replaces the /profile redirect with a real profile page —
but only updated one of three tests that expected the old behaviour.
These two tests in test_admin_tokens_ui.py and test_pat.py were left
asserting `/profile → 302 /tokens`, which now returns
`/profile → 302 /login?next=%2Fprofile` for unauth users (the standard
auth guard) or `/profile → 200 HTML` for authenticated users.
Removed both rather than patched — coverage for the new behaviour
already exists in tests/test_auth_providers.py (added by the same
commit). The /tokens render assertions in the deleted test_pat.py case
are redundant with test_admin_tokens_ui.py's own /tokens UI tests.
* fix(auth): Google groups search query needs parent + labels predicates
Cloud Identity Groups Search API returns 400 INVALID_ARGUMENT when the
CEL query lacks the required `parent == 'customers/<id>'` predicate AND
a `'<label>' in labels` membership predicate. Zdeněk's original 4f7e4cd
query had only `member_key_id == '<email>'` — every fetch silently
returned [] and the /profile groups list was always empty.
Fix: build the query with all three required pieces:
parent == 'customers/my_customer' (alias = caller's own Workspace
org; no need to look up customer ID)
member_key_id == '<email>' (filter to this user's memberships)
'cloudidentity.googleapis.com/groups.discussion_forum' in labels
(Workspace mailing-list groups —
the common case; security-group
coverage is a follow-up)
Also: log the full error body (not truncated to 200 chars) and the
query string so the next time Google rejects something we can diagnose
in one log line instead of a re-deploy.
Caught when first agnes-dev login completed normally (HTTP 302) but app
log showed `Google groups fetch returned 400 for petr@keboola.com:
{"error":{"code":400,"message":"Request contains an invalid argument."}}`
on the same VM (kids-ai-data-analysis / agnes-dev.keboola.com).
Reference: https://cloud.google.com/identity/docs/reference/rest/v1/groups/search
* feat(web): add Profile link to user dropdown menu
The /profile page (Zdeněk's 4f7e4cd cherry-pick) renders a real profile
view including Google Workspace groups, but had no entry point in the
UI — users could only reach it by typing the URL manually. Add a
"Profile" menu item between the user header (email + role) and
"My tokens" so the page is discoverable.
Side effect: cleaned up the leftover `or _path.startswith('/profile')`
condition on the "My tokens" active class, which dated from the old
/profile → /tokens redirect (removed in c789617). Now each menu item
owns its own active state.
* fix: profile-link tests + .env quoting for CADDY_TLS
Two issues caught by Keboola's first agnes-dev deploy + agnes-auto-upgrade
cron run:
1. tests/test_web_ui.py — two negative assertions ("href=/profile" NOT in
body) date from when /profile was a redirect-only stub. Now /profile
is a real page (groups display) AND has a dropdown menu link, so the
negative assertions flip to positive. Same for ">Profile<" text in
the non-admin nav test.
2. startup-script.sh.tpl — CADDY_TLS line must be QUOTED in .env, because
agnes-auto-upgrade.sh sources .env via `set -a; . .env; set +a` and
bash treats `KEY=value with spaces` as `KEY=value` followed by `with`
and `spaces` exec attempts. Symptom: cron log spam
`/opt/agnes/.env: line 14: petr@keboola.com: command not found`,
the cron exits non-zero, and no auto-upgrade ever happens. Caddy
itself reads the value fine because docker-compose env_file=.env
parses key=value properly without shell-evaluating the rest.
Fix: emit `CADDY_TLS="tls <email>"` instead of `CADDY_TLS=tls <email>`.
Both the cron source and docker-compose env_file accept the quoted
form; cron stops failing.
* fix(auth): use searchTransitiveGroups + security label for non-admin user
Three bugs in the original cherry-pick + my prior fix attempt, all caught
by a stdlib probe script (scripts/debug/probe_google_groups.py) run
locally with a Playground-issued OAuth token:
1. Wrong endpoint. `groups:search` is the admin "find groups in org"
endpoint and 400s for non-admin users regardless of query. Switched
to `groups/-/memberships:searchTransitiveGroups` which is the
user-perspective "what groups am I in" endpoint.
2. Wrong label. Querying with `cloudidentity.googleapis.com/groups.discussion_forum`
returns 403 "Insufficient permissions to retrieve memberships" even
on the new endpoint — Workspace policy denies non-admin reads of
discussion-forum groups. Switching to `groups.security` returns 200
with the actual membership list. Empirically every Workspace group
at Keboola carries BOTH labels, so the security filter sees the full
set anyway. Confirmed with the probe script.
3. Wrong response shape. `searchTransitiveGroups` returns
{"memberships": [...]}, not {"groups": [...]}. Parser updated
accordingly.
Also adds scripts/debug/probe_google_groups.py — stdlib-only standalone
probe that hits 6 candidate endpoints with a user OAuth token. Saved a
deploy cycle (~10 min) per query iteration; future API-syntax debugging
should start there.
Verified end-to-end: petr@keboola.com login on agnes-dev returns 5
groups (LIC-1PASSWORD, ROLE_ATLASSIAN_*, etc.) via the probe; once
deployed, the same will populate session["google_groups"] and render
on /profile.
* test(auth): update Google groups parser fixture to match searchTransitiveGroups shape
Mock payload was `{"groups": [...]}` (the shape `groups:search` returns).
After switching to `groups/-/memberships:searchTransitiveGroups` in the
prior commit, the actual response is `{"memberships": [...]}` and the
parser iterates that key. Test now mirrors the real shape.
The per-item structure (groupKey.id + displayName) is unchanged, so the
expected output dict stays the same: [{"id": "...", "name": "..."}].
* docs(auth): add docs/auth-groups.md — Google Workspace groups runbook
Captures the non-obvious bits: the GCP-side setup checklist (Cloud
Identity API + scope on consent screen + Internal user type), the
`security` vs `discussion_forum` label trap (the latter 403s for
non-admins, the former 200s — one of those is a 4-iteration debug
session and shouldn't have to be repeated), where groups are stored
(session, not DB) and how to refresh (re-login), plus how to use the
probe script for future API-syntax issues.
Deliberately stops short of explaining "what is Cloud Identity" or
"what is OAuth scope" — those belong in Google's own docs, not ours.
* docs(claude): document release workflows + module versioning + recreate trick
New "Release & deploy workflows" section in CLAUDE.md covers what didn't
exist anywhere in the repo before:
- Distinction between release.yml (auto-build per push) vs the new
keboola-deploy.yml (tag-triggered, explicit deploy only) — plus when
to use which (per-developer convenience vs shared dev VM safety)
- Module versioning (infra-vX.Y.Z) and the bump-after-merge dance
- The lifecycle.ignore_changes [metadata_startup_script] gotcha and how
to force a recreate via workflow_dispatch's recreate_targets input
All generic — no customer hostnames, project IDs, IPs. Customer-specific
deploy steps belong in the consuming infra repo's README.
Also: cross-reference docs/auth-groups.md from the Authentication
section so future Claude sessions find the Workspace-groups runbook
without grepping.
---------
Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
* feat(deploy): keboola-deploy tag-triggered workflow + Caddyfile LE/internal modes + dev_instances TLS support
Three coordinated changes that together unblock Keboola's internal Agnes
deployment from the foot-gun where the dev VM tracks `:dev` (= last push
from anyone in the upstream repo).
1. .github/workflows/keboola-deploy.yml — new workflow
Triggered ONLY on `keboola-deploy-*` git tag pushes (not on every branch
push like release.yml). Builds an image and publishes two GHCR tags:
ghcr.io/keboola/agnes-the-ai-analyst:keboola-deploy-<git-tag-suffix>
ghcr.io/keboola/agnes-the-ai-analyst:keboola-deploy-latest
The Keboola dev VM pins to `keboola-deploy-latest`; an operator deploys
by `git tag keboola-deploy-foo && git push origin keboola-deploy-foo`.
Audit trail lives in git tags (immutable, who-tagged-what-when), no
PR-cycle needed for each deploy.
Doesn't touch Vojta/Minas/David workflow — release.yml still builds
`:dev-<slug>` for every branch push as before.
2. Caddyfile — parametrize TLS directive via $CADDY_TLS env var
PR #51 hardcoded cert-file mode (`tls /certs/fullchain.pem ...`) for
Groupon's corporate CA flow. That broke the Let's Encrypt path the
module previously supported. Now:
CADDY_TLS unset (default) → cert-file mode (Groupon corp PKI)
CADDY_TLS="tls user@x.com" → Let's Encrypt auto-issue
CADDY_TLS="tls internal" → Caddy-managed self-signed (lab/dev)
Single Caddyfile, three regimes, no per-deployment fork. Validated with
`caddy validate` in all three modes.
3. customer-instance module — dev_instances TLS + auto-set CADDY_TLS
- variables.tf: dev_instances object schema gains optional tls_mode +
domain (mirroring prod_instance). Defaults to "none" + "" so existing
callers without those fields keep current behavior.
- startup-script.sh.tpl: when tls_mode="caddy" and DOMAIN is set, write
CADDY_TLS=tls <ACME_EMAIL> (or "tls internal" when ACME_EMAIL empty)
into /opt/agnes/.env. Caddy then picks it up and the Caddyfile
substitution flips the cert source.
For an LE deploy: set tls_mode="caddy", domain="agnes-dev.example.com",
ensure DNS A-record points at the VM, and acme_email is set on the
module (or seed_admin_email is, since acme_email defaults to it).
After this lands, tag as infra-v1.6.0 so downstream infra repos can bump
their module ref without needing the upstream change tracking.
* feat(deploy): fetch optional Google OAuth credentials from Secret Manager
Mirrors the existing keboola-storage-token / agnes-<customer>-jwt-secret
pattern: VM SA reads google-oauth-client-{id,secret} secrets at boot
(if they exist + IAM is wired by caller via runtime_secrets) and writes
them into /opt/agnes/.env. Empty / missing / 403 → silent fallback
to "" so password and email auth keep working untouched.
Pairs with downstream change in agnes-infra-keboola which adds the two
secret names to runtime_secrets, granting the Keboola VM SA secretAccessor
on them. Operator pre-creates the SM containers via gcloud secrets create
google-oauth-client-{id,secret} (one-time, out of band) — values stay
in SM forever; rotation = `gcloud secrets versions add`.
This unblocks the Keboola agnes-dev deploy from PR #3 (infra) — without
GOOGLE_CLIENT_{ID,SECRET} in .env, app/auth/providers/google.is_available()
returns False and the Google sign-in button never even appears.
* dryrun: intentional failing test (will be reverted)
* feat(auth): optional SEED_ADMIN_PASSWORD to pre-hash seed admin (dev helper)
Terraform gains enable_seed_password + seed_admin_password (sensitive) vars
on the customer-instance module; when enabled the password is piped via
startup-script into /opt/agnes/.env as SEED_ADMIN_PASSWORD. On first boot
app/main.py argon2-hashes it onto the seed user so the admin can log in
immediately without going through /auth/bootstrap. Never overwrites an
existing password_hash — safe against accidental reset on terraform apply.
* ci(release): build :dev-<slug> on any branch, not just feature/**
Before: only 'feature/**' branches triggered release.yml, so pushing
'zs/my-edit' or 'fix/bug' did not publish an image. dev_instances entry
pinning image_tag = 'dev-zs-my-edit' then crashed VM startup with
'image not found'.
Now: any branch push (except main, which produces :stable) publishes
:dev-<slug>. Slug strips a leading 'feature/' and replaces non-[a-z0-9-]
with '-', keeping existing feature/** behavior identical.
* Revert "dryrun: intentional failing test (will be reverted)"
This reverts commit cf9cc06a7884bb401ff29fc5cb6d8baf84dc3daa.
Before: startup script wrote AGNES_VERSION=stable (the floating tag name)
into .env, which overrode the image's build-time ENV AGNES_VERSION=2026.04.47.
UI badge showed 'stable-stable' instead of 'stable-2026.04.47'.
After:
- Dockerfile ARG/ENV for AGNES_COMMIT_SHA (alongside existing VERSION + CHANNEL)
- release.yml passes github.sha as AGNES_COMMIT_SHA build-arg
- Startup script no longer writes these three into .env; the app reads them
from the image ENV set at build time.
Result: badge displays 'stable-2026.04.47 · stable · <time> ago' with the
real CalVer, and the commit SHA tooltip points at an actual commit rather
than the floating manifest digest.
UI now shows a small footer badge with:
- release channel + CalVer version (e.g. 'stable-2026.04.47')
- floating image tag (e.g. 'stable')
- time since last container restart (proxy for 'last deployed')
Backend:
- app/api/health.py: /api/health returns image_tag, commit_sha, deployed_at
- app/api/health.py: new /api/version endpoint (lightweight, no DB hit, for
footer badge polling)
Infra:
- startup-script.sh.tpl: resolves image digest from ghcr pull, derives
channel + version from the tag name, and writes AGNES_VERSION /
RELEASE_CHANNEL / AGNES_COMMIT_SHA into .env so the app can surface them
to the UI.
UI:
- app/web/templates/base.html: footer loads /api/version asynchronously and
renders '<channel>-<version> · <tag> · deployed <relative> (<UTC>)'.
Tooltip shows full detail (commit sha, schema version).
Critical fixes:
- C1: VM SA now gets secretmanager.secretAccessor only on specific secrets
(JWT + each entry in runtime_secrets). Previously project-wide.
- C3: chmod 640 on /var/log/agnes-startup.log (defense in depth)
- C4: Remove '|| echo ""' fallback on keboola-storage-token — boot now fails
loudly if the secret is missing instead of starting a broken app.
- C5: Cron auto-upgrade script sources /opt/agnes/.env for AGNES_TAG. If an
operator edits .env to pin a specific stable-YYYY.MM.N, cron picks it up
immediately with no drift. Removed AGNES_TAG from crontab entry.
- C7: explicit depends_on = [IAM bindings, secret_version] on VM — prevents
race where VM boots before IAM propagates.
Important fixes:
- I1: Split firewall into web (80/443 + conditional 8000) and ssh (port 22 with
configurable source_ranges, default IAP range only).
- I4: Fetch docker-compose files from compose_ref (default 'main'), so customers
can pin a specific tag for reproducibility.
- I5+I6: Merge order fixed — user-supplied dev_instances values now override
defaults (was the other way around). Dev tls_mode default flipped to 'none'.
- I7: Remove '|| true' on Caddyfile fetch; surface failures loudly.
- New acme_email variable (falls back to seed_admin_email if empty).
Out-of-module:
- Comments translated from Czech to English where applicable (M1).
The CI smoke test failed because docker-compose.prod.yml forced a bind mount
to /data on the host — which doesn't exist on GitHub runners.
Split the bind mount into docker-compose.host-mount.yml, which is only
composed by the VM startup script (/data exists there, mounted from the
persistent disk). CI continues to use the default named volume.
Module startup script + auto-upgrade cron now compose all three:
-f docker-compose.yml -f docker-compose.prod.yml -f docker-compose.host-mount.yml
Watchtower container has Docker API mismatch (client 1.25 vs daemon 1.54+)
that can't be worked around without upstream fix. Simple cron job does the
same thing more reliably:
- Every 5 min: docker compose pull + detect digest change + up -d if changed
- Logs to /var/log/agnes-auto-upgrade.log
This removes the watchtower container and a Docker daemon dependency.