Completes the previous commit — bakes the full git SHA into the image ENV
at build time so the UI badge shows a real commit, not a sha256 digest
(which was the floating manifest digest and unhelpful for debugging).
Before: startup script wrote AGNES_VERSION=stable (the floating tag name)
into .env, which overrode the image's build-time ENV AGNES_VERSION=2026.04.47.
UI badge showed 'stable-stable' instead of 'stable-2026.04.47'.
After:
- Dockerfile ARG/ENV for AGNES_COMMIT_SHA (alongside existing VERSION + CHANNEL)
- release.yml passes github.sha as AGNES_COMMIT_SHA build-arg
- Startup script no longer writes these three into .env; the app reads them
from the image ENV set at build time.
Result: badge displays 'stable-2026.04.47 · stable · <time> ago' with the
real CalVer, and the commit SHA tooltip points at an actual commit rather
than the floating manifest digest.
The earlier base.html edit only affected templates that extend base.html
(login.html via base_login.html). Most pages (dashboard, catalog,
admin_tables, admin_permissions, activity_center, corporate_memory, ...)
are standalone templates with their own <body>, so the badge never showed.
Fix: extracted the badge + fetch script into _version_badge.html partial,
included it before </body> in every full-page template. Consistent across
login, dashboard, admin, catalog, etc.
GCP rejected the policy with 'REDUCE_COUNT_FALSE cannot be applied to
metrics with value type DOUBLE' — because ALIGN_FRACTION_TRUE already
produces a fraction 0..1 per series, no need for an additional cross-series
reducer. Simplified: alert when the per-series fraction < 1 for 5 min.
Review M4 predicted this — uptime check filters needed double-checking
against live GCP.
v1.3.0 added google_monitoring_uptime_check_config + alert policies to the
module, but bootstrap-gcp.sh was not updated. Fresh customers (and the
first apply after upgrading existing customers) hit 403 on
monitoring.uptimeCheckConfigs.create.
Fix: enable monitoring.googleapis.com + grant roles/monitoring.editor to
the deploy SA. Idempotent (safe to re-run on existing projects).
UI now shows a small footer badge with:
- release channel + CalVer version (e.g. 'stable-2026.04.47')
- floating image tag (e.g. 'stable')
- time since last container restart (proxy for 'last deployed')
Backend:
- app/api/health.py: /api/health returns image_tag, commit_sha, deployed_at
- app/api/health.py: new /api/version endpoint (lightweight, no DB hit, for
footer badge polling)
Infra:
- startup-script.sh.tpl: resolves image digest from ghcr pull, derives
channel + version from the tag name, and writes AGNES_VERSION /
RELEASE_CHANNEL / AGNES_COMMIT_SHA into .env so the app can surface them
to the UI.
UI:
- app/web/templates/base.html: footer loads /api/version asynchronously and
renders '<channel>-<version> · <tag> · deployed <relative> (<UTC>)'.
Tooltip shows full detail (commit sha, schema version).
- docs/DEPLOYMENT.md: rewritten to pick between Terraform (managed) and
Docker Compose (OSS self-host). Old manual SSH-key-and-git-clone flow
replaced with compose-based instructions pointing at the persistent-disk
overlay and bootstrap endpoint.
- docs/ONBOARDING.md: section 4 now documents the new v1.4.0 variables
(runtime_secrets, firewall_ssh_source_ranges, notification_channel_ids,
compose_ref). Section 6 explains the /auth/bootstrap seed-user fix and
warns that destroy+apply reopens the bootstrap window until run again.
- README.md: Documentation list expanded — ONBOARDING.md first (recommended
path), DEPLOYMENT.md as the branching point, plus links to CONFIGURATION,
architecture, and QUICKSTART.
Bug: SEED_ADMIN_EMAIL creates a password-less user at app startup, which made
/auth/bootstrap return 403 '1 users already exist' on a fresh deployment —
leaving the operator no way to log in (the seed user has no password, and
/auth/token requires one).
Fix: bootstrap is now disabled only when at least one user has a
password_hash set. On a fresh deploy with a seed user:
- POST /auth/bootstrap { email: <matches seed>, password: X } → sets the
password on the seed user, promotes to admin, returns token.
- With a non-matching email, a new admin is created alongside the seed user.
Lock semantics: bootstrap self-deactivates as soon as any password is set.
Tests: 8 passing, including new test_bootstrap_activates_seed_user and
test_bootstrap_disabled_when_password_user_exists covering the two halves.
Critical fixes:
- C1: VM SA now gets secretmanager.secretAccessor only on specific secrets
(JWT + each entry in runtime_secrets). Previously project-wide.
- C3: chmod 640 on /var/log/agnes-startup.log (defense in depth)
- C4: Remove '|| echo ""' fallback on keboola-storage-token — boot now fails
loudly if the secret is missing instead of starting a broken app.
- C5: Cron auto-upgrade script sources /opt/agnes/.env for AGNES_TAG. If an
operator edits .env to pin a specific stable-YYYY.MM.N, cron picks it up
immediately with no drift. Removed AGNES_TAG from crontab entry.
- C7: explicit depends_on = [IAM bindings, secret_version] on VM — prevents
race where VM boots before IAM propagates.
Important fixes:
- I1: Split firewall into web (80/443 + conditional 8000) and ssh (port 22 with
configurable source_ranges, default IAP range only).
- I4: Fetch docker-compose files from compose_ref (default 'main'), so customers
can pin a specific tag for reproducibility.
- I5+I6: Merge order fixed — user-supplied dev_instances values now override
defaults (was the other way around). Dev tls_mode default flipped to 'none'.
- I7: Remove '|| true' on Caddyfile fetch; surface failures loudly.
- New acme_email variable (falls back to seed_admin_email if empty).
Out-of-module:
- Comments translated from Czech to English where applicable (M1).
- google_compute_resource_policy.daily_backup: daily snapshot at 02:00,
30-day retention, labels (app=agnes, customer=<name>)
- google_compute_disk_resource_policy_attachment.data_backup: attach policy
to each data disk (prod + dev)
- google_monitoring_uptime_check_config.health: per-VM /api/health uptime
check every 60s, 10s timeout
- google_monitoring_alert_policy.health_failure: alert when uptime check
fails for > 5 min
New opt-out: enable_monitoring = false (default true)
New opt-in: notification_channel_ids = [...] to wire alerts to email/Slack
Module API unchanged; existing customers pick up backups + monitoring on
next module upgrade. TF provider requirement unchanged.
Extracts branch name from GITHUB_REF, slugifies it, and adds as extra tag
on feature branch builds. Main branch is unaffected (no branch_slug output).
Enables dev_instances tfvar with image_tag pinning specific feature branches.
The CI smoke test failed because docker-compose.prod.yml forced a bind mount
to /data on the host — which doesn't exist on GitHub runners.
Split the bind mount into docker-compose.host-mount.yml, which is only
composed by the VM startup script (/data exists there, mounted from the
persistent disk). CI continues to use the default named volume.
Module startup script + auto-upgrade cron now compose all three:
-f docker-compose.yml -f docker-compose.prod.yml -f docker-compose.host-mount.yml
Watchtower container has Docker API mismatch (client 1.25 vs daemon 1.54+)
that can't be worked around without upstream fix. Simple cron job does the
same thing more reliably:
- Every 5 min: docker compose pull + detect digest change + up -d if changed
- Logs to /var/log/agnes-auto-upgrade.log
This removes the watchtower container and a Docker daemon dependency.
Without this override, docker-compose creates a named volume 'agnes_data'
on the boot disk, ignoring any persistent disk mounted at /data by the
VM startup script. This override makes the 'data' volume a bind mount
to host /data, so persistent disks work as expected.
Reads JWT_SECRET_KEY and KEBOOLA_STORAGE_TOKEN from Secret Manager,
combines with non-secret config, writes .env with chmod 600.
Run as part of VM startup or manually for rotation.
Creates agnes-deploy SA with Terraform-scoped roles, GCS tfstate bucket,
and generates a JSON key. Idempotent — safe to re-run.
Expanded .gitignore to block *-key.json files from ever being committed.
- Spec: pure self-deploy model with per-customer GCP project
- Public upstream repo with TF module; private template + per-customer repos
- Branch-aware dev VMs via dev_instances list
- Caddy TLS, Secret Manager for tokens, SA JSON key for CI (WIF follow-up)
- 6-phase implementation plan with bite-sized tasks
- Create empty .env before docker compose up in CI (env_file: .env is required)
- Mock get_jira_service in webhook HMAC test to isolate signature check
from Jira API availability — strict assert 200 instead of permissive 500
Replace module-level SECRET_KEY cache with lazy _get_cached_secret_key()
that re-reads env vars in test mode. This fixes 20 test failures caused
by JWT secret mismatch when test modules load in different orders.