* feat(deploy): keboola-deploy tag-triggered workflow + Caddyfile LE/internal modes + dev_instances TLS support
Three coordinated changes that together unblock Keboola's internal Agnes
deployment from the foot-gun where the dev VM tracks `:dev` (= last push
from anyone in the upstream repo).
1. .github/workflows/keboola-deploy.yml — new workflow
Triggered ONLY on `keboola-deploy-*` git tag pushes (not on every branch
push like release.yml). Builds an image and publishes two GHCR tags:
ghcr.io/keboola/agnes-the-ai-analyst:keboola-deploy-<git-tag-suffix>
ghcr.io/keboola/agnes-the-ai-analyst:keboola-deploy-latest
The Keboola dev VM pins to `keboola-deploy-latest`; an operator deploys
by `git tag keboola-deploy-foo && git push origin keboola-deploy-foo`.
Audit trail lives in git tags (immutable, who-tagged-what-when), no
PR-cycle needed for each deploy.
Doesn't touch Vojta/Minas/David workflow — release.yml still builds
`:dev-<slug>` for every branch push as before.
2. Caddyfile — parametrize TLS directive via $CADDY_TLS env var
PR #51 hardcoded cert-file mode (`tls /certs/fullchain.pem ...`) for
Groupon's corporate CA flow. That broke the Let's Encrypt path the
module previously supported. Now:
CADDY_TLS unset (default) → cert-file mode (Groupon corp PKI)
CADDY_TLS="tls user@x.com" → Let's Encrypt auto-issue
CADDY_TLS="tls internal" → Caddy-managed self-signed (lab/dev)
Single Caddyfile, three regimes, no per-deployment fork. Validated with
`caddy validate` in all three modes.
3. customer-instance module — dev_instances TLS + auto-set CADDY_TLS
- variables.tf: dev_instances object schema gains optional tls_mode +
domain (mirroring prod_instance). Defaults to "none" + "" so existing
callers without those fields keep current behavior.
- startup-script.sh.tpl: when tls_mode="caddy" and DOMAIN is set, write
CADDY_TLS=tls <ACME_EMAIL> (or "tls internal" when ACME_EMAIL empty)
into /opt/agnes/.env. Caddy then picks it up and the Caddyfile
substitution flips the cert source.
For an LE deploy: set tls_mode="caddy", domain="agnes-dev.example.com",
ensure DNS A-record points at the VM, and acme_email is set on the
module (or seed_admin_email is, since acme_email defaults to it).
After this lands, tag as infra-v1.6.0 so downstream infra repos can bump
their module ref without needing the upstream change tracking.
* feat(deploy): fetch optional Google OAuth credentials from Secret Manager
Mirrors the existing keboola-storage-token / agnes-<customer>-jwt-secret
pattern: VM SA reads google-oauth-client-{id,secret} secrets at boot
(if they exist + IAM is wired by caller via runtime_secrets) and writes
them into /opt/agnes/.env. Empty / missing / 403 → silent fallback
to "" so password and email auth keep working untouched.
Pairs with downstream change in agnes-infra-keboola which adds the two
secret names to runtime_secrets, granting the Keboola VM SA secretAccessor
on them. Operator pre-creates the SM containers via gcloud secrets create
google-oauth-client-{id,secret} (one-time, out of band) — values stay
in SM forever; rotation = `gcloud secrets versions add`.
This unblocks the Keboola agnes-dev deploy from PR #3 (infra) — without
GOOGLE_CLIENT_{ID,SECRET} in .env, app/auth/providers/google.is_available()
returns False and the Google sign-in button never even appears.
* dryrun: intentional failing test (will be reverted)
* feat(auth): optional SEED_ADMIN_PASSWORD to pre-hash seed admin (dev helper)
Terraform gains enable_seed_password + seed_admin_password (sensitive) vars
on the customer-instance module; when enabled the password is piped via
startup-script into /opt/agnes/.env as SEED_ADMIN_PASSWORD. On first boot
app/main.py argon2-hashes it onto the seed user so the admin can log in
immediately without going through /auth/bootstrap. Never overwrites an
existing password_hash — safe against accidental reset on terraform apply.
* ci(release): build :dev-<slug> on any branch, not just feature/**
Before: only 'feature/**' branches triggered release.yml, so pushing
'zs/my-edit' or 'fix/bug' did not publish an image. dev_instances entry
pinning image_tag = 'dev-zs-my-edit' then crashed VM startup with
'image not found'.
Now: any branch push (except main, which produces :stable) publishes
:dev-<slug>. Slug strips a leading 'feature/' and replaces non-[a-z0-9-]
with '-', keeping existing feature/** behavior identical.
* Revert "dryrun: intentional failing test (will be reverted)"
This reverts commit cf9cc06a7884bb401ff29fc5cb6d8baf84dc3daa.
Before: startup script wrote AGNES_VERSION=stable (the floating tag name)
into .env, which overrode the image's build-time ENV AGNES_VERSION=2026.04.47.
UI badge showed 'stable-stable' instead of 'stable-2026.04.47'.
After:
- Dockerfile ARG/ENV for AGNES_COMMIT_SHA (alongside existing VERSION + CHANNEL)
- release.yml passes github.sha as AGNES_COMMIT_SHA build-arg
- Startup script no longer writes these three into .env; the app reads them
from the image ENV set at build time.
Result: badge displays 'stable-2026.04.47 · stable · <time> ago' with the
real CalVer, and the commit SHA tooltip points at an actual commit rather
than the floating manifest digest.
GCP rejected the policy with 'REDUCE_COUNT_FALSE cannot be applied to
metrics with value type DOUBLE' — because ALIGN_FRACTION_TRUE already
produces a fraction 0..1 per series, no need for an additional cross-series
reducer. Simplified: alert when the per-series fraction < 1 for 5 min.
Review M4 predicted this — uptime check filters needed double-checking
against live GCP.
UI now shows a small footer badge with:
- release channel + CalVer version (e.g. 'stable-2026.04.47')
- floating image tag (e.g. 'stable')
- time since last container restart (proxy for 'last deployed')
Backend:
- app/api/health.py: /api/health returns image_tag, commit_sha, deployed_at
- app/api/health.py: new /api/version endpoint (lightweight, no DB hit, for
footer badge polling)
Infra:
- startup-script.sh.tpl: resolves image digest from ghcr pull, derives
channel + version from the tag name, and writes AGNES_VERSION /
RELEASE_CHANNEL / AGNES_COMMIT_SHA into .env so the app can surface them
to the UI.
UI:
- app/web/templates/base.html: footer loads /api/version asynchronously and
renders '<channel>-<version> · <tag> · deployed <relative> (<UTC>)'.
Tooltip shows full detail (commit sha, schema version).
Critical fixes:
- C1: VM SA now gets secretmanager.secretAccessor only on specific secrets
(JWT + each entry in runtime_secrets). Previously project-wide.
- C3: chmod 640 on /var/log/agnes-startup.log (defense in depth)
- C4: Remove '|| echo ""' fallback on keboola-storage-token — boot now fails
loudly if the secret is missing instead of starting a broken app.
- C5: Cron auto-upgrade script sources /opt/agnes/.env for AGNES_TAG. If an
operator edits .env to pin a specific stable-YYYY.MM.N, cron picks it up
immediately with no drift. Removed AGNES_TAG from crontab entry.
- C7: explicit depends_on = [IAM bindings, secret_version] on VM — prevents
race where VM boots before IAM propagates.
Important fixes:
- I1: Split firewall into web (80/443 + conditional 8000) and ssh (port 22 with
configurable source_ranges, default IAP range only).
- I4: Fetch docker-compose files from compose_ref (default 'main'), so customers
can pin a specific tag for reproducibility.
- I5+I6: Merge order fixed — user-supplied dev_instances values now override
defaults (was the other way around). Dev tls_mode default flipped to 'none'.
- I7: Remove '|| true' on Caddyfile fetch; surface failures loudly.
- New acme_email variable (falls back to seed_admin_email if empty).
Out-of-module:
- Comments translated from Czech to English where applicable (M1).
- google_compute_resource_policy.daily_backup: daily snapshot at 02:00,
30-day retention, labels (app=agnes, customer=<name>)
- google_compute_disk_resource_policy_attachment.data_backup: attach policy
to each data disk (prod + dev)
- google_monitoring_uptime_check_config.health: per-VM /api/health uptime
check every 60s, 10s timeout
- google_monitoring_alert_policy.health_failure: alert when uptime check
fails for > 5 min
New opt-out: enable_monitoring = false (default true)
New opt-in: notification_channel_ids = [...] to wire alerts to email/Slack
Module API unchanged; existing customers pick up backups + monitoring on
next module upgrade. TF provider requirement unchanged.
The CI smoke test failed because docker-compose.prod.yml forced a bind mount
to /data on the host — which doesn't exist on GitHub runners.
Split the bind mount into docker-compose.host-mount.yml, which is only
composed by the VM startup script (/data exists there, mounted from the
persistent disk). CI continues to use the default named volume.
Module startup script + auto-upgrade cron now compose all three:
-f docker-compose.yml -f docker-compose.prod.yml -f docker-compose.host-mount.yml
Watchtower container has Docker API mismatch (client 1.25 vs daemon 1.54+)
that can't be worked around without upstream fix. Simple cron job does the
same thing more reliably:
- Every 5 min: docker compose pull + detect digest change + up -d if changed
- Logs to /var/log/agnes-auto-upgrade.log
This removes the watchtower container and a Docker daemon dependency.
- Replace padak/tmp_oss → keboola/agnes-the-ai-analyst in all docs, infra, CLI
- Replace your-org/ai-data-analyst → keboola/agnes-the-ai-analyst in README, Jira docs
- Remove real GCP project ID from terraform.tfvars.example
- Delete internal draft documents (dev_docs/draft/)
- Update infra/main.tf to clone from main branch