Commit graph

383 commits

Author SHA1 Message Date
ZdenekSrotyr
ada9fb75f6
chore: add switch-dev-vm.sh helper for hackathon (#20) 2026-04-21 21:33:02 +02:00
ZdenekSrotyr
2cbffce85f
ci: propagate infra-v* tags to template repo + auto-merge rules (#17)
* dryrun: verify per-branch GHCR tag

* ci: propagate infra-v* tag bumps to template repo

On push of any infra-v* tag, opens a PR in keboola/agnes-infra-template
that bumps the module ref in terraform/main.tf. Auto-merge rules in the
template (Renovate + CI validate + GitHub native auto-merge) land it
without manual work on patch/minor bumps.

Requires repo secret TEMPLATE_REPO_TOKEN (fine-grained PAT with
Contents:write + Pull requests:write on keboola/agnes-infra-template).

Fail-soft: if secret is missing the job is skipped and Renovate on the
template repo picks up the new tag on its next cycle as a fallback.

* docs(onboarding): 'Keeping the template up-to-date' maintainer section

Documents the two mechanisms (upstream release hook + Renovate), the
required repo settings (allow_auto_merge, validate.yml gate), the TOKEN
secret setup, and the one-time setup checklist. Notes the difference
between template repo (auto-merge on) and customer infra repos
(human approval).
2026-04-21 21:32:58 +02:00
ZdenekSrotyr
e4f6910398 Merge: real CalVer + commit SHA in UI badge 2026-04-21 21:00:42 +02:00
ZdenekSrotyr
1c7cc8aa29 fix(image): add AGNES_COMMIT_SHA build-arg to Dockerfile + release.yml
Completes the previous commit — bakes the full git SHA into the image ENV
at build time so the UI badge shows a real commit, not a sha256 digest
(which was the floating manifest digest and unhelpful for debugging).
2026-04-21 21:00:30 +02:00
ZdenekSrotyr
af6761f33e fix(version): bake AGNES_VERSION/CHANNEL/COMMIT_SHA into image ENV
Before: startup script wrote AGNES_VERSION=stable (the floating tag name)
into .env, which overrode the image's build-time ENV AGNES_VERSION=2026.04.47.
UI badge showed 'stable-stable' instead of 'stable-2026.04.47'.

After:
- Dockerfile ARG/ENV for AGNES_COMMIT_SHA (alongside existing VERSION + CHANNEL)
- release.yml passes github.sha as AGNES_COMMIT_SHA build-arg
- Startup script no longer writes these three into .env; the app reads them
  from the image ENV set at build time.

Result: badge displays 'stable-2026.04.47 · stable · <time> ago' with the
real CalVer, and the commit SHA tooltip points at an actual commit rather
than the floating manifest digest.
2026-04-21 21:00:04 +02:00
ZdenekSrotyr
7553f77e55 Merge: version badge partial on all full-page templates 2026-04-21 20:52:04 +02:00
ZdenekSrotyr
432e7695b3 feat(ui): version badge as shared partial, injected into every full-page template
The earlier base.html edit only affected templates that extend base.html
(login.html via base_login.html). Most pages (dashboard, catalog,
admin_tables, admin_permissions, activity_center, corporate_memory, ...)
are standalone templates with their own <body>, so the badge never showed.

Fix: extracted the badge + fetch script into _version_badge.html partial,
included it before </body> in every full-page template. Consistent across
login, dashboard, admin, catalog, etc.
2026-04-21 20:51:55 +02:00
ZdenekSrotyr
dbac3e698c Merge: alert policy reducer fix 2026-04-21 20:36:21 +02:00
ZdenekSrotyr
9a99a82e92 fix(infra): alert policy aggregation — drop cross_series_reducer
GCP rejected the policy with 'REDUCE_COUNT_FALSE cannot be applied to
metrics with value type DOUBLE' — because ALIGN_FRACTION_TRUE already
produces a fraction 0..1 per series, no need for an additional cross-series
reducer. Simplified: alert when the per-series fraction < 1 for 5 min.

Review M4 predicted this — uptime check filters needed double-checking
against live GCP.
2026-04-21 20:36:09 +02:00
ZdenekSrotyr
717f40c218 Merge: bootstrap monitoring role fix 2026-04-21 20:32:59 +02:00
ZdenekSrotyr
4ab0838ba2 fix(bootstrap): grant monitoring.editor + enable monitoring API
v1.3.0 added google_monitoring_uptime_check_config + alert policies to the
module, but bootstrap-gcp.sh was not updated. Fresh customers (and the
first apply after upgrading existing customers) hit 403 on
monitoring.uptimeCheckConfigs.create.

Fix: enable monitoring.googleapis.com + grant roles/monitoring.editor to
the deploy SA. Idempotent (safe to re-run on existing projects).
2026-04-21 20:32:50 +02:00
ZdenekSrotyr
3fb17a13bb Merge: workflow-driven recreate + docs 2026-04-21 20:24:40 +02:00
ZdenekSrotyr
1a55167234 docs: workflow-driven VM recreate for startup-script propagation
- ONBOARDING.md: replace 'propagating module changes' section with two
  explicit options — workflow_dispatch with recreate_targets (recommended,
  CI audit trail), or local terraform apply -replace (emergency). Adds a
  'do not' section banning manual .env edits on VMs.
- deployment-log.md: iteration 4 summary (version badge + module v1.5.0 +
  workflow_dispatch).
2026-04-21 20:24:31 +02:00
ZdenekSrotyr
11c03f7235 Merge: version badge in footer + /api/version 2026-04-21 20:19:51 +02:00
ZdenekSrotyr
b091cf7003 feat(ui): version badge in footer + /api/version endpoint
UI now shows a small footer badge with:
- release channel + CalVer version (e.g. 'stable-2026.04.47')
- floating image tag (e.g. 'stable')
- time since last container restart (proxy for 'last deployed')

Backend:
- app/api/health.py: /api/health returns image_tag, commit_sha, deployed_at
- app/api/health.py: new /api/version endpoint (lightweight, no DB hit, for
  footer badge polling)

Infra:
- startup-script.sh.tpl: resolves image digest from ghcr pull, derives
  channel + version from the tag name, and writes AGNES_VERSION /
  RELEASE_CHANNEL / AGNES_COMMIT_SHA into .env so the app can surface them
  to the UI.

UI:
- app/web/templates/base.html: footer loads /api/version asynchronously and
  renders '<channel>-<version> · <tag> · deployed <relative> (<UTC>)'.
  Tooltip shows full detail (commit sha, schema version).
2026-04-21 20:19:40 +02:00
ZdenekSrotyr
2743de6114 Merge: deployment log iteration 3 2026-04-21 20:09:27 +02:00
ZdenekSrotyr
cdd959b19f docs(log): add iteration 3 — review, bootstrap fix, docs sweep, infra-v1.4.0 2026-04-21 20:09:13 +02:00
ZdenekSrotyr
c1227df990 Merge: docs sweep — DEPLOYMENT.md rewrite, ONBOARDING v1.4.0, README links 2026-04-21 20:08:23 +02:00
ZdenekSrotyr
0121354596 docs: refresh DEPLOYMENT.md and ONBOARDING.md for infra-v1.4.0
- docs/DEPLOYMENT.md: rewritten to pick between Terraform (managed) and
  Docker Compose (OSS self-host). Old manual SSH-key-and-git-clone flow
  replaced with compose-based instructions pointing at the persistent-disk
  overlay and bootstrap endpoint.
- docs/ONBOARDING.md: section 4 now documents the new v1.4.0 variables
  (runtime_secrets, firewall_ssh_source_ranges, notification_channel_ids,
  compose_ref). Section 6 explains the /auth/bootstrap seed-user fix and
  warns that destroy+apply reopens the bootstrap window until run again.
- README.md: Documentation list expanded — ONBOARDING.md first (recommended
  path), DEPLOYMENT.md as the branching point, plus links to CONFIGURATION,
  architecture, and QUICKSTART.
2026-04-21 20:07:43 +02:00
ZdenekSrotyr
0643437ab8 Merge: /auth/bootstrap seed-user fix 2026-04-21 20:01:38 +02:00
ZdenekSrotyr
2b17973796 fix(auth): /auth/bootstrap activates seed users, disabled only by real password
Bug: SEED_ADMIN_EMAIL creates a password-less user at app startup, which made
/auth/bootstrap return 403 '1 users already exist' on a fresh deployment —
leaving the operator no way to log in (the seed user has no password, and
/auth/token requires one).

Fix: bootstrap is now disabled only when at least one user has a
password_hash set. On a fresh deploy with a seed user:
- POST /auth/bootstrap { email: <matches seed>, password: X } → sets the
  password on the seed user, promotes to admin, returns token.
- With a non-matching email, a new admin is created alongside the seed user.

Lock semantics: bootstrap self-deactivates as soon as any password is set.

Tests: 8 passing, including new test_bootstrap_activates_seed_user and
test_bootstrap_disabled_when_password_user_exists covering the two halves.
2026-04-21 20:01:20 +02:00
ZdenekSrotyr
7245eedd23 Merge: code review fixes — scoped SA, fail-fast, firewall split, cron .env 2026-04-21 19:40:07 +02:00
ZdenekSrotyr
921094ae40 feat(infra): address code review — scoped SA, fail-fast secrets, firewall split, cron reads .env, merge fix
Critical fixes:
- C1: VM SA now gets secretmanager.secretAccessor only on specific secrets
  (JWT + each entry in runtime_secrets). Previously project-wide.
- C3: chmod 640 on /var/log/agnes-startup.log (defense in depth)
- C4: Remove '|| echo ""' fallback on keboola-storage-token — boot now fails
  loudly if the secret is missing instead of starting a broken app.
- C5: Cron auto-upgrade script sources /opt/agnes/.env for AGNES_TAG. If an
  operator edits .env to pin a specific stable-YYYY.MM.N, cron picks it up
  immediately with no drift. Removed AGNES_TAG from crontab entry.
- C7: explicit depends_on = [IAM bindings, secret_version] on VM — prevents
  race where VM boots before IAM propagates.

Important fixes:
- I1: Split firewall into web (80/443 + conditional 8000) and ssh (port 22 with
  configurable source_ranges, default IAP range only).
- I4: Fetch docker-compose files from compose_ref (default 'main'), so customers
  can pin a specific tag for reproducibility.
- I5+I6: Merge order fixed — user-supplied dev_instances values now override
  defaults (was the other way around). Dev tls_mode default flipped to 'none'.
- I7: Remove '|| true' on Caddyfile fetch; surface failures loudly.
- New acme_email variable (falls back to seed_admin_email if empty).

Out-of-module:
- Comments translated from Czech to English where applicable (M1).
2026-04-21 19:39:53 +02:00
ZdenekSrotyr
9962fc4d40 Merge: final deployment log iteration 2 2026-04-21 19:11:14 +02:00
ZdenekSrotyr
6470e23df3 docs: finalize deployment log — iteration 2 summary 2026-04-21 19:11:07 +02:00
ZdenekSrotyr
1073517969 Merge: onboarding race-condition fix 2026-04-21 19:10:12 +02:00
ZdenekSrotyr
0b4807a836 docs(onboarding): use 'gh repo create --clone' to avoid template-copy race
Separate 'gh repo create --clone=false' + 'git clone' races with GitHub's
template content propagation. '--clone' waits for it in one step.
2026-04-21 19:10:04 +02:00
ZdenekSrotyr
4501840893 Merge: onboarding docs — propagation, restore, monitoring 2026-04-21 19:06:27 +02:00
ZdenekSrotyr
3e9213bfc4 docs(onboarding): add module propagation, backup restore, monitoring setup
- 'Propagating module changes' — explains ignore_changes + -replace workflow
- 'Restoring from backup' — step-by-step disk swap from daily snapshot
- 'Monitoring alerts' — wiring notification channels
2026-04-21 19:06:20 +02:00
ZdenekSrotyr
85bca573a7 Merge: daily backup snapshot + monitoring alerts 2026-04-21 19:02:07 +02:00
ZdenekSrotyr
0842debf8a feat(infra): add daily backup snapshot + monitoring alerts
- google_compute_resource_policy.daily_backup: daily snapshot at 02:00,
  30-day retention, labels (app=agnes, customer=<name>)
- google_compute_disk_resource_policy_attachment.data_backup: attach policy
  to each data disk (prod + dev)
- google_monitoring_uptime_check_config.health: per-VM /api/health uptime
  check every 60s, 10s timeout
- google_monitoring_alert_policy.health_failure: alert when uptime check
  fails for > 5 min

New opt-out: enable_monitoring = false (default true)
New opt-in:  notification_channel_ids = [...] to wire alerts to email/Slack

Module API unchanged; existing customers pick up backups + monitoring on
next module upgrade. TF provider requirement unchanged.
2026-04-21 19:01:56 +02:00
ZdenekSrotyr
0ca8ed2bce Merge: per-branch image tag :dev-<slug> for branch-aware dev deploys 2026-04-21 18:47:16 +02:00
ZdenekSrotyr
5188bd9127 ci: add per-branch image tag :dev-<slug> for branch-aware dev deploys
Extracts branch name from GITHUB_REF, slugifies it, and adds as extra tag
on feature branch builds. Main branch is unaffected (no branch_slug output).

Enables dev_instances tfvar with image_tag pinning specific feature branches.
2026-04-21 18:47:01 +02:00
ZdenekSrotyr
1811a408de Merge: fix CI smoke test — split host bind mount to separate overlay 2026-04-21 16:54:27 +02:00
ZdenekSrotyr
1acc89c486 fix(ci): move bind-mount of /data to separate overlay, fix CI smoke test
The CI smoke test failed because docker-compose.prod.yml forced a bind mount
to /data on the host — which doesn't exist on GitHub runners.

Split the bind mount into docker-compose.host-mount.yml, which is only
composed by the VM startup script (/data exists there, mounted from the
persistent disk). CI continues to use the default named volume.

Module startup script + auto-upgrade cron now compose all three:
  -f docker-compose.yml -f docker-compose.prod.yml -f docker-compose.host-mount.yml
2026-04-21 16:54:18 +02:00
ZdenekSrotyr
a3b4b43e47 Merge: deployment log with final state 2026-04-21 16:51:28 +02:00
ZdenekSrotyr
03dd81c825 docs: update deployment log with final state and onboarding workflow
- Volume fix documented (Docker named volume → bind mount /data)
- Watchtower → cron-based auto-upgrade
- Final state snapshot of VMs, repos, tags, secrets
- Onboarding flow summary for 2nd customer
2026-04-21 16:51:20 +02:00
ZdenekSrotyr
85c6b114b0 Merge: add ONBOARDING.md 2026-04-21 16:49:54 +02:00
ZdenekSrotyr
a44e11a5e2 docs: add ONBOARDING.md — end-to-end per-customer deployment guide 2026-04-21 16:49:45 +02:00
ZdenekSrotyr
3dcdc52faf Merge: replace watchtower with cron, bump infra module to v1.1.0 2026-04-21 16:47:05 +02:00
ZdenekSrotyr
cbd85c52ed fix(infra): replace watchtower with cron for auto-upgrade
Watchtower container has Docker API mismatch (client 1.25 vs daemon 1.54+)
that can't be worked around without upstream fix. Simple cron job does the
same thing more reliably:
- Every 5 min: docker compose pull + detect digest change + up -d if changed
- Logs to /var/log/agnes-auto-upgrade.log

This removes the watchtower container and a Docker daemon dependency.
2026-04-21 16:46:55 +02:00
ZdenekSrotyr
94b6a8eff2 Merge feature/multi-customer-deployment: multi-customer deployment infra
- infra/modules/customer-instance/ — reusable Terraform module (tag infra-v1.0.0)
- infra/examples/minimal/ — OSS self-host quickstart
- scripts/bootstrap-gcp.sh — per-customer GCP setup
- scripts/fetch-env-from-secrets.sh — VM-side .env from Secret Manager
- docker-compose.prod.yml — bind data volume to host /data for persistent disks
- docs/superpowers/specs/2026-04-21-multi-customer-deployment-spec.md
- docs/superpowers/plans/2026-04-21-multi-customer-deployment.md
- docs/superpowers/plans/2026-04-21-deployment-log.md
2026-04-21 16:43:06 +02:00
ZdenekSrotyr
52d63457ff fix(prod): bind docker data volume to host /data for persistent disk
Without this override, docker-compose creates a named volume 'agnes_data'
on the boot disk, ignoring any persistent disk mounted at /data by the
VM startup script. This override makes the 'data' volume a bind mount
to host /data, so persistent disks work as expected.
2026-04-21 16:42:23 +02:00
ZdenekSrotyr
a2c05a5d97 infra: refactor Terraform into reusable customer-instance module
Breaking changes:
- infra/main.tf, variables.tf, outputs.tf, terraform.tfvars.example removed
- Single-file monolith replaced by reusable module + example

New structure:
- infra/modules/customer-instance/ — the module:
  - main.tf: VMs, disks, firewall, Secret Manager, dedicated VM SA
  - variables.tf: prod_instance + dev_instances flexible schema
  - outputs.tf: IPs, SA email, JWT secret reference
  - startup-script.sh.tpl: bootstraps VM, fetches secrets, runs compose,
    adds Watchtower for auto-upgrade
- infra/examples/minimal/ — OSS self-host quickstart using the module

Supports:
- Per-customer GCP project isolation
- Branch-aware dev VMs via dev_instances list (any image_tag)
- Persistent /data disk (rebuild-safe)
- OS Login (no per-user SSH keys)
- Caddy TLS mode (opt-in via tls_mode="caddy" + domain)
- Watchtower auto-upgrade (opt-in via upgrade_mode="auto")
2026-04-21 16:18:35 +02:00
ZdenekSrotyr
0dd8b13d62 infra: add fetch-env-from-secrets.sh for VM-side .env generation
Reads JWT_SECRET_KEY and KEBOOLA_STORAGE_TOKEN from Secret Manager,
combines with non-secret config, writes .env with chmod 600.
Run as part of VM startup or manually for rotation.
2026-04-21 16:18:35 +02:00
ZdenekSrotyr
5ad96e5f86 infra: add bootstrap-gcp.sh for per-customer GCP setup
Creates agnes-deploy SA with Terraform-scoped roles, GCS tfstate bucket,
and generates a JSON key. Idempotent — safe to re-run.

Expanded .gitignore to block *-key.json files from ever being committed.
2026-04-21 16:18:35 +02:00
ZdenekSrotyr
e514f57267
Merge pull request #6 from keboola/dependabot/uv/python-multipart-0.0.26
chore(deps): bump python-multipart from 0.0.24 to 0.0.26
2026-04-21 15:27:25 +02:00
dependabot[bot]
6e93461918
chore(deps): bump python-multipart from 0.0.24 to 0.0.26
Bumps [python-multipart](https://github.com/Kludex/python-multipart) from 0.0.24 to 0.0.26.
- [Release notes](https://github.com/Kludex/python-multipart/releases)
- [Changelog](https://github.com/Kludex/python-multipart/blob/master/CHANGELOG.md)
- [Commits](https://github.com/Kludex/python-multipart/compare/0.0.24...0.0.26)

---
updated-dependencies:
- dependency-name: python-multipart
  dependency-version: 0.0.26
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-04-21 13:26:19 +00:00
ZdenekSrotyr
e53de59a42 docs: multi-customer deployment spec + implementation plan
- Spec: pure self-deploy model with per-customer GCP project
- Public upstream repo with TF module; private template + per-customer repos
- Branch-aware dev VMs via dev_instances list
- Caddy TLS, Secret Manager for tokens, SA JSON key for CI (WIF follow-up)
- 6-phase implementation plan with bite-sized tasks
2026-04-21 15:25:17 +02:00
ZdenekSrotyr
cf8528b5cf
Merge pull request #7 from keboola/dependabot/uv/authlib-1.6.11
chore(deps): bump authlib from 1.6.9 to 1.6.11
2026-04-21 15:24:57 +02:00