agnes-the-ai-analyst/infra/modules/customer-instance/variables.tf
Vojtech 2447da7bb1
refactor(ops): bake all host artifacts into image, drop every curl-from-main (#149)
* refactor(ops): bake all host artifacts into image, drop every curl-from-main

Replaces the curl-from-main pattern (originally introduced in 0.25.0 for
agnes-auto-upgrade.sh; older for the compose files + Caddyfile) with image-
bundled host artifacts. Same-tag delivery for everything the host runs,
version-pinned by AGNES_TAG, atomically rolled back by reverting the image.

## Motivation

The customer-instance startup template was curling 6 files from
raw.githubusercontent.com on every VM boot:

  docker-compose.yml
  docker-compose.prod.yml
  docker-compose.host-mount.yml
  docker-compose.tls.yml
  Caddyfile
  scripts/ops/agnes-auto-upgrade.sh   (added in 0.25.0)

Every one of them already lives inside the image (`COPY . .` copies the
whole repo to /app/). Curling them from the public internet duplicates
content the image already carries and introduces three problems:

1. **Split-brain version pinning.** image_tag pins the docker image to an
   immutable digest. The compose files + script bypassed that pinning by
   tracking `main` (or the rarely-set compose_ref). A customer pinned to
   stable-2026.04.516 could wake up tomorrow with their host artifacts
   floating on whatever shipped to main overnight — even though they're
   explicitly pinned for stability.

2. **No rollback knob.** Reverting a bad host artifact meant reverting
   the upstream PR globally — affects every customer that reboots after
   the bad commit. No "rollback for me only" path; tag-pinning gave no
   protection.

3. **Public-internet dependency on every boot.** The image is already
   pulled from a private registry on the same boot. Reusing that channel
   is strictly cheaper than adding a second one. Customers with restricted
   egress (no raw.githubusercontent.com reachability) silently broke on
   every boot.

## Changes

### Dockerfile (+19 -8)

After `COPY . .` and before the wheel build, an explicit `cp` lifts every
host-side artifact into a stable contract path /opt/agnes-host/:

  agnes-auto-upgrade.sh                  (mode 0755 — host cron driver)
  docker-compose.{yml,prod,host-mount,tls}.yml
  Caddyfile                              (mode 0644)

Why a copy instead of pointing at /app directly: /app is owned by uid 999
(USER agnes); /opt/agnes-host is root-owned, mode 0755 across the board,
stable path that won't shift if /app structure refactors.

### infra/modules/customer-instance/startup-script.sh.tpl (+22 -36)

Replaced six curls and the standalone agnes-auto-upgrade.sh extract block
(introduced earlier in this PR) with one extract sequence in section 3:

    docker pull "$${IMAGE_REPO}:$${IMAGE_TAG}"
    EXTRACT_CONTAINER=$(docker create "$${IMAGE_REPO}:$${IMAGE_TAG}")
    trap "docker rm '$EXTRACT_CONTAINER' >/dev/null 2>&1 || true" EXIT
    docker cp "$EXTRACT_CONTAINER:/opt/agnes-host/." "$APP_DIR/"
    docker cp "$EXTRACT_CONTAINER:/opt/agnes-host/agnes-auto-upgrade.sh" /usr/local/bin/agnes-auto-upgrade.sh
    chmod +x /usr/local/bin/agnes-auto-upgrade.sh

The auto-upgrade section (#6) is now a no-op — script is already in place.

### infra/modules/customer-instance/variables.tf (+1 -1)

`compose_ref` marked DEPRECATED in description. Default unchanged for
one release cycle to avoid breaking existing terraform plans. Will be
removed in a future major bump.

### CHANGELOG.md

`### Changed` entry under [Unreleased] — supersedes the narrower entry
this PR previously had (which only covered the script).

## Out of scope (filed as follow-ups)

1. **agnes-the-ai-analyst-infra/startup.sh (operator deploy)** still
   curls the same artifacts from main. Symmetric fix needed there.
   Will file as a separate PR against the infra repo.

2. **Self-update inside agnes-auto-upgrade.sh** after a successful
   `docker compose pull` of a new digest. Otherwise the running cron
   keeps using the OLD baked-in script for one tick after image upgrade.
   ~10 LOC. Deferred to keep this PR scoped.

3. **scripts/ops/agnes-tls-rotate.sh** has the same shape — host-side
   bash currently sourced via the infra repo. Should follow the same
   bake-into-image pattern.

## Tested

- Local: `docker build .` succeeds with the new RUN block.
- `docker create` + `docker cp /opt/agnes-host/.` round-trips all 6
  artifacts; sha matches each source file.
- Not yet tested on a live VM bring-up — that requires a CI image with
  this Dockerfile change. **Recommend reviewer trigger CI build, then
  do a single VM-recreate against a dev VM (e.g. foundryai-development)
  to confirm the extract path works end-to-end before merge.**

## Compatibility

- Existing VMs running 0.25.0 are unaffected — they have host artifacts
  in place from `curl from main` already; this PR doesn't touch them.
  They pick up the new pattern only on next VM recreate.
- VMs pinned to an image_tag *older* than this PR (no /opt/agnes-host
  in the image) would FAIL the docker cp. Current diff fails-loud (no
  fallback). Recommend operators upgrade to a fresh-enough image_tag
  alongside the template upgrade — same coupling as any compose-flag bump.

* docs(infra): document image_tag >= v0.26.0 minimum on prod/dev_instances

The new startup script extracts host artifacts from /opt/agnes-host/
inside the image — a directory added in this PR (will ship as v0.26.0).
Pinning image_tag to an older tag would fail-loud at first boot with
'docker cp: No such file or directory'. Existing VMs are unaffected
because the module ignores metadata_startup_script changes.

Devin ANALYSIS_0004 on PR #149.

* fix(changelog): mark BREAKING + drop private-repo reference

Per CLAUDE.md, breaking changes start with **BREAKING** so operators
can grep before bumping the pin. The image_tag minimum constraint
introduced here qualifies — older tags fail-loud at first boot.

Also drop the explicit 'agnes-the-ai-analyst-infra' name from the
entry; the OSS distribution shouldn't reference operator-side
deploy templates by their private-repo names. Generic 'consumer-
side deploy templates' wording instead.

Devin BUG_0001 + WARN_0001 on PR #149.

* chore(release): cut 0.26.0

---------

Co-authored-by: ZdenekSrotyr <zdenek.srotyr@keboola.com>
2026-04-30 21:40:25 +02:00

142 lines
4.9 KiB
HCL

variable "gcp_project_id" {
description = "GCP project ID where the instance will be deployed."
type = string
}
variable "region" {
description = "GCP region"
type = string
default = "europe-west1"
}
variable "zone" {
description = "GCP zone"
type = string
default = "europe-west1-b"
}
variable "customer_name" {
description = "Short customer identifier (e.g. acme, example). Used as a prefix for created resources."
type = string
validation {
condition = can(regex("^[a-z][a-z0-9-]{1,20}$", var.customer_name))
error_message = "customer_name must be lowercase, start with a letter, 2-21 chars."
}
}
variable "prod_instance" {
description = <<-EOT
Production VM configuration.
`image_tag` MUST point to an image that contains `/opt/agnes-host/`
(this directory was added in v0.26.0). Older tags will fail at first
boot with `docker cp: No such file or directory` because the startup
script extracts host artifacts from the image instead of curling
them. Existing VMs are unaffected by this constraint the module
sets `lifecycle { ignore_changes = [metadata_startup_script] }` so
the new script only runs on freshly-created VMs.
EOT
type = object({
name = string
machine_type = optional(string, "e2-small")
disk_size_gb = optional(number, 30)
data_disk_gb = optional(number, 50)
image_tag = optional(string, "stable")
upgrade_mode = optional(string, "auto")
tls_mode = optional(string, "caddy")
domain = optional(string, "")
})
}
variable "dev_instances" {
description = <<-EOT
List of dev VMs. Empty list = no dev VMs.
tls_mode + domain are optional and default to plain HTTP on :8000. Set
tls_mode = "caddy" + domain to enable Caddy + Let's Encrypt (or whatever
CADDY_TLS env var is configured to in the Caddyfile see Caddyfile docs).
Same `image_tag >= v0.26.0` constraint as `prod_instance` older tags
lack `/opt/agnes-host/` and the startup `docker cp` fails-loud.
EOT
type = list(object({
name = string
machine_type = optional(string, "e2-small")
image_tag = optional(string, "dev")
tls_mode = optional(string, "none")
domain = optional(string, "")
}))
default = []
}
variable "seed_admin_email" {
description = "Email of the initial admin user."
type = string
}
variable "enable_seed_password" {
description = "If true, the seed admin user immediately gets a password_hash from seed_admin_password (dev helper). Keep false in prod — the admin sets a password via /auth/bootstrap or Google OAuth."
type = bool
default = false
}
variable "seed_admin_password" {
description = "Plain-text password for the seed admin. Only used when enable_seed_password=true. WARNING: stored in Terraform state."
type = string
default = ""
sensitive = true
}
variable "data_source" {
description = "Data source type — keboola | bigquery | csv."
type = string
default = "keboola"
}
variable "keboola_stack_url" {
description = "Keboola Stack URL (used when data_source = keboola)."
type = string
default = ""
}
variable "image_repo" {
description = "Docker image repo"
type = string
default = "ghcr.io/keboola/agnes-the-ai-analyst"
}
variable "compose_ref" {
description = "DEPRECATED — no longer used. Compose files now ship inside the docker image at /opt/agnes-host/ and are extracted via `docker cp` from the same `image_tag` the operator pinned. Pin `image_tag` instead. Variable retained for one release cycle to avoid breaking existing terraform plans; will be removed in a future major bump."
type = string
default = "main"
}
variable "enable_monitoring" {
description = "Create uptime checks + alert policies for each VM. Requires notification_channel_ids to be useful."
type = bool
default = true
}
variable "notification_channel_ids" {
description = "Full resource IDs of GCP Monitoring notification channels (create in customer project via gcloud alpha monitoring channels create). Empty list = alerts fire but nothing is notified."
type = list(string)
default = []
}
variable "runtime_secrets" {
description = "Names of existing Secret Manager secrets the VM needs to read at runtime (e.g. Keboola Storage token). VM SA gets scoped secretAccessor on each."
type = list(string)
default = ["keboola-storage-token"]
}
variable "firewall_ssh_source_ranges" {
description = "CIDR ranges allowed to reach SSH (port 22). Default is IAP tunnel range only (use `gcloud compute ssh --tunnel-through-iap`). Override to `[\"0.0.0.0/0\"]` for unrestricted (not recommended)."
type = list(string)
default = ["35.235.240.0/20"]
}
variable "acme_email" {
description = "Email for Let's Encrypt account (used when tls_mode=caddy). Defaults to seed_admin_email if empty."
type = string
default = ""
}