History

Vojtech 0bbbf3e40b feat(tls): corporate-CA HTTPS with URL-driven rotation, on-VM CSR gen, self-signed fallback (#51 ) Replaces the implicit Let's Encrypt flow with a general corporate-CA HTTPS path: - Caddy switches to cert-file mode (`tls /certs/fullchain.pem /certs/privkey.pem`) with HSTS + TLS 1.2/1.3 floor - New `docker-compose.tls.yml` overlay closes host `:8000` when Caddy fronts (no TLS bypass) - New `scripts/tls-fetch.sh` — generic URL fetcher for `sm://`, `gs://`, `https://`, `file://` with redirect refusal + PEM validation - New `scripts/grpn/agnes-tls-rotate.sh` — daily rotation, self-signed fallback against same key (zero key churn), on-VM RSA-2048 + CSR auto-gen, atomic swap, SIGUSR1 reload - `scripts/grpn/agnes-auto-upgrade.sh` becomes cert-aware (auto-enables tls overlay when certs present) - Compose profile `production` renamed to `tls` (aligns with DEPLOYMENT.md and infra startup) Pairs with FoundryAI/agnes-the-ai-analyst-infra#27 (merged) which wires per-VM `local.vm_tls`, writes `TLS_` env vars into `.env`, auto-creates Secret Manager containers for `sm://` privkey URLs, and installs `agnes-tls-rotate.{service,timer}` for daily polling. Includes hardening + docs follow-ups from code review: - `TLS_CSR_SUBJECT` env-var parametrisation applied to both CSR and self-signed cert paths - curl `--max-redirs 0 --proto '=https'` + post-fetch PEM validation in `tls-fetch.sh` - `ulimit -c 0` + array-form `COMPOSE_FILES` (POSIX-safe, bash 3.2 compatible) - TLS section added to `config/.env.template` - Historical-note headers in `docs/superpowers/{plans,specs}/2026-04-09-.md` flagging the profile rename		2026-04-25 19:51:25 +00:00
..
agnes-auto-upgrade.sh	feat(tls): corporate-CA HTTPS with URL-driven rotation, on-VM CSR gen, self-signed fallback (#51 )	2026-04-25 19:51:25 +00:00
agnes-tls-rotate.sh	feat(tls): corporate-CA HTTPS with URL-driven rotation, on-VM CSR gen, self-signed fallback (#51 )	2026-04-25 19:51:25 +00:00
Makefile	User management + PAT + CLI distribution + HTML auth redirect (#9 #10 #11 #12 ) (#28 )	2026-04-22 14:24:28 +02:00
README.md	User management + PAT + CLI distribution + HTML auth redirect (#9 #10 #11 #12 ) (#28 )	2026-04-22 14:24:28 +02:00

README.md

Manual deploy helper — Agnes on an existing VM (GRPN pattern)

A make-based helper for deploying and operating Agnes on an existing GCE VM when the full Terraform flow is blocked — typically by organization policies that forbid SA JSON key creation or by missing IAM delegation. This is the pattern we used on GRPN's foundryai-development during the 2026-04-22 hackathon.

It is not a replacement for the full Terraform module — only a stopgap while the proper flow is being unblocked. See Migration path below.

When to use this

Use this helper when all are true:

A target VM already exists in the customer's GCP project (we don't create it)
You (or the deploy SA) do not have roles/resourcemanager.projectIamAdmin on that project, or the org has constraints/iam.disableServiceAccountKeyCreation enabled
The customer is OK with a single-VM, single-node Agnes (no prod + dev split for now)
Data persistence on the VM's boot disk is acceptable (no persistent disk attached → data loss on VM recreate)

Any of those false → go the Terraform route via docs/HACKATHON.md Part 1.

What it does (and doesn't)

Aspect	Manual helper (this)	Full Terraform flow
VM provisioning	Reuses existing VM	Creates a dedicated `agnes-prod` + optional `agnes-dev` VMs
Docker install	Inline `curl get.docker.com \| sh` on first deploy	Part of the module's startup script
Secrets	Plain `.env` on VM (`chmod 600`)	GCP Secret Manager, read by VM SA
Service account	Uses the VM's existing SA, whatever that is	Dedicated `agnes-<customer>-vm` with scoped `secretmanager.secretAccessor` only
Data persistence	Boot disk, ephemeral across VM recreate	Separate persistent disk (`/data` bind-mount), daily snapshot + 30-day retention
Auto-upgrade	`install-cron` target deploys the same cron script the module uses	Built into the startup script
Monitoring / alerts	None	Uptime check + alert policy per VM
Backup	None	Daily snapshot schedule
Branch-aware dev VMs	Not supported (single VM)	`dev_instances` list — one VM per branch/engineer
CI/CD	None — manual `make deploy`	GitHub Actions: PR → plan → apply (dev auto, prod gated)

The helper covers the runtime aspects (pull image, restart, logs, access) but skips the infra-as-code posture.

One-time setup

Done for GRPN during the 2026-04-22 hackathon. Re-useable template for any future customer in a similar constrained environment:

1. Verify access to the VM

gcloud compute ssh $VM --zone=$ZONE --project=$PROJECT --command='whoami'

If this works, you have SSH via OS Login or your own key. IAP tunnel auto-kicks in if the VM has no external IP. No further auth setup is needed.

2. Install Docker + compose plugin

gcloud compute ssh $VM --zone=$ZONE --project=$PROJECT --command="
  curl -fsSL https://get.docker.com | sudo sh
  sudo apt-get install -y -qq docker-compose-plugin
"

3. Prepare app directory and data root

gcloud compute ssh $VM --zone=$ZONE --project=$PROJECT --command="
  sudo mkdir -p /opt/agnes /data/state /data/analytics /data/extracts
  sudo chown -R \$USER:\$USER /opt/agnes
  cd /opt/agnes
  curl -fsSL https://raw.githubusercontent.com/keboola/agnes-the-ai-analyst/main/docker-compose.yml -o docker-compose.yml
  curl -fsSL https://raw.githubusercontent.com/keboola/agnes-the-ai-analyst/main/docker-compose.prod.yml -o docker-compose.prod.yml
  curl -fsSL https://raw.githubusercontent.com/keboola/agnes-the-ai-analyst/main/docker-compose.host-mount.yml -o docker-compose.host-mount.yml
"

4. Write `.env` (plain, chmod 600)

JWT=$(openssl rand -hex 32)
cat > /tmp/agnes-env <<EOF
JWT_SECRET_KEY=$JWT
DATA_DIR=/data
DATA_SOURCE=csv          # or bigquery / keboola
SEED_ADMIN_EMAIL=<your@email>
LOG_LEVEL=info
AGNES_TAG=stable
EOF
gcloud compute scp /tmp/agnes-env $VM:/tmp/.env --zone=$ZONE --project=$PROJECT
gcloud compute ssh $VM --zone=$ZONE --project=$PROJECT --command="
  sudo install -m 600 -o \$USER -g \$USER /tmp/.env /opt/agnes/.env
  rm /tmp/.env
"
rm /tmp/agnes-env

If DATA_SOURCE=keboola, add KEBOOLA_STORAGE_TOKEN=... + KEBOOLA_STACK_URL=... lines. Same for any BQ / custom data source credentials — they all live in this one .env.

5. First boot

make deploy
make bootstrap-admin PASSWORD=<strong-initial>

deploy pulls the image + starts containers. bootstrap-admin hits /auth/bootstrap to activate the seed admin.

6. (Optional) Auto-upgrade

make install-cron

Installs the same 5-minute polling cron used by the Terraform module. After this, every new :stable image digest is picked up within ~5 min without any human action.

Everyday operations

From the repo root (tested defaults target GRPN's foundryai-development):

make -C scripts/grpn help           # list all targets
make -C scripts/grpn status         # is it up?
make -C scripts/grpn version        # what's deployed right now
make -C scripts/grpn logs           # tail app logs
make -C scripts/grpn deploy         # pull :stable + recreate
make -C scripts/grpn tunnel         # IAP tunnel → http://localhost:8000

Configuration

All targets read overridable variables at the top of Makefile. Defaults target GRPN's foundryai-development. For other VMs/projects:

# one-off override
make -C scripts/grpn status \
    PROJECT=other-project \
    ZONE=us-central1-a \
    VM=other-vm

# or fork this Makefile into `scripts/<customer>/Makefile` with different defaults

Variable	Default	Purpose
`PROJECT`	`prj-grp-foundryai-dev-7c37`	GCP project ID
`ZONE`	`us-central1-a`	VM zone
`VM`	`foundryai-development`	Instance name
`APP_DIR`	`/opt/agnes`	Where compose files + `.env` live on the VM
`LOCAL_PORT`	`8000`	Local port for `tunnel` target
`VM_PORT`	`8000`	Port the app listens on inside the VM
`IMAGE`	`ghcr.io/keboola/agnes-the-ai-analyst`	GHCR image repo
`ADMIN_EMAIL`	`e_zsrotyr@groupon.com`	Default bootstrap email

Files

scripts/grpn/
├── Makefile                 # the helper itself
├── agnes-auto-upgrade.sh    # deployed by `make install-cron` to /usr/local/bin/
└── README.md                # this file

Plus the deploy log: docs/superpowers/plans/2026-04-22-grpn-deploy-learnings.md — lists all the org-policy constraints encountered and their workarounds.

Migration path

Once the blockers are lifted, move to the proper Terraform flow:

Get roles/resourcemanager.projectIamAdmin on the customer project (ask the GRPN admin to grant it).
Create a WIF pool + provider in the customer project (doesn't require SA JSON keys; bypasses iam.disableServiceAccountKeyCreation). Draft patch pending on bootstrap-gcp.sh — track via GitHub issue tagged wif.
Migrate: run the new bootstrap-gcp.sh --wif, create a private infra repo from keboola/agnes-infra-template, terraform apply → this creates a new Agnes VM alongside the existing foundryai-development.
Optional — move data from the manual VM to the TF VM with a tar snapshot through GCS (see the original migration in docs/superpowers/plans/2026-04-21-deployment-log.md "Data migration" section).
Decommission the manual deploy: make stop + delete /opt/agnes/ on the VM.

Caveats

Single VM, single point of failure. No dev/prod split.
No automatic backups. If someone deletes the VM, data is gone (30-day boot-disk retention from GCP default only).
Plain-text secrets in .env. Acceptable for IAP-only internal VM; not acceptable if the VM ever gets an external IP.
No drift detection. Anyone with SSH can hand-edit .env or compose files without leaving an audit trail. The Terraform flow's ignore_changes + -replace pattern is the correct version of this.