* chore(oss): isolate customer-specific deploy bits from scripts/grpn/ (#88) Vendor-neutralization step before public release. The directory mixed two concerns: (1) generic ops scripts referenced from mainline OSS infrastructure (TLS rotation, auto-upgrade cron) and (2) one operator's hackathon manual-deploy helper with hardcoded GCP project IDs, VM names, and admin emails. Splitting them per concern. Moved (still in OSS, just under a vendor-neutral name): - scripts/grpn/agnes-tls-rotate.sh → scripts/ops/agnes-tls-rotate.sh - scripts/grpn/agnes-auto-upgrade.sh → scripts/ops/agnes-auto-upgrade.sh Removed (belongs in private consumer infra repos, not upstream OSS): - scripts/grpn/Makefile (hardcoded prj-grp-foundryai-dev-7c37, foundryai-development VM name, e_zsrotyr@groupon.com bootstrap email) - scripts/grpn/README.md (GRPN hackathon deploy walkthrough) - docs/superpowers/plans/2026-04-22-grpn-deploy-learnings.md (org-specific deploy log) Cross-refs updated in README.md, CLAUDE.md, docs/DEPLOYMENT.md, docker-compose.yml. CHANGELOG entry flags BREAKING (ops) for any consumer infra repo that installs these scripts via path-based systemd timers. This is the first wave of #88 — the remaining leaks (test data with prj-grp-dataview-prod-1ff9, AIAgent.FoundryAI tags in OpenMetadata test fixtures, docstrings in connectors/openmetadata/enricher.py) will be a separate, smaller PR. Refs #88. * chore(oss): comprehensive vendor-neutralization (#88 wave 2 + review fixes) PR #94 review found that the original wave-1 grep was scoped wrong and many leaks survived. This commit closes wave 1 properly AND folds in all wave-2 anonymization in a single pass — easier to review than two PRs. Wave-1 review-fix corrections: - Caddyfile: scripts/grpn/agnes-tls-rotate.sh → scripts/ops/ (the original wave-1 grep filter excluded extensionless files like Caddyfile). - CHANGELOG bullet rewritten — original wording implied an in-repo migration for infra/modules/customer-instance/, which is wrong (the TF module embeds the script inline via heredoc, never sourced from scripts/grpn/). Now flags downstream consumer infra repos only. - infra/modules/customer-instance/variables.tf: Czech docstring with `grpn` example → English description with `acme, example` placeholders. Wave-2 anonymization: - Code docstrings (connectors/openmetadata/{client,transformer,enricher}.py, src/catalog_export.py, scripts/duckdb_manager.py): prj-grp-… → my-bq-project / prj-example-1234, AIAgent.FoundryAI → AIAgent.MyAgent, FoundryAIDataModel → AnalyticsDataModel. - Test fixtures (4 files): same set of replacements — 157 tests still pass. - .github/workflows/keboola-deploy.yml: "Groupon-side dev VMs" comment → generic "per-developer dev VMs". - docs/auth-groups.md + scripts/debug/probe_google_groups.py: kids-ai-data-analysis project name → acme-internal-prod placeholder. - 5 planning/spec docs under docs/superpowers/{plans,specs}/2026-04-21-*: hardcoded IPs (34.77.94.14, 34.77.102.61) → <dev-vm-ip>/<prod-vm-ip>; GRPN/Groupon → Acme/another-customer; prj-grp-… → prj-example-…. - scripts/switch-dev-vm.sh deleted — hackathon-era helper hardcoded to a specific shared dev VM. Per-developer dev VMs are the supported pattern. Final grep `groupon|grpn|foundryai|prj-grp|groupondev|34\.77\.(94|102)\.…|kids-ai-data` returns zero hits (excluding CHANGELOG.md historical entries). CHANGELOG entry expanded to document both waves under one bullet, with the BREAKING (ops) clarification about the TF module being unaffected. Refs review of #94, closes #88. * fix(oss): close remaining #94 review-2 findings (Czech, padak refs, CHANGELOG) Reviewer of PR #94 round 2 caught 4 remaining items the wave-2 pass missed: 1. infra/modules/customer-instance/variables.tf had Czech descriptions on 8 more variables. Previous review only flagged line 19; this round audited the rest. Translated lines 2, 28, 42-46 (heredoc), 60, 65, 71, 78, 84 to English. Same review concern: a Terraform module that is the customer-facing API surface in Czech is unfit for OSS distribution. 2. infra/modules/customer-instance/outputs.tf had Czech descriptions on four outputs. Same fix. 3. docs/padak-security.md referenced a private repo (padak/keboola_agent_cli#206) in two places. Replaced with generic 'tracked upstream in the auth-CLI repo' per CLAUDE.md vendor-agnostic rule (no cross-refs to private repos). 4. scripts/fetch-env-from-secrets.sh:41 had a Czech comment. Translated. 5. CHANGELOG cosmetic: bullet said 'AIAgent.FoundryAI -> AIAgent.MyAgent' but the actual code uses both MyAgent (in docstrings) and Example (in test fixtures). Reworded to mention both targets. Final grep across all shipping file types (.md, .py, .yml, .yaml, .sh, Makefile, .json, .tf, .tpl, Caddyfile, .toml) for groupon|grpn|foundryai| prj-grp|groupondev|34.77.94.14|34.77.102.61|kids-ai-data|padak/keboola_agent_cli returns ZERO hits (excluding CHANGELOG.md). Czech-diacritic grep across .tf/.toml/Caddyfile/Makefile/.yml returns ZERO hits. 157/157 OpenMetadata + DuckDB tests still pass. * fix(oss): close #94 round-3 leaks (env.template, instance.yaml.example, padak typo) Round-3 reviewer caught two MUST-FIX leaks the round-2 grep missed (grep was scoped to extensions that did not include .template / .example suffixes — the audit was right, the previous grep was not paranoid enough): 1. config/instance.yaml.example:114 — '(optional - Groupon-specific)' brand leak in a shipping config example. Replaced with '(optional)'. 2. config/.env.template:68 — stale path 'scripts/grpn/agnes-tls-rotate.sh' in operator-facing env-template comment. The script lives at scripts/ops/ now (commit 16a85cc); this comment had been pointing operators at a non-existent path. 3. docs/padak-security.md:188 — phrase duplication 'tracked in tracked upstream' from a sloppy substitution in round-2. Trivial wording fix. Final paranoid grep across .md/.py/.yml/.yaml/.sh/Makefile/.json/.tf/.tpl/ Caddyfile/.toml/.template/.example/.env* with the full token set (groupon|grpn|foundryai|prj-grp|groupondev|34\.77\.94\.14|34\.77\.102\.61| kids-ai-data|padak/keboola_agent_cli) returns ZERO hits, excluding CHANGELOG.md historical entries. * fix(oss): #94 round-4 — QUICKSTART.md + rename padak-security.md Devin Review caught two findings on the latest round-3 commit: 1. docs/QUICKSTART.md:67 still pointed users at the deleted scripts/switch-dev-vm.sh. A Quickstart user following step-by-step would hit a missing-file error at the final step. Replaced with the inline gcloud-ssh equivalent that the Removed bullet documents. 2. docs/padak-security.md filename retains the personal identifier 'padak'. The PR fixed the body content (replaced padak/keboola_agent_cli#206 references with generic wording) but missed the filename. Renamed to docs/security-audit-2026-04.md (date-anchored, vendor-neutral). Updated the historical CHANGELOG link to point at the new path with an inline note about the rename. * fix(oss): redact remaining hardcoded IPs from planning docs + remove default email Devin Review caught two more leaks: 1. scripts/fetch-env-from-secrets.sh line 16 had a hardcoded personal-email default (zdenek.srotyr@keboola.com). Replaced with ':?' bash error so SEED_ADMIN_EMAIL must be explicitly set — safer than carrying any specific identity. 2. Planning docs still had 35.195.96.98 and 34.62.223.189 (legacy prod/dev IPs) that the round-1 IP-replace pattern missed (it only targeted 34.77.x.x). Generic regex redaction across all five planning docs replaces every public IP with <redacted-ip>, preserving private/loopback/IAP ranges.
185 lines
8.2 KiB
Bash
Executable file
185 lines
8.2 KiB
Bash
Executable file
#!/bin/bash
|
|
# Deployed to /usr/local/bin/agnes-tls-rotate.sh on the VM by the infra
|
|
# repo startup.sh. A systemd timer fires it daily.
|
|
#
|
|
# Corp security rotates certs at stable URLs (TLS_FULLCHAIN_URL,
|
|
# TLS_PRIVKEY_URL in /opt/agnes/.env). This script refetches, compares
|
|
# sha via cmp, atomically replaces changed files, and sends SIGUSR1 to
|
|
# caddy for a zero-downtime reload. No-op when cert has not moved.
|
|
#
|
|
# TLS_PRIVKEY_URL is optional — leave empty when the key is provisioned
|
|
# once per VM (e.g. from Secret Manager at boot) and reused across
|
|
# cert rotations.
|
|
#
|
|
# Self-signed fallback: when TLS_FULLCHAIN_URL returns no data (security
|
|
# dept hasn't published the real cert yet) AND no fullchain.pem exists
|
|
# on disk, generate a 30-day self-signed cert against the same privkey.
|
|
# Because Security signs the eventual real cert against the CSR
|
|
# produced from this same key, the key never changes — the rotate tick
|
|
# after publication just swaps the fullchain file, SIGUSR1-reloads
|
|
# Caddy, and clients start seeing the real chain with zero downtime.
|
|
# Browsers see a self-signed warning in the meantime — acceptable for
|
|
# the bring-up window, and the only way to get Caddy up before the
|
|
# real cert exists without splitting into two code paths.
|
|
set -euo pipefail
|
|
# Disable core dumps for this script. openssl runs with the unencrypted
|
|
# privkey in process memory; a SIGSEGV core file would leak it to whoever
|
|
# can read /var/lib/systemd/coredump (typically root + adm group). Cheap
|
|
# defence in depth — this script is short-lived and has no debug needs.
|
|
ulimit -c 0
|
|
|
|
cd /opt/agnes
|
|
# shellcheck disable=SC1091
|
|
set -a; . /opt/agnes/.env; set +a
|
|
|
|
[ -n "${TLS_FULLCHAIN_URL:-}" ] || { echo "TLS_FULLCHAIN_URL empty — nothing to rotate"; exit 0; }
|
|
|
|
CERT_DIR=/data/state/certs
|
|
mkdir -p "$CERT_DIR"
|
|
chmod 700 "$CERT_DIR"
|
|
|
|
CHANGED=0
|
|
TMP=$(mktemp); trap 'rm -f "$TMP"' EXIT
|
|
|
|
refetch() {
|
|
local url="$1" dest="$2" mode="$3" kind="$4"
|
|
# IMPORTANT: tls-fetch.sh may fail (404, empty body, auth error,
|
|
# invalid PEM, redirect attempt). When the caller sits behind
|
|
# `if ! refetch`, bash disables `set -e` for everything inside the
|
|
# condition — so without an explicit exit-code check we would fall
|
|
# through to `install` and overwrite $dest with whatever stale bytes
|
|
# the PREVIOUS refetch call left in $TMP. That turned the "fullchain
|
|
# unavailable → fall back to self-signed" branch into a "fullchain
|
|
# file filled with privkey bytes" bug. Check explicitly and return 1
|
|
# on any fetch failure so the caller's fallback branch fires cleanly.
|
|
if ! /usr/local/bin/tls-fetch.sh "$url" "$TMP" "$mode" "$kind"; then
|
|
return 1
|
|
fi
|
|
if [ ! -f "$dest" ] || ! cmp -s "$TMP" "$dest"; then
|
|
install -m "$mode" "$TMP" "$dest"
|
|
echo "$(date -Is) rotated $(basename "$dest")"
|
|
CHANGED=1
|
|
fi
|
|
}
|
|
|
|
# Private key handling.
|
|
#
|
|
# Three modes (decided per-VM in the infra repo's local.vm_tls):
|
|
#
|
|
# 1. TLS_PRIVKEY_URL set (sm://, gs://, https://, file://) — fetch it
|
|
# every rotate tick. Used by VMs that keep the key in Secret
|
|
# Manager or similar for VM-replace resilience (legacy pattern).
|
|
#
|
|
# 2. TLS_PRIVKEY_URL empty AND $CERT_DIR/privkey.pem already on disk
|
|
# — reuse the on-disk key, never fetch. The file survives the VM
|
|
# for the lifetime of /data's persistence.
|
|
#
|
|
# 3. TLS_PRIVKEY_URL empty AND no on-disk key — generate an RSA-2048
|
|
# key + a CSR against $DOMAIN in place. This is the "fresh VM"
|
|
# bring-up path: the key never leaves the VM, and the CSR is
|
|
# written to $CERT_DIR/cert.csr for the operator to grab via
|
|
# `gcloud compute ssh … sudo cat /data/state/certs/cert.csr` and
|
|
# attach to the SECURITY Jira that requests public-cert signing.
|
|
# Until Security publishes the real fullchain, the self-signed
|
|
# fallback below keeps Caddy serving HTTPS against this same key.
|
|
if [ -n "${TLS_PRIVKEY_URL:-}" ]; then
|
|
if ! refetch "$TLS_PRIVKEY_URL" "$CERT_DIR/privkey.pem" 600 key; then
|
|
if [ ! -s "$CERT_DIR/privkey.pem" ]; then
|
|
echo "ERROR: privkey fetch failed and no cached copy exists — aborting" >&2
|
|
exit 1
|
|
fi
|
|
echo "$(date -Is) privkey fetch failed; keeping cached $CERT_DIR/privkey.pem"
|
|
fi
|
|
elif [ ! -s "$CERT_DIR/privkey.pem" ]; then
|
|
CN="${DOMAIN:-localhost}"
|
|
# Site-specific CSR subject (C/ST/L/O fields) comes from
|
|
# TLS_CSR_SUBJECT in /opt/agnes/.env — the deployer's infra layer
|
|
# writes it with its PKI conventions. This script stays generic;
|
|
# default to a minimal /CN=<hostname> when the var is unset so the
|
|
# CSR is still syntactically valid but carries no org metadata the
|
|
# deployer didn't choose.
|
|
SUBJECT="${TLS_CSR_SUBJECT:-/CN=$CN}"
|
|
echo "$(date -Is) no privkey — generating RSA-2048 key + CSR (subject: $SUBJECT)"
|
|
CSR_CONF=$(mktemp)
|
|
cat > "$CSR_CONF" <<CFG
|
|
[ req ]
|
|
prompt = no
|
|
distinguished_name = req_distinguished_name
|
|
req_extensions = ext
|
|
|
|
[ req_distinguished_name ]
|
|
CN = $CN
|
|
|
|
[ ext ]
|
|
keyUsage = digitalSignature, keyEncipherment
|
|
extendedKeyUsage = serverAuth
|
|
subjectAltName = @subject_alt_names
|
|
|
|
[ subject_alt_names ]
|
|
DNS.1 = $CN
|
|
CFG
|
|
umask 077
|
|
openssl req -newkey rsa:2048 \
|
|
-keyout "$CERT_DIR/privkey.pem" \
|
|
-out "$CERT_DIR/cert.csr" \
|
|
-subj "$SUBJECT" \
|
|
-config "$CSR_CONF" -extensions ext -nodes 2>/dev/null
|
|
chmod 600 "$CERT_DIR/privkey.pem"
|
|
chmod 644 "$CERT_DIR/cert.csr"
|
|
rm -f "$CSR_CONF"
|
|
echo "$(date -Is) privkey.pem + cert.csr written to $CERT_DIR"
|
|
echo "$(date -Is) ACTION: send $CERT_DIR/cert.csr to your certificate authority for signing — the CSR is public and safe to transit; the key never leaves this VM."
|
|
fi
|
|
|
|
# Real cert fetch. On failure, fall back to self-signed IFF no
|
|
# fullchain exists yet. If one exists (prior real OR prior self-signed)
|
|
# keep it — a transient fetch failure should not churn certs.
|
|
if ! refetch "$TLS_FULLCHAIN_URL" "$CERT_DIR/fullchain.pem" 644 cert; then
|
|
if [ ! -s "$CERT_DIR/fullchain.pem" ]; then
|
|
echo "$(date -Is) real cert unavailable at $TLS_FULLCHAIN_URL — generating 30-day self-signed"
|
|
if [ ! -s "$CERT_DIR/privkey.pem" ]; then
|
|
echo "ERROR: no privkey available — cannot self-sign" >&2
|
|
exit 1
|
|
fi
|
|
CN="${DOMAIN:-localhost}"
|
|
# Same parametrisation as the CSR branch above — site-specific PKI
|
|
# fields belong in the deployer's .env, not in this script. Keeps
|
|
# the self-signed bring-up cert consistent with whatever the eventual
|
|
# CA-signed cert will say.
|
|
SUBJECT="${TLS_CSR_SUBJECT:-/CN=$CN}"
|
|
openssl req -x509 -new -key "$CERT_DIR/privkey.pem" \
|
|
-out "$CERT_DIR/fullchain.pem" -days 30 \
|
|
-subj "$SUBJECT" \
|
|
-addext "subjectAltName=DNS:$CN" \
|
|
-addext "keyUsage=digitalSignature,keyEncipherment" \
|
|
-addext "extendedKeyUsage=serverAuth" 2>/dev/null
|
|
chmod 644 "$CERT_DIR/fullchain.pem"
|
|
echo "$(date -Is) self-signed fullchain.pem installed (CN=$CN)"
|
|
CHANGED=1
|
|
else
|
|
echo "$(date -Is) fetch failed but cached fullchain.pem exists — keeping it"
|
|
fi
|
|
fi
|
|
|
|
if [ "$CHANGED" -eq 1 ]; then
|
|
# Array form (vs. word-split string) — quoted expansion is the
|
|
# modern bash idiom for arg lists, defensive against future filename
|
|
# weirdness. ps --status flag requires Compose v2.6.1+; if your VMs
|
|
# are older, replace with `ps --format '{{.Service}} {{.State}}'`
|
|
# and filter on the State column.
|
|
COMPOSE_FILES=( -f docker-compose.yml -f docker-compose.prod.yml -f docker-compose.host-mount.yml -f docker-compose.tls.yml )
|
|
if docker compose "${COMPOSE_FILES[@]}" --profile tls ps --status=running --format '{{.Service}}' 2>/dev/null | grep -q '^caddy$'; then
|
|
# Caddy running — graceful reload via SIGUSR1 picks up the new
|
|
# cert without dropping connections.
|
|
docker compose "${COMPOSE_FILES[@]}" --profile tls kill -s SIGUSR1 caddy >/dev/null 2>&1 \
|
|
&& echo "$(date -Is) caddy reloaded" \
|
|
|| echo "$(date -Is) caddy reload signal failed"
|
|
else
|
|
# Caddy not running yet — first time certs land on this VM, or
|
|
# operator hasn't brought up the tls profile yet. Flip the stack
|
|
# in place so this script is self-sufficient: no separate manual
|
|
# `docker compose up` step after seeding certs.
|
|
echo "$(date -Is) caddy not running — bringing tls profile up"
|
|
docker compose "${COMPOSE_FILES[@]}" --profile tls up -d 2>&1 | tail -5
|
|
fi
|
|
fi
|