fix(tls-rotate): chown CERT_DIR to UID 999 so the app container can read its own certs (#143)

The script's `mkdir -p` left ownership of `/data/state/certs/` to whichever
process won the create race — root when systemd's timer fired before the
app container's first volume init, UID 999 when the container ran first.
With mode 700, a root-owned dir blocks the UID-999 agnes container from
reading its own fullchain.pem; `_read_agnes_ca_pem()` returns None, and
the cross-platform TLS trust block (Step 0 from PR #137) silently
disappears from the /install setup prompt. Operators on the unlucky-race
VMs got a setup prompt that couldn't bootstrap client trust against the
self-signed host. Existing VMs self-heal on next timer tick.
This commit is contained in:
ZdenekSrotyr 2026-04-30 13:21:59 +02:00 committed by GitHub
parent 70672204fe
commit f3d252f17d
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
2 changed files with 14 additions and 0 deletions

View file

@ -10,6 +10,9 @@ CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every C
## [Unreleased]
### Fixed
- **`scripts/ops/agnes-tls-rotate.sh` now chowns `/data/state/certs/` to UID 999 (the `agnes` user inside the app image) on every run.** Previously the script only `mkdir -p`'d and `chmod 700`'d the directory, leaving ownership to whoever happened to create it first — root when systemd fired the timer before docker-compose-up, or UID 999 when the container's volume init touched it first. Race-dependent. When root won, the resulting `drwx------ root:root` directory was unreadable by the UID-999 container, `_read_agnes_ca_pem()` returned `None`, and the `/install` setup prompt silently dropped the cross-platform TLS trust block (Step 0 from #137) — operators on those VMs ended up with no client-side cert bootstrap and a broken `claude plugin marketplace add` against the self-signed host. The chown is unconditional + idempotent (`|| true` for hosts where the numeric GID can't be set), so re-running the timer self-heals existing VMs without manual `chown` on the operator's part. Files inside the directory keep their existing modes — `fullchain.pem` is `0644` (world-readable, so root- or 999-owned both work for the agnes container) and `privkey.pem` is `0600` (only Caddy reads it, and Caddy's container runs as root).
## [0.23.0] — 2026-04-30
### Added

View file

@ -36,6 +36,17 @@ set -a; . /opt/agnes/.env; set +a
CERT_DIR=/data/state/certs
mkdir -p "$CERT_DIR"
# Match the agnes UID baked into the app image (Dockerfile: useradd --uid 999).
# Without this, whoever happens to win the create race (this script as root
# vs. the app container's first volume-init touch as 999) decides ownership;
# when root wins, mode 700 leaves the container unable to read its own certs
# and `_read_agnes_ca_pem()` silently returns None, suppressing the trust-
# bootstrap block in the /install setup prompt. `|| true` keeps the script
# resilient on hosts where the GID is reserved (chgrp on a non-existent
# numeric GID is fine on Linux but pedantically fails on some BSD-derived
# tooling); if the chown itself fails we keep going and surface the
# resulting permission error from the next refetch step instead.
chown 999:999 "$CERT_DIR" || true
chmod 700 "$CERT_DIR"
CHANGED=0