feat(scheduler): re-wire sync_schedule + script.schedule; tune via env; OpenMetadata TLS (#135 )

Bundles 4 issues:
- #79 — table_registry.sync_schedule honored at runtime (API-side filter + Pydantic validators)
- #78 — script_registry.schedule honored via new POST /api/scripts/run-due (atomic claim, BackgroundTask exec, deploy-time safety validation)
- #77 — sidecar JOBS env-driven (SCHEDULER_DATA_REFRESH_INTERVAL/HEALTH_CHECK_INTERVAL/SCRIPT_RUN_INTERVAL/TICK_SECONDS)
- #89 — OpenMetadataClient verify=True default (BREAKING for self-signed)

Cuts release 0.19.0. See CHANGELOG for full notes incl. Known Limitations.

2026-04-29 22:06:30 +02:00

11 KiB

Raw Blame History

Deployment Guide

Agnes supports two deployment paths. Pick the one that matches your use case.

1. Terraform — managed, multi-customer (recommended)

For Keboola-operated deployments and anyone running Agnes for multiple customers on GCP.

Follow: ONBOARDING.md

Highlights:

Per-customer GCP project + private infra repo cloned from keboola/agnes-infra-template
Reusable Terraform module infra/modules/customer-instance (versioned — infra-vX.Y.Z tags)
Prod + optional branch-aware dev VMs
Persistent SSD data disk with daily snapshots
Secret Manager for tokens (no plaintext in VM metadata)
OS Login for SSH, dedicated VM service account with scoped secretAccessor
Cron-based auto-upgrade (pulls :stable image digest every 5 min)
Caddy TLS with corporate-CA or self-managed certs mounted from /data/state/certs; daily auto-rotation from a URL (TLS_FULLCHAIN_URL) with zero-downtime SIGUSR1 reload
Uptime check + alert policy per VM (wire a notification channel to be paged)
CI/CD in the private repo: PR → terraform plan, merge to main → apply-dev auto, apply-prod gated by reviewer
First-boot bootstrap via POST /auth/bootstrap

Target onboarding time: < 1 hour per customer.

2. Docker Compose — OSS self-host

For running Agnes on your own VM / bare metal without Terraform. You're responsible for provisioning and maintenance.

Prerequisites

Ubuntu 24.04 (or any Linux with Docker)
2 vCPU, 2 GB RAM, 30 GB SSD minimum
Docker Engine + Compose plugin
Public IP with ports 80/443 (if using Caddy TLS) or 8000 (plain HTTP) open
Data-source credentials (e.g., Keboola Storage token)

Steps

Clone the Agnes repository:

git clone https://github.com/keboola/agnes-the-ai-analyst.git /opt/agnes
cd /opt/agnes

Create .env:

cat > .env <<'EOF'
JWT_SECRET_KEY=$(openssl rand -hex 32)
DATA_DIR=/data
DATA_SOURCE=keboola
KEBOOLA_STORAGE_TOKEN=<your-token>
KEBOOLA_STACK_URL=<your-stack-url>
SEED_ADMIN_EMAIL=<your-email>
LOG_LEVEL=info
AGNES_TAG=stable
EOF
chmod 600 .env

Mount a persistent disk at /data (optional but recommended — survives host rebuild). If you do, use the overlay:

docker compose \
    -f docker-compose.yml \
    -f docker-compose.prod.yml \
    -f docker-compose.host-mount.yml \
    up -d

Without a persistent disk (data on Docker named volume, tied to boot disk):

docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d

Bootstrap your admin password via POST /auth/bootstrap:

curl -X POST http://<host>:8000/auth/bootstrap \
    -H "Content-Type: application/json" \
    -d '{"email":"<your-email>","password":"<strong-password>"}'

Open http://<host>:8000/login and sign in.

TLS (optional)

Caddy runs as the TLS terminator. It reads certs from /data/state/certs/{fullchain,privkey}.pem bind-mounted into the container. Two provisioning modes:

A. Public internet (Let's Encrypt) — for this path, override the Caddyfile to drop the tls directive (so Caddy auto-issues) and skip steps below. Not covered here anymore; see git history prior to the feat(tls) change if you need the ACME flow.

B. Corporate CA / self-managed certs (recommended, and what the infra repo ships):

Two bring-up flows, picked by whether TLS_PRIVKEY_URL is set in .env:

On-VM gen (preferred for new deployments): leave TLS_PRIVKEY_URL empty. On first run, agnes-tls-rotate.sh generates an RSA-2048 key + CSR directly into /data/state/certs/ using the subject string from TLS_CSR_SUBJECT. The key never leaves the host; the CSR (/data/state/certs/cert.csr) is what you submit to your corporate PKI. Until the CA signs and publishes, rotate falls back to a 30-day self-signed cert against the same key so Caddy can serve :443.
Pre-provisioned key (legacy / VM-replace-resilient): set TLS_PRIVKEY_URL=sm://<secret> (or any supported scheme). Seed the key out-of-band before first rotate. Same real-cert fetch + self-signed fallback applies.

Both modes converge: once the CA publishes the signed chain at TLS_FULLCHAIN_URL, the daily rotate tick atomically swaps the fullchain in place and SIGUSR1-reloads Caddy. Zero key churn, zero downtime, no reload when the URL content hasn't moved.

Set the required env vars in .env:

DOMAIN=agnes.example.com
TLS_FULLCHAIN_URL=https://your-ca.example.com/agnes/fullchain.pem
TLS_PRIVKEY_URL=            # empty → on-VM gen; or sm://<secret>
TLS_CSR_SUBJECT=/C=…/ST=…/L=…/O=…/CN=agnes.example.com

Start with the tls profile + overlay (docker-compose.tls.yml closes host :8000 so all traffic enters via :443):

docker compose \
    -f docker-compose.yml \
    -f docker-compose.prod.yml \
    -f docker-compose.tls.yml \
    --profile tls up -d

Grab the CSR if you used on-VM gen:
```
sudo cat /data/state/certs/cert.csr
```
Submit to your corporate PKI. While waiting, Caddy is already up on :443 with the self-signed fallback.

Automatic rotation

scripts/ops/agnes-tls-rotate.sh is the single entry point — it handles fetch, self-signed fallback, auto-generation on missing key, atomic cert swap, and Caddy reload. Env vars it reads:

Var	Required	Schemes	Notes
`DOMAIN`	yes	—	The hostname Caddy serves + the CN in auto-generated CSRs.
`TLS_FULLCHAIN_URL`	yes	`https://`, `sm://<secret>`, `gs://<obj>`, `file://`	Polled daily; rotate only reloads Caddy when the bytes change.
`TLS_PRIVKEY_URL`	optional	same	Empty activates on-VM gen. Set to pre-provisioned scheme (e.g. `sm://`) for VM-replace resilience.
`TLS_CSR_SUBJECT`	optional	—	Stamped on auto-generated CSRs. Defaults to `/CN=<DOMAIN>` if unset. Example: `/C=US/ST=Illinois/L=Chicago/O=Your Org/CN=agnes.example.com`.

scripts/tls-fetch.sh at /usr/local/bin/tls-fetch.sh is required (generic URL fetcher used by rotate). On infra-repo-managed VMs, both scripts are installed by startup.sh and fired via a daily systemd timer; for manual compose deployments, copy them under /usr/local/bin/ and wire a systemd timer (OnBootSec=10min, OnUnitActiveSec=24h, Persistent=true).

Upgrades (manual)

cd /opt/agnes
git pull
docker compose -f docker-compose.yml -f docker-compose.prod.yml pull
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d

Or set up a cron job — see infra/modules/customer-instance/startup-script.sh.tpl for the reference implementation.

Health checks & external monitoring

Two health endpoints serve different audiences:

Endpoint	Auth	Response	Use for
`GET /api/health`	None	`{"status": "ok"}`	Load balancers, Docker `healthcheck`, uptime pings
`GET /api/health/detailed`	Bearer token	`{"status", "version", "services": {...}}`	Dashboards, alerting rules, `da diagnose`/`da status` CLI

The Docker Compose healthcheck uses the minimal endpoint (curl -sf http://localhost:8000/api/health). For external monitoring tools (Datadog, Prometheus, UptimeRobot, etc.) that need service-level detail (DuckDB status, sync freshness, user count), point them at /api/health/detailed with an Authorization: Bearer <token> header. Any authenticated user can call it; a personal access token (da admin create-pat) works well for service accounts.

Scheduler tuning

The scheduler sidecar (services/scheduler/__main__.py) fires periodic HTTP calls against the main app. Job cadences are configurable via env vars on the scheduler container:

Env var	Default	Purpose
`SCHEDULER_DATA_REFRESH_INTERVAL`	`900`	seconds between `POST /api/sync/trigger`
`SCHEDULER_HEALTH_CHECK_INTERVAL`	`300`	seconds between `GET /api/health`
`SCHEDULER_SCRIPT_RUN_INTERVAL`	`60`	seconds between `POST /api/scripts/run-due`
`SCHEDULER_TICK_SECONDS`	`30`	loop polling cadence; must be ≤ smallest interval above

/api/sync/trigger walks table_registry; tables with a per-row sync_schedule (every Nm / every Nh / daily HH:MM[,...]) are filtered to only those due for sync since their last run. Tables without a schedule continue to run on every tick. The marketplace job runs at daily 03:00 UTC and is not currently env-tunable.

/api/scripts/run-due walks script_registry and runs each deployed script whose schedule says it is due. Scripts in the running state are skipped on subsequent ticks until the previous run writes a terminal status. The endpoint requires admin auth (the sidecar's SCHEDULER_API_TOKEN resolves to a synthetic Admin user).

Caveats

Schedule quantization rounds up. The schedule grammar has minute- level resolution. Non-multiples of 60 seconds round UP to the next minute (SCHEDULER_DATA_REFRESH_INTERVAL=90 → every 2m, not every 1m) so a job never fires more often than configured. Sub-minute values clamp to every 1m. Use multiples of 60 for predictable cadence.
A crashed BackgroundTask can leave a script stuck in last_status='running'. The next sidecar tick will skip the stuck script forever. Recovery is manual: open a DuckDB shell on system.duckdb and run UPDATE script_registry SET last_status = NULL WHERE id = '<id>'; Auto-recovery via max-runtime detection is intentionally out of scope for v0; revisit if it happens in practice.

Which path should I pick?

	Terraform	Docker Compose
Setup time	~45 min first customer, ~15 min each subsequent	~30 min
Infra-as-Code	Full (all resources in git)	Partial (compose.yml only)
Secret storage	GCP Secret Manager	`.env` file on host
Upgrades	Auto via cron, gated prod apply	Manual `docker compose pull`
Backups	Daily GCP snapshots, 30-day retention	You set up yourself
Monitoring / alerts	GCP Uptime Checks + alert policy	You set up yourself
TLS	Caddy + corp cert, auto-rotated from URL	Caddy + corp cert, manual or user-scripted rotation
Best for	Multi-tenant SaaS, production	Single-instance self-host, learning

ONBOARDING.md — end-to-end Terraform onboarding checklist
CONFIGURATION.md — instance.yaml, env vars, per-instance config
architecture.md — internal architecture (orchestrator, extractors, DB layout)
QUICKSTART.md — local development setup
superpowers/specs/2026-04-21-multi-customer-deployment-spec.md — design rationale for the multi-customer model

11 KiB Raw Blame History