* feat(observability): optional PostHog integration (errors, LLM traces, replay, flags)
Off by default. Activates when POSTHOG_API_KEY is set in env. Defaults
to PostHog Cloud EU; override host for US Cloud or self-hosted.
Coverage:
- FastAPI 500 handler captures unhandled exceptions
- src/orchestrator.py rebuild + rebuild_source failures
- services/scheduler/ HTTP-job failures
- cli/main.py uncaught CLI errors (Typer.Exit/SystemExit/KeyboardInterrupt
skipped; flushes before re-raise so short-lived CLI invocations don't
drop events)
- connectors/llm/anthropic_provider.py + openai_compat.py emit
$ai_generation events with provider, model, latency, token counts
(prompt/completion bodies stay off unless POSTHOG_LLM_PAYLOADS=1
because LLM prompts here routinely include customer SQL/data)
- Browser snippet injected into every text/html response by
PosthogInjectionMiddleware — registered inside the GZip layer so it
sees uncompressed HTML before compression. Many templates are
standalone (their own DOCTYPE) and never extend base.html, so a
per-template include would miss them.
- Frontend: $pageview, $pageleave, JS error capture via window.error
and unhandledrejection handlers, masked session replay
(maskAllInputs: true plus CSS-selector mask for known data surfaces),
feature flags (browser posthog.isFeatureEnabled + server-side
feature_enabled with fallback for older SDKs).
Identification mode operator-configurable: none / id / email / full.
Default email ships user.id + email but never name. CLI entry point
moves from cli.main:app to cli.main:main (Typer wrapper).
Files:
- src/observability/posthog_client.py — lazy singleton, no network
when disabled, single-process flush on shutdown
- src/observability/llm_tracing.py — trace_generation context manager
- app/middleware/posthog_inject.py — HTML rewrite middleware
- app/web/templates/_posthog.html — browser snippet template
- docs/observability.md — operator guide
- config/.env.template — documented POSTHOG_* knobs
- tests/test_posthog_disabled.py + tests/test_posthog_client.py +
tests/test_llm_tracing.py — 18 tests covering disabled state,
identify-mode payloads, $ai_generation shape, error variant.
CHANGELOG entry under [Unreleased] Added.
* feat(observability): tag every PostHog event with environment + release
Splits PostHog dashboards cleanly between localhost / dev / staging /
production without manual tagging on every capture call.
- POSTHOG_ENVIRONMENT explicit override; auto-resolves to "local" when
LOCAL_DEV_MODE=1, else RELEASE_CHANNEL, else AGNES_DEPLOYMENT_ENV,
else "unknown".
- AGNES_VERSION → RELEASE_CHANNEL fallback feeds the `release` property
for "is this error new in this release?" cohorting.
- Backend gets both via the PostHog SDK's super_properties constructor
arg (every captured event picks them up automatically).
- Browser snippet calls posthog.register({environment, release}) inside
the loaded callback so $pageview, $exception, autocapture, etc. all
carry the same labels.
- request.state.user now populated by auth dependencies so the snippet
can actually call posthog.identify(user_id, {email}) for logged-in
users (previously the user block always resolved to None because
nothing wrote to request.state.user).
4 new tests cover env resolution: explicit > LOCAL_DEV_MODE > channel
> unknown, plus super-properties forwarding into the SDK constructor.
* feat(observability): inline user attrs on every PostHog event + debug throw route
PostHog's UI shows person properties on the Person profile page, not
inline on each event — so a reviewer triaging an exception couldn't tell
which user hit the bug without clicking through. Fix it on both sides.
- Backend capture_exception merges user_id / user_email / user_name into
the event properties (gated by POSTHOG_IDENTIFY_PII: none/id/email/full).
Backed by a new _user_props_for_event helper on PosthogClient.
- Browser snippet registers user_id + user_email + user_name as super-
properties via posthog.register({...}) so every $exception, $pageview,
and custom event coming from posthog.captureException() carries them
inline. Mirrors the backend so cross-referencing client/server events
doesn't require a person-profile lookup.
- /api/debug/throw — debug-only endpoint gated by DEBUG=1 (404 in prod).
Runs Depends(get_current_user) first so request.state.user is set when
the unhandled-exception handler captures the event. Lets operators
exercise the full observability path end-to-end without hand-rolling
a TestClient script. Configurable via ?kind=ValueError&msg=...
7 new tests cover: backend user-attr merge across identify modes,
anonymous request fall-through, browser snippet super-prop emission for
logged-in / anonymous / id-only / full-name cases.
* fix(observability): address minasarustamyan PR #231 review
Two bugs caught in review.
1. PosthogInjectionMiddleware dropped Response.background on every
return path. BaseHTTPMiddleware materialises the body and asks
subclasses to return a fresh Response — three paths in dispatch()
omitted background=, silently cancelling any BackgroundTask /
BackgroundTasks the route attached (audit logging, async webhooks,
email sends) with no log line. Fix: route every return through a
_passthrough() helper that forwards background.
Also adds a _MAX_BUFFER_BYTES (4 MB) cap so a streamed-HTML response
can't balloon RSS during buffering. Bigger bodies short-circuit
through with a warning rather than being injected.
Regression tests in tests/test_posthog_inject_middleware.py exercise
four return paths (snippet present, render-fail, double-injection
guard, non-HTML passthrough) plus the streaming-guard short-circuit.
2. $ai_input / $ai_output_choices were emitted without truncation, so
POSTHOG_LLM_PAYLOADS=1 silently dropped events past PostHog's ~32 KB
per-event ingest limit — exactly the calls (large prompts with
schemas / sample rows / SQL) an operator would want to inspect.
Fix: clip both at POSTHOG_LLM_PAYLOAD_MAX_CHARS (default 30000) with
an explicit "…[truncated N chars]" marker so readers don't mistake
truncated captures for complete ones. Metadata (provider, model,
tokens, latency, error) flows regardless. Three new tests cover
default-cap clipping, env-override, and pass-through under the cap.
37 PostHog tests pass.
238 lines
9.6 KiB
Python
238 lines
9.6 KiB
Python
"""Scheduler service — replaces systemd timers.
|
|
|
|
Lightweight sidecar that fires scheduled jobs over HTTP against the main
|
|
app. Authenticates with ``SCHEDULER_API_TOKEN`` (shared-secret synthetic
|
|
admin — see ``app.auth.scheduler_token``); falls back to no-auth in
|
|
LOCAL_DEV_MODE.
|
|
|
|
Schedules are strings parsed by ``src.scheduler.is_table_due`` — accepts
|
|
"every 15m", "every 1h", "daily 03:00", "daily 07:00,13:00".
|
|
|
|
Why every job is HTTP and nothing runs in-process: the scheduler container
|
|
shares ``/data/state/system.duckdb`` with the app container, but DuckDB
|
|
permits only one writer per file across processes. An in-process call
|
|
from the scheduler raced the app's long-lived handle and 500-ed on
|
|
``Could not set lock on file``. Going through HTTP makes the app the sole
|
|
writer; the scheduler is reduced to a pure cron clock.
|
|
|
|
Usage: python -m services.scheduler
|
|
"""
|
|
|
|
import logging
|
|
import os
|
|
import signal
|
|
import time
|
|
from datetime import datetime, timezone
|
|
|
|
import httpx
|
|
|
|
from app.logging_config import setup_logging
|
|
from src.scheduler import is_table_due
|
|
|
|
setup_logging(__name__)
|
|
logger = logging.getLogger(__name__)
|
|
|
|
API_URL = os.environ.get("API_URL", "http://localhost:8000")
|
|
SCHEDULER_API_TOKEN = os.environ.get("SCHEDULER_API_TOKEN", "")
|
|
|
|
_token_warning_emitted = False
|
|
|
|
|
|
def _get_auth_token() -> str:
|
|
"""Return the bearer token for API calls.
|
|
|
|
Production: ``SCHEDULER_API_TOKEN`` is a shared secret generated by the
|
|
Terraform startup script and written to ``/opt/agnes/.env``. Both the
|
|
``app`` and ``scheduler`` containers source the same .env via Docker
|
|
Compose ``env_file:``, so the secret is symmetric. The app validates
|
|
incoming Bearer tokens against this env var (constant-time compare in
|
|
``app.auth.scheduler_token``) and resolves matches to a synthetic
|
|
``scheduler@system.local`` user that is a member of the Admin group.
|
|
|
|
Dev / LOCAL_DEV_MODE: leave it unset. The scheduler returns the empty
|
|
string and calls the API without an ``Authorization`` header — the
|
|
API's dev-bypass auto-authenticates the request as the dev user.
|
|
"""
|
|
global _token_warning_emitted
|
|
if SCHEDULER_API_TOKEN:
|
|
return SCHEDULER_API_TOKEN
|
|
if not _token_warning_emitted:
|
|
logger.warning(
|
|
"SCHEDULER_API_TOKEN is not set — calling the API without "
|
|
"Authorization. Required in production; in LOCAL_DEV_MODE "
|
|
"the dev-bypass auto-authenticates and this is fine."
|
|
)
|
|
_token_warning_emitted = True
|
|
return ""
|
|
|
|
|
|
# --- Env parsing ------------------------------------------------------------
|
|
|
|
_DEFAULTS = {
|
|
"SCHEDULER_DATA_REFRESH_INTERVAL": 15 * 60, # seconds
|
|
"SCHEDULER_HEALTH_CHECK_INTERVAL": 5 * 60,
|
|
"SCHEDULER_SCRIPT_RUN_INTERVAL": 1 * 60,
|
|
"SCHEDULER_TICK_SECONDS": 30,
|
|
# LLM pipeline cadences (#176, #179 review). Defaults preserve the
|
|
# 10m / 15m / 17m coprime offset so the three jobs don't fire on the
|
|
# same tick and stack their API + DB load. The verification-detector
|
|
# default (900s) is also the source of truth for the health-check
|
|
# staleness grace window in app/api/health.py — single env var drives
|
|
# both, so an operator changing the cadence moves both.
|
|
"SCHEDULER_SESSION_COLLECTOR_INTERVAL": 10 * 60,
|
|
"SCHEDULER_VERIFICATION_DETECTOR_INTERVAL": 15 * 60,
|
|
"SCHEDULER_CORPORATE_MEMORY_INTERVAL": 17 * 60,
|
|
}
|
|
|
|
|
|
def _read_positive_int(name: str) -> int:
|
|
"""Read an env var as a positive integer or fall back to the default.
|
|
|
|
Treats unset env (``None``) as "use default". Treats explicitly empty
|
|
string (``""``) as an operator typo and raises — silently defaulting
|
|
on a literal ``FOO=`` in the env_file would mask configuration bugs.
|
|
"""
|
|
raw = os.environ.get(name)
|
|
if raw is None:
|
|
if name not in _DEFAULTS:
|
|
raise ValueError(f"Unknown scheduler env var: {name}")
|
|
return _DEFAULTS[name]
|
|
if raw == "":
|
|
raise ValueError(f"{name}='' must be a positive integer (seconds)")
|
|
try:
|
|
value = int(raw)
|
|
except (TypeError, ValueError):
|
|
raise ValueError(f"{name}={raw!r} must be a positive integer (seconds)")
|
|
if value <= 0:
|
|
raise ValueError(f"{name}={value} must be > 0 (seconds)")
|
|
return value
|
|
|
|
|
|
def _seconds_to_schedule(seconds: int) -> str:
|
|
"""Convert a seconds value to the closest 'every Nm' / 'every Nh' string.
|
|
|
|
Uses ceiling division so a non-multiple-of-60 input never produces a
|
|
schedule that fires MORE often than the operator configured (90s →
|
|
'every 2m', not 'every 1m'). Sub-minute inputs clamp to 'every 1m'
|
|
because the schedule grammar has minute-level resolution.
|
|
"""
|
|
if seconds % 3600 == 0 and seconds >= 3600:
|
|
return f"every {seconds // 3600}h"
|
|
# Ceiling division: -(-x // y) is the standard trick.
|
|
minutes = max(1, -(-seconds // 60))
|
|
return f"every {minutes}m"
|
|
|
|
|
|
def resolved_tick_seconds() -> int:
|
|
"""Read + validate SCHEDULER_TICK_SECONDS in isolation (test helper)."""
|
|
return _read_positive_int("SCHEDULER_TICK_SECONDS")
|
|
|
|
|
|
def build_jobs() -> list[tuple[str, str, str, str, int]]:
|
|
"""Build the JOBS list from env, applying defaults and validation.
|
|
|
|
Tuple shape: (name, schedule_string, endpoint, method, http_timeout_sec).
|
|
Marketplaces stays hardcoded — promoting it to env is out of #77 scope.
|
|
"""
|
|
refresh = _read_positive_int("SCHEDULER_DATA_REFRESH_INTERVAL")
|
|
health = _read_positive_int("SCHEDULER_HEALTH_CHECK_INTERVAL")
|
|
scripts = _read_positive_int("SCHEDULER_SCRIPT_RUN_INTERVAL")
|
|
sess = _read_positive_int("SCHEDULER_SESSION_COLLECTOR_INTERVAL")
|
|
verify = _read_positive_int("SCHEDULER_VERIFICATION_DETECTOR_INTERVAL")
|
|
corpmem = _read_positive_int("SCHEDULER_CORPORATE_MEMORY_INTERVAL")
|
|
tick = _read_positive_int("SCHEDULER_TICK_SECONDS")
|
|
smallest = min(refresh, health, scripts, sess, verify, corpmem)
|
|
if tick > smallest:
|
|
raise ValueError(
|
|
f"SCHEDULER_TICK_SECONDS={tick} must be <= the smallest job "
|
|
f"interval ({smallest}s) so jobs don't consistently miss their "
|
|
f"cadence by up to one tick"
|
|
)
|
|
return [
|
|
("data-refresh", _seconds_to_schedule(refresh), "/api/sync/trigger", "POST", 120),
|
|
("health-check", _seconds_to_schedule(health), "/api/health", "GET", 30),
|
|
("script-runner", _seconds_to_schedule(scripts), "/api/scripts/run-due", "POST", 600),
|
|
("marketplaces", "daily 03:00", "/api/marketplaces/sync-all", "POST", 900),
|
|
# LLM pipeline (#176, #179 review). Cadences are deliberately offset
|
|
# (10m / 15m / 17m by default — all coprime modulo the 30s tick) so
|
|
# the three LLM-driven jobs don't fire on the same tick and stack
|
|
# their API + DB load. Driven by env so an operator can throttle
|
|
# without a code change; the verification-detector cadence is the
|
|
# single source of truth for the health-check staleness grace
|
|
# window in app/api/health.py (which uses 2x the cadence).
|
|
("session-collector", _seconds_to_schedule(sess), "/api/admin/run-session-collector", "POST", 300),
|
|
("verification-detector", _seconds_to_schedule(verify), "/api/admin/run-verification-detector", "POST", 900),
|
|
("corporate-memory", _seconds_to_schedule(corpmem), "/api/admin/run-corporate-memory", "POST", 900),
|
|
]
|
|
|
|
_running = True
|
|
|
|
|
|
def _signal_handler(sig, frame):
|
|
global _running
|
|
logger.info(f"Received signal {sig}, shutting down...")
|
|
_running = False
|
|
|
|
|
|
def _call_api(endpoint: str, method: str, timeout_sec: int) -> bool:
|
|
"""Call the main app API. Returns True on success."""
|
|
url = f"{API_URL}{endpoint}"
|
|
headers = {}
|
|
token = _get_auth_token()
|
|
if token:
|
|
headers["Authorization"] = f"Bearer {token}"
|
|
try:
|
|
if method == "POST":
|
|
resp = httpx.post(url, headers=headers, timeout=timeout_sec)
|
|
else:
|
|
resp = httpx.get(url, headers=headers, timeout=timeout_sec)
|
|
if resp.status_code < 400:
|
|
logger.info(f"Job {endpoint}: {resp.status_code}")
|
|
return True
|
|
else:
|
|
logger.warning(f"Job {endpoint}: HTTP {resp.status_code} - {resp.text[:200]}")
|
|
return False
|
|
except Exception as e:
|
|
logger.error(f"Job {endpoint} failed: {e}")
|
|
try:
|
|
from src.observability import get_posthog
|
|
get_posthog().capture_exception(
|
|
e,
|
|
distinct_id="system",
|
|
properties={"job": endpoint, "method": method, "component": "scheduler"},
|
|
)
|
|
except Exception:
|
|
logger.exception("PostHog capture_exception failed in scheduler")
|
|
return False
|
|
|
|
|
|
def run():
|
|
signal.signal(signal.SIGTERM, _signal_handler)
|
|
signal.signal(signal.SIGINT, _signal_handler)
|
|
|
|
jobs = build_jobs()
|
|
tick = resolved_tick_seconds()
|
|
logger.info(
|
|
"Scheduler started. API_URL=%s, %d jobs, tick=%ds. Schedules: %s",
|
|
API_URL, len(jobs), tick,
|
|
{name: schedule for name, schedule, *_ in jobs},
|
|
)
|
|
|
|
last_run: dict[str, str | None] = {name: None for name, *_ in jobs}
|
|
|
|
while _running:
|
|
now_iso = datetime.now(timezone.utc).isoformat()
|
|
for name, schedule, endpoint, method, timeout_sec in jobs:
|
|
if not is_table_due(schedule, last_run[name]):
|
|
continue
|
|
logger.info("Running job: %s (%s)", name, schedule)
|
|
ok = _call_api(endpoint, method, timeout_sec)
|
|
if ok:
|
|
last_run[name] = now_iso
|
|
time.sleep(tick)
|
|
|
|
logger.info("Scheduler stopped.")
|
|
|
|
|
|
if __name__ == "__main__":
|
|
run()
|