agnes-the-ai-analyst/services/scheduler/__main__.py
Vojtech 107195730d
feat(observability): optional PostHog integration (#231)
* feat(observability): optional PostHog integration (errors, LLM traces, replay, flags)

Off by default. Activates when POSTHOG_API_KEY is set in env. Defaults
to PostHog Cloud EU; override host for US Cloud or self-hosted.

Coverage:
  - FastAPI 500 handler captures unhandled exceptions
  - src/orchestrator.py rebuild + rebuild_source failures
  - services/scheduler/ HTTP-job failures
  - cli/main.py uncaught CLI errors (Typer.Exit/SystemExit/KeyboardInterrupt
    skipped; flushes before re-raise so short-lived CLI invocations don't
    drop events)
  - connectors/llm/anthropic_provider.py + openai_compat.py emit
    $ai_generation events with provider, model, latency, token counts
    (prompt/completion bodies stay off unless POSTHOG_LLM_PAYLOADS=1
    because LLM prompts here routinely include customer SQL/data)
  - Browser snippet injected into every text/html response by
    PosthogInjectionMiddleware — registered inside the GZip layer so it
    sees uncompressed HTML before compression. Many templates are
    standalone (their own DOCTYPE) and never extend base.html, so a
    per-template include would miss them.
  - Frontend: $pageview, $pageleave, JS error capture via window.error
    and unhandledrejection handlers, masked session replay
    (maskAllInputs: true plus CSS-selector mask for known data surfaces),
    feature flags (browser posthog.isFeatureEnabled + server-side
    feature_enabled with fallback for older SDKs).

Identification mode operator-configurable: none / id / email / full.
Default email ships user.id + email but never name. CLI entry point
moves from cli.main:app to cli.main:main (Typer wrapper).

Files:
  - src/observability/posthog_client.py — lazy singleton, no network
    when disabled, single-process flush on shutdown
  - src/observability/llm_tracing.py — trace_generation context manager
  - app/middleware/posthog_inject.py — HTML rewrite middleware
  - app/web/templates/_posthog.html — browser snippet template
  - docs/observability.md — operator guide
  - config/.env.template — documented POSTHOG_* knobs
  - tests/test_posthog_disabled.py + tests/test_posthog_client.py +
    tests/test_llm_tracing.py — 18 tests covering disabled state,
    identify-mode payloads, $ai_generation shape, error variant.

CHANGELOG entry under [Unreleased] Added.

* feat(observability): tag every PostHog event with environment + release

Splits PostHog dashboards cleanly between localhost / dev / staging /
production without manual tagging on every capture call.

- POSTHOG_ENVIRONMENT explicit override; auto-resolves to "local" when
  LOCAL_DEV_MODE=1, else RELEASE_CHANNEL, else AGNES_DEPLOYMENT_ENV,
  else "unknown".
- AGNES_VERSION → RELEASE_CHANNEL fallback feeds the `release` property
  for "is this error new in this release?" cohorting.
- Backend gets both via the PostHog SDK's super_properties constructor
  arg (every captured event picks them up automatically).
- Browser snippet calls posthog.register({environment, release}) inside
  the loaded callback so $pageview, $exception, autocapture, etc. all
  carry the same labels.
- request.state.user now populated by auth dependencies so the snippet
  can actually call posthog.identify(user_id, {email}) for logged-in
  users (previously the user block always resolved to None because
  nothing wrote to request.state.user).

4 new tests cover env resolution: explicit > LOCAL_DEV_MODE > channel
> unknown, plus super-properties forwarding into the SDK constructor.

* feat(observability): inline user attrs on every PostHog event + debug throw route

PostHog's UI shows person properties on the Person profile page, not
inline on each event — so a reviewer triaging an exception couldn't tell
which user hit the bug without clicking through. Fix it on both sides.

- Backend capture_exception merges user_id / user_email / user_name into
  the event properties (gated by POSTHOG_IDENTIFY_PII: none/id/email/full).
  Backed by a new _user_props_for_event helper on PosthogClient.
- Browser snippet registers user_id + user_email + user_name as super-
  properties via posthog.register({...}) so every $exception, $pageview,
  and custom event coming from posthog.captureException() carries them
  inline. Mirrors the backend so cross-referencing client/server events
  doesn't require a person-profile lookup.
- /api/debug/throw — debug-only endpoint gated by DEBUG=1 (404 in prod).
  Runs Depends(get_current_user) first so request.state.user is set when
  the unhandled-exception handler captures the event. Lets operators
  exercise the full observability path end-to-end without hand-rolling
  a TestClient script. Configurable via ?kind=ValueError&msg=...

7 new tests cover: backend user-attr merge across identify modes,
anonymous request fall-through, browser snippet super-prop emission for
logged-in / anonymous / id-only / full-name cases.

* fix(observability): address minasarustamyan PR #231 review

Two bugs caught in review.

1. PosthogInjectionMiddleware dropped Response.background on every
   return path. BaseHTTPMiddleware materialises the body and asks
   subclasses to return a fresh Response — three paths in dispatch()
   omitted background=, silently cancelling any BackgroundTask /
   BackgroundTasks the route attached (audit logging, async webhooks,
   email sends) with no log line. Fix: route every return through a
   _passthrough() helper that forwards background.

   Also adds a _MAX_BUFFER_BYTES (4 MB) cap so a streamed-HTML response
   can't balloon RSS during buffering. Bigger bodies short-circuit
   through with a warning rather than being injected.

   Regression tests in tests/test_posthog_inject_middleware.py exercise
   four return paths (snippet present, render-fail, double-injection
   guard, non-HTML passthrough) plus the streaming-guard short-circuit.

2. $ai_input / $ai_output_choices were emitted without truncation, so
   POSTHOG_LLM_PAYLOADS=1 silently dropped events past PostHog's ~32 KB
   per-event ingest limit — exactly the calls (large prompts with
   schemas / sample rows / SQL) an operator would want to inspect.
   Fix: clip both at POSTHOG_LLM_PAYLOAD_MAX_CHARS (default 30000) with
   an explicit "…[truncated N chars]" marker so readers don't mistake
   truncated captures for complete ones. Metadata (provider, model,
   tokens, latency, error) flows regardless. Three new tests cover
   default-cap clipping, env-override, and pass-through under the cap.

37 PostHog tests pass.
2026-05-08 17:57:10 +04:00

238 lines
9.6 KiB
Python

"""Scheduler service — replaces systemd timers.
Lightweight sidecar that fires scheduled jobs over HTTP against the main
app. Authenticates with ``SCHEDULER_API_TOKEN`` (shared-secret synthetic
admin — see ``app.auth.scheduler_token``); falls back to no-auth in
LOCAL_DEV_MODE.
Schedules are strings parsed by ``src.scheduler.is_table_due`` — accepts
"every 15m", "every 1h", "daily 03:00", "daily 07:00,13:00".
Why every job is HTTP and nothing runs in-process: the scheduler container
shares ``/data/state/system.duckdb`` with the app container, but DuckDB
permits only one writer per file across processes. An in-process call
from the scheduler raced the app's long-lived handle and 500-ed on
``Could not set lock on file``. Going through HTTP makes the app the sole
writer; the scheduler is reduced to a pure cron clock.
Usage: python -m services.scheduler
"""
import logging
import os
import signal
import time
from datetime import datetime, timezone
import httpx
from app.logging_config import setup_logging
from src.scheduler import is_table_due
setup_logging(__name__)
logger = logging.getLogger(__name__)
API_URL = os.environ.get("API_URL", "http://localhost:8000")
SCHEDULER_API_TOKEN = os.environ.get("SCHEDULER_API_TOKEN", "")
_token_warning_emitted = False
def _get_auth_token() -> str:
"""Return the bearer token for API calls.
Production: ``SCHEDULER_API_TOKEN`` is a shared secret generated by the
Terraform startup script and written to ``/opt/agnes/.env``. Both the
``app`` and ``scheduler`` containers source the same .env via Docker
Compose ``env_file:``, so the secret is symmetric. The app validates
incoming Bearer tokens against this env var (constant-time compare in
``app.auth.scheduler_token``) and resolves matches to a synthetic
``scheduler@system.local`` user that is a member of the Admin group.
Dev / LOCAL_DEV_MODE: leave it unset. The scheduler returns the empty
string and calls the API without an ``Authorization`` header — the
API's dev-bypass auto-authenticates the request as the dev user.
"""
global _token_warning_emitted
if SCHEDULER_API_TOKEN:
return SCHEDULER_API_TOKEN
if not _token_warning_emitted:
logger.warning(
"SCHEDULER_API_TOKEN is not set — calling the API without "
"Authorization. Required in production; in LOCAL_DEV_MODE "
"the dev-bypass auto-authenticates and this is fine."
)
_token_warning_emitted = True
return ""
# --- Env parsing ------------------------------------------------------------
_DEFAULTS = {
"SCHEDULER_DATA_REFRESH_INTERVAL": 15 * 60, # seconds
"SCHEDULER_HEALTH_CHECK_INTERVAL": 5 * 60,
"SCHEDULER_SCRIPT_RUN_INTERVAL": 1 * 60,
"SCHEDULER_TICK_SECONDS": 30,
# LLM pipeline cadences (#176, #179 review). Defaults preserve the
# 10m / 15m / 17m coprime offset so the three jobs don't fire on the
# same tick and stack their API + DB load. The verification-detector
# default (900s) is also the source of truth for the health-check
# staleness grace window in app/api/health.py — single env var drives
# both, so an operator changing the cadence moves both.
"SCHEDULER_SESSION_COLLECTOR_INTERVAL": 10 * 60,
"SCHEDULER_VERIFICATION_DETECTOR_INTERVAL": 15 * 60,
"SCHEDULER_CORPORATE_MEMORY_INTERVAL": 17 * 60,
}
def _read_positive_int(name: str) -> int:
"""Read an env var as a positive integer or fall back to the default.
Treats unset env (``None``) as "use default". Treats explicitly empty
string (``""``) as an operator typo and raises — silently defaulting
on a literal ``FOO=`` in the env_file would mask configuration bugs.
"""
raw = os.environ.get(name)
if raw is None:
if name not in _DEFAULTS:
raise ValueError(f"Unknown scheduler env var: {name}")
return _DEFAULTS[name]
if raw == "":
raise ValueError(f"{name}='' must be a positive integer (seconds)")
try:
value = int(raw)
except (TypeError, ValueError):
raise ValueError(f"{name}={raw!r} must be a positive integer (seconds)")
if value <= 0:
raise ValueError(f"{name}={value} must be > 0 (seconds)")
return value
def _seconds_to_schedule(seconds: int) -> str:
"""Convert a seconds value to the closest 'every Nm' / 'every Nh' string.
Uses ceiling division so a non-multiple-of-60 input never produces a
schedule that fires MORE often than the operator configured (90s →
'every 2m', not 'every 1m'). Sub-minute inputs clamp to 'every 1m'
because the schedule grammar has minute-level resolution.
"""
if seconds % 3600 == 0 and seconds >= 3600:
return f"every {seconds // 3600}h"
# Ceiling division: -(-x // y) is the standard trick.
minutes = max(1, -(-seconds // 60))
return f"every {minutes}m"
def resolved_tick_seconds() -> int:
"""Read + validate SCHEDULER_TICK_SECONDS in isolation (test helper)."""
return _read_positive_int("SCHEDULER_TICK_SECONDS")
def build_jobs() -> list[tuple[str, str, str, str, int]]:
"""Build the JOBS list from env, applying defaults and validation.
Tuple shape: (name, schedule_string, endpoint, method, http_timeout_sec).
Marketplaces stays hardcoded — promoting it to env is out of #77 scope.
"""
refresh = _read_positive_int("SCHEDULER_DATA_REFRESH_INTERVAL")
health = _read_positive_int("SCHEDULER_HEALTH_CHECK_INTERVAL")
scripts = _read_positive_int("SCHEDULER_SCRIPT_RUN_INTERVAL")
sess = _read_positive_int("SCHEDULER_SESSION_COLLECTOR_INTERVAL")
verify = _read_positive_int("SCHEDULER_VERIFICATION_DETECTOR_INTERVAL")
corpmem = _read_positive_int("SCHEDULER_CORPORATE_MEMORY_INTERVAL")
tick = _read_positive_int("SCHEDULER_TICK_SECONDS")
smallest = min(refresh, health, scripts, sess, verify, corpmem)
if tick > smallest:
raise ValueError(
f"SCHEDULER_TICK_SECONDS={tick} must be <= the smallest job "
f"interval ({smallest}s) so jobs don't consistently miss their "
f"cadence by up to one tick"
)
return [
("data-refresh", _seconds_to_schedule(refresh), "/api/sync/trigger", "POST", 120),
("health-check", _seconds_to_schedule(health), "/api/health", "GET", 30),
("script-runner", _seconds_to_schedule(scripts), "/api/scripts/run-due", "POST", 600),
("marketplaces", "daily 03:00", "/api/marketplaces/sync-all", "POST", 900),
# LLM pipeline (#176, #179 review). Cadences are deliberately offset
# (10m / 15m / 17m by default — all coprime modulo the 30s tick) so
# the three LLM-driven jobs don't fire on the same tick and stack
# their API + DB load. Driven by env so an operator can throttle
# without a code change; the verification-detector cadence is the
# single source of truth for the health-check staleness grace
# window in app/api/health.py (which uses 2x the cadence).
("session-collector", _seconds_to_schedule(sess), "/api/admin/run-session-collector", "POST", 300),
("verification-detector", _seconds_to_schedule(verify), "/api/admin/run-verification-detector", "POST", 900),
("corporate-memory", _seconds_to_schedule(corpmem), "/api/admin/run-corporate-memory", "POST", 900),
]
_running = True
def _signal_handler(sig, frame):
global _running
logger.info(f"Received signal {sig}, shutting down...")
_running = False
def _call_api(endpoint: str, method: str, timeout_sec: int) -> bool:
"""Call the main app API. Returns True on success."""
url = f"{API_URL}{endpoint}"
headers = {}
token = _get_auth_token()
if token:
headers["Authorization"] = f"Bearer {token}"
try:
if method == "POST":
resp = httpx.post(url, headers=headers, timeout=timeout_sec)
else:
resp = httpx.get(url, headers=headers, timeout=timeout_sec)
if resp.status_code < 400:
logger.info(f"Job {endpoint}: {resp.status_code}")
return True
else:
logger.warning(f"Job {endpoint}: HTTP {resp.status_code} - {resp.text[:200]}")
return False
except Exception as e:
logger.error(f"Job {endpoint} failed: {e}")
try:
from src.observability import get_posthog
get_posthog().capture_exception(
e,
distinct_id="system",
properties={"job": endpoint, "method": method, "component": "scheduler"},
)
except Exception:
logger.exception("PostHog capture_exception failed in scheduler")
return False
def run():
signal.signal(signal.SIGTERM, _signal_handler)
signal.signal(signal.SIGINT, _signal_handler)
jobs = build_jobs()
tick = resolved_tick_seconds()
logger.info(
"Scheduler started. API_URL=%s, %d jobs, tick=%ds. Schedules: %s",
API_URL, len(jobs), tick,
{name: schedule for name, schedule, *_ in jobs},
)
last_run: dict[str, str | None] = {name: None for name, *_ in jobs}
while _running:
now_iso = datetime.now(timezone.utc).isoformat()
for name, schedule, endpoint, method, timeout_sec in jobs:
if not is_table_due(schedule, last_run[name]):
continue
logger.info("Running job: %s (%s)", name, schedule)
ok = _call_api(endpoint, method, timeout_sec)
if ok:
last_run[name] = now_iso
time.sleep(tick)
logger.info("Scheduler stopped.")
if __name__ == "__main__":
run()