agnes-the-ai-analyst/docs/observability.md
Vojtech 107195730d
feat(observability): optional PostHog integration (#231)
* feat(observability): optional PostHog integration (errors, LLM traces, replay, flags)

Off by default. Activates when POSTHOG_API_KEY is set in env. Defaults
to PostHog Cloud EU; override host for US Cloud or self-hosted.

Coverage:
  - FastAPI 500 handler captures unhandled exceptions
  - src/orchestrator.py rebuild + rebuild_source failures
  - services/scheduler/ HTTP-job failures
  - cli/main.py uncaught CLI errors (Typer.Exit/SystemExit/KeyboardInterrupt
    skipped; flushes before re-raise so short-lived CLI invocations don't
    drop events)
  - connectors/llm/anthropic_provider.py + openai_compat.py emit
    $ai_generation events with provider, model, latency, token counts
    (prompt/completion bodies stay off unless POSTHOG_LLM_PAYLOADS=1
    because LLM prompts here routinely include customer SQL/data)
  - Browser snippet injected into every text/html response by
    PosthogInjectionMiddleware — registered inside the GZip layer so it
    sees uncompressed HTML before compression. Many templates are
    standalone (their own DOCTYPE) and never extend base.html, so a
    per-template include would miss them.
  - Frontend: $pageview, $pageleave, JS error capture via window.error
    and unhandledrejection handlers, masked session replay
    (maskAllInputs: true plus CSS-selector mask for known data surfaces),
    feature flags (browser posthog.isFeatureEnabled + server-side
    feature_enabled with fallback for older SDKs).

Identification mode operator-configurable: none / id / email / full.
Default email ships user.id + email but never name. CLI entry point
moves from cli.main:app to cli.main:main (Typer wrapper).

Files:
  - src/observability/posthog_client.py — lazy singleton, no network
    when disabled, single-process flush on shutdown
  - src/observability/llm_tracing.py — trace_generation context manager
  - app/middleware/posthog_inject.py — HTML rewrite middleware
  - app/web/templates/_posthog.html — browser snippet template
  - docs/observability.md — operator guide
  - config/.env.template — documented POSTHOG_* knobs
  - tests/test_posthog_disabled.py + tests/test_posthog_client.py +
    tests/test_llm_tracing.py — 18 tests covering disabled state,
    identify-mode payloads, $ai_generation shape, error variant.

CHANGELOG entry under [Unreleased] Added.

* feat(observability): tag every PostHog event with environment + release

Splits PostHog dashboards cleanly between localhost / dev / staging /
production without manual tagging on every capture call.

- POSTHOG_ENVIRONMENT explicit override; auto-resolves to "local" when
  LOCAL_DEV_MODE=1, else RELEASE_CHANNEL, else AGNES_DEPLOYMENT_ENV,
  else "unknown".
- AGNES_VERSION → RELEASE_CHANNEL fallback feeds the `release` property
  for "is this error new in this release?" cohorting.
- Backend gets both via the PostHog SDK's super_properties constructor
  arg (every captured event picks them up automatically).
- Browser snippet calls posthog.register({environment, release}) inside
  the loaded callback so $pageview, $exception, autocapture, etc. all
  carry the same labels.
- request.state.user now populated by auth dependencies so the snippet
  can actually call posthog.identify(user_id, {email}) for logged-in
  users (previously the user block always resolved to None because
  nothing wrote to request.state.user).

4 new tests cover env resolution: explicit > LOCAL_DEV_MODE > channel
> unknown, plus super-properties forwarding into the SDK constructor.

* feat(observability): inline user attrs on every PostHog event + debug throw route

PostHog's UI shows person properties on the Person profile page, not
inline on each event — so a reviewer triaging an exception couldn't tell
which user hit the bug without clicking through. Fix it on both sides.

- Backend capture_exception merges user_id / user_email / user_name into
  the event properties (gated by POSTHOG_IDENTIFY_PII: none/id/email/full).
  Backed by a new _user_props_for_event helper on PosthogClient.
- Browser snippet registers user_id + user_email + user_name as super-
  properties via posthog.register({...}) so every $exception, $pageview,
  and custom event coming from posthog.captureException() carries them
  inline. Mirrors the backend so cross-referencing client/server events
  doesn't require a person-profile lookup.
- /api/debug/throw — debug-only endpoint gated by DEBUG=1 (404 in prod).
  Runs Depends(get_current_user) first so request.state.user is set when
  the unhandled-exception handler captures the event. Lets operators
  exercise the full observability path end-to-end without hand-rolling
  a TestClient script. Configurable via ?kind=ValueError&msg=...

7 new tests cover: backend user-attr merge across identify modes,
anonymous request fall-through, browser snippet super-prop emission for
logged-in / anonymous / id-only / full-name cases.

* fix(observability): address minasarustamyan PR #231 review

Two bugs caught in review.

1. PosthogInjectionMiddleware dropped Response.background on every
   return path. BaseHTTPMiddleware materialises the body and asks
   subclasses to return a fresh Response — three paths in dispatch()
   omitted background=, silently cancelling any BackgroundTask /
   BackgroundTasks the route attached (audit logging, async webhooks,
   email sends) with no log line. Fix: route every return through a
   _passthrough() helper that forwards background.

   Also adds a _MAX_BUFFER_BYTES (4 MB) cap so a streamed-HTML response
   can't balloon RSS during buffering. Bigger bodies short-circuit
   through with a warning rather than being injected.

   Regression tests in tests/test_posthog_inject_middleware.py exercise
   four return paths (snippet present, render-fail, double-injection
   guard, non-HTML passthrough) plus the streaming-guard short-circuit.

2. $ai_input / $ai_output_choices were emitted without truncation, so
   POSTHOG_LLM_PAYLOADS=1 silently dropped events past PostHog's ~32 KB
   per-event ingest limit — exactly the calls (large prompts with
   schemas / sample rows / SQL) an operator would want to inspect.
   Fix: clip both at POSTHOG_LLM_PAYLOAD_MAX_CHARS (default 30000) with
   an explicit "…[truncated N chars]" marker so readers don't mistake
   truncated captures for complete ones. Metadata (provider, model,
   tokens, latency, error) flows regardless. Three new tests cover
   default-cap clipping, env-override, and pass-through under the cap.

37 PostHog tests pass.
2026-05-08 17:57:10 +04:00

6.2 KiB

Observability — PostHog integration

Optional integration that wires four signals into a single PostHog project:

  1. Backend exceptions — every unhandled FastAPI exception, plus rebuild failures from src/orchestrator.py and HTTP-job failures from services/scheduler/.
  2. LLM tracing — every Anthropic / OpenAI-compat call emits a $ai_generation event with provider, model, latency, and token counts.
  3. Frontend errors + pageviewswindow.error / unhandledrejection forwarded via posthog.captureException; automatic $pageview and $pageleave.
  4. Session replay (masked) + feature flags — both gated behind the same single POSTHOG_API_KEY.

The integration ships off by default. Setting one environment variable turns everything on.

Enabling the integration

# Required — the only switch that controls on/off.
# Use a PROJECT key (publishable phc_…), never a personal API key.
POSTHOG_API_KEY=phc_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

That's the entire minimum. Defaults will:

  • Send to https://eu.i.posthog.com (override with POSTHOG_HOST).
  • Identify logged-in users by id + email (override with POSTHOG_IDENTIFY_PII).
  • Record session replay with all inputs and known data surfaces masked (override with POSTHOG_REPLAY=false or POSTHOG_REPLAY_MASK_SELECTOR=…).
  • Skip prompt / completion bodies in LLM events; emit token counts + latency only (override with POSTHOG_LLM_PAYLOADS=1 if you accept the privacy trade-off — LLM prompts in this product routinely include customer SQL and data).

All knobs

Variable Default Notes
POSTHOG_API_KEY unset The on/off switch. Unset = integration is fully off. Project key only.
POSTHOG_HOST https://eu.i.posthog.com Full URL. Use https://us.i.posthog.com for the US region or your own host.
POSTHOG_IDENTIFY_PII email none / id / email / full.
POSTHOG_REPLAY true Disable replay only, keeping errors / events / flags.
POSTHOG_REPLAY_MASK_SELECTOR empty CSS selector appended to the default mask list.
POSTHOG_LLM_PAYLOADS 0 1 adds $ai_input + $ai_output_choices to LLM events. Off by default.
POSTHOG_ENVIRONMENT auto Tagged on every event as the environment super-property. Auto-resolves to local when LOCAL_DEV_MODE=1, else RELEASE_CHANNEL, else AGNES_DEPLOYMENT_ENV, else unknown.

Splitting traffic by environment

Every captured event — backend exceptions, $ai_generation, browser $pageview, JS errors, custom events — is tagged with two super properties so PostHog dashboards can slice cleanly:

  • environment — resolved at startup (see table above). Operators typically set this to local, staging, or production explicitly, or rely on the auto-resolver.
  • release — the running AGNES_VERSION, falling back to RELEASE_CHANNEL. Useful for "is this error new in this release?" cohorting.

Both apply to backend events via the SDK's super_properties and to browser events via posthog.register({...}) in the loaded callback, so filtering by environment = production in PostHog hides every event generated from a developer laptop, CI, or staging.

Privacy posture

  • The PostHog project key is publishable — it's safe in browser HTML. PostHog uses a separate personal API key for admin operations. This integration only ever exposes the project key. Treat the personal key like any other secret and never set it as POSTHOG_API_KEY.
  • Session replay defaults: maskAllInputs: true, plus a CSS-selector mask for known data-bearing classes (.data-cell, .query-result, .sql-output, plain <code> and <pre>, and any element marked data-sensitive). Add your own with POSTHOG_REPLAY_MASK_SELECTOR.
  • LLM payloads are off by default because the prompts and completions in this product include customer SQL, query results, and table samples. Token counts and latency are always sent (no payload contents in them).
  • person_profiles: 'identified_only' — anonymous visits do not create person records.

Where the events come from

Event Code path
$exception (unhandled 500) app/main.py:_unhandled_exception_handler
$exception (orchestrator rebuild) src/orchestrator.py:_capture_orchestrator_exception
$exception (scheduler job) services/scheduler/__main__.py:_call_api
$exception (CLI uncaught) cli/main.py:main
$ai_generation src/observability/llm_tracing.py:trace_generation wrapped at connectors/llm/anthropic_provider.py:_attempt_extraction and connectors/llm/openai_compat.py
$pageview, $pageleave, JS errors injected into every text/html response by app/middleware/posthog_inject.py

CLI coverage

The da CLI (cli/main.py:main) catches every uncaught exception from a command, forwards it to PostHog with component=cli and the invoked command name, then flushes the client before re-raising for Typer's default error printer. Normal Typer / Click exits, SystemExit, and KeyboardInterrupt are intentionally skipped.

Operators must surface POSTHOG_API_KEY (and any other POSTHOG_* knob) into the shell that runs da — typically by sourcing the same .env the server uses, or by setting the variable in their shell profile. The CLI respects exactly the same env-var contract as the server.

LLM calls made by CLI commands (da query, da explore, etc.) flow through the provider wrappers in connectors/llm/ and therefore emit $ai_generation events via the same tracing path the server uses.

Testing the integration

Boot the app with the key set, hit /, then provoke a 500 (e.g. via a debug-only route). One Errors event should arrive within seconds along with one $pageview per page load. Open Session replay and pick the session — every <input> should show as a masked rectangle.

The unit tests in tests/test_posthog_*.py cover the disabled and enabled configurations; tests/test_llm_tracing.py exercises the success and error variants of the LLM event.

Self-hosting note

PostHog is itself open source — operators with a self-hosted PostHog instance just point POSTHOG_HOST at their endpoint. No code changes required.