agnes-the-ai-analyst/tests/test_llm_tracing.py
Vojtech 107195730d
feat(observability): optional PostHog integration (#231)
* feat(observability): optional PostHog integration (errors, LLM traces, replay, flags)

Off by default. Activates when POSTHOG_API_KEY is set in env. Defaults
to PostHog Cloud EU; override host for US Cloud or self-hosted.

Coverage:
  - FastAPI 500 handler captures unhandled exceptions
  - src/orchestrator.py rebuild + rebuild_source failures
  - services/scheduler/ HTTP-job failures
  - cli/main.py uncaught CLI errors (Typer.Exit/SystemExit/KeyboardInterrupt
    skipped; flushes before re-raise so short-lived CLI invocations don't
    drop events)
  - connectors/llm/anthropic_provider.py + openai_compat.py emit
    $ai_generation events with provider, model, latency, token counts
    (prompt/completion bodies stay off unless POSTHOG_LLM_PAYLOADS=1
    because LLM prompts here routinely include customer SQL/data)
  - Browser snippet injected into every text/html response by
    PosthogInjectionMiddleware — registered inside the GZip layer so it
    sees uncompressed HTML before compression. Many templates are
    standalone (their own DOCTYPE) and never extend base.html, so a
    per-template include would miss them.
  - Frontend: $pageview, $pageleave, JS error capture via window.error
    and unhandledrejection handlers, masked session replay
    (maskAllInputs: true plus CSS-selector mask for known data surfaces),
    feature flags (browser posthog.isFeatureEnabled + server-side
    feature_enabled with fallback for older SDKs).

Identification mode operator-configurable: none / id / email / full.
Default email ships user.id + email but never name. CLI entry point
moves from cli.main:app to cli.main:main (Typer wrapper).

Files:
  - src/observability/posthog_client.py — lazy singleton, no network
    when disabled, single-process flush on shutdown
  - src/observability/llm_tracing.py — trace_generation context manager
  - app/middleware/posthog_inject.py — HTML rewrite middleware
  - app/web/templates/_posthog.html — browser snippet template
  - docs/observability.md — operator guide
  - config/.env.template — documented POSTHOG_* knobs
  - tests/test_posthog_disabled.py + tests/test_posthog_client.py +
    tests/test_llm_tracing.py — 18 tests covering disabled state,
    identify-mode payloads, $ai_generation shape, error variant.

CHANGELOG entry under [Unreleased] Added.

* feat(observability): tag every PostHog event with environment + release

Splits PostHog dashboards cleanly between localhost / dev / staging /
production without manual tagging on every capture call.

- POSTHOG_ENVIRONMENT explicit override; auto-resolves to "local" when
  LOCAL_DEV_MODE=1, else RELEASE_CHANNEL, else AGNES_DEPLOYMENT_ENV,
  else "unknown".
- AGNES_VERSION → RELEASE_CHANNEL fallback feeds the `release` property
  for "is this error new in this release?" cohorting.
- Backend gets both via the PostHog SDK's super_properties constructor
  arg (every captured event picks them up automatically).
- Browser snippet calls posthog.register({environment, release}) inside
  the loaded callback so $pageview, $exception, autocapture, etc. all
  carry the same labels.
- request.state.user now populated by auth dependencies so the snippet
  can actually call posthog.identify(user_id, {email}) for logged-in
  users (previously the user block always resolved to None because
  nothing wrote to request.state.user).

4 new tests cover env resolution: explicit > LOCAL_DEV_MODE > channel
> unknown, plus super-properties forwarding into the SDK constructor.

* feat(observability): inline user attrs on every PostHog event + debug throw route

PostHog's UI shows person properties on the Person profile page, not
inline on each event — so a reviewer triaging an exception couldn't tell
which user hit the bug without clicking through. Fix it on both sides.

- Backend capture_exception merges user_id / user_email / user_name into
  the event properties (gated by POSTHOG_IDENTIFY_PII: none/id/email/full).
  Backed by a new _user_props_for_event helper on PosthogClient.
- Browser snippet registers user_id + user_email + user_name as super-
  properties via posthog.register({...}) so every $exception, $pageview,
  and custom event coming from posthog.captureException() carries them
  inline. Mirrors the backend so cross-referencing client/server events
  doesn't require a person-profile lookup.
- /api/debug/throw — debug-only endpoint gated by DEBUG=1 (404 in prod).
  Runs Depends(get_current_user) first so request.state.user is set when
  the unhandled-exception handler captures the event. Lets operators
  exercise the full observability path end-to-end without hand-rolling
  a TestClient script. Configurable via ?kind=ValueError&msg=...

7 new tests cover: backend user-attr merge across identify modes,
anonymous request fall-through, browser snippet super-prop emission for
logged-in / anonymous / id-only / full-name cases.

* fix(observability): address minasarustamyan PR #231 review

Two bugs caught in review.

1. PosthogInjectionMiddleware dropped Response.background on every
   return path. BaseHTTPMiddleware materialises the body and asks
   subclasses to return a fresh Response — three paths in dispatch()
   omitted background=, silently cancelling any BackgroundTask /
   BackgroundTasks the route attached (audit logging, async webhooks,
   email sends) with no log line. Fix: route every return through a
   _passthrough() helper that forwards background.

   Also adds a _MAX_BUFFER_BYTES (4 MB) cap so a streamed-HTML response
   can't balloon RSS during buffering. Bigger bodies short-circuit
   through with a warning rather than being injected.

   Regression tests in tests/test_posthog_inject_middleware.py exercise
   four return paths (snippet present, render-fail, double-injection
   guard, non-HTML passthrough) plus the streaming-guard short-circuit.

2. $ai_input / $ai_output_choices were emitted without truncation, so
   POSTHOG_LLM_PAYLOADS=1 silently dropped events past PostHog's ~32 KB
   per-event ingest limit — exactly the calls (large prompts with
   schemas / sample rows / SQL) an operator would want to inspect.
   Fix: clip both at POSTHOG_LLM_PAYLOAD_MAX_CHARS (default 30000) with
   an explicit "…[truncated N chars]" marker so readers don't mistake
   truncated captures for complete ones. Metadata (provider, model,
   tokens, latency, error) flows regardless. Three new tests cover
   default-cap clipping, env-override, and pass-through under the cap.

37 PostHog tests pass.
2026-05-08 17:57:10 +04:00

192 lines
7.3 KiB
Python

"""LLM tracing emits well-formed $ai_generation events."""
from __future__ import annotations
from types import SimpleNamespace
from unittest.mock import MagicMock, patch
import pytest
@pytest.fixture
def enabled_posthog(monkeypatch):
monkeypatch.setenv("POSTHOG_API_KEY", "phc_x")
monkeypatch.delenv("POSTHOG_LLM_PAYLOADS", raising=False)
from src.observability import reset_posthog
reset_posthog()
yield
reset_posthog()
def test_success_emits_ai_generation_with_token_counts(enabled_posthog):
sdk = MagicMock()
with patch("posthog.Posthog", return_value=sdk):
from src.observability import trace_generation
with trace_generation(provider="anthropic", model="claude-test", distinct_id="u-1") as t:
t.set_input("hello")
t.set_tokens(input_tokens=5, output_tokens=10)
# The wrapper calls sdk.capture exactly once.
sdk.capture.assert_called_once()
kwargs = sdk.capture.call_args.kwargs
assert kwargs["event"] == "$ai_generation"
assert kwargs["distinct_id"] == "u-1"
props = kwargs["properties"]
assert props["$ai_provider"] == "anthropic"
assert props["$ai_model"] == "claude-test"
assert props["$ai_input_tokens"] == 5
assert props["$ai_output_tokens"] == 10
assert "$ai_latency" in props
assert "$ai_trace_id" in props
# Payloads off by default — neither input nor output bodies leak.
assert "$ai_input" not in props
assert "$ai_output_choices" not in props
assert "$ai_is_error" not in props
def test_payloads_flag_enables_prompt_and_completion(enabled_posthog, monkeypatch):
monkeypatch.setenv("POSTHOG_LLM_PAYLOADS", "1")
from src.observability import reset_posthog
reset_posthog()
sdk = MagicMock()
with patch("posthog.Posthog", return_value=sdk):
from src.observability import trace_generation
with trace_generation(provider="openai_compat", model="gpt-x") as t:
t.set_input("the prompt")
t.set_output("the completion")
t.set_tokens(input_tokens=1, output_tokens=2)
kwargs = sdk.capture.call_args.kwargs
props = kwargs["properties"]
assert props["$ai_input"] == "the prompt"
assert props["$ai_output_choices"] == "the completion"
def test_exception_emits_error_event_and_reraises(enabled_posthog):
sdk = MagicMock()
with patch("posthog.Posthog", return_value=sdk):
from src.observability import trace_generation
with pytest.raises(RuntimeError, match="api down"):
with trace_generation(provider="anthropic", model="claude-test") as t:
t.set_input("x")
raise RuntimeError("api down")
sdk.capture.assert_called_once()
props = sdk.capture.call_args.kwargs["properties"]
assert props["$ai_is_error"] is True
assert "api down" in props["$ai_error"]
assert props["$ai_provider"] == "anthropic"
assert "$ai_latency" in props
def test_set_output_from_anthropic_extracts_tokens(enabled_posthog):
sdk = MagicMock()
with patch("posthog.Posthog", return_value=sdk):
from src.observability import trace_generation
# Build a fake Anthropic response object.
block = SimpleNamespace(text="some output text")
response = SimpleNamespace(
usage=SimpleNamespace(input_tokens=11, output_tokens=22),
content=[block],
)
with trace_generation(provider="anthropic", model="claude-test") as t:
t.set_output_from_anthropic(response)
props = sdk.capture.call_args.kwargs["properties"]
assert props["$ai_input_tokens"] == 11
assert props["$ai_output_tokens"] == 22
def test_payload_truncation_under_default_cap(enabled_posthog, monkeypatch):
"""Oversized prompt/output gets clipped so PostHog doesn't drop the event.
Agnes ships LLM prompts containing sample rows / SQL that routinely
exceed PostHog's ~32 KB per-event ingest cap. Without truncation the
interesting events vanish silently. PR #231 review (minasarustamyan).
"""
monkeypatch.setenv("POSTHOG_LLM_PAYLOADS", "1")
from src.observability import reset_posthog
reset_posthog()
big_prompt = "P" * 50_000
big_output = "O" * 50_000
sdk = MagicMock()
with patch("posthog.Posthog", return_value=sdk):
from src.observability import trace_generation
with trace_generation(provider="anthropic", model="claude-x") as t:
t.set_input(big_prompt)
t.set_output(big_output)
t.set_tokens(input_tokens=1, output_tokens=2)
props = sdk.capture.call_args.kwargs["properties"]
assert len(props["$ai_input"]) < len(big_prompt)
assert len(props["$ai_output_choices"]) < len(big_output)
# Truncation marker present so reader knows it was clipped.
assert "[truncated " in props["$ai_input"]
assert "[truncated " in props["$ai_output_choices"]
# Cap stays well under PostHog's ~32 KB per-event limit.
assert len(props["$ai_input"]) < 32_000
assert len(props["$ai_output_choices"]) < 32_000
def test_payload_truncation_respects_env_override(enabled_posthog, monkeypatch):
monkeypatch.setenv("POSTHOG_LLM_PAYLOADS", "1")
monkeypatch.setenv("POSTHOG_LLM_PAYLOAD_MAX_CHARS", "100")
from src.observability import reset_posthog
reset_posthog()
sdk = MagicMock()
with patch("posthog.Posthog", return_value=sdk):
from src.observability import trace_generation
with trace_generation(provider="anthropic", model="claude-x") as t:
t.set_input("X" * 500)
t.set_output("Y" * 500)
props = sdk.capture.call_args.kwargs["properties"]
# Cap honored — first 100 chars then the marker.
assert props["$ai_input"].startswith("X" * 100)
assert props["$ai_input"].endswith("[truncated 400 chars]")
def test_payload_under_cap_is_passed_through_unchanged(enabled_posthog, monkeypatch):
monkeypatch.setenv("POSTHOG_LLM_PAYLOADS", "1")
from src.observability import reset_posthog
reset_posthog()
sdk = MagicMock()
with patch("posthog.Posthog", return_value=sdk):
from src.observability import trace_generation
small = "tiny prompt"
with trace_generation(provider="anthropic", model="claude-x") as t:
t.set_input(small)
t.set_output(small)
props = sdk.capture.call_args.kwargs["properties"]
assert props["$ai_input"] == small
assert props["$ai_output_choices"] == small
assert "[truncated" not in props["$ai_input"]
def test_set_output_from_openai_extracts_tokens(enabled_posthog):
sdk = MagicMock()
with patch("posthog.Posthog", return_value=sdk):
from src.observability import trace_generation
msg = SimpleNamespace(content="hi")
choice = SimpleNamespace(message=msg)
response = SimpleNamespace(
usage=SimpleNamespace(prompt_tokens=3, completion_tokens=7),
choices=[choice],
)
with trace_generation(provider="openai_compat", model="gpt-x") as t:
t.set_output_from_openai(response)
props = sdk.capture.call_args.kwargs["properties"]
assert props["$ai_input_tokens"] == 3
assert props["$ai_output_tokens"] == 7