* feat(observability): optional PostHog integration (errors, LLM traces, replay, flags)
Off by default. Activates when POSTHOG_API_KEY is set in env. Defaults
to PostHog Cloud EU; override host for US Cloud or self-hosted.
Coverage:
- FastAPI 500 handler captures unhandled exceptions
- src/orchestrator.py rebuild + rebuild_source failures
- services/scheduler/ HTTP-job failures
- cli/main.py uncaught CLI errors (Typer.Exit/SystemExit/KeyboardInterrupt
skipped; flushes before re-raise so short-lived CLI invocations don't
drop events)
- connectors/llm/anthropic_provider.py + openai_compat.py emit
$ai_generation events with provider, model, latency, token counts
(prompt/completion bodies stay off unless POSTHOG_LLM_PAYLOADS=1
because LLM prompts here routinely include customer SQL/data)
- Browser snippet injected into every text/html response by
PosthogInjectionMiddleware — registered inside the GZip layer so it
sees uncompressed HTML before compression. Many templates are
standalone (their own DOCTYPE) and never extend base.html, so a
per-template include would miss them.
- Frontend: $pageview, $pageleave, JS error capture via window.error
and unhandledrejection handlers, masked session replay
(maskAllInputs: true plus CSS-selector mask for known data surfaces),
feature flags (browser posthog.isFeatureEnabled + server-side
feature_enabled with fallback for older SDKs).
Identification mode operator-configurable: none / id / email / full.
Default email ships user.id + email but never name. CLI entry point
moves from cli.main:app to cli.main:main (Typer wrapper).
Files:
- src/observability/posthog_client.py — lazy singleton, no network
when disabled, single-process flush on shutdown
- src/observability/llm_tracing.py — trace_generation context manager
- app/middleware/posthog_inject.py — HTML rewrite middleware
- app/web/templates/_posthog.html — browser snippet template
- docs/observability.md — operator guide
- config/.env.template — documented POSTHOG_* knobs
- tests/test_posthog_disabled.py + tests/test_posthog_client.py +
tests/test_llm_tracing.py — 18 tests covering disabled state,
identify-mode payloads, $ai_generation shape, error variant.
CHANGELOG entry under [Unreleased] Added.
* feat(observability): tag every PostHog event with environment + release
Splits PostHog dashboards cleanly between localhost / dev / staging /
production without manual tagging on every capture call.
- POSTHOG_ENVIRONMENT explicit override; auto-resolves to "local" when
LOCAL_DEV_MODE=1, else RELEASE_CHANNEL, else AGNES_DEPLOYMENT_ENV,
else "unknown".
- AGNES_VERSION → RELEASE_CHANNEL fallback feeds the `release` property
for "is this error new in this release?" cohorting.
- Backend gets both via the PostHog SDK's super_properties constructor
arg (every captured event picks them up automatically).
- Browser snippet calls posthog.register({environment, release}) inside
the loaded callback so $pageview, $exception, autocapture, etc. all
carry the same labels.
- request.state.user now populated by auth dependencies so the snippet
can actually call posthog.identify(user_id, {email}) for logged-in
users (previously the user block always resolved to None because
nothing wrote to request.state.user).
4 new tests cover env resolution: explicit > LOCAL_DEV_MODE > channel
> unknown, plus super-properties forwarding into the SDK constructor.
* feat(observability): inline user attrs on every PostHog event + debug throw route
PostHog's UI shows person properties on the Person profile page, not
inline on each event — so a reviewer triaging an exception couldn't tell
which user hit the bug without clicking through. Fix it on both sides.
- Backend capture_exception merges user_id / user_email / user_name into
the event properties (gated by POSTHOG_IDENTIFY_PII: none/id/email/full).
Backed by a new _user_props_for_event helper on PosthogClient.
- Browser snippet registers user_id + user_email + user_name as super-
properties via posthog.register({...}) so every $exception, $pageview,
and custom event coming from posthog.captureException() carries them
inline. Mirrors the backend so cross-referencing client/server events
doesn't require a person-profile lookup.
- /api/debug/throw — debug-only endpoint gated by DEBUG=1 (404 in prod).
Runs Depends(get_current_user) first so request.state.user is set when
the unhandled-exception handler captures the event. Lets operators
exercise the full observability path end-to-end without hand-rolling
a TestClient script. Configurable via ?kind=ValueError&msg=...
7 new tests cover: backend user-attr merge across identify modes,
anonymous request fall-through, browser snippet super-prop emission for
logged-in / anonymous / id-only / full-name cases.
* fix(observability): address minasarustamyan PR #231 review
Two bugs caught in review.
1. PosthogInjectionMiddleware dropped Response.background on every
return path. BaseHTTPMiddleware materialises the body and asks
subclasses to return a fresh Response — three paths in dispatch()
omitted background=, silently cancelling any BackgroundTask /
BackgroundTasks the route attached (audit logging, async webhooks,
email sends) with no log line. Fix: route every return through a
_passthrough() helper that forwards background.
Also adds a _MAX_BUFFER_BYTES (4 MB) cap so a streamed-HTML response
can't balloon RSS during buffering. Bigger bodies short-circuit
through with a warning rather than being injected.
Regression tests in tests/test_posthog_inject_middleware.py exercise
four return paths (snippet present, render-fail, double-injection
guard, non-HTML passthrough) plus the streaming-guard short-circuit.
2. $ai_input / $ai_output_choices were emitted without truncation, so
POSTHOG_LLM_PAYLOADS=1 silently dropped events past PostHog's ~32 KB
per-event ingest limit — exactly the calls (large prompts with
schemas / sample rows / SQL) an operator would want to inspect.
Fix: clip both at POSTHOG_LLM_PAYLOAD_MAX_CHARS (default 30000) with
an explicit "…[truncated N chars]" marker so readers don't mistake
truncated captures for complete ones. Metadata (provider, model,
tokens, latency, error) flows regardless. Three new tests cover
default-cap clipping, env-override, and pass-through under the cap.
37 PostHog tests pass.
6.2 KiB
Observability — PostHog integration
Optional integration that wires four signals into a single PostHog project:
- Backend exceptions — every unhandled FastAPI exception, plus rebuild
failures from
src/orchestrator.pyand HTTP-job failures fromservices/scheduler/. - LLM tracing — every Anthropic / OpenAI-compat call emits a
$ai_generationevent with provider, model, latency, and token counts. - Frontend errors + pageviews —
window.error/unhandledrejectionforwarded viaposthog.captureException; automatic$pageviewand$pageleave. - Session replay (masked) + feature flags — both gated behind the same
single
POSTHOG_API_KEY.
The integration ships off by default. Setting one environment variable turns everything on.
Enabling the integration
# Required — the only switch that controls on/off.
# Use a PROJECT key (publishable phc_…), never a personal API key.
POSTHOG_API_KEY=phc_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
That's the entire minimum. Defaults will:
- Send to
https://eu.i.posthog.com(override withPOSTHOG_HOST). - Identify logged-in users by id + email (override with
POSTHOG_IDENTIFY_PII). - Record session replay with all inputs and known data surfaces masked
(override with
POSTHOG_REPLAY=falseorPOSTHOG_REPLAY_MASK_SELECTOR=…). - Skip prompt / completion bodies in LLM events; emit token counts + latency
only (override with
POSTHOG_LLM_PAYLOADS=1if you accept the privacy trade-off — LLM prompts in this product routinely include customer SQL and data).
All knobs
| Variable | Default | Notes |
|---|---|---|
POSTHOG_API_KEY |
unset | The on/off switch. Unset = integration is fully off. Project key only. |
POSTHOG_HOST |
https://eu.i.posthog.com |
Full URL. Use https://us.i.posthog.com for the US region or your own host. |
POSTHOG_IDENTIFY_PII |
email |
none / id / email / full. |
POSTHOG_REPLAY |
true |
Disable replay only, keeping errors / events / flags. |
POSTHOG_REPLAY_MASK_SELECTOR |
empty | CSS selector appended to the default mask list. |
POSTHOG_LLM_PAYLOADS |
0 |
1 adds $ai_input + $ai_output_choices to LLM events. Off by default. |
POSTHOG_ENVIRONMENT |
auto | Tagged on every event as the environment super-property. Auto-resolves to local when LOCAL_DEV_MODE=1, else RELEASE_CHANNEL, else AGNES_DEPLOYMENT_ENV, else unknown. |
Splitting traffic by environment
Every captured event — backend exceptions, $ai_generation, browser
$pageview, JS errors, custom events — is tagged with two super
properties so PostHog dashboards can slice cleanly:
environment— resolved at startup (see table above). Operators typically set this tolocal,staging, orproductionexplicitly, or rely on the auto-resolver.release— the runningAGNES_VERSION, falling back toRELEASE_CHANNEL. Useful for "is this error new in this release?" cohorting.
Both apply to backend events via the SDK's super_properties and to
browser events via posthog.register({...}) in the loaded callback, so
filtering by environment = production in PostHog hides every event
generated from a developer laptop, CI, or staging.
Privacy posture
- The PostHog project key is publishable — it's safe in browser HTML.
PostHog uses a separate personal API key for admin operations. This
integration only ever exposes the project key. Treat the personal key like
any other secret and never set it as
POSTHOG_API_KEY. - Session replay defaults:
maskAllInputs: true, plus a CSS-selector mask for known data-bearing classes (.data-cell,.query-result,.sql-output, plain<code>and<pre>, and any element markeddata-sensitive). Add your own withPOSTHOG_REPLAY_MASK_SELECTOR. - LLM payloads are off by default because the prompts and completions in this product include customer SQL, query results, and table samples. Token counts and latency are always sent (no payload contents in them).
person_profiles: 'identified_only'— anonymous visits do not create person records.
Where the events come from
| Event | Code path |
|---|---|
$exception (unhandled 500) |
app/main.py:_unhandled_exception_handler |
$exception (orchestrator rebuild) |
src/orchestrator.py:_capture_orchestrator_exception |
$exception (scheduler job) |
services/scheduler/__main__.py:_call_api |
$exception (CLI uncaught) |
cli/main.py:main |
$ai_generation |
src/observability/llm_tracing.py:trace_generation wrapped at connectors/llm/anthropic_provider.py:_attempt_extraction and connectors/llm/openai_compat.py |
$pageview, $pageleave, JS errors |
injected into every text/html response by app/middleware/posthog_inject.py |
CLI coverage
The da CLI (cli/main.py:main) catches every uncaught exception from a
command, forwards it to PostHog with component=cli and the invoked
command name, then flushes the client before re-raising for Typer's
default error printer. Normal Typer / Click exits, SystemExit, and
KeyboardInterrupt are intentionally skipped.
Operators must surface POSTHOG_API_KEY (and any other POSTHOG_* knob)
into the shell that runs da — typically by sourcing the same .env the
server uses, or by setting the variable in their shell profile. The CLI
respects exactly the same env-var contract as the server.
LLM calls made by CLI commands (da query, da explore, etc.) flow
through the provider wrappers in connectors/llm/ and therefore emit
$ai_generation events via the same tracing path the server uses.
Testing the integration
Boot the app with the key set, hit /, then provoke a 500 (e.g. via a
debug-only route). One Errors event should arrive within seconds along
with one $pageview per page load. Open Session replay and pick the
session — every <input> should show as a masked rectangle.
The unit tests in tests/test_posthog_*.py cover the disabled and enabled
configurations; tests/test_llm_tracing.py exercises the success and error
variants of the LLM event.
Self-hosting note
PostHog is itself open source — operators with a self-hosted PostHog instance
just point POSTHOG_HOST at their endpoint. No code changes required.