* feat(observability): optional PostHog integration (errors, LLM traces, replay, flags)
Off by default. Activates when POSTHOG_API_KEY is set in env. Defaults
to PostHog Cloud EU; override host for US Cloud or self-hosted.
Coverage:
- FastAPI 500 handler captures unhandled exceptions
- src/orchestrator.py rebuild + rebuild_source failures
- services/scheduler/ HTTP-job failures
- cli/main.py uncaught CLI errors (Typer.Exit/SystemExit/KeyboardInterrupt
skipped; flushes before re-raise so short-lived CLI invocations don't
drop events)
- connectors/llm/anthropic_provider.py + openai_compat.py emit
$ai_generation events with provider, model, latency, token counts
(prompt/completion bodies stay off unless POSTHOG_LLM_PAYLOADS=1
because LLM prompts here routinely include customer SQL/data)
- Browser snippet injected into every text/html response by
PosthogInjectionMiddleware — registered inside the GZip layer so it
sees uncompressed HTML before compression. Many templates are
standalone (their own DOCTYPE) and never extend base.html, so a
per-template include would miss them.
- Frontend: $pageview, $pageleave, JS error capture via window.error
and unhandledrejection handlers, masked session replay
(maskAllInputs: true plus CSS-selector mask for known data surfaces),
feature flags (browser posthog.isFeatureEnabled + server-side
feature_enabled with fallback for older SDKs).
Identification mode operator-configurable: none / id / email / full.
Default email ships user.id + email but never name. CLI entry point
moves from cli.main:app to cli.main:main (Typer wrapper).
Files:
- src/observability/posthog_client.py — lazy singleton, no network
when disabled, single-process flush on shutdown
- src/observability/llm_tracing.py — trace_generation context manager
- app/middleware/posthog_inject.py — HTML rewrite middleware
- app/web/templates/_posthog.html — browser snippet template
- docs/observability.md — operator guide
- config/.env.template — documented POSTHOG_* knobs
- tests/test_posthog_disabled.py + tests/test_posthog_client.py +
tests/test_llm_tracing.py — 18 tests covering disabled state,
identify-mode payloads, $ai_generation shape, error variant.
CHANGELOG entry under [Unreleased] Added.
* feat(observability): tag every PostHog event with environment + release
Splits PostHog dashboards cleanly between localhost / dev / staging /
production without manual tagging on every capture call.
- POSTHOG_ENVIRONMENT explicit override; auto-resolves to "local" when
LOCAL_DEV_MODE=1, else RELEASE_CHANNEL, else AGNES_DEPLOYMENT_ENV,
else "unknown".
- AGNES_VERSION → RELEASE_CHANNEL fallback feeds the `release` property
for "is this error new in this release?" cohorting.
- Backend gets both via the PostHog SDK's super_properties constructor
arg (every captured event picks them up automatically).
- Browser snippet calls posthog.register({environment, release}) inside
the loaded callback so $pageview, $exception, autocapture, etc. all
carry the same labels.
- request.state.user now populated by auth dependencies so the snippet
can actually call posthog.identify(user_id, {email}) for logged-in
users (previously the user block always resolved to None because
nothing wrote to request.state.user).
4 new tests cover env resolution: explicit > LOCAL_DEV_MODE > channel
> unknown, plus super-properties forwarding into the SDK constructor.
* feat(observability): inline user attrs on every PostHog event + debug throw route
PostHog's UI shows person properties on the Person profile page, not
inline on each event — so a reviewer triaging an exception couldn't tell
which user hit the bug without clicking through. Fix it on both sides.
- Backend capture_exception merges user_id / user_email / user_name into
the event properties (gated by POSTHOG_IDENTIFY_PII: none/id/email/full).
Backed by a new _user_props_for_event helper on PosthogClient.
- Browser snippet registers user_id + user_email + user_name as super-
properties via posthog.register({...}) so every $exception, $pageview,
and custom event coming from posthog.captureException() carries them
inline. Mirrors the backend so cross-referencing client/server events
doesn't require a person-profile lookup.
- /api/debug/throw — debug-only endpoint gated by DEBUG=1 (404 in prod).
Runs Depends(get_current_user) first so request.state.user is set when
the unhandled-exception handler captures the event. Lets operators
exercise the full observability path end-to-end without hand-rolling
a TestClient script. Configurable via ?kind=ValueError&msg=...
7 new tests cover: backend user-attr merge across identify modes,
anonymous request fall-through, browser snippet super-prop emission for
logged-in / anonymous / id-only / full-name cases.
* fix(observability): address minasarustamyan PR #231 review
Two bugs caught in review.
1. PosthogInjectionMiddleware dropped Response.background on every
return path. BaseHTTPMiddleware materialises the body and asks
subclasses to return a fresh Response — three paths in dispatch()
omitted background=, silently cancelling any BackgroundTask /
BackgroundTasks the route attached (audit logging, async webhooks,
email sends) with no log line. Fix: route every return through a
_passthrough() helper that forwards background.
Also adds a _MAX_BUFFER_BYTES (4 MB) cap so a streamed-HTML response
can't balloon RSS during buffering. Bigger bodies short-circuit
through with a warning rather than being injected.
Regression tests in tests/test_posthog_inject_middleware.py exercise
four return paths (snippet present, render-fail, double-injection
guard, non-HTML passthrough) plus the streaming-guard short-circuit.
2. $ai_input / $ai_output_choices were emitted without truncation, so
POSTHOG_LLM_PAYLOADS=1 silently dropped events past PostHog's ~32 KB
per-event ingest limit — exactly the calls (large prompts with
schemas / sample rows / SQL) an operator would want to inspect.
Fix: clip both at POSTHOG_LLM_PAYLOAD_MAX_CHARS (default 30000) with
an explicit "…[truncated N chars]" marker so readers don't mistake
truncated captures for complete ones. Metadata (provider, model,
tokens, latency, error) flows regardless. Three new tests cover
default-cap clipping, env-override, and pass-through under the cap.
37 PostHog tests pass.
127 lines
6.2 KiB
Markdown
127 lines
6.2 KiB
Markdown
# Observability — PostHog integration
|
|
|
|
Optional integration that wires four signals into a single PostHog project:
|
|
|
|
1. **Backend exceptions** — every unhandled FastAPI exception, plus rebuild
|
|
failures from `src/orchestrator.py` and HTTP-job failures from
|
|
`services/scheduler/`.
|
|
2. **LLM tracing** — every Anthropic / OpenAI-compat call emits a
|
|
`$ai_generation` event with provider, model, latency, and token counts.
|
|
3. **Frontend errors + pageviews** — `window.error` /
|
|
`unhandledrejection` forwarded via `posthog.captureException`; automatic
|
|
`$pageview` and `$pageleave`.
|
|
4. **Session replay (masked) + feature flags** — both gated behind the same
|
|
single `POSTHOG_API_KEY`.
|
|
|
|
The integration ships **off by default**. Setting one environment variable
|
|
turns everything on.
|
|
|
|
## Enabling the integration
|
|
|
|
```bash
|
|
# Required — the only switch that controls on/off.
|
|
# Use a PROJECT key (publishable phc_…), never a personal API key.
|
|
POSTHOG_API_KEY=phc_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
|
|
```
|
|
|
|
That's the entire minimum. Defaults will:
|
|
|
|
- Send to `https://eu.i.posthog.com` (override with `POSTHOG_HOST`).
|
|
- Identify logged-in users by id + email (override with `POSTHOG_IDENTIFY_PII`).
|
|
- Record session replay with all inputs and known data surfaces masked
|
|
(override with `POSTHOG_REPLAY=false` or
|
|
`POSTHOG_REPLAY_MASK_SELECTOR=…`).
|
|
- Skip prompt / completion bodies in LLM events; emit token counts + latency
|
|
only (override with `POSTHOG_LLM_PAYLOADS=1` if you accept the privacy
|
|
trade-off — LLM prompts in this product routinely include customer SQL
|
|
and data).
|
|
|
|
## All knobs
|
|
|
|
| Variable | Default | Notes |
|
|
|---|---|---|
|
|
| `POSTHOG_API_KEY` | unset | **The on/off switch.** Unset = integration is fully off. Project key only. |
|
|
| `POSTHOG_HOST` | `https://eu.i.posthog.com` | Full URL. Use `https://us.i.posthog.com` for the US region or your own host. |
|
|
| `POSTHOG_IDENTIFY_PII` | `email` | `none` / `id` / `email` / `full`. |
|
|
| `POSTHOG_REPLAY` | `true` | Disable replay only, keeping errors / events / flags. |
|
|
| `POSTHOG_REPLAY_MASK_SELECTOR` | empty | CSS selector appended to the default mask list. |
|
|
| `POSTHOG_LLM_PAYLOADS` | `0` | `1` adds `$ai_input` + `$ai_output_choices` to LLM events. Off by default. |
|
|
| `POSTHOG_ENVIRONMENT` | auto | Tagged on every event as the `environment` super-property. Auto-resolves to `local` when `LOCAL_DEV_MODE=1`, else `RELEASE_CHANNEL`, else `AGNES_DEPLOYMENT_ENV`, else `unknown`. |
|
|
|
|
## Splitting traffic by environment
|
|
|
|
Every captured event — backend exceptions, `$ai_generation`, browser
|
|
`$pageview`, JS errors, custom events — is tagged with two super
|
|
properties so PostHog dashboards can slice cleanly:
|
|
|
|
- `environment` — resolved at startup (see table above). Operators
|
|
typically set this to `local`, `staging`, or `production` explicitly,
|
|
or rely on the auto-resolver.
|
|
- `release` — the running `AGNES_VERSION`, falling back to
|
|
`RELEASE_CHANNEL`. Useful for "is this error new in this release?"
|
|
cohorting.
|
|
|
|
Both apply to backend events via the SDK's `super_properties` and to
|
|
browser events via `posthog.register({...})` in the loaded callback, so
|
|
filtering by `environment = production` in PostHog hides every event
|
|
generated from a developer laptop, CI, or staging.
|
|
|
|
## Privacy posture
|
|
|
|
- The PostHog **project key** is publishable — it's safe in browser HTML.
|
|
PostHog uses a separate **personal API key** for admin operations. This
|
|
integration only ever exposes the project key. Treat the personal key like
|
|
any other secret and never set it as `POSTHOG_API_KEY`.
|
|
- Session replay defaults: `maskAllInputs: true`, plus a CSS-selector mask
|
|
for known data-bearing classes (`.data-cell`, `.query-result`,
|
|
`.sql-output`, plain `<code>` and `<pre>`, and any element marked
|
|
`data-sensitive`). Add your own with `POSTHOG_REPLAY_MASK_SELECTOR`.
|
|
- LLM payloads are **off by default** because the prompts and completions
|
|
in this product include customer SQL, query results, and table samples.
|
|
Token counts and latency are always sent (no payload contents in them).
|
|
- `person_profiles: 'identified_only'` — anonymous visits do not create
|
|
person records.
|
|
|
|
## Where the events come from
|
|
|
|
| Event | Code path |
|
|
|---|---|
|
|
| `$exception` (unhandled 500) | `app/main.py:_unhandled_exception_handler` |
|
|
| `$exception` (orchestrator rebuild) | `src/orchestrator.py:_capture_orchestrator_exception` |
|
|
| `$exception` (scheduler job) | `services/scheduler/__main__.py:_call_api` |
|
|
| `$exception` (CLI uncaught) | `cli/main.py:main` |
|
|
| `$ai_generation` | `src/observability/llm_tracing.py:trace_generation` wrapped at `connectors/llm/anthropic_provider.py:_attempt_extraction` and `connectors/llm/openai_compat.py` |
|
|
| `$pageview`, `$pageleave`, JS errors | injected into every `text/html` response by `app/middleware/posthog_inject.py` |
|
|
|
|
## CLI coverage
|
|
|
|
The `da` CLI (`cli/main.py:main`) catches every uncaught exception from a
|
|
command, forwards it to PostHog with `component=cli` and the invoked
|
|
command name, then flushes the client before re-raising for Typer's
|
|
default error printer. Normal Typer / Click exits, `SystemExit`, and
|
|
`KeyboardInterrupt` are intentionally skipped.
|
|
|
|
Operators must surface `POSTHOG_API_KEY` (and any other `POSTHOG_*` knob)
|
|
into the shell that runs `da` — typically by sourcing the same `.env` the
|
|
server uses, or by setting the variable in their shell profile. The CLI
|
|
respects exactly the same env-var contract as the server.
|
|
|
|
LLM calls made by CLI commands (`da query`, `da explore`, etc.) flow
|
|
through the provider wrappers in `connectors/llm/` and therefore emit
|
|
`$ai_generation` events via the same tracing path the server uses.
|
|
|
|
## Testing the integration
|
|
|
|
Boot the app with the key set, hit `/`, then provoke a 500 (e.g. via a
|
|
debug-only route). One **Errors** event should arrive within seconds along
|
|
with one `$pageview` per page load. Open **Session replay** and pick the
|
|
session — every `<input>` should show as a masked rectangle.
|
|
|
|
The unit tests in `tests/test_posthog_*.py` cover the disabled and enabled
|
|
configurations; `tests/test_llm_tracing.py` exercises the success and error
|
|
variants of the LLM event.
|
|
|
|
## Self-hosting note
|
|
|
|
PostHog is itself open source — operators with a self-hosted PostHog instance
|
|
just point `POSTHOG_HOST` at their endpoint. No code changes required.
|