agnes-the-ai-analyst/app/api/health.py
ZdenekSrotyr 506a378c3a
release: 0.47.1 — Keboola connector v27 (incremental, partitioned, where_filters, typed parquet) (#217)
## Summary

Brings the Keboola connector to feature parity with the legacy internal data-analyst's per-table sync strategies. Closes the four documented gaps from the spec branch (`zs/keboola-connector-specs`):

- **Typed parquet** in the legacy SDK extraction path — column types from Keboola Storage metadata (provider cascade `user > ai-metadata-enrichment > keboola.snowflake-transformation`) survive the CSV → parquet roundtrip; invalid date strings (`'0000-00-00'`) and invalid numeric strings (`'Non-Manager'`) become NULL while keeping the column's typed schema. Pre-fix everything was VARCHAR.
- **Incremental sync** via Storage API `changedSince` — opt-in per table; pulls only delta rows, merges into the existing parquet by `primary_key` (drop_duplicates with keep='last'). Cuts daily extraction from O(full table) to O(delta).
- **Partitioned sync** — flat per-partition layout `data/<table>/<key>.parquet` (e.g. `2026_05.parquet`), per-affected-partition merge for daily updates, chunked initial load with 1-day overlap and 2-empty-chunk stop heuristic.
- **`where_filters`** — server-side row filter with date placeholders (`{{today}}`, `{{last_3_months}}`, `{{start_of_3_months_ago}}`, etc.) resolved at sync time. Force the SDK path; reject `incremental + where_filters` combination at API layer (changedSince already filters temporally).

## Architecture

- **Schema migration v25 → v26**: 7 new columns on `table_registry`. Existing `sync_strategy` column reused (pre-v26 it was inert catalog metadata; post-v26 the extractor dispatches off it).
- **Per-table dispatcher** in `extractor.run()` routes to one of `_extract_via_extension` (full_refresh + extension), `_extract_via_legacy` (full_refresh + filters or extension fallback), `extract_incremental`, or `extract_partitioned`.
- **API conflict policy**: `incremental + where_filters` → 422; `partitioned + query_mode='remote'` → 422; `partitioned ⇒ partition_by required`.
- **Admin UI**: third "Direct extract (Storage API)" radio in the Keboola Register / Edit modals, alongside existing "Whole table (extension)" and "Custom SQL". When selected, exposes a v26 sync-strategy panel with conditional fields per strategy.

## Test plan

- [x] **Unit + module** — 134 v26 tests covering migration, repo, parquet_io, where_filters, incremental (compute_changed_since + merge_parquet + extract_incremental E2E), partitioned (key derivation + merge_partition + chunked windows + extract_partitioned E2E), extractor dispatcher, admin API validators, PUT field clearing, registry-shape → dispatcher bridge
- [x] **HTML form structure** — all v26 inputs + visibility classes + JS payload fields verified in rendered template
- [x] **Real Keboola roundtrip** — registered a small test table as `sync_strategy='incremental'` against a test Storage project, triggered two syncs:
  - Sync 1: `changedSince=None` → full pull → 9 rows typed parquet
  - Sync 2: `changedSince=last_sync - 1d window` → 9 delta rows merged with 9 existing → 9 after dedup on primary_key (PK merge confirmed)
- [x] **Browser UX** — agent-browser session against a local uvicorn: login → admin/tables → register modal → switch radios → verify field visibility per strategy → submit → edit existing row → switch to Direct/Incremental → save → confirm DB persistence
- [x] **Regression** — no regressions in the broader 3252-test suite (3 pre-v26 tests updated for the deprecation-marker removal + schema-version bump; 2 pre-existing environment-sensitive test failures unrelated to this change)

## Bugs caught + fixed during E2E

The browser + real-Keboola roundtrip exposed four bugs the unit tests missed:

1. **JS visibility race** — two competing `forEach` loops set `display=''` then `display='none'` on form elements sharing `kb-strategy-incremental kb-strategy-partitioned` classes (window_days + max_history_days are reused across strategies). Fix: single-pass selector with class-based visibility resolver.
2. **PUT cannot clear field** — pre-v26 `updates = {k: v ... if v is not None}` collapsed "omitted from body" and "sent as null" into the same case, so admin couldn't switch a partitioned row back to full_refresh and have stale `partition_by` clear. Fix: `model_dump(exclude_unset=True)`.
3. **Subprocess DB lock conflict** — `_read_last_sync` reopened `system.duckdb` while the parent server held the write lock (subprocess contract at `app/api/sync.py:_run_sync` line 260). Fix: parent injects `__last_sync__` into table_config before subprocess spawn.
4. **Wrong KBC table_id** — `extract_incremental` / `extract_partitioned` built the Storage API table_id from the registry row's slugified `id` (`circle_inc`) instead of `bucket.source_table` (`in.c-finance.circle`), producing 404s. Fix: prefer `bucket+source_table`; fall back to `id` only when bucket empty.

## Operator notes

- Existing tables stay on `full_refresh` after migration; admins opt individual tables in via `agnes admin register-table --sync-strategy ...`, the Keboola Edit modal, or `POST/PUT /api/admin/registry`.
- `merge_parquet` and `merge_partition` use `pd.concat + drop_duplicates`, loading both existing and delta into pandas RAM. For tables in the multi-million-row range this may OOM — switch to `partitioned` strategy for those (per-partition merge keeps memory bounded). Documented in `### Internal` of the changelog entry.
- Date placeholders are resolved at **sync time**, not register time — a typo'd `{{lasst_week}}` is accepted at register and surfaces only when the next sync runs. By design (rolling windows need late-binding).

## Spec source

The four corresponding plans on the `zs/keboola-connector-specs` branch under `docs/superpowers/plans/2026-05-07-0[1-4]-*.md` capture the design rationale and link back to internal repo references for each subsystem.
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/keboola/agnes-the-ai-analyst/pull/217" target="_blank">
  <picture>
    <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
    <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review">
  </picture>
</a>
<!-- devin-review-badge-end -->
2026-05-07 19:01:27 +02:00

374 lines
15 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

"""Health check endpoint — structured diagnostics for AI agents.
## Severity vocabulary
Per-check `status` values, in order of escalation:
- `ok` — nothing to surface.
- `info` — non-trivial observation worth showing the operator, but the
situation isn't broken. **Does not** promote the overall
status to `degraded` (issue #178).
- `unknown`— check couldn't run (missing dependency, FS error). Surfaced
but doesn't promote overall.
- `warning`— real issue, operator should look. Promotes overall to
`degraded`.
- `error` — critical. Promotes overall to `unhealthy`.
Add an `info`-tier check by returning `{"status": "info", ...}` from the
check function. The aggregator at the bottom of `health_check_detailed`
treats `info` as non-promoting.
"""
import os
from datetime import datetime, timezone
from pathlib import Path
from fastapi import APIRouter, Depends, Query
import duckdb
from app.auth.dependencies import _get_db, get_current_user
from src.db import SCHEMA_VERSION, get_system_db
from src.repositories.sync_state import SyncStateRepository
router = APIRouter(tags=["health"])
# Captured at module import (i.e., app process start) — proxy for "deployed at".
# When the cron auto-upgrade pulls a new digest and recreates the container,
# this resets. Accurate enough for a UI "last updated" badge.
_DEPLOYED_AT = datetime.now(timezone.utc).isoformat()
def _check_bq_billing_project() -> dict | None:
"""Surface the USER_PROJECT_DENIED footgun when a BQ instance has
`billing_project` falling back to (or explicitly equal to) `project`.
Background: connectors/bigquery/access.py:339-342 lets `billing` default
to `data` when `billing_project` is unset. A service account with
`roles/bigquery.dataViewer` on the data project but no
`serviceusage.services.use` on it then 403s on every BQ call with
USER_PROJECT_DENIED. The config is technically valid, so we warn rather
than error — the operator's billable project must be set distinctly.
Returns:
None when the check doesn't apply (non-BQ instance, or BQ deps missing).
A service-entry dict otherwise: {"status": "ok"} or
{"status": "warning", "detail": ..., "hint": ..., "billing_project": ...,
"data_project": ...}.
"""
try:
from app.instance_config import get_data_source_type
except Exception:
return None
if (get_data_source_type() or "").lower() != "bigquery":
return None
try:
from connectors.bigquery.access import get_bq_access
bq = get_bq_access()
billing = bq.projects.billing
data = bq.projects.data
except Exception as e:
# Resolution failure (missing google-cloud-bigquery, auth error,
# malformed config) is itself a problem worth surfacing. Returning
# status='ok' would mask the failure from automated alerting that
# keys on `status != 'ok'`. Use 'unknown' so the entry shows as
# non-green in operator dashboards but doesn't promote the overall
# check to 'degraded' (which 'warning' does). Devin finding
# 2026-05-01: ANALYSIS_pr-review-job-642ff90f_0007.
return {
"status": "unknown",
"detail": f"could not resolve BQ projects: {e}",
}
if not data:
# not_configured sentinel — surfaced elsewhere; nothing to warn about here.
return {"status": "ok", "detail": "BigQuery project not configured"}
if billing == data:
# Issue #178: this is informational, not a fault. Many valid
# single-project dev instances run with billing == data and the SA
# has `serviceusage.services.use`. Keep the message visible but
# don't promote the overall status to `degraded` for it.
return {
"status": "info",
"detail": "BigQuery billing project equals data project",
"hint": (
"If the SA hits USER_PROJECT_DENIED 403, set "
"data_source.bigquery.billing_project in instance.yaml to a "
"project the SA can bill against (typically your dev/billable "
"project, distinct from a shared read-only data project). "
"Configurable via /admin/server-config UI."
),
"billing_project": billing,
"data_project": data,
}
return {
"status": "ok",
"billing_project": billing,
"data_project": data,
}
def _check_session_pipeline(conn: duckdb.DuckDBPyConnection) -> dict:
"""Detect a stuck session pipeline: jsonls land but never get processed.
Heuristic (#176):
max(mtime of /data/user_sessions/**/*.jsonl) <=
max(processed_at in session_extraction_state) + grace_seconds
grace_seconds = 2 × the verification-detector cadence (default 15m → 30m).
Operators with a custom SCHEDULER_VERIFICATION_DETECTOR_INTERVAL can
extend the grace by setting that env var.
Returns ``warning`` (never ``error``) — the LLM may be down for
maintenance, not a hard failure. Returns ``ok`` when no session
files exist (cold-start case).
"""
# Resolve user_sessions dir from the same DATA_DIR conftest sets up.
data_dir = Path(os.environ.get("DATA_DIR", "/data"))
user_sessions = data_dir / "user_sessions"
try:
session_files = list(user_sessions.glob("**/*.jsonl"))
except OSError:
# Permission / FS error — surface as 'unknown' rather than ok/warning.
return {"status": "unknown", "detail": "could not scan user_sessions"}
if not session_files:
return {"status": "ok", "detail": "no session files yet"}
try:
latest_session_mtime = max(f.stat().st_mtime for f in session_files)
except OSError:
return {"status": "unknown", "detail": "could not stat session files"}
# Look up the most recent processed_at.
try:
row = conn.execute(
"SELECT MAX(processed_at) FROM session_extraction_state"
).fetchone()
except Exception as e:
return {"status": "unknown", "detail": f"could not query session_extraction_state: {e}"}
last_processed = row[0] if row else None
grace_seconds = _verification_detector_grace_seconds()
if last_processed is None:
# Files exist but state table is empty — pipeline never ran here.
if (datetime.now(timezone.utc).timestamp() - latest_session_mtime) > grace_seconds:
return {
"status": "warning",
"detail": (
"session_extraction_state is empty but jsonl files exist. "
"Check the verification-detector scheduler job."
),
"session_files": len(session_files),
}
return {"status": "ok", "session_files": len(session_files)}
# Both available — compare. session_extraction_state.processed_at is
# stored as DuckDB TIMESTAMP (naive). DuckDB converts tz-aware writes
# to local time before storing, so the only safe interpretation is
# local-naive on read. Compute the lag against `datetime.now()` (also
# local-naive) and only convert to epoch via the OS's local timezone
# mapping at the comparison boundary.
now_local_naive = datetime.now()
if hasattr(last_processed, "tzinfo") and last_processed.tzinfo is not None:
last_processed = last_processed.replace(tzinfo=None)
proc_age_seconds = (now_local_naive - last_processed).total_seconds()
file_age_seconds = time_now() - latest_session_mtime
# File is newer than the last processed_at by more than grace_seconds.
if proc_age_seconds - file_age_seconds > grace_seconds:
lag_seconds = int(proc_age_seconds - file_age_seconds)
return {
"status": "warning",
"detail": (
f"session jsonls newer than session_extraction_state by ~{lag_seconds}s "
f"(grace={grace_seconds}s). Check the verification-detector scheduler "
f"job — uploads are not being processed."
),
"lag_seconds": lag_seconds,
"session_files": len(session_files),
}
return {"status": "ok", "session_files": len(session_files)}
def time_now() -> float:
"""Wall-clock seconds since epoch — separated out for test seam parity."""
import time as _t
return _t.time()
def _verification_detector_grace_seconds() -> int:
"""Compute the staleness grace window for the session pipeline check."""
cadence_seconds_default = 15 * 60
raw = os.environ.get("SCHEDULER_VERIFICATION_DETECTOR_INTERVAL")
if raw:
try:
cadence_seconds = int(raw)
if cadence_seconds > 0:
return 2 * cadence_seconds
except ValueError:
pass
return 2 * cadence_seconds_default
def _check_db_schema() -> dict:
"""Check DB schema version against expected SCHEMA_VERSION.
Returns a dict with 'db_schema' key and optional 'detail' key.
"""
try:
conn = get_system_db()
row = conn.execute(
"SELECT version FROM schema_version ORDER BY applied_at DESC LIMIT 1"
).fetchone()
if row is None:
return {"db_schema": "mismatch", "detail": "no schema_version row found"}
current_version = row[0]
if current_version == SCHEMA_VERSION:
return {"db_schema": "ok", "current": current_version, "expected": SCHEMA_VERSION}
else:
return {"db_schema": "mismatch", "current": current_version, "expected": SCHEMA_VERSION}
except Exception as e:
return {"db_schema": "unreachable", "detail": str(e)}
@router.get("/api/health")
async def health_check():
"""Minimal health check for load balancers / compose healthcheck. No auth required."""
schema_check = _check_db_schema()
status = "ok"
if schema_check["db_schema"] != "ok":
status = "unhealthy"
return {"status": status, **schema_check}
@router.get("/api/health/detailed")
async def health_check_detailed(
conn: duckdb.DuckDBPyConnection = Depends(_get_db),
_user: dict = Depends(get_current_user),
include: str = Query(
"",
description=(
"Comma-separated list of optional checks to include. "
"Recognised values: `schema` (DB schema version against the "
"expected migration). The default response omits these because "
"they're rarely actionable on a healthy instance and add noise "
"to `agnes diagnose` output (issue #204). Pass `?include=schema` "
"to get the legacy behavior."
),
),
):
"""Structured health check with deployment metadata. Requires authentication."""
checks = {}
include_set = {p.strip() for p in include.split(",") if p.strip()}
# DuckDB state
try:
conn.execute("SELECT 1").fetchone()
checks["duckdb_state"] = {"status": "ok"}
except Exception as e:
checks["duckdb_state"] = {"status": "error", "detail": str(e)}
# DB schema version check — opt-in (issue #204). Operators who run a
# fresh release pinned to the same image as the running schema rarely
# care about this number; analysts hitting the endpoint via
# `agnes diagnose` see it as noise. Surface it on demand via
# `?include=schema` (the dashboard / admin UI passes this; default
# CLI does not).
if "schema" in include_set:
checks["db_schema"] = _check_db_schema()
# Sync state summary
try:
repo = SyncStateRepository(conn)
all_states = repo.get_all_states()
total_tables = len(all_states)
total_rows = sum(s.get("rows", 0) or 0 for s in all_states)
stale = []
now = datetime.now(timezone.utc)
for s in all_states:
last = s.get("last_sync")
if last:
try:
# Handle both tz-aware and tz-naive datetimes from DuckDB
if hasattr(last, 'tzinfo') and last.tzinfo is None:
from datetime import timezone as tz
last = last.replace(tzinfo=tz.utc)
if (now - last).total_seconds() > 86400:
stale.append(s["table_id"])
except (TypeError, AttributeError):
pass # skip if timestamp comparison fails
checks["data"] = {
"status": "ok" if not stale else "warning",
"tables": total_tables,
"total_rows": total_rows,
"stale_tables": stale,
}
except Exception as e:
checks["data"] = {"status": "error", "detail": str(e)}
# User count
try:
user_count = conn.execute("SELECT COUNT(*) FROM users").fetchone()[0]
checks["users"] = {"status": "ok", "count": user_count}
except Exception as e:
checks["users"] = {"status": "error", "detail": str(e)}
# BigQuery billing-project sanity check (USER_PROJECT_DENIED footgun).
bq_cfg = _check_bq_billing_project()
if bq_cfg is not None:
checks["bq_config"] = bq_cfg
# Session pipeline (#176): warn when uploaded jsonls aren't getting
# processed by the verification-detector cadence.
try:
checks["session_pipeline"] = _check_session_pipeline(conn)
except Exception as e:
checks["session_pipeline"] = {"status": "unknown", "detail": str(e)}
# Aggregate to overall status. `info` and `unknown` surface in the
# response but never escalate the headline (issue #178). `warning`
# promotes to `degraded`; `error` (or a schema mismatch when the
# caller asked for it) promotes to `unhealthy`.
overall = "healthy"
for check in checks.values():
if check.get("status") == "error":
overall = "unhealthy"
break
if check.get("status") == "warning":
overall = "degraded"
# Schema mismatch only escalates when the caller asked for the check
# — otherwise the absent key is treated as "not asserted".
if "db_schema" in checks and checks["db_schema"].get("db_schema") != "ok":
overall = "unhealthy"
return {
"status": overall,
"version": os.environ.get("AGNES_VERSION", "dev"),
"channel": os.environ.get("RELEASE_CHANNEL", "dev"),
"image_tag": os.environ.get("AGNES_TAG", "unknown"),
"commit_sha": os.environ.get("AGNES_COMMIT_SHA", "unknown"),
"schema_version": SCHEMA_VERSION,
"deployed_at": _DEPLOYED_AT,
"timestamp": datetime.now(timezone.utc).isoformat(),
"services": checks,
}
@router.get("/api/version")
async def version_info():
"""Lightweight version info — cacheable, no DB touch. Used by UI footer badge."""
return {
"version": os.environ.get("AGNES_VERSION", "dev"),
"channel": os.environ.get("RELEASE_CHANNEL", "dev"),
"image_tag": os.environ.get("AGNES_TAG", "unknown"),
"commit_sha": os.environ.get("AGNES_COMMIT_SHA", "unknown"),
"schema_version": SCHEMA_VERSION,
"deployed_at": _DEPLOYED_AT,
}