* Extract session pipeline framework, refactor verification, add UsageProcessor skeleton Pluggable framework under services/session_pipeline/ (contract + lib + per-processor runner) so multiple processors can read /data/user_sessions/<key>/*.jsonl on their own cadence with full failure isolation. Verification flow becomes the first plugin; a no-op UsageProcessor reserves the second slot pending a separate brainstorm on extraction logic + storage shape. Schema v28→v29: rename session_extraction_state → session_processor_state with composite PK (processor_name, session_file). Existing rows copied over with processor_name='verification'; legacy table dropped. Migration is idempotent and no-ops the copy step on fresh installs that came up at the new schema. Endpoint: /api/admin/run-verification-detector replaced by parametrized /api/admin/run-session-processor?processor=<name>. Audit action format follows. Scheduler JOBS: verification-detector entry split into session-processor:verification + session-processor:usage. SCHEDULER_VERIFICATION_DETECTOR_INTERVAL retained for operator compatibility (drives both cadence and health-check grace window); SCHEDULER_USAGE_PROCESSOR_INTERVAL added. * Address PR #232 review: scan dead branch + per-processor lock - `SessionProcessorStateRepository.scan_unprocessed_for` dead else: both branches surfaced every jsonl, the SELECT was unused, runner MD5-rehashed every stable session per tick. Replaced with an mtime precheck — stable sessions (mtime <= processed_at) are filtered at scan; modified files still surface for the runner's authoritative `file_hash` invalidation. Naive-local comparison matches the existing health-check idiom (DuckDB TIMESTAMP strips tz on storage). - Per-processor advisory lock around `_run_processor` in `/api/admin/run-session-processor`. Scheduler tick + manual admin POST could otherwise both run, both call create_evidence on overlapping detections, and accumulate duplicate verification_evidence rows (the dedup short-circuit only covers create+contradiction, not evidence per ADR Decision 3). Non-blocking acquire → 409 Conflict on concurrent invocation; release in finally so a runner exception doesn't wedge the processor. Tests: two new scan unit tests (mtime filter + post-mark mtime bump), 409 endpoint test, lock-released-on-exception test. Two existing tests updated for the new "filtered at scan" stat shape (previously asserted skipped == 1, now scanned == 0). * Address PR #232 review #2: parallel scheduler tick + last_run on terminal state Two pre-existing scaffold bugs in services/scheduler/__main__.py amplified by adding more session-pipeline jobs: 1. Serial for-loop over jobs with synchronous httpx.post(timeout=900) — a 10-minute verification run blocked every other job (data-refresh, health-check, usage, corporate-memory) for the whole window. The PR's stated isolation guarantee held inside the runner but broke at the scheduler dispatch layer. 2. last_run advanced only when _call_api returned True. Permanent-failure jobs hot-looped on every tick (30s) instead of cadence (15min). Fix: ThreadPoolExecutor.submit per due job + per-job in_flight set so a long-running job can't be re-launched on subsequent ticks. last_run advances unconditionally in finally; errors still surface via _call_api logging + audit_log on the receiving side. _run_job extracted to module-level for unit testing. New tests: - TestRunJobBookkeeping: advances on success / failure / unhandled raise - TestRunLoopParallelism: in_flight protection prevents duplicate launches across ticks for a single slow job --------- Co-authored-by: Minas Arustamyan <arustamyan.minas@gmail.com>
81 lines
2.8 KiB
Python
81 lines
2.8 KiB
Python
"""LLM-side helpers for the verification detector.
|
|
|
|
After the session-pipeline refactor, the orchestration loop (scan unprocessed
|
|
→ parse jsonl → mark processed) lives in services/session_pipeline/, and the
|
|
per-session persistence flow lives in services/session_processors/verification.py
|
|
(VerificationProcessor). This module retains only the pieces specific to LLM
|
|
extraction — prompt formatting, the structured-output call, and the
|
|
deterministic-id helper — which both the new processor and the legacy
|
|
__main__.py CLI shim still import.
|
|
"""
|
|
|
|
import hashlib
|
|
import logging
|
|
|
|
from connectors.llm import StructuredExtractor
|
|
from connectors.llm.exceptions import LLMError
|
|
|
|
from .prompts import VERIFICATION_EXTRACT_PROMPT
|
|
from .schemas import VERIFICATION_SCHEMA
|
|
|
|
logger = logging.getLogger(__name__)
|
|
|
|
MAX_TURNS_PER_SESSION = 100
|
|
|
|
|
|
def _generate_id(title: str, content: str) -> str:
|
|
"""Generate deterministic ID from title + content (same pattern as corporate memory collector)."""
|
|
raw = f"{title}:{content}"
|
|
return "kv_" + hashlib.sha256(raw.encode()).hexdigest()[:12]
|
|
|
|
|
|
def _format_turns(turns: list[dict]) -> str:
|
|
"""Format conversation turns as a parseable, prompt-injection-hardened block.
|
|
|
|
Session transcripts are heavily user-influenced (anything the analyst typed
|
|
lands here). Each turn is wrapped in `<turn role="…">` tags with `</turn>`
|
|
neutralized inside the content so a crafted message cannot break out of
|
|
the wrapper. The trust-boundary instruction in VERIFICATION_EXTRACT_PROMPT
|
|
tells the LLM to treat content inside `<turn>` as data, not directives.
|
|
"""
|
|
lines: list[str] = []
|
|
for turn in turns:
|
|
role = turn.get("role", "unknown")
|
|
content = (turn.get("content") or "").replace("</turn>", "</turn>")
|
|
lines.append(f'<turn role="{role}">{content}</turn>')
|
|
return "\n".join(lines)
|
|
|
|
|
|
def extract_verifications(
|
|
extractor: StructuredExtractor,
|
|
username: str,
|
|
session_id: str,
|
|
turns: list[dict],
|
|
max_turns: int = MAX_TURNS_PER_SESSION,
|
|
) -> list[dict]:
|
|
"""Send conversation turns to LLM for verification detection."""
|
|
if not turns:
|
|
return []
|
|
|
|
# Truncate to last N turns if too long
|
|
if len(turns) > max_turns:
|
|
turns = turns[-max_turns:]
|
|
|
|
conversation_text = _format_turns(turns)
|
|
prompt = VERIFICATION_EXTRACT_PROMPT.format(
|
|
username=username,
|
|
session_id=session_id,
|
|
conversation=conversation_text,
|
|
)
|
|
|
|
try:
|
|
result = extractor.extract_json(
|
|
prompt=prompt,
|
|
max_tokens=4096,
|
|
json_schema=VERIFICATION_SCHEMA,
|
|
schema_name="verification_extract",
|
|
)
|
|
return result.get("verifications", [])
|
|
except LLMError as e:
|
|
logger.error("LLM extraction failed for session %s: %s", session_id, e)
|
|
return []
|