* Extract session pipeline framework, refactor verification, add UsageProcessor skeleton Pluggable framework under services/session_pipeline/ (contract + lib + per-processor runner) so multiple processors can read /data/user_sessions/<key>/*.jsonl on their own cadence with full failure isolation. Verification flow becomes the first plugin; a no-op UsageProcessor reserves the second slot pending a separate brainstorm on extraction logic + storage shape. Schema v28→v29: rename session_extraction_state → session_processor_state with composite PK (processor_name, session_file). Existing rows copied over with processor_name='verification'; legacy table dropped. Migration is idempotent and no-ops the copy step on fresh installs that came up at the new schema. Endpoint: /api/admin/run-verification-detector replaced by parametrized /api/admin/run-session-processor?processor=<name>. Audit action format follows. Scheduler JOBS: verification-detector entry split into session-processor:verification + session-processor:usage. SCHEDULER_VERIFICATION_DETECTOR_INTERVAL retained for operator compatibility (drives both cadence and health-check grace window); SCHEDULER_USAGE_PROCESSOR_INTERVAL added. * Address PR #232 review: scan dead branch + per-processor lock - `SessionProcessorStateRepository.scan_unprocessed_for` dead else: both branches surfaced every jsonl, the SELECT was unused, runner MD5-rehashed every stable session per tick. Replaced with an mtime precheck — stable sessions (mtime <= processed_at) are filtered at scan; modified files still surface for the runner's authoritative `file_hash` invalidation. Naive-local comparison matches the existing health-check idiom (DuckDB TIMESTAMP strips tz on storage). - Per-processor advisory lock around `_run_processor` in `/api/admin/run-session-processor`. Scheduler tick + manual admin POST could otherwise both run, both call create_evidence on overlapping detections, and accumulate duplicate verification_evidence rows (the dedup short-circuit only covers create+contradiction, not evidence per ADR Decision 3). Non-blocking acquire → 409 Conflict on concurrent invocation; release in finally so a runner exception doesn't wedge the processor. Tests: two new scan unit tests (mtime filter + post-mark mtime bump), 409 endpoint test, lock-released-on-exception test. Two existing tests updated for the new "filtered at scan" stat shape (previously asserted skipped == 1, now scanned == 0). * Address PR #232 review #2: parallel scheduler tick + last_run on terminal state Two pre-existing scaffold bugs in services/scheduler/__main__.py amplified by adding more session-pipeline jobs: 1. Serial for-loop over jobs with synchronous httpx.post(timeout=900) — a 10-minute verification run blocked every other job (data-refresh, health-check, usage, corporate-memory) for the whole window. The PR's stated isolation guarantee held inside the runner but broke at the scheduler dispatch layer. 2. last_run advanced only when _call_api returned True. Permanent-failure jobs hot-looped on every tick (30s) instead of cadence (15min). Fix: ThreadPoolExecutor.submit per due job + per-job in_flight set so a long-running job can't be re-launched on subsequent ticks. last_run advances unconditionally in finally; errors still surface via _call_api logging + audit_log on the receiving side. _run_job extracted to module-level for unit testing. New tests: - TestRunJobBookkeeping: advances on success / failure / unhandled raise - TestRunLoopParallelism: in_flight protection prevents duplicate launches across ticks for a single slow job --------- Co-authored-by: Minas Arustamyan <arustamyan.minas@gmail.com>
80 lines
2.5 KiB
Python
80 lines
2.5 KiB
Python
"""CLI entry point for ad-hoc local runs of the verification processor.
|
|
|
|
Usage:
|
|
python -m services.verification_detector [--verbose] [--reset]
|
|
|
|
After the session-pipeline refactor the canonical execution path is the
|
|
admin endpoint POST /api/admin/run-session-processor?processor=verification
|
|
driven by the scheduler. This CLI shim is kept as a developer convenience
|
|
for running the verification flow against a local instance without going
|
|
through HTTP — it constructs the VerificationProcessor and runs it through
|
|
the shared runner.
|
|
"""
|
|
|
|
import argparse
|
|
import logging
|
|
import sys
|
|
|
|
from app.logging_config import setup_logging
|
|
from services.session_pipeline.runner import run_processor
|
|
from services.session_processors.verification import build_verification_processor
|
|
from src.db import get_system_db
|
|
|
|
logger = logging.getLogger(__name__)
|
|
|
|
|
|
def main() -> None:
|
|
parser = argparse.ArgumentParser(
|
|
description="Extract verified organizational knowledge from analyst session transcripts."
|
|
)
|
|
parser.add_argument(
|
|
"--verbose",
|
|
action="store_true",
|
|
help="Enable debug-level logging.",
|
|
)
|
|
parser.add_argument(
|
|
"--reset",
|
|
action="store_true",
|
|
help="Reset the verification processor's session-processed state before running.",
|
|
)
|
|
args = parser.parse_args()
|
|
|
|
setup_logging(__name__, level="DEBUG" if args.verbose else "INFO")
|
|
|
|
try:
|
|
processor = build_verification_processor()
|
|
except (ValueError, FileNotFoundError) as e:
|
|
logger.error(
|
|
"Failed to initialize verification processor: %s. "
|
|
"Configure ai: in instance.yaml or set ANTHROPIC_API_KEY / LLM_API_KEY.",
|
|
e,
|
|
)
|
|
sys.exit(1)
|
|
|
|
conn = get_system_db()
|
|
|
|
if args.reset:
|
|
logger.info("Resetting verification processor state...")
|
|
conn.execute(
|
|
"DELETE FROM session_processor_state WHERE processor_name = ?",
|
|
[processor.name],
|
|
)
|
|
|
|
stats = run_processor(conn, processor)
|
|
|
|
print("\n--- Verification Processor Summary ---")
|
|
print(f"Sessions scanned: {stats['scanned']}")
|
|
print(f"Sessions processed: {stats['processed']}")
|
|
print(f"Sessions skipped: {stats['skipped']}")
|
|
print(f"Items created: {stats['items_extracted']}")
|
|
if stats["errors"]:
|
|
print(f"Errors: {stats['errors']}")
|
|
for err in stats["errors_detail"]:
|
|
print(f" - {err}")
|
|
|
|
if stats["errors"]:
|
|
sys.exit(1)
|
|
|
|
|
|
if __name__ == "__main__":
|
|
main()
|