* Extract session pipeline framework, refactor verification, add UsageProcessor skeleton Pluggable framework under services/session_pipeline/ (contract + lib + per-processor runner) so multiple processors can read /data/user_sessions/<key>/*.jsonl on their own cadence with full failure isolation. Verification flow becomes the first plugin; a no-op UsageProcessor reserves the second slot pending a separate brainstorm on extraction logic + storage shape. Schema v28→v29: rename session_extraction_state → session_processor_state with composite PK (processor_name, session_file). Existing rows copied over with processor_name='verification'; legacy table dropped. Migration is idempotent and no-ops the copy step on fresh installs that came up at the new schema. Endpoint: /api/admin/run-verification-detector replaced by parametrized /api/admin/run-session-processor?processor=<name>. Audit action format follows. Scheduler JOBS: verification-detector entry split into session-processor:verification + session-processor:usage. SCHEDULER_VERIFICATION_DETECTOR_INTERVAL retained for operator compatibility (drives both cadence and health-check grace window); SCHEDULER_USAGE_PROCESSOR_INTERVAL added. * Address PR #232 review: scan dead branch + per-processor lock - `SessionProcessorStateRepository.scan_unprocessed_for` dead else: both branches surfaced every jsonl, the SELECT was unused, runner MD5-rehashed every stable session per tick. Replaced with an mtime precheck — stable sessions (mtime <= processed_at) are filtered at scan; modified files still surface for the runner's authoritative `file_hash` invalidation. Naive-local comparison matches the existing health-check idiom (DuckDB TIMESTAMP strips tz on storage). - Per-processor advisory lock around `_run_processor` in `/api/admin/run-session-processor`. Scheduler tick + manual admin POST could otherwise both run, both call create_evidence on overlapping detections, and accumulate duplicate verification_evidence rows (the dedup short-circuit only covers create+contradiction, not evidence per ADR Decision 3). Non-blocking acquire → 409 Conflict on concurrent invocation; release in finally so a runner exception doesn't wedge the processor. Tests: two new scan unit tests (mtime filter + post-mark mtime bump), 409 endpoint test, lock-released-on-exception test. Two existing tests updated for the new "filtered at scan" stat shape (previously asserted skipped == 1, now scanned == 0). * Address PR #232 review #2: parallel scheduler tick + last_run on terminal state Two pre-existing scaffold bugs in services/scheduler/__main__.py amplified by adding more session-pipeline jobs: 1. Serial for-loop over jobs with synchronous httpx.post(timeout=900) — a 10-minute verification run blocked every other job (data-refresh, health-check, usage, corporate-memory) for the whole window. The PR's stated isolation guarantee held inside the runner but broke at the scheduler dispatch layer. 2. last_run advanced only when _call_api returned True. Permanent-failure jobs hot-looped on every tick (30s) instead of cadence (15min). Fix: ThreadPoolExecutor.submit per due job + per-job in_flight set so a long-running job can't be re-launched on subsequent ticks. last_run advances unconditionally in finally; errors still surface via _call_api logging + audit_log on the receiving side. _run_job extracted to module-level for unit testing. New tests: - TestRunJobBookkeeping: advances on success / failure / unhandled raise - TestRunLoopParallelism: in_flight protection prevents duplicate launches across ticks for a single slow job --------- Co-authored-by: Minas Arustamyan <arustamyan.minas@gmail.com>
127 lines
5.1 KiB
Python
127 lines
5.1 KiB
Python
"""v20 adds source_query column to table_registry.
|
|
|
|
Backs query_mode='materialized' for BigQuery: admin registers a SQL body
|
|
that the scheduler runs through the DuckDB BQ extension and writes as a
|
|
parquet to /data/extracts/bigquery/data/<id>.parquet.
|
|
|
|
The v19 step (#150) drops dataset_permissions, access_requests tables and
|
|
users.role, table_registry.is_public columns; v20 then ALTERs the post-v19
|
|
table_registry to add the source_query column.
|
|
"""
|
|
import duckdb
|
|
|
|
from src.db import SCHEMA_VERSION, _ensure_schema, get_schema_version
|
|
|
|
|
|
def test_schema_version_is_31():
|
|
# v27 → v28: explicit-install (Model B) for curated marketplace plugins.
|
|
# user_plugin_optouts row presence flips meaning from "excluded" to
|
|
# "subscribed"; migration wipes existing rows so the inverted reading
|
|
# starts from a clean baseline. Also adds marketplace_plugins.created_at
|
|
# (per-plugin "newest first" sort on /marketplace), backfilled from
|
|
# parent marketplace_registry.registered_at.
|
|
# v28 → v29: /home page rollout — instance_templates singleton
|
|
# consolidation (welcome_template + claude_md_template merged) + new
|
|
# users.onboarded column. See tests/test_v29_home_migration.py for
|
|
# the exhaustive coverage of that step.
|
|
# v29 → v30: news_template — single versioned table for the /home
|
|
# news perex + /news permalink page. See
|
|
# tests/test_news_template_repository.py.
|
|
# v30 → v31: session-pipeline framework. Renames session_extraction_state
|
|
# → session_processor_state with composite PK (processor_name,
|
|
# session_file) so multiple processors can track their own
|
|
# processed-set independently. Existing rows are copied across with
|
|
# processor_name='verification'; the old table is dropped.
|
|
assert SCHEMA_VERSION == 31
|
|
|
|
|
|
def test_v20_adds_source_query(tmp_path):
|
|
db_path = tmp_path / "system.duckdb"
|
|
conn = duckdb.connect(str(db_path))
|
|
_ensure_schema(conn)
|
|
|
|
cols = {
|
|
r[0] for r in conn.execute(
|
|
"SELECT column_name FROM information_schema.columns "
|
|
"WHERE table_name = 'table_registry'"
|
|
).fetchall()
|
|
}
|
|
assert "source_query" in cols, f"source_query missing from {cols}"
|
|
assert get_schema_version(conn) == SCHEMA_VERSION
|
|
conn.close()
|
|
|
|
|
|
def test_claude_md_template_seeded_in_instance_templates(tmp_path):
|
|
"""v23 introduced claude_md_template as a singleton table; v28 consolidates
|
|
it into instance_templates keyed 'claude_md'. Post-v28 the legacy table is
|
|
dropped — the canonical lookup is `instance_templates WHERE key='claude_md'`.
|
|
|
|
See tests/test_v28_migration.py for the migration path coverage. This test
|
|
just verifies the seeded row is present on a fresh install.
|
|
"""
|
|
db_path = tmp_path / "system.duckdb"
|
|
conn = duckdb.connect(str(db_path))
|
|
_ensure_schema(conn)
|
|
|
|
tables = {
|
|
r[0] for r in conn.execute(
|
|
"SELECT table_name FROM information_schema.tables "
|
|
"WHERE table_schema = 'main'"
|
|
).fetchall()
|
|
}
|
|
assert "instance_templates" in tables
|
|
assert "claude_md_template" not in tables, (
|
|
"claude_md_template should be consolidated away post-v28"
|
|
)
|
|
|
|
row = conn.execute(
|
|
"SELECT key, content FROM instance_templates WHERE key = 'claude_md'"
|
|
).fetchone()
|
|
assert row is not None
|
|
assert row[0] == "claude_md"
|
|
assert row[1] is None # default = no override
|
|
conn.close()
|
|
|
|
|
|
def test_v19_db_migrates_to_v20(tmp_path):
|
|
"""Pre-existing v19 DB (post-RBAC-drop) without source_query upgrades
|
|
cleanly without losing data."""
|
|
db_path = tmp_path / "system.duckdb"
|
|
conn = duckdb.connect(str(db_path))
|
|
|
|
# Simulate a v19 DB at minimal but realistic shape: schema_version row +
|
|
# a table_registry row in the post-v19 column shape (no is_public column,
|
|
# since v19 finalize dropped it via the table-rebuild idiom).
|
|
conn.execute(
|
|
"CREATE TABLE schema_version (version INTEGER, "
|
|
"applied_at TIMESTAMP DEFAULT current_timestamp)"
|
|
)
|
|
conn.execute("INSERT INTO schema_version (version) VALUES (19)")
|
|
conn.execute("""CREATE TABLE table_registry (
|
|
id VARCHAR PRIMARY KEY, name VARCHAR NOT NULL,
|
|
source_type VARCHAR, bucket VARCHAR, source_table VARCHAR,
|
|
sync_strategy VARCHAR DEFAULT 'full_refresh',
|
|
query_mode VARCHAR DEFAULT 'local',
|
|
sync_schedule VARCHAR, profile_after_sync BOOLEAN DEFAULT true,
|
|
primary_key VARCHAR, folder VARCHAR, description TEXT,
|
|
registered_by VARCHAR,
|
|
registered_at TIMESTAMP DEFAULT current_timestamp
|
|
)""")
|
|
conn.execute("INSERT INTO table_registry (id, name) VALUES ('foo', 'foo')")
|
|
|
|
_ensure_schema(conn)
|
|
|
|
assert get_schema_version(conn) == SCHEMA_VERSION # bumped 19→28 forward
|
|
cols = {
|
|
r[0] for r in conn.execute(
|
|
"SELECT column_name FROM information_schema.columns "
|
|
"WHERE table_name = 'table_registry'"
|
|
).fetchall()
|
|
}
|
|
assert "source_query" in cols
|
|
# Existing row preserved, new column NULL
|
|
row = conn.execute(
|
|
"SELECT id, source_query FROM table_registry WHERE id='foo'"
|
|
).fetchone()
|
|
assert row == ("foo", None)
|
|
conn.close()
|