fix(security): RBAC filter uses stable user_id instead of mutable email local-part (#293) (#299)

* fix(security): RBAC filter for agnes_sessions matches both email local-part and user_id

The upload API (POST /api/upload/sessions) stores session files under
user_sessions/{user_id}/ (UUID), while the session collector uses the
OS username (email local-part). The session pipeline writes the directory
name verbatim into usage_session_summary.username, so the column can
contain either value depending on the ingestion path.

The RBAC filter in build_filter_clause previously only matched the email
local-part, missing sessions uploaded via the API. The fix adds an OR
condition so non-admin users see rows where username matches either their
email local-part or their user_id.

Closes #293

Co-Authored-By: zdenek.srotyr <zdenek.srotyr@keboola.com>

* fix(security): RBAC filter uses stable user_id instead of mutable email local-part

Closes #293

Previous fix used OR condition matching both email local-part and user_id
in the username column. This was fragile: email changes would break
filtering. This commit introduces a dedicated user_id column populated
by the session pipeline via resolve_user_id(), and switches the RBAC
filter to use it exclusively.

Changes:
- Schema v45: add user_id column to usage_session_summary and usage_events
- UsageProcessor: accept and store user_id in both tables
- runner.py: resolve_user_id() maps directory name to users.id UUID
  (exact match for UUID dirs, email LIKE for local-part dirs)
- INTERNAL_TABLES: agnes_sessions/agnes_telemetry filter on user_id column
- build_filter_clause: simplified to WHERE user_id = '<uuid>' (no OR)
- me.py/admin_user_sessions.py: query by user_id OR username for
  backward compatibility during transition
- USAGE_PROCESSOR_VERSION bumped 2→3 to trigger reprocessing/backfill
- Tests updated: 27 pass including new email-change resilience test

Co-Authored-By: zdenek.srotyr <zdenek.srotyr@keboola.com>

* fix(tests): bump schema version assertions 44→45

Co-Authored-By: zdenek.srotyr <zdenek.srotyr@keboola.com>

* fix(docs): correct resolve_user_id docstring, add TypeError comment

Co-Authored-By: zdenek.srotyr <zdenek.srotyr@keboola.com>

* fix(security): address review — backward-compat OR, LIKE escaping, narrower TypeError

Co-Authored-By: zdenek.srotyr <zdenek.srotyr@keboola.com>

* fix(security): address code review — eliminate TypeError hack, add resolve_user_id tests

Co-Authored-By: zdenek.srotyr <zdenek.srotyr@keboola.com>

* fix(db): create user_id indexes in _v44_to_v45, not _SYSTEM_SCHEMA

_SYSTEM_SCHEMA runs before the migration ladder. On an upgrade from
v42/v43/v44, usage_events / usage_session_summary already exist without
the user_id column (CREATE TABLE IF NOT EXISTS is a no-op), so the
CREATE INDEX ... (user_id) lines in _SYSTEM_SCHEMA failed to bind and
aborted _ensure_schema — the app would not start post-upgrade. Move the
index creation to _v44_to_v45, which ADDs the column first. Same pattern
as the v41 audit_log indices.

* fix(usage): bump USAGE_PROCESSOR_VERSION 3→4 for user_id backfill

#303 shipped USAGE_PROCESSOR_VERSION=3 (release 0.54.12) for its
<command-name> slash extraction. This PR's 2→3 bump collided with it
on rebase, so the reprocess loop would not re-trigger to backfill the
new user_id column on deployments already running v3. Bump to 4.

* release: 0.54.13 — RBAC filter uses stable user_id (#293)

---------

Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
This commit is contained in:
ZdenekSrotyr 2026-05-14 16:12:54 +02:00 committed by GitHub
parent f53e98d5a3
commit 3e19caa975
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
16 changed files with 729 additions and 494 deletions

View file

@ -10,6 +10,20 @@ CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every C
## [Unreleased] ## [Unreleased]
## [0.54.13] — 2026-05-14
### Security
- **RBAC filter uses stable `user_id` (UUID) instead of mutable email
local-part (#293).** Non-admin users querying `agnes_sessions` /
`agnes_telemetry` are now filtered by `user_id` (immutable UUID)
rather than `username` (email local-part, which changes on rename).
Schema v45 adds a `user_id` column to `usage_session_summary` and
`usage_events`; the session pipeline's `resolve_user_id()` populates
it on every (re)process run. `USAGE_PROCESSOR_VERSION` bumps 3→4 to
trigger backfill. During the transition period, RBAC queries include
an OR fallback on `username` so pre-backfill rows remain visible.
## [0.54.12] — 2026-05-14 ## [0.54.12] — 2026-05-14
### Fixed ### Fixed

View file

@ -97,6 +97,8 @@ def list_user_sessions(
# ------------------------------------------------------------------ # ------------------------------------------------------------------
# Pull processed rows from usage_session_summary # Pull processed rows from usage_session_summary
# ------------------------------------------------------------------ # ------------------------------------------------------------------
# Match on both user_id (stable, v45+) and username (legacy) so the
# admin view shows sessions from both ingestion paths and pre-v45 rows.
try: try:
rows_db = conn.execute( rows_db = conn.execute(
""" """
@ -105,19 +107,27 @@ def list_user_sessions(
active_seconds, wall_seconds, active_seconds, wall_seconds,
tool_calls, tool_errors, primary_model tool_calls, tool_errors, primary_model
FROM usage_session_summary FROM usage_session_summary
WHERE username = ? WHERE user_id = ? OR username = ?
ORDER BY started_at DESC NULLS LAST ORDER BY started_at DESC NULLS LAST
""", """,
[username], [user_id, username],
).fetchall() ).fetchall()
except Exception: except Exception:
rows_db = [] rows_db = []
processed_files: dict[str, dict] = {} processed_files: dict[str, dict] = {}
if rows_db: if rows_db:
cols = ["session_file", "session_id", "started_at", "ended_at", cols = [
"active_seconds", "wall_seconds", "tool_calls", "tool_errors", "session_file",
"primary_model"] "session_id",
"started_at",
"ended_at",
"active_seconds",
"wall_seconds",
"tool_calls",
"tool_errors",
"primary_model",
]
for r in rows_db: for r in rows_db:
d = dict(zip(cols, r)) d = dict(zip(cols, r))
# Normalise timestamps to ISO strings # Normalise timestamps to ISO strings
@ -144,18 +154,20 @@ def list_user_sessions(
# Try to extract a session_id from the filename: the collector # Try to extract a session_id from the filename: the collector
# names files like "<session_id>.jsonl" or "sess-<id>.jsonl". # names files like "<session_id>.jsonl" or "sess-<id>.jsonl".
sid = p.stem sid = p.stem
all_rows.append({ all_rows.append(
"session_file": fname, {
"session_id": sid, "session_file": fname,
"started_at": mtime.isoformat(), "session_id": sid,
"ended_at": None, "started_at": mtime.isoformat(),
"active_seconds": None, "ended_at": None,
"wall_seconds": None, "active_seconds": None,
"tool_calls": 0, "wall_seconds": None,
"tool_errors": 0, "tool_calls": 0,
"primary_model": None, "tool_errors": 0,
"processed": False, "primary_model": None,
}) "processed": False,
}
)
# Sort: processed (have started_at) first then unprocessed, both newest-first # Sort: processed (have started_at) first then unprocessed, both newest-first
def _sort_key(r: dict): def _sort_key(r: dict):
@ -165,7 +177,7 @@ def list_user_sessions(
all_rows.sort(key=_sort_key, reverse=True) all_rows.sort(key=_sort_key, reverse=True)
total = len(all_rows) total = len(all_rows)
page = all_rows[offset: offset + limit] page = all_rows[offset : offset + limit]
return { return {
"rows": page, "rows": page,
@ -351,7 +363,7 @@ def list_user_activity(
audit_repo = AuditRepository(conn) audit_repo = AuditRepository(conn)
rows, _ = audit_repo.query(user_id=user_id, limit=limit + offset) rows, _ = audit_repo.query(user_id=user_id, limit=limit + offset)
# Apply offset via slicing — cursor-based pagination is per-page only # Apply offset via slicing — cursor-based pagination is per-page only
rows = rows[offset: offset + limit] rows = rows[offset : offset + limit]
# Normalise timestamps to ISO strings and decode JSON params # Normalise timestamps to ISO strings and decode JSON params
for r in rows: for r in rows:
@ -366,9 +378,7 @@ def list_user_activity(
except (ValueError, TypeError): except (ValueError, TypeError):
pass pass
total = conn.execute( total = conn.execute("SELECT COUNT(*) FROM audit_log WHERE user_id = ?", [user_id]).fetchone()[0]
"SELECT COUNT(*) FROM audit_log WHERE user_id = ?", [user_id]
).fetchone()[0]
try: try:
AuditRepository(conn).log( AuditRepository(conn).log(

View file

@ -84,9 +84,7 @@ def _username_for_stats(user: dict) -> str:
return email.split("@")[0] if "@" in email else email return email.split("@")[0] if "@" in email else email
def compute_home_stats( def compute_home_stats(conn: duckdb.DuckDBPyConnection, user: dict, window: str = "24h") -> dict:
conn: duckdb.DuckDBPyConnection, user: dict, window: str = "24h"
) -> dict:
"""Pure helper that returns the home-stats payload for the given user. """Pure helper that returns the home-stats payload for the given user.
Shared by the HTTP endpoint and the /home Jinja handler (server-side Shared by the HTTP endpoint and the /home Jinja handler (server-side
@ -101,9 +99,13 @@ def compute_home_stats(
interval = _WINDOW_INTERVALS["24h"] interval = _WINDOW_INTERVALS["24h"]
username = _username_for_stats(user) username = _username_for_stats(user)
uid = user.get("id") or ""
# f-string interpolates only the validated interval literal above; # f-string interpolates only the validated interval literal above;
# all user-controlled input flows through bound parameters. # all user-controlled input flows through bound parameters.
# Match on both user_id (stable, populated by v45 pipeline) and
# username (legacy rows before v45 backfill) so stats are complete
# during the transition period.
sql = f""" sql = f"""
WITH win AS ( WITH win AS (
SELECT current_timestamp - {interval} AS since SELECT current_timestamp - {interval} AS since
@ -117,12 +119,13 @@ def compute_home_stats(
COALESCE(SUM(cache_read_tokens), 0) AS cache_read, COALESCE(SUM(cache_read_tokens), 0) AS cache_read,
COALESCE(SUM(cache_creation_tokens), 0) AS cache_creation COALESCE(SUM(cache_creation_tokens), 0) AS cache_creation
FROM usage_session_summary, win FROM usage_session_summary, win
WHERE username = ? AND started_at >= win.since WHERE (user_id = ? OR username = ?)
AND started_at >= win.since
), ),
proj AS ( proj AS (
SELECT COUNT(DISTINCT cwd) AS projects SELECT COUNT(DISTINCT cwd) AS projects
FROM usage_events, win FROM usage_events, win
WHERE username = ? WHERE (user_id = ? OR username = ?)
AND cwd IS NOT NULL AND cwd IS NOT NULL
AND occurred_at >= win.since AND occurred_at >= win.since
), ),
@ -137,7 +140,7 @@ def compute_home_stats(
proj.projects proj.projects
FROM u, sess, proj FROM u, sess, proj
""" """
row = conn.execute(sql, [username, username, user["id"]]).fetchone() row = conn.execute(sql, [uid, username, uid, username, uid]).fetchone()
if row is None: if row is None:
return { return {
@ -146,15 +149,16 @@ def compute_home_stats(
"sessions": 0, "sessions": 0,
"prompts": 0, "prompts": 0,
"tokens": { "tokens": {
"input": 0, "output": 0, "input": 0,
"cache_read": 0, "cache_creation": 0, "output": 0,
"cache_read": 0,
"cache_creation": 0,
"total": 0, "total": 0,
}, },
"projects": 0, "projects": 0,
} }
(last_pull_at, sessions, prompts, (last_pull_at, sessions, prompts, input_t, output_t, cache_read, cache_creation, projects) = row
input_t, output_t, cache_read, cache_creation, projects) = row
return { return {
"window": window, "window": window,
"last_pull_at": last_pull_at.isoformat() if last_pull_at else None, "last_pull_at": last_pull_at.isoformat() if last_pull_at else None,
@ -165,8 +169,7 @@ def compute_home_stats(
"output": int(output_t or 0), "output": int(output_t or 0),
"cache_read": int(cache_read or 0), "cache_read": int(cache_read or 0),
"cache_creation": int(cache_creation or 0), "cache_creation": int(cache_creation or 0),
"total": int((input_t or 0) + (output_t or 0) "total": int((input_t or 0) + (output_t or 0) + (cache_read or 0) + (cache_creation or 0)),
+ (cache_read or 0) + (cache_creation or 0)),
}, },
"projects": int(projects or 0), "projects": int(projects or 0),
} }

View file

@ -27,9 +27,8 @@ from __future__ import annotations
import logging import logging
import re import re
from dataclasses import dataclass from dataclasses import dataclass
from typing import Any, Optional from typing import Any
import duckdb
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
@ -38,6 +37,7 @@ logger = logging.getLogger(__name__)
# Internal-table registry — single source of truth # Internal-table registry — single source of truth
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
@dataclass(frozen=True) @dataclass(frozen=True)
class InternalTable: class InternalTable:
"""One internal table mapping. """One internal table mapping.
@ -54,30 +54,34 @@ class InternalTable:
display_name: human-readable name (also goes into ``table_registry.name``) display_name: human-readable name (also goes into ``table_registry.name``)
description: short blurb (catalog UI + ``agnes catalog`` output) description: short blurb (catalog UI + ``agnes catalog`` output)
""" """
registry_id: str registry_id: str
source_table: str source_table: str
filter_column: str filter_column: str
filter_kind: str # 'username' | 'user_id' filter_kind: str # 'username' | 'user_id'
display_name: str display_name: str
description: str description: str
legacy_username_column: str | None = None # backward-compat OR fallback
INTERNAL_TABLES: tuple[InternalTable, ...] = ( INTERNAL_TABLES: tuple[InternalTable, ...] = (
InternalTable( InternalTable(
registry_id="agnes_sessions", registry_id="agnes_sessions",
source_table="usage_session_summary", source_table="usage_session_summary",
filter_column="username", filter_column="user_id",
filter_kind="username", filter_kind="user_id",
display_name="Agnes sessions", display_name="Agnes sessions",
description="Claude Code sessions. Also available locally for analysis.", description="Claude Code sessions. Also available locally for analysis.",
legacy_username_column="username",
), ),
InternalTable( InternalTable(
registry_id="agnes_telemetry", registry_id="agnes_telemetry",
source_table="usage_events", source_table="usage_events",
filter_column="username", filter_column="user_id",
filter_kind="username", filter_kind="user_id",
display_name="Agnes telemetry events", display_name="Agnes telemetry events",
description="Tool and skill invocations from Claude Code. Also available locally for analysis.", description="Tool and skill invocations from Claude Code. Also available locally for analysis.",
legacy_username_column="username",
), ),
InternalTable( InternalTable(
registry_id="agnes_audit", registry_id="agnes_audit",
@ -89,9 +93,7 @@ INTERNAL_TABLES: tuple[InternalTable, ...] = (
), ),
) )
INTERNAL_TABLES_BY_ID: dict[str, InternalTable] = { INTERNAL_TABLES_BY_ID: dict[str, InternalTable] = {t.registry_id: t for t in INTERNAL_TABLES}
t.registry_id: t for t in INTERNAL_TABLES
}
def is_internal_table(table_id: str) -> bool: def is_internal_table(table_id: str) -> bool:
@ -106,7 +108,7 @@ def is_internal_table(table_id: str) -> bool:
# resolve to filesystem usernames with a `+`, and the session-data-dir # resolve to filesystem usernames with a `+`, and the session-data-dir
# layout already supports the same character class. # layout already supports the same character class.
_USERNAME_RE = re.compile(r"^[A-Za-z0-9._+-]{1,200}$") _USERNAME_RE = re.compile(r"^[A-Za-z0-9._+-]{1,200}$")
_USER_ID_RE = re.compile(r"^[A-Za-z0-9._@:+-]{1,200}$") _USER_ID_RE = re.compile(r"^[A-Za-z0-9._@:+-]{1,200}$")
class InternalAccessError(Exception): class InternalAccessError(Exception):
@ -128,16 +130,12 @@ def _filter_value(user: dict[str, Any], kind: str) -> str:
email = (user or {}).get("email", "") or "" email = (user or {}).get("email", "") or ""
username = email.split("@")[0] if "@" in email else email username = email.split("@")[0] if "@" in email else email
if not _USERNAME_RE.match(username): if not _USERNAME_RE.match(username):
raise InternalAccessError( raise InternalAccessError(f"user email {email!r} does not yield a safe username for scoping")
f"user email {email!r} does not yield a safe username for scoping"
)
return username return username
if kind == "user_id": if kind == "user_id":
uid = (user or {}).get("id", "") or "" uid = (user or {}).get("id", "") or ""
if not _USER_ID_RE.match(uid): if not _USER_ID_RE.match(uid):
raise InternalAccessError( raise InternalAccessError(f"user_id {uid!r} fails the safe-identifier check")
f"user_id {uid!r} fails the safe-identifier check"
)
return uid return uid
raise InternalAccessError(f"unknown filter_kind: {kind!r}") raise InternalAccessError(f"unknown filter_kind: {kind!r}")
@ -145,15 +143,24 @@ def _filter_value(user: dict[str, Any], kind: str) -> str:
def build_filter_clause(table: InternalTable, user: dict[str, Any], is_admin: bool) -> str: def build_filter_clause(table: InternalTable, user: dict[str, Any], is_admin: bool) -> str:
"""Return the WHERE clause for one internal table. """Return the WHERE clause for one internal table.
Admins get an empty string (unscoped view). Everyone else gets Admins get an empty string (unscoped view). Everyone else get
``WHERE <col> = '<value>'`` where value has been regex-validated. ``WHERE <col> = '<value>'`` where value has been regex-validated.
``agnes_sessions`` and ``agnes_telemetry`` filter primarily on
``user_id`` (stable UUID) but include an OR fallback on
``username`` (email local-part) for rows that pre-date the v45
backfill. Once all rows carry a non-NULL ``user_id`` the
``legacy_username_column`` field can be removed.
""" """
if is_admin: if is_admin:
return "" return ""
value = _filter_value(user, table.filter_kind) value = _filter_value(user, table.filter_kind)
# value is regex-validated to ``[A-Za-z0-9._@:-]`` so single-quote
# escape is unnecessary; double single-quote anyway for defense-in-depth.
safe = value.replace("'", "''") safe = value.replace("'", "''")
if table.legacy_username_column:
legacy = _filter_value(user, "username").replace("'", "''")
return f"WHERE ({table.filter_column} = '{safe}' OR {table.legacy_username_column} = '{legacy}')"
return f"WHERE {table.filter_column} = '{safe}'" return f"WHERE {table.filter_column} = '{safe}'"
@ -162,7 +169,7 @@ def build_filter_clause(table: InternalTable, user: dict[str, Any], is_admin: bo
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
_TABLE_REF_RE = re.compile( _TABLE_REF_RE = re.compile(
r'\b(' + '|'.join(re.escape(t.registry_id) for t in INTERNAL_TABLES) + r')\b', r"\b(" + "|".join(re.escape(t.registry_id) for t in INTERNAL_TABLES) + r")\b",
re.IGNORECASE, re.IGNORECASE,
) )
@ -175,7 +182,7 @@ _SQL_STRING_LITERAL_RE = re.compile(r"'(?:''|[^'])*'")
# comment-wrapped table name (`/**/users/**/`) can't slip past the # comment-wrapped table name (`/**/users/**/`) can't slip past the
# identifier scan downstream. # identifier scan downstream.
_SQL_BLOCK_COMMENT_RE = re.compile(r"/\*[\s\S]*?\*/") _SQL_BLOCK_COMMENT_RE = re.compile(r"/\*[\s\S]*?\*/")
_SQL_LINE_COMMENT_RE = re.compile(r"--[^\n]*") _SQL_LINE_COMMENT_RE = re.compile(r"--[^\n]*")
def _strip_sql_noise(sql: str) -> str: def _strip_sql_noise(sql: str) -> str:
@ -190,9 +197,7 @@ def _strip_sql_noise(sql: str) -> str:
return s return s
_INTERNAL_ALIAS_NAMES: frozenset[str] = frozenset( _INTERNAL_ALIAS_NAMES: frozenset[str] = frozenset(t.registry_id.lower() for t in INTERNAL_TABLES)
t.registry_id.lower() for t in INTERNAL_TABLES
)
def _sensitive_table_reference(stripped_sql: str, conn) -> str | None: def _sensitive_table_reference(stripped_sql: str, conn) -> str | None:
@ -211,10 +216,7 @@ def _sensitive_table_reference(stripped_sql: str, conn) -> str | None:
double-quoted (`"users"`) forms both match because the bare name double-quoted (`"users"`) forms both match because the bare name
still sits between word boundaries. still sits between word boundaries.
""" """
rows = conn.execute( rows = conn.execute("SELECT table_name FROM information_schema.tables WHERE table_schema = 'main'").fetchall()
"SELECT table_name FROM information_schema.tables "
"WHERE table_schema = 'main'"
).fetchall()
for (name,) in rows: for (name,) in rows:
if name is None: if name is None:
continue continue
@ -300,17 +302,13 @@ def execute_internal_query(
sensitive = _sensitive_table_reference(stripped, get_system_db()) sensitive = _sensitive_table_reference(stripped, get_system_db())
if sensitive is not None: if sensitive is not None:
raise InternalAccessError( raise InternalAccessError(
f"non-admin SQL cannot reference table {sensitive!r}; " f"non-admin SQL cannot reference table {sensitive!r}; query one of the agnes_* aliases instead"
"query one of the agnes_* aliases instead"
) )
cte_parts = [] cte_parts = []
for table_id in refs: for table_id in refs:
table = INTERNAL_TABLES_BY_ID[table_id] table = INTERNAL_TABLES_BY_ID[table_id]
where_clause = build_filter_clause(table, user, is_admin) where_clause = build_filter_clause(table, user, is_admin)
cte_parts.append( cte_parts.append(f"{table.registry_id} AS (SELECT * FROM {table.source_table} {where_clause})")
f"{table.registry_id} AS "
f"(SELECT * FROM {table.source_table} {where_clause})"
)
cte_prefix = "WITH " + ", ".join(cte_parts) cte_prefix = "WITH " + ", ".join(cte_parts)
wrapped = f"{cte_prefix} SELECT * FROM ({sql}) AS _agnes_user_query" wrapped = f"{cte_prefix} SELECT * FROM ({sql}) AS _agnes_user_query"
@ -332,6 +330,7 @@ def execute_internal_query(
# Schema introspection — feeds /api/v2/schema/{id} for internal tables # Schema introspection — feeds /api/v2/schema/{id} for internal tables
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
def get_schema(system_db_path: str, table_id: str) -> list[dict]: def get_schema(system_db_path: str, table_id: str) -> list[dict]:
"""Return the underlying physical schema for an internal table. """Return the underlying physical schema for an internal table.
@ -349,6 +348,7 @@ def get_schema(system_db_path: str, table_id: str) -> list[dict]:
return [] return []
table = INTERNAL_TABLES_BY_ID[table_id] table = INTERNAL_TABLES_BY_ID[table_id]
from src.db import get_system_db from src.db import get_system_db
cursor = get_system_db().cursor() cursor = get_system_db().cursor()
try: try:
rows = cursor.execute( rows = cursor.execute(
@ -357,10 +357,7 @@ def get_schema(system_db_path: str, table_id: str) -> list[dict]:
"WHERE table_name = ? ORDER BY ordinal_position", "WHERE table_name = ? ORDER BY ordinal_position",
[table.source_table], [table.source_table],
).fetchall() ).fetchall()
return [ return [{"name": r[0], "type": r[1], "nullable": r[2] == "YES"} for r in rows]
{"name": r[0], "type": r[1], "nullable": r[2] == "YES"}
for r in rows
]
finally: finally:
try: try:
cursor.close() cursor.close()

View file

@ -1,6 +1,6 @@
[project] [project]
name = "agnes-the-ai-analyst" name = "agnes-the-ai-analyst"
version = "0.54.12" version = "0.54.13"
description = "Agnes — AI Data Analyst platform for AI analytical systems" description = "Agnes — AI Data Analyst platform for AI analytical systems"
requires-python = ">=3.11,<3.14" requires-python = ">=3.11,<3.14"
license = "MIT" license = "MIT"

View file

@ -25,6 +25,43 @@ from src.repositories.session_processor_state import SessionProcessorStateReposi
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
def resolve_user_id(
conn: duckdb.DuckDBPyConnection,
username: str,
) -> str | None:
"""Map a session-directory name to the stable ``users.id`` UUID.
Two conventions exist for the directory name under
``/data/user_sessions/``:
* **Session collector** writes under the OS username, which in
current deployments equals the email local-part (e.g. ``alice``).
* **Upload API** writes under ``user["id"]`` a UUID.
Resolution order:
1. Exact match on ``users.id`` (covers the UUID path).
2. Email local-part match: ``users.email LIKE '<username>@%'``.
If multiple users share the same local-part (different domains),
we pick the one most recently updated.
3. Fallback: return ``None`` (orphaned / deleted user).
"""
row = conn.execute(
"SELECT id FROM users WHERE id = ?",
[username],
).fetchone()
if row:
return row[0]
escaped = username.replace("\\", "\\\\").replace("%", "\\%").replace("_", "\\_")
row = conn.execute(
"SELECT id FROM users WHERE email LIKE ? || '@%' ESCAPE '\\' ORDER BY updated_at DESC NULLS LAST LIMIT 1",
[escaped],
).fetchone()
if row:
return row[0]
return None
DEFAULT_SESSION_DATA_DIR = Path(os.environ.get("SESSION_DATA_DIR", "/data/user_sessions")) DEFAULT_SESSION_DATA_DIR = Path(os.environ.get("SESSION_DATA_DIR", "/data/user_sessions"))
@ -60,6 +97,11 @@ def run_processor(
logger.info("No sessions to process for processor=%s", processor.name) logger.info("No sessions to process for processor=%s", processor.name)
return stats return stats
# Pre-resolve user_id per directory name so each processor can
# store the stable identity. Cache avoids repeated DB lookups when
# one user has many sessions.
_uid_cache: dict[str, str | None] = {}
for username, jsonl_path in candidates: for username, jsonl_path in candidates:
session_key = f"{username}/{jsonl_path.name}" session_key = f"{username}/{jsonl_path.name}"
try: try:
@ -67,7 +109,9 @@ def run_processor(
except Exception as e: except Exception as e:
logger.warning( logger.warning(
"Cannot hash %s for processor=%s: %s", "Cannot hash %s for processor=%s: %s",
session_key, processor.name, e, session_key,
processor.name,
e,
) )
stats["errors"] += 1 stats["errors"] += 1
stats["errors_detail"].append({"session": session_key, "error": str(e)}) stats["errors_detail"].append({"session": session_key, "error": str(e)})
@ -82,12 +126,23 @@ def run_processor(
stats["skipped"] += 1 stats["skipped"] += 1
continue continue
if username not in _uid_cache:
_uid_cache[username] = resolve_user_id(conn, username)
resolved_uid = _uid_cache[username]
try: try:
result = processor.process_session(jsonl_path, username, session_key, conn) result = processor.process_session(
jsonl_path,
username,
session_key,
conn,
user_id=resolved_uid,
)
except Exception as e: except Exception as e:
logger.exception( logger.exception(
"Processor %s failed on %s — leaving state unwritten for retry", "Processor %s failed on %s — leaving state unwritten for retry",
processor.name, session_key, processor.name,
session_key,
) )
stats["errors"] += 1 stats["errors"] += 1
stats["errors_detail"].append({"session": session_key, "error": str(e)}) stats["errors_detail"].append({"session": session_key, "error": str(e)})
@ -101,7 +156,8 @@ def run_processor(
# cause the same session to be retried forever. # cause the same session to be retried forever.
logger.warning( logger.warning(
"Processor %s returned non-ProcessorResult on %s; coercing to empty result", "Processor %s returned non-ProcessorResult on %s; coercing to empty result",
processor.name, session_key, processor.name,
session_key,
) )
result = ProcessorResult(items_count=0) result = ProcessorResult(items_count=0)

View file

@ -32,6 +32,8 @@ class UsageProcessor:
username: str, username: str,
session_key: str, session_key: str,
conn: duckdb.DuckDBPyConnection, conn: duckdb.DuckDBPyConnection,
*,
user_id: str | None = None,
) -> ProcessorResult: ) -> ProcessorResult:
turns = parse_jsonl(session_path) turns = parse_jsonl(session_path)
events = list(iter_events(turns)) events = list(iter_events(turns))
@ -63,6 +65,7 @@ class UsageProcessor:
"session_id": session_id, "session_id": session_id,
"session_file": session_key, "session_file": session_key,
"username": username, "username": username,
"user_id": user_id,
"event_uuid": e.event_uuid, "event_uuid": e.event_uuid,
"parent_uuid": e.parent_uuid, "parent_uuid": e.parent_uuid,
"event_type": e.event_type, "event_type": e.event_type,
@ -83,6 +86,7 @@ class UsageProcessor:
summary = compute_summary(turns, rows) summary = compute_summary(turns, rows)
summary["session_file"] = session_key summary["session_file"] = session_key
summary["username"] = username summary["username"] = username
summary["user_id"] = user_id
# Override session_id with the resolved one # Override session_id with the resolved one
if not summary.get("session_id"): if not summary.get("session_id"):
summary["session_id"] = session_id summary["session_id"] = session_id

View file

@ -40,13 +40,29 @@ from dataclasses import dataclass
from datetime import datetime, timezone, timedelta from datetime import datetime, timezone, timedelta
from typing import Iterator from typing import Iterator
USAGE_PROCESSOR_VERSION = 3 # v4: #293 added the user_id column to usage tables — bump forces the
# session-pipeline reprocess loop to backfill user_id on existing rows.
# (v3 was #303's <command-name> slash extraction.)
USAGE_PROCESSOR_VERSION = 4
BUILTIN_TOOLS = frozenset({ BUILTIN_TOOLS = frozenset(
"Bash", "Read", "Edit", "Write", "Grep", "Glob", "TodoWrite", {
"Task", "Agent", "NotebookEdit", "WebFetch", "WebSearch", "ExitPlanMode", "Bash",
"LS", # also built-in "Read",
}) "Edit",
"Write",
"Grep",
"Glob",
"TodoWrite",
"Task",
"Agent",
"NotebookEdit",
"WebFetch",
"WebSearch",
"ExitPlanMode",
"LS", # also built-in
}
)
# Claude Code wraps user-typed slash invocations as # Claude Code wraps user-typed slash invocations as
# <command-name>/<name></command-name> inside the user message content # <command-name>/<name></command-name> inside the user message content
@ -57,18 +73,23 @@ BUILTIN_TOOLS = frozenset({
COMMAND_NAME_RE = re.compile(r"<command-name>/([A-Za-z][\w:-]*)</command-name>") COMMAND_NAME_RE = re.compile(r"<command-name>/([A-Za-z][\w:-]*)</command-name>")
# Event types to skip entirely # Event types to skip entirely
_SKIP_TYPES = frozenset({ _SKIP_TYPES = frozenset(
"system", "summary", "file-history-snapshot", {
"queue-operation", "progress", "system",
}) "summary",
"file-history-snapshot",
"queue-operation",
"progress",
}
)
@dataclass(frozen=True) @dataclass(frozen=True)
class ParsedEvent: class ParsedEvent:
event_uuid: str | None event_uuid: str | None
parent_uuid: str | None parent_uuid: str | None
tool_id: str | None # tool_use 'id' (tu_xxx) from message.content item; None for slash_command tool_id: str | None # tool_use 'id' (tu_xxx) from message.content item; None for slash_command
event_type: str # 'tool_use' | 'slash_command' | 'subagent' | 'mcp_call' event_type: str # 'tool_use' | 'slash_command' | 'subagent' | 'mcp_call'
tool_name: str | None tool_name: str | None
skill_name: str | None skill_name: str | None
subagent_type: str | None subagent_type: str | None
@ -207,9 +228,7 @@ def iter_events(turns: list[dict]) -> Iterator[ParsedEvent]:
text_parts = [content] text_parts = [content]
elif isinstance(content, list): elif isinstance(content, list):
text_parts = [ text_parts = [
item.get("text", "") item.get("text", "") for item in content if isinstance(item, dict) and item.get("type") == "text"
for item in content
if isinstance(item, dict) and item.get("type") == "text"
] ]
else: else:
text_parts = [] text_parts = []
@ -243,7 +262,7 @@ class AttributionLookup:
""" """
def __init__(self, conn): def __init__(self, conn):
self._skills: dict[str, tuple[str, str]] = {} # name -> (source, ref_id) self._skills: dict[str, tuple[str, str]] = {} # name -> (source, ref_id)
self._agents: dict[str, tuple[str, str]] = {} self._agents: dict[str, tuple[str, str]] = {}
self._commands: dict[str, tuple[str, str]] = {} self._commands: dict[str, tuple[str, str]] = {}
@ -375,9 +394,7 @@ def compute_summary(turns: list[dict], events: list[dict]) -> dict:
started_at = min(timestamps) if timestamps else None started_at = min(timestamps) if timestamps else None
ended_at = max(timestamps) if timestamps else None ended_at = max(timestamps) if timestamps else None
wall_seconds = ( wall_seconds = int((ended_at - started_at).total_seconds()) if started_at and ended_at else 0
int((ended_at - started_at).total_seconds()) if started_at and ended_at else 0
)
active_seconds = compute_active_seconds(timestamps) active_seconds = compute_active_seconds(timestamps)
# Aggregate counts from events # Aggregate counts from events

View file

@ -42,6 +42,7 @@ class VerificationProcessor:
username: str, username: str,
session_key: str, session_key: str,
conn: duckdb.DuckDBPyConnection, conn: duckdb.DuckDBPyConnection,
**kwargs: object,
) -> ProcessorResult: ) -> ProcessorResult:
repo = KnowledgeRepository(conn) repo = KnowledgeRepository(conn)
session_id = f"session-{session_path.stem}-{username}" session_id = f"session-{session_path.stem}-{username}"
@ -126,7 +127,8 @@ class VerificationProcessor:
except Exception as e: except Exception as e:
logger.warning( logger.warning(
"Duplicate-candidate detection failed for %s: %s", "Duplicate-candidate detection failed for %s: %s",
item_id, e, item_id,
e,
) )
# Run contradiction detection inline. Failure of the LLM # Run contradiction detection inline. Failure of the LLM
@ -140,12 +142,15 @@ class VerificationProcessor:
except Exception as e: except Exception as e:
logger.warning( logger.warning(
"Unexpected error during contradiction check for %s: %s", "Unexpected error during contradiction check for %s: %s",
item_id, e, item_id,
e,
) )
logger.info( logger.info(
"Processed %s: %d verifications, %d items created", "Processed %s: %d verifications, %d items created",
session_key, len(verifications), items_created, session_key,
len(verifications),
items_created,
) )
return ProcessorResult(items_count=items_created) return ProcessorResult(items_count=items_created)

316
src/db.py
View file

@ -40,7 +40,7 @@ def _maybe_instrument(con, db_tag: str):
_SAFE_IDENTIFIER = re.compile(r"^[a-zA-Z_][a-zA-Z0-9_]{0,63}$") _SAFE_IDENTIFIER = re.compile(r"^[a-zA-Z_][a-zA-Z0-9_]{0,63}$")
SCHEMA_VERSION = 44 SCHEMA_VERSION = 45
_SYSTEM_SCHEMA = """ _SYSTEM_SCHEMA = """
CREATE TABLE IF NOT EXISTS schema_version ( CREATE TABLE IF NOT EXISTS schema_version (
@ -723,13 +723,18 @@ CREATE TABLE IF NOT EXISTS usage_events (
occurred_at TIMESTAMP NOT NULL, occurred_at TIMESTAMP NOT NULL,
processor_version INTEGER NOT NULL, processor_version INTEGER NOT NULL,
extracted_at TIMESTAMP DEFAULT current_timestamp, extracted_at TIMESTAMP DEFAULT current_timestamp,
friction_tags JSON friction_tags JSON,
user_id VARCHAR
); );
CREATE INDEX IF NOT EXISTS idx_usage_events_session ON usage_events(session_id); CREATE INDEX IF NOT EXISTS idx_usage_events_session ON usage_events(session_id);
CREATE INDEX IF NOT EXISTS idx_usage_events_user_time ON usage_events(username, occurred_at); CREATE INDEX IF NOT EXISTS idx_usage_events_user_time ON usage_events(username, occurred_at);
CREATE INDEX IF NOT EXISTS idx_usage_events_tool ON usage_events(tool_name); CREATE INDEX IF NOT EXISTS idx_usage_events_tool ON usage_events(tool_name);
CREATE INDEX IF NOT EXISTS idx_usage_events_skill ON usage_events(skill_name); CREATE INDEX IF NOT EXISTS idx_usage_events_skill ON usage_events(skill_name);
CREATE INDEX IF NOT EXISTS idx_usage_events_ref ON usage_events(source, ref_id); CREATE INDEX IF NOT EXISTS idx_usage_events_ref ON usage_events(source, ref_id);
-- idx_usage_events_user_id is created by _v44_to_v45, not here: _SYSTEM_SCHEMA
-- runs before the migration ladder, and CREATE TABLE IF NOT EXISTS won't add
-- user_id to a pre-v45 usage_events, so an index on it would fail to bind.
-- Same pattern as the v41 audit_log indices below.
CREATE TABLE IF NOT EXISTS usage_session_summary ( CREATE TABLE IF NOT EXISTS usage_session_summary (
session_file VARCHAR PRIMARY KEY, session_file VARCHAR PRIMARY KEY,
@ -760,10 +765,13 @@ CREATE TABLE IF NOT EXISTS usage_session_summary (
input_tokens BIGINT DEFAULT 0, input_tokens BIGINT DEFAULT 0,
output_tokens BIGINT DEFAULT 0, output_tokens BIGINT DEFAULT 0,
cache_read_tokens BIGINT DEFAULT 0, cache_read_tokens BIGINT DEFAULT 0,
cache_creation_tokens BIGINT DEFAULT 0 cache_creation_tokens BIGINT DEFAULT 0,
user_id VARCHAR
); );
CREATE INDEX IF NOT EXISTS idx_usage_session_user ON usage_session_summary(username); CREATE INDEX IF NOT EXISTS idx_usage_session_user ON usage_session_summary(username);
CREATE INDEX IF NOT EXISTS idx_usage_session_started ON usage_session_summary(started_at); CREATE INDEX IF NOT EXISTS idx_usage_session_started ON usage_session_summary(started_at);
-- idx_usage_session_user_id is created by _v44_to_v45, not here see the
-- note on idx_usage_events_user_id above.
CREATE TABLE IF NOT EXISTS usage_tool_daily ( CREATE TABLE IF NOT EXISTS usage_tool_daily (
day DATE NOT NULL, day DATE NOT NULL,
@ -881,15 +889,16 @@ def _try_open_system_db(db_path: str) -> duckdb.DuckDBPyConnection:
snapshot = Path(db_path).parent / "system.duckdb.pre-migrate" snapshot = Path(db_path).parent / "system.duckdb.pre-migrate"
if not snapshot.exists(): if not snapshot.exists():
logger.error( logger.error(
"WAL replay failed and no pre-migrate snapshot at %s" "WAL replay failed and no pre-migrate snapshot at %smanual recovery required.",
"manual recovery required.", snapshot, snapshot,
) )
raise raise
wal_path = Path(db_path + ".wal") wal_path = Path(db_path + ".wal")
logger.warning( logger.warning(
"WAL replay failed (%s) — auto-restoring from pre-migrate " "WAL replay failed (%s) — auto-restoring from pre-migrate "
"snapshot %s. The migration ladder will re-run on this start.", "snapshot %s. The migration ladder will re-run on this start.",
msg.split("\n", 1)[0][:200], snapshot, msg.split("\n", 1)[0][:200],
snapshot,
) )
# Move (not copy) the broken DB aside so an operator can post- # Move (not copy) the broken DB aside so an operator can post-
# mortem if needed. The pre-migrate snapshot becomes the new # mortem if needed. The pre-migrate snapshot becomes the new
@ -961,9 +970,7 @@ def get_analytics_db() -> duckdb.DuckDBPyConnection:
return _maybe_instrument(_analytics_db_conn.cursor(), "analytics") return _maybe_instrument(_analytics_db_conn.cursor(), "analytics")
def _reattach_remote_extensions( def _reattach_remote_extensions(conn: duckdb.DuckDBPyConnection, extracts_dir: Path) -> None:
conn: duckdb.DuckDBPyConnection, extracts_dir: Path
) -> None:
"""Re-LOAD DuckDB extensions listed in _remote_attach tables of each extract.duckdb. """Re-LOAD DuckDB extensions listed in _remote_attach tables of each extract.duckdb.
Called from get_analytics_db_readonly() after ATTACHing extract.duckdb files so Called from get_analytics_db_readonly() after ATTACHing extract.duckdb files so
@ -974,9 +981,7 @@ def _reattach_remote_extensions(
return return
try: try:
attached_dbs = { attached_dbs = {r[0] for r in conn.execute("SELECT database_name FROM duckdb_databases()").fetchall()}
r[0] for r in conn.execute("SELECT database_name FROM duckdb_databases()").fetchall()
}
except Exception: except Exception:
return return
@ -1013,9 +1018,7 @@ def _reattach_remote_extensions(
# Refresh attached list before processing each source's rows # Refresh attached list before processing each source's rows
try: try:
attached_dbs = { attached_dbs = {r[0] for r in conn.execute("SELECT database_name FROM duckdb_databases()").fetchall()}
r[0] for r in conn.execute("SELECT database_name FROM duckdb_databases()").fetchall()
}
except Exception: except Exception:
pass pass
@ -1042,7 +1045,8 @@ def _reattach_remote_extensions(
"query-path remote_attach: extension %r not in allowlist; " "query-path remote_attach: extension %r not in allowlist; "
"refusing to LOAD/ATTACH for source %s. Override via " "refusing to LOAD/ATTACH for source %s. Override via "
"AGNES_REMOTE_ATTACH_EXTENSIONS if intended.", "AGNES_REMOTE_ATTACH_EXTENSIONS if intended.",
extension, alias, extension,
alias,
) )
continue continue
if token_env and not is_token_env_allowed(token_env): if token_env and not is_token_env_allowed(token_env):
@ -1050,7 +1054,8 @@ def _reattach_remote_extensions(
"query-path remote_attach: token_env %r not in allowlist; " "query-path remote_attach: token_env %r not in allowlist; "
"refusing for source %s. Override via " "refusing for source %s. Override via "
"AGNES_REMOTE_ATTACH_TOKEN_ENVS if intended.", "AGNES_REMOTE_ATTACH_TOKEN_ENVS if intended.",
token_env, alias, token_env,
alias,
) )
continue continue
if alias in attached_dbs: if alias in attached_dbs:
@ -1081,25 +1086,20 @@ def _reattach_remote_extensions(
except BQMetadataAuthError as e: except BQMetadataAuthError as e:
logger.error( logger.error(
"Failed to fetch BQ metadata token for %s: %s — skipping ATTACH", "Failed to fetch BQ metadata token for %s: %s — skipping ATTACH",
alias, e, alias,
e,
) )
continue continue
escaped = escape_sql_string_literal(bq_token) escaped = escape_sql_string_literal(bq_token)
secret_name = f"bq_secret_{alias}" secret_name = f"bq_secret_{alias}"
conn.execute( conn.execute(f"CREATE OR REPLACE SECRET {secret_name} (TYPE bigquery, ACCESS_TOKEN '{escaped}')")
f"CREATE OR REPLACE SECRET {secret_name} "
f"(TYPE bigquery, ACCESS_TOKEN '{escaped}')"
)
from connectors.bigquery.access import apply_bq_session_settings from connectors.bigquery.access import apply_bq_session_settings
apply_bq_session_settings(conn) apply_bq_session_settings(conn)
conn.execute( conn.execute(f"ATTACH '{safe_url}' AS {alias} (TYPE {extension}, READ_ONLY)")
f"ATTACH '{safe_url}' AS {alias} (TYPE {extension}, READ_ONLY)"
)
elif token: elif token:
escaped_token = escape_sql_string_literal(token) escaped_token = escape_sql_string_literal(token)
conn.execute( conn.execute(f"ATTACH '{safe_url}' AS {alias} (TYPE {extension}, TOKEN '{escaped_token}')")
f"ATTACH '{safe_url}' AS {alias} (TYPE {extension}, TOKEN '{escaped_token}')"
)
# Apply BQ session settings on every BQ-extension attach, # Apply BQ session settings on every BQ-extension attach,
# not only the metadata-token branch above. Previously the # not only the metadata-token branch above. Previously the
# token-based branch fell through without setting # token-based branch fell through without setting
@ -1107,13 +1107,13 @@ def _reattach_remote_extensions(
# in place and causing "remote query timeout" surprises. # in place and causing "remote query timeout" surprises.
if extension == "bigquery": if extension == "bigquery":
from connectors.bigquery.access import apply_bq_session_settings from connectors.bigquery.access import apply_bq_session_settings
apply_bq_session_settings(conn) apply_bq_session_settings(conn)
else: else:
conn.execute( conn.execute(f"ATTACH '{safe_url}' AS {alias} (TYPE {extension}, READ_ONLY)")
f"ATTACH '{safe_url}' AS {alias} (TYPE {extension}, READ_ONLY)"
)
if extension == "bigquery": if extension == "bigquery":
from connectors.bigquery.access import apply_bq_session_settings from connectors.bigquery.access import apply_bq_session_settings
apply_bq_session_settings(conn) apply_bq_session_settings(conn)
attached_dbs.add(alias) attached_dbs.add(alias)
logger.debug("Re-attached remote source %s via %s extension", alias, extension) logger.debug("Re-attached remote source %s via %s extension", alias, extension)
@ -1434,8 +1434,7 @@ _V16_TO_V17_MIGRATIONS = [
PRIMARY KEY (item_a_id, item_b_id, relation_type) PRIMARY KEY (item_a_id, item_b_id, relation_type)
) )
""", """,
"CREATE INDEX IF NOT EXISTS idx_knowledge_item_relations_resolved " "CREATE INDEX IF NOT EXISTS idx_knowledge_item_relations_resolved ON knowledge_item_relations(resolved)",
"ON knowledge_item_relations(resolved)",
] ]
@ -1458,14 +1457,15 @@ _V19_TO_V20_MIGRATIONS = [
# first so implies references resolve cleanly when expand_implies does BFS. # first so implies references resolve cleanly when expand_implies does BFS.
_CORE_ROLES_SEED = [ _CORE_ROLES_SEED = [
# (key, display_name, description, implies) # (key, display_name, description, implies)
("core.viewer", "Viewer", ("core.viewer", "Viewer", "Read-only access to permitted datasets.", []),
"Read-only access to permitted datasets.", []), ("core.analyst", "Analyst", "Default user role; query data, run analyses.", ["core.viewer"]),
("core.analyst", "Analyst", (
"Default user role; query data, run analyses.", ["core.viewer"]), "core.km_admin",
("core.km_admin", "Knowledge-management admin", "Knowledge-management admin",
"Manages metric definitions and column metadata.", ["core.analyst"]), "Manages metric definitions and column metadata.",
("core.admin", "Administrator", ["core.analyst"],
"Full system access; bypasses dataset_permissions.", ["core.km_admin"]), ),
("core.admin", "Administrator", "Full system access; bypasses dataset_permissions.", ["core.km_admin"]),
] ]
# Maps the legacy users.role string values onto core.* keys for the v8→v9 # Maps the legacy users.role string values onto core.* keys for the v8→v9
@ -1486,10 +1486,8 @@ SYSTEM_EVERYONE_GROUP = "Everyone"
# app.auth.access (admin short-circuit) and the OAuth callback (default # app.auth.access (admin short-circuit) and the OAuth callback (default
# Everyone membership for new users); changing them is a breaking change. # Everyone membership for new users); changing them is a breaking change.
_SYSTEM_GROUPS_SEED = [ _SYSTEM_GROUPS_SEED = [
(SYSTEM_ADMIN_GROUP, (SYSTEM_ADMIN_GROUP, "System: full access to all data and admin actions"),
"System: full access to all data and admin actions"), (SYSTEM_EVERYONE_GROUP, "System: default group every user is implicitly a member of"),
(SYSTEM_EVERYONE_GROUP,
"System: default group every user is implicitly a member of"),
] ]
@ -1505,9 +1503,7 @@ def _seed_system_groups(conn: duckdb.DuckDBPyConnection) -> None:
import uuid as _uuid import uuid as _uuid
for name, description in _SYSTEM_GROUPS_SEED: for name, description in _SYSTEM_GROUPS_SEED:
existing = conn.execute( existing = conn.execute("SELECT id, is_system FROM user_groups WHERE name = ?", [name]).fetchone()
"SELECT id, is_system FROM user_groups WHERE name = ?", [name]
).fetchone()
if existing is None: if existing is None:
conn.execute( conn.execute(
"""INSERT INTO user_groups (id, name, description, is_system, created_by) """INSERT INTO user_groups (id, name, description, is_system, created_by)
@ -1549,9 +1545,7 @@ def _v12_to_v13_finalize(conn: duckdb.DuckDBPyConnection) -> None:
try: try:
_seed_system_groups(conn) _seed_system_groups(conn)
admin_group_id = conn.execute( admin_group_id = conn.execute("SELECT id FROM user_groups WHERE name = ?", [SYSTEM_ADMIN_GROUP]).fetchone()[0]
"SELECT id FROM user_groups WHERE name = ?", [SYSTEM_ADMIN_GROUP]
).fetchone()[0]
everyone_group_id = conn.execute( everyone_group_id = conn.execute(
"SELECT id FROM user_groups WHERE name = ?", [SYSTEM_EVERYONE_GROUP] "SELECT id FROM user_groups WHERE name = ?", [SYSTEM_EVERYONE_GROUP]
).fetchone()[0] ).fetchone()[0]
@ -1560,16 +1554,14 @@ def _v12_to_v13_finalize(conn: duckdb.DuckDBPyConnection) -> None:
# column having been physically dropped already (re-run safety) and of # column having been physically dropped already (re-run safety) and of
# malformed JSON (caught row-by-row, skipped silently). # malformed JSON (caught row-by-row, skipped silently).
has_groups_col = conn.execute( has_groups_col = conn.execute(
"SELECT 1 FROM information_schema.columns " "SELECT 1 FROM information_schema.columns WHERE table_name = 'users' AND column_name = 'groups'"
"WHERE table_name = 'users' AND column_name = 'groups'"
).fetchone() ).fetchone()
if has_groups_col: if has_groups_col:
rows = conn.execute( rows = conn.execute("SELECT id, groups FROM users WHERE groups IS NOT NULL").fetchall()
"SELECT id, groups FROM users WHERE groups IS NOT NULL"
).fetchall()
for user_id, groups_json in rows: for user_id, groups_json in rows:
try: try:
import json as _json import json as _json
names = _json.loads(groups_json) if isinstance(groups_json, str) else (groups_json or []) names = _json.loads(groups_json) if isinstance(groups_json, str) else (groups_json or [])
except (ValueError, TypeError): except (ValueError, TypeError):
names = [] names = []
@ -1579,7 +1571,8 @@ def _v12_to_v13_finalize(conn: duckdb.DuckDBPyConnection) -> None:
if not isinstance(name, str) or not name.strip(): if not isinstance(name, str) or not name.strip():
continue continue
group_row = conn.execute( group_row = conn.execute(
"SELECT id FROM user_groups WHERE name = ?", [name], "SELECT id FROM user_groups WHERE name = ?",
[name],
).fetchone() ).fetchone()
if not group_row: if not group_row:
continue continue
@ -1592,9 +1585,9 @@ def _v12_to_v13_finalize(conn: duckdb.DuckDBPyConnection) -> None:
) )
except duckdb.ConstraintException: except duckdb.ConstraintException:
logger.debug( logger.debug(
"v13 backfill step 2 (google_sync): skipped " "v13 backfill step 2 (google_sync): skipped insert for user=%s group=%s — already present",
"insert for user=%s group=%s — already present", user_id,
user_id, name, name,
) )
# 3. core.admin grants → Admin membership. Tolerant of either table being # 3. core.admin grants → Admin membership. Tolerant of either table being
@ -1670,7 +1663,8 @@ def _v12_to_v13_finalize(conn: duckdb.DuckDBPyConnection) -> None:
logger.debug( logger.debug(
"v13 backfill step 5 (resource_grants): skipped " "v13 backfill step 5 (resource_grants): skipped "
"insert for group=%s resource=%s — already migrated", "insert for group=%s resource=%s — already migrated",
group_id, resource_id, group_id,
resource_id,
) )
# Audit: log any non-core capability grants before dropping the # Audit: log any non-core capability grants before dropping the
@ -1695,7 +1689,8 @@ def _v12_to_v13_finalize(conn: duckdb.DuckDBPyConnection) -> None:
"If this role was registered via register_internal_role(), " "If this role was registered via register_internal_role(), "
"the affected users need to be re-added to an " "the affected users need to be re-added to an "
"appropriate user_group post-upgrade.", "appropriate user_group post-upgrade.",
cnt, role_key, cnt,
role_key,
) )
# 6. Drop legacy tables in FK-correct order: dependent tables first. # 6. Drop legacy tables in FK-correct order: dependent tables first.
@ -1754,8 +1749,7 @@ def _v13_to_v14_finalize(conn: duckdb.DuckDBPyConnection) -> None:
).fetchone()[0] ).fetchone()[0]
if orphan_members: if orphan_members:
logger.warning( logger.warning(
"v14 migration: dropping %d orphan user_group_members rows " "v14 migration: dropping %d orphan user_group_members rows (group_id pointed at a deleted user_groups.id)",
"(group_id pointed at a deleted user_groups.id)",
orphan_members, orphan_members,
) )
if orphan_grants: if orphan_grants:
@ -1778,9 +1772,7 @@ def _v13_to_v14_finalize(conn: duckdb.DuckDBPyConnection) -> None:
) )
# user_group_members rebuild # user_group_members rebuild
conn.execute( conn.execute("ALTER TABLE user_group_members RENAME TO user_group_members_v13_pre")
"ALTER TABLE user_group_members RENAME TO user_group_members_v13_pre"
)
conn.execute( conn.execute(
"""CREATE TABLE user_group_members ( """CREATE TABLE user_group_members (
user_id VARCHAR NOT NULL, user_id VARCHAR NOT NULL,
@ -1800,9 +1792,7 @@ def _v13_to_v14_finalize(conn: duckdb.DuckDBPyConnection) -> None:
conn.execute("DROP TABLE user_group_members_v13_pre") conn.execute("DROP TABLE user_group_members_v13_pre")
# resource_grants rebuild # resource_grants rebuild
conn.execute( conn.execute("ALTER TABLE resource_grants RENAME TO resource_grants_v13_pre")
"ALTER TABLE resource_grants RENAME TO resource_grants_v13_pre"
)
conn.execute( conn.execute(
"""CREATE TABLE resource_grants ( """CREATE TABLE resource_grants (
id VARCHAR PRIMARY KEY, id VARCHAR PRIMARY KEY,
@ -1913,11 +1903,13 @@ def _v18_to_v19_finalize(conn: duckdb.DuckDBPyConnection) -> None:
`primary_key`) still migrate cleanly. Wrapped in BEGIN/COMMIT; `primary_key`) still migrate cleanly. Wrapped in BEGIN/COMMIT;
on error ROLLBACK and the outer caller skips the schema_version bump. on error ROLLBACK and the outer caller skips the schema_version bump.
""" """
def _existing_cols(table: str) -> set[str]: def _existing_cols(table: str) -> set[str]:
return { return {
r[0] for r in conn.execute( r[0]
"SELECT column_name FROM information_schema.columns " for r in conn.execute(
"WHERE table_name = ?", [table], "SELECT column_name FROM information_schema.columns WHERE table_name = ?",
[table],
).fetchall() ).fetchall()
} }
@ -1951,26 +1943,29 @@ def _v18_to_v19_finalize(conn: duckdb.DuckDBPyConnection) -> None:
)""" )"""
) )
users_target_cols = [ users_target_cols = [
"id", "email", "name", "password_hash", "id",
"setup_token", "setup_token_created", "email",
"reset_token", "reset_token_created", "name",
"active", "deactivated_at", "deactivated_by", "password_hash",
"created_at", "updated_at", "setup_token",
"setup_token_created",
"reset_token",
"reset_token_created",
"active",
"deactivated_at",
"deactivated_by",
"created_at",
"updated_at",
] ]
old_users_cols = _existing_cols("users_v18_pre") old_users_cols = _existing_cols("users_v18_pre")
common = [c for c in users_target_cols if c in old_users_cols] common = [c for c in users_target_cols if c in old_users_cols]
col_list = ", ".join(common) col_list = ", ".join(common)
conn.execute( conn.execute(f"INSERT INTO users ({col_list}) SELECT {col_list} FROM users_v18_pre")
f"INSERT INTO users ({col_list}) "
f"SELECT {col_list} FROM users_v18_pre"
)
conn.execute("DROP TABLE users_v18_pre") conn.execute("DROP TABLE users_v18_pre")
# 4: rebuild table_registry without `is_public` column. # 4: rebuild table_registry without `is_public` column.
if "is_public" in _existing_cols("table_registry"): if "is_public" in _existing_cols("table_registry"):
conn.execute( conn.execute("ALTER TABLE table_registry RENAME TO table_registry_v18_pre")
"ALTER TABLE table_registry RENAME TO table_registry_v18_pre"
)
conn.execute( conn.execute(
"""CREATE TABLE table_registry ( """CREATE TABLE table_registry (
id VARCHAR PRIMARY KEY, id VARCHAR PRIMARY KEY,
@ -1990,18 +1985,25 @@ def _v18_to_v19_finalize(conn: duckdb.DuckDBPyConnection) -> None:
)""" )"""
) )
registry_target_cols = [ registry_target_cols = [
"id", "name", "source_type", "bucket", "source_table", "id",
"sync_strategy", "query_mode", "sync_schedule", "name",
"profile_after_sync", "primary_key", "folder", "source_type",
"description", "registered_by", "registered_at", "bucket",
"source_table",
"sync_strategy",
"query_mode",
"sync_schedule",
"profile_after_sync",
"primary_key",
"folder",
"description",
"registered_by",
"registered_at",
] ]
old_registry_cols = _existing_cols("table_registry_v18_pre") old_registry_cols = _existing_cols("table_registry_v18_pre")
common = [c for c in registry_target_cols if c in old_registry_cols] common = [c for c in registry_target_cols if c in old_registry_cols]
col_list = ", ".join(common) col_list = ", ".join(common)
conn.execute( conn.execute(f"INSERT INTO table_registry ({col_list}) SELECT {col_list} FROM table_registry_v18_pre")
f"INSERT INTO table_registry ({col_list}) "
f"SELECT {col_list} FROM table_registry_v18_pre"
)
conn.execute("DROP TABLE table_registry_v18_pre") conn.execute("DROP TABLE table_registry_v18_pre")
conn.execute("COMMIT") conn.execute("COMMIT")
@ -2025,9 +2027,7 @@ def _seed_core_roles(conn: duckdb.DuckDBPyConnection) -> None:
import uuid as _uuid import uuid as _uuid
for key, display_name, description, implies in _CORE_ROLES_SEED: for key, display_name, description, implies in _CORE_ROLES_SEED:
existing = conn.execute( existing = conn.execute("SELECT id FROM internal_roles WHERE key = ?", [key]).fetchone()
"SELECT id FROM internal_roles WHERE key = ?", [key]
).fetchone()
implies_json = _json.dumps(implies) implies_json = _json.dumps(implies)
if existing: if existing:
conn.execute( conn.execute(
@ -2059,25 +2059,21 @@ def _backfill_users_role_to_grants(conn: duckdb.DuckDBPyConnection) -> None:
# Verify users.role column still exists (we may be re-running after a # Verify users.role column still exists (we may be re-running after a
# half-applied migration); skip silently if it's already gone. # half-applied migration); skip silently if it's already gone.
has_role_col = conn.execute( has_role_col = conn.execute(
"SELECT 1 FROM information_schema.columns " "SELECT 1 FROM information_schema.columns WHERE table_name = 'users' AND column_name = 'role'"
"WHERE table_name = 'users' AND column_name = 'role'"
).fetchone() ).fetchone()
if not has_role_col: if not has_role_col:
return return
rows = conn.execute( rows = conn.execute("SELECT id, role FROM users WHERE role IS NOT NULL").fetchall()
"SELECT id, role FROM users WHERE role IS NOT NULL"
).fetchall()
backfilled = 0 backfilled = 0
for user_id, role_str in rows: for user_id, role_str in rows:
role_key = _LEGACY_ROLE_TO_CORE_KEY.get(role_str, "core.viewer") role_key = _LEGACY_ROLE_TO_CORE_KEY.get(role_str, "core.viewer")
role_row = conn.execute( role_row = conn.execute("SELECT id FROM internal_roles WHERE key = ?", [role_key]).fetchone()
"SELECT id FROM internal_roles WHERE key = ?", [role_key]
).fetchone()
if not role_row: if not role_row:
logger.warning( logger.warning(
"v9 backfill: core role %s missing — skipping user %s", "v9 backfill: core role %s missing — skipping user %s",
role_key, user_id, role_key,
user_id,
) )
continue continue
try: try:
@ -2363,8 +2359,7 @@ def _v28_to_v29_finalize(conn) -> None:
are gone after the first run and the seed INSERTs use ON CONFLICT. are gone after the first run and the seed INSERTs use ON CONFLICT.
""" """
has_welcome = conn.execute( has_welcome = conn.execute(
"SELECT 1 FROM information_schema.tables " "SELECT 1 FROM information_schema.tables WHERE table_schema = 'main' AND table_name = 'welcome_template'"
"WHERE table_schema = 'main' AND table_name = 'welcome_template'"
).fetchone() ).fetchone()
if has_welcome: if has_welcome:
conn.execute( conn.execute(
@ -2375,8 +2370,7 @@ def _v28_to_v29_finalize(conn) -> None:
conn.execute("DROP TABLE welcome_template") conn.execute("DROP TABLE welcome_template")
has_claude_md = conn.execute( has_claude_md = conn.execute(
"SELECT 1 FROM information_schema.tables " "SELECT 1 FROM information_schema.tables WHERE table_schema = 'main' AND table_name = 'claude_md_template'"
"WHERE table_schema = 'main' AND table_name = 'claude_md_template'"
).fetchone() ).fetchone()
if has_claude_md: if has_claude_md:
conn.execute( conn.execute(
@ -2391,8 +2385,7 @@ def _v28_to_v29_finalize(conn) -> None:
# operators) or via a prior migration run (idempotent re-execution). # operators) or via a prior migration run (idempotent re-execution).
for key in ("welcome", "claude_md", "home"): for key in ("welcome", "claude_md", "home"):
conn.execute( conn.execute(
"INSERT INTO instance_templates (key, content) VALUES (?, NULL) " "INSERT INTO instance_templates (key, content) VALUES (?, NULL) ON CONFLICT (key) DO NOTHING",
"ON CONFLICT (key) DO NOTHING",
[key], [key],
) )
@ -2531,15 +2524,12 @@ def _v34_to_v35_migrate(conn: duckdb.DuckDBPyConnection) -> None:
The audit columns (``archived_at``, ``archived_by``) ship first The audit columns (``archived_at``, ``archived_by``) ship first
behind ``IF NOT EXISTS`` so they're safe in all three states. behind ``IF NOT EXISTS`` so they're safe in all three states.
""" """
conn.execute( conn.execute("ALTER TABLE store_entities ADD COLUMN IF NOT EXISTS archived_at TIMESTAMP")
"ALTER TABLE store_entities ADD COLUMN IF NOT EXISTS archived_at TIMESTAMP" conn.execute("ALTER TABLE store_entities ADD COLUMN IF NOT EXISTS archived_by VARCHAR")
)
conn.execute(
"ALTER TABLE store_entities ADD COLUMN IF NOT EXISTS archived_by VARCHAR"
)
cols = { cols = {
r[0] for r in conn.execute( r[0]
for r in conn.execute(
"SELECT column_name FROM information_schema.columns " "SELECT column_name FROM information_schema.columns "
"WHERE table_name = 'store_entities' " "WHERE table_name = 'store_entities' "
" AND column_name IN ('visibility_status', '_vis_v35')" " AND column_name IN ('visibility_status', '_vis_v35')"
@ -2553,18 +2543,10 @@ def _v34_to_v35_migrate(conn: duckdb.DuckDBPyConnection) -> None:
# in v36 (DuckDB ALTER COLUMN supports SET NOT NULL / SET DEFAULT # in v36 (DuckDB ALTER COLUMN supports SET NOT NULL / SET DEFAULT
# but not ADD CHECK on an existing column). Value-list enforcement # but not ADD CHECK on an existing column). Value-list enforcement
# is application-side via VALID_VISIBILITY in StoreEntitiesRepository. # is application-side via VALID_VISIBILITY in StoreEntitiesRepository.
conn.execute( conn.execute("ALTER TABLE store_entities ADD COLUMN _vis_v35 VARCHAR")
"ALTER TABLE store_entities ADD COLUMN _vis_v35 VARCHAR" conn.execute("UPDATE store_entities SET _vis_v35 = visibility_status")
) conn.execute("ALTER TABLE store_entities DROP COLUMN visibility_status")
conn.execute( conn.execute("ALTER TABLE store_entities RENAME COLUMN _vis_v35 TO visibility_status")
"UPDATE store_entities SET _vis_v35 = visibility_status"
)
conn.execute(
"ALTER TABLE store_entities DROP COLUMN visibility_status"
)
conn.execute(
"ALTER TABLE store_entities RENAME COLUMN _vis_v35 TO visibility_status"
)
elif has_temp and not has_vis: elif has_temp and not has_vis:
# Partial-rebuild recovery — prior attempt dropped visibility_status # Partial-rebuild recovery — prior attempt dropped visibility_status
# but the RENAME never landed. Data is already in _vis_v35 from # but the RENAME never landed. Data is already in _vis_v35 from
@ -2573,19 +2555,14 @@ def _v34_to_v35_migrate(conn: duckdb.DuckDBPyConnection) -> None:
"v34→v35 detected partial-rebuild state (visibility_status " "v34→v35 detected partial-rebuild state (visibility_status "
"missing, _vis_v35 present); recovering via RENAME" "missing, _vis_v35 present); recovering via RENAME"
) )
conn.execute( conn.execute("ALTER TABLE store_entities RENAME COLUMN _vis_v35 TO visibility_status")
"ALTER TABLE store_entities RENAME COLUMN _vis_v35 TO visibility_status"
)
elif has_vis and has_temp: elif has_vis and has_temp:
# Both present — earlier rebuild aborted before the DROP. # Both present — earlier rebuild aborted before the DROP.
# visibility_status holds the canonical values; drop the temp. # visibility_status holds the canonical values; drop the temp.
logger.warning( logger.warning(
"v34→v35 detected partial-rebuild state (both visibility_status " "v34→v35 detected partial-rebuild state (both visibility_status and _vis_v35 present); dropping the temp"
"and _vis_v35 present); dropping the temp"
)
conn.execute(
"ALTER TABLE store_entities DROP COLUMN _vis_v35"
) )
conn.execute("ALTER TABLE store_entities DROP COLUMN _vis_v35")
# else: neither column is present, which means store_entities itself # else: neither column is present, which means store_entities itself
# is at a shape ahead of v34. _SYSTEM_SCHEMA above already created # is at a shape ahead of v34. _SYSTEM_SCHEMA above already created
# the post-v35 shape; nothing to do here. # the post-v35 shape; nothing to do here.
@ -2894,10 +2871,7 @@ def _v42_to_v43(conn: duckdb.DuckDBPyConnection) -> None:
UNIQUE (user_id, name) UNIQUE (user_id, name)
) )
""") """)
conn.execute( conn.execute("CREATE INDEX IF NOT EXISTS idx_obs_views_user ON user_observability_views(user_id, created_at)")
"CREATE INDEX IF NOT EXISTS idx_obs_views_user "
"ON user_observability_views(user_id, created_at)"
)
def _v43_to_v44(conn: duckdb.DuckDBPyConnection) -> None: def _v43_to_v44(conn: duckdb.DuckDBPyConnection) -> None:
@ -2915,19 +2889,35 @@ def _v43_to_v44(conn: duckdb.DuckDBPyConnection) -> None:
from 1 2 in the same release, which the session-pipeline from 1 2 in the same release, which the session-pipeline
reprocess loop uses to invalidate stale summaries. reprocess loop uses to invalidate stale summaries.
""" """
conn.execute( conn.execute("ALTER TABLE users ADD COLUMN IF NOT EXISTS last_pull_at TIMESTAMP")
"ALTER TABLE users ADD COLUMN IF NOT EXISTS last_pull_at TIMESTAMP"
)
for col in ( for col in (
"input_tokens", "input_tokens",
"output_tokens", "output_tokens",
"cache_read_tokens", "cache_read_tokens",
"cache_creation_tokens", "cache_creation_tokens",
): ):
conn.execute( conn.execute(f"ALTER TABLE usage_session_summary ADD COLUMN IF NOT EXISTS {col} BIGINT DEFAULT 0")
f"ALTER TABLE usage_session_summary "
f"ADD COLUMN IF NOT EXISTS {col} BIGINT DEFAULT 0"
) def _v44_to_v45(conn: duckdb.DuckDBPyConnection) -> None:
"""v45: add user_id column to usage tables for stable RBAC filtering.
The ``username`` column in ``usage_session_summary`` / ``usage_events``
stores the directory name from the session-data path, which is either
an email local-part (session collector) or a UUID (upload API). Email
local-parts are unstable they change when users rename. ``user_id``
is the stable identity and becomes the authoritative RBAC filter
column for the ``agnes_sessions`` / ``agnes_telemetry`` aliases.
Backfill: the UsageProcessor populates ``user_id`` on every
(re)process run. Existing rows get backfilled when
``USAGE_PROCESSOR_VERSION`` bumps, which triggers the session-pipeline
reprocess loop.
"""
conn.execute("ALTER TABLE usage_session_summary ADD COLUMN IF NOT EXISTS user_id VARCHAR")
conn.execute("ALTER TABLE usage_events ADD COLUMN IF NOT EXISTS user_id VARCHAR")
conn.execute("CREATE INDEX IF NOT EXISTS idx_usage_session_user_id ON usage_session_summary(user_id)")
conn.execute("CREATE INDEX IF NOT EXISTS idx_usage_events_user_id ON usage_events(user_id)")
_V33_TO_V34_MIGRATIONS = [ _V33_TO_V34_MIGRATIONS = [
@ -3049,8 +3039,10 @@ def _replace_for_v24(project_id: str):
function-form replacement is the defensive idiom it makes the function-form replacement is the defensive idiom it makes the
intent explicit and removes the dependency on re.sub's replacement- intent explicit and removes the dependency on re.sub's replacement-
string escaping rules.""" string escaping rules."""
def _repl(m): def _repl(m):
return f"`{project_id}.{m.group(1)}.{m.group(2)}`" return f"`{project_id}.{m.group(1)}.{m.group(2)}`"
return _repl return _repl
@ -3059,6 +3051,7 @@ def _v23_to_v24_finalize(conn: duckdb.DuckDBPyConnection) -> None:
try: try:
from app.instance_config import get_value from app.instance_config import get_value
project_id = get_value("data_source", "bigquery", "project", default="") or "" project_id = get_value("data_source", "bigquery", "project", default="") or ""
except Exception: except Exception:
project_id = "" project_id = ""
@ -3092,7 +3085,7 @@ def _v23_to_v24_finalize(conn: duckdb.DuckDBPyConnection) -> None:
raise RuntimeError( raise RuntimeError(
f"v24 migration cannot complete: {len(rows)} materialized " f"v24 migration cannot complete: {len(rows)} materialized "
f"BigQuery row(s) need their source_query rewritten from " f"BigQuery row(s) need their source_query rewritten from "
f"DuckDB-flavor `bq.\"ds\".\"tbl\"` to BQ-native " f'DuckDB-flavor `bq."ds"."tbl"` to BQ-native '
f"`<project>.ds.tbl`, but `data_source.bigquery.project` is " f"`<project>.ds.tbl`, but `data_source.bigquery.project` is "
f"not configured. Set it via /admin/server-config (or " f"not configured. Set it via /admin/server-config (or "
f"`instance.yaml: data_source.bigquery.project`) and restart " f"`instance.yaml: data_source.bigquery.project`) and restart "
@ -3113,7 +3106,8 @@ def _v23_to_v24_finalize(conn: duckdb.DuckDBPyConnection) -> None:
[new_sq, row_id], [new_sq, row_id],
) )
logger.info( logger.info(
"v24 migration: rewrote source_query for row %r", row_id, "v24 migration: rewrote source_query for row %r",
row_id,
) )
conn.execute("COMMIT") conn.execute("COMMIT")
except Exception: except Exception:
@ -3179,16 +3173,12 @@ def _ensure_schema(conn: duckdb.DuckDBPyConnection) -> None:
[SCHEMA_VERSION], [SCHEMA_VERSION],
) )
# v22 setup_banner row (kept as compat per CLAUDE.md schema notes). # v22 setup_banner row (kept as compat per CLAUDE.md schema notes).
conn.execute( conn.execute("INSERT INTO setup_banner (id, content) VALUES (1, NULL) ON CONFLICT (id) DO NOTHING")
"INSERT INTO setup_banner (id, content) VALUES (1, NULL) "
"ON CONFLICT (id) DO NOTHING"
)
# v26 instance_templates seed — three canonical keys with NULL # v26 instance_templates seed — three canonical keys with NULL
# content (operator override absent → render OSS default). # content (operator override absent → render OSS default).
for key in ("welcome", "claude_md", "home"): for key in ("welcome", "claude_md", "home"):
conn.execute( conn.execute(
"INSERT INTO instance_templates (key, content) VALUES (?, NULL) " "INSERT INTO instance_templates (key, content) VALUES (?, NULL) ON CONFLICT (key) DO NOTHING",
"ON CONFLICT (key) DO NOTHING",
[key], [key],
) )
# v41 audit_log indices: _SYSTEM_SCHEMA omits CREATE INDEX to # v41 audit_log indices: _SYSTEM_SCHEMA omits CREATE INDEX to
@ -3207,6 +3197,10 @@ def _ensure_schema(conn: duckdb.DuckDBPyConnection) -> None:
# them on fresh installs (no-op ALTERs); kept here for the # them on fresh installs (no-op ALTERs); kept here for the
# ladder's chronological readability. # ladder's chronological readability.
_v43_to_v44(conn) _v43_to_v44(conn)
# v45 user_id column on usage tables. _SYSTEM_SCHEMA declares
# the columns for fresh installs; migration adds them for
# existing DBs. No-op on fresh.
_v44_to_v45(conn)
# Fresh-install seed is handled by the unconditional # Fresh-install seed is handled by the unconditional
# _seed_core_roles call at the bottom of _ensure_schema — # _seed_core_roles call at the bottom of _ensure_schema —
# left as a no-op branch here so the migration ladder still # left as a no-op branch here so the migration ladder still
@ -3248,8 +3242,7 @@ def _ensure_schema(conn: duckdb.DuckDBPyConnection) -> None:
# Skip UPDATE if the column never existed (e.g. test fixtures # Skip UPDATE if the column never existed (e.g. test fixtures
# starting from v2/v3 with a hand-crafted minimal users table). # starting from v2/v3 with a hand-crafted minimal users table).
has_role_col = conn.execute( has_role_col = conn.execute(
"SELECT 1 FROM information_schema.columns " "SELECT 1 FROM information_schema.columns WHERE table_name = 'users' AND column_name = 'role'"
"WHERE table_name = 'users' AND column_name = 'role'"
).fetchone() ).fetchone()
if has_role_col: if has_role_col:
conn.execute("UPDATE users SET role = NULL") conn.execute("UPDATE users SET role = NULL")
@ -3350,6 +3343,8 @@ def _ensure_schema(conn: duckdb.DuckDBPyConnection) -> None:
_v42_to_v43(conn) _v42_to_v43(conn)
if current < 44: if current < 44:
_v43_to_v44(conn) _v43_to_v44(conn)
if current < 45:
_v44_to_v45(conn)
conn.execute( conn.execute(
"UPDATE schema_version SET version = ?, applied_at = current_timestamp", "UPDATE schema_version SET version = ?, applied_at = current_timestamp",
[SCHEMA_VERSION], [SCHEMA_VERSION],
@ -3383,7 +3378,8 @@ def _ensure_schema(conn: duckdb.DuckDBPyConnection) -> None:
"this process before any container restart is the " "this process before any container restart is the "
"safe path; otherwise, the next start may need to " "safe path; otherwise, the next start may need to "
"restore from %s.", "restore from %s.",
e, _get_state_dir() / "system.duckdb.pre-migrate", e,
_get_state_dir() / "system.duckdb.pre-migrate",
) )
# Always run the system-groups seed when the DB is on a version this binary # Always run the system-groups seed when the DB is on a version this binary

View file

@ -16,15 +16,31 @@ class UsageRepository:
if not rows: if not rows:
return 0 return 0
cols = [ cols = [
"id", "session_id", "session_file", "username", "event_uuid", "parent_uuid", "id",
"event_type", "tool_name", "skill_name", "subagent_type", "command_name", "session_id",
"is_error", "source", "ref_id", "model", "cwd", "occurred_at", "processor_version", "session_file",
"username",
"event_uuid",
"parent_uuid",
"event_type",
"tool_name",
"skill_name",
"subagent_type",
"command_name",
"is_error",
"source",
"ref_id",
"model",
"cwd",
"occurred_at",
"processor_version",
"user_id",
] ]
placeholders = ",".join("?" for _ in cols) placeholders = ",".join("?" for _ in cols)
sql = f"INSERT OR IGNORE INTO usage_events ({','.join(cols)}) VALUES ({placeholders})" sql = f"INSERT OR IGNORE INTO usage_events ({','.join(cols)}) VALUES ({placeholders})"
self.conn.executemany(sql, [ self.conn.executemany(
[r.get(c) if c != "processor_version" else processor_version for c in cols] for r in rows sql, [[r.get(c) if c != "processor_version" else processor_version for c in cols] for r in rows]
]) )
return len(rows) return len(rows)
def upsert_summary(self, summary: dict, *, processor_version: int) -> None: def upsert_summary(self, summary: dict, *, processor_version: int) -> None:
@ -37,8 +53,8 @@ class UsageRepository:
tool_calls, tool_errors, skill_invocations, subagent_dispatches, tool_calls, tool_errors, skill_invocations, subagent_dispatches,
mcp_calls, slash_commands, distinct_tools, distinct_skills, mcp_calls, slash_commands, distinct_tools, distinct_skills,
primary_model, input_tokens, output_tokens, cache_read_tokens, primary_model, input_tokens, output_tokens, cache_read_tokens,
cache_creation_tokens, processor_version) cache_creation_tokens, processor_version, user_id)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
""", """,
[ [
summary["session_file"], summary["session_file"],
@ -64,18 +80,15 @@ class UsageRepository:
summary.get("cache_read_tokens", 0), summary.get("cache_read_tokens", 0),
summary.get("cache_creation_tokens", 0), summary.get("cache_creation_tokens", 0),
processor_version, processor_version,
summary.get("user_id"),
], ],
) )
def purge_for_session(self, session_file: str) -> int: def purge_for_session(self, session_file: str) -> int:
"""DELETE events + summary for one session — used on reprocess.""" """DELETE events + summary for one session — used on reprocess."""
r = self.conn.execute( r = self.conn.execute("DELETE FROM usage_events WHERE session_file = ?", [session_file])
"DELETE FROM usage_events WHERE session_file = ?", [session_file]
)
events_deleted = r.rowcount if r.rowcount else 0 events_deleted = r.rowcount if r.rowcount else 0
self.conn.execute( self.conn.execute("DELETE FROM usage_session_summary WHERE session_file = ?", [session_file])
"DELETE FROM usage_session_summary WHERE session_file = ?", [session_file]
)
return events_deleted return events_deleted
def delete_older_than(self, days: int) -> int: def delete_older_than(self, days: int) -> int:

View file

@ -8,6 +8,7 @@ The v19 step (#150) drops dataset_permissions, access_requests tables and
users.role, table_registry.is_public columns; v20 then ALTERs the post-v19 users.role, table_registry.is_public columns; v20 then ALTERs the post-v19
table_registry to add the source_query column. table_registry to add the source_query column.
""" """
import duckdb import duckdb
from src.db import SCHEMA_VERSION, _ensure_schema, get_schema_version from src.db import SCHEMA_VERSION, _ensure_schema, get_schema_version
@ -98,7 +99,7 @@ def test_schema_version_is_44():
# output_tokens, cache_read_tokens, cache_creation_tokens). # output_tokens, cache_read_tokens, cache_creation_tokens).
# USAGE_PROCESSOR_VERSION simultaneously bumps 1→2 so the # USAGE_PROCESSOR_VERSION simultaneously bumps 1→2 so the
# reprocess loop backfills tokens on next tick. # reprocess loop backfills tokens on next tick.
assert SCHEMA_VERSION == 44 assert SCHEMA_VERSION == 45
def test_v37_marketplace_curator_columns(tmp_path): def test_v37_marketplace_curator_columns(tmp_path):
@ -109,9 +110,9 @@ def test_v37_marketplace_curator_columns(tmp_path):
_ensure_schema(conn) _ensure_schema(conn)
registry_cols = { registry_cols = {
r[0] for r in conn.execute( r[0]
"SELECT column_name FROM information_schema.columns " for r in conn.execute(
"WHERE table_name = 'marketplace_registry'" "SELECT column_name FROM information_schema.columns WHERE table_name = 'marketplace_registry'"
).fetchall() ).fetchall()
} }
assert {"curator_name", "curator_email"} <= registry_cols, ( assert {"curator_name", "curator_email"} <= registry_cols, (
@ -119,9 +120,9 @@ def test_v37_marketplace_curator_columns(tmp_path):
) )
plugin_cols = { plugin_cols = {
r[0] for r in conn.execute( r[0]
"SELECT column_name FROM information_schema.columns " for r in conn.execute(
"WHERE table_name = 'marketplace_plugins'" "SELECT column_name FROM information_schema.columns WHERE table_name = 'marketplace_plugins'"
).fetchall() ).fetchall()
} }
assert {"cover_photo_url", "video_url", "doc_links"} <= plugin_cols, ( assert {"cover_photo_url", "video_url", "doc_links"} <= plugin_cols, (
@ -139,10 +140,7 @@ def test_v36_db_migrates_to_current(tmp_path):
# Stand up a minimal v36-shape registry + plugin row, plus the # Stand up a minimal v36-shape registry + plugin row, plus the
# schema_version row that pins us to 36. # schema_version row that pins us to 36.
conn.execute( conn.execute("CREATE TABLE schema_version (version INTEGER, applied_at TIMESTAMP DEFAULT current_timestamp)")
"CREATE TABLE schema_version (version INTEGER, "
"applied_at TIMESTAMP DEFAULT current_timestamp)"
)
conn.execute("INSERT INTO schema_version (version) VALUES (36)") conn.execute("INSERT INTO schema_version (version) VALUES (36)")
conn.execute("""CREATE TABLE marketplace_registry ( conn.execute("""CREATE TABLE marketplace_registry (
id VARCHAR PRIMARY KEY, name VARCHAR NOT NULL, id VARCHAR PRIMARY KEY, name VARCHAR NOT NULL,
@ -161,22 +159,15 @@ def test_v36_db_migrates_to_current(tmp_path):
PRIMARY KEY (marketplace_id, name) PRIMARY KEY (marketplace_id, name)
)""") )""")
conn.execute( conn.execute(
"INSERT INTO marketplace_registry (id, name, url) " "INSERT INTO marketplace_registry (id, name, url) VALUES ('legacy', 'Legacy', 'https://example.com/repo.git')"
"VALUES ('legacy', 'Legacy', 'https://example.com/repo.git')"
)
conn.execute(
"INSERT INTO marketplace_plugins (marketplace_id, name) "
"VALUES ('legacy', 'foo')"
) )
conn.execute("INSERT INTO marketplace_plugins (marketplace_id, name) VALUES ('legacy', 'foo')")
_ensure_schema(conn) _ensure_schema(conn)
assert get_schema_version(conn) == SCHEMA_VERSION assert get_schema_version(conn) == SCHEMA_VERSION
# v37 enrichment columns exist; existing rows preserved with NULL. # v37 enrichment columns exist; existing rows preserved with NULL.
row = conn.execute( row = conn.execute("SELECT curator_name, curator_email FROM marketplace_registry WHERE id = 'legacy'").fetchone()
"SELECT curator_name, curator_email FROM marketplace_registry "
"WHERE id = 'legacy'"
).fetchone()
assert row == (None, None) assert row == (None, None)
row = conn.execute( row = conn.execute(
@ -196,27 +187,18 @@ def test_v39_adds_marketplace_plugins_is_system(tmp_path):
_ensure_schema(conn) _ensure_schema(conn)
cols = { cols = {
r[0] for r in conn.execute( r[0]
"SELECT column_name FROM information_schema.columns " for r in conn.execute(
"WHERE table_name = 'marketplace_plugins'" "SELECT column_name FROM information_schema.columns WHERE table_name = 'marketplace_plugins'"
).fetchall() ).fetchall()
} }
assert "is_system" in cols, f"is_system missing from {cols}" assert "is_system" in cols, f"is_system missing from {cols}"
# New rows default to FALSE — required so a freshly-synced plugin # New rows default to FALSE — required so a freshly-synced plugin
# doesn't accidentally land in everyone's stack. # doesn't accidentally land in everyone's stack.
conn.execute( conn.execute("INSERT INTO marketplace_registry (id, name, url) VALUES ('m', 'M', 'https://example.com/repo.git')")
"INSERT INTO marketplace_registry (id, name, url) " conn.execute("INSERT INTO marketplace_plugins (marketplace_id, name) VALUES ('m', 'p')")
"VALUES ('m', 'M', 'https://example.com/repo.git')" row = conn.execute("SELECT is_system FROM marketplace_plugins WHERE marketplace_id = 'm' AND name = 'p'").fetchone()
)
conn.execute(
"INSERT INTO marketplace_plugins (marketplace_id, name) "
"VALUES ('m', 'p')"
)
row = conn.execute(
"SELECT is_system FROM marketplace_plugins "
"WHERE marketplace_id = 'm' AND name = 'p'"
).fetchone()
assert row[0] is False, f"new plugin defaulted to {row[0]!r}, expected False" assert row[0] is False, f"new plugin defaulted to {row[0]!r}, expected False"
conn.close() conn.close()
@ -230,10 +212,7 @@ def test_v38_db_migrates_to_v39(tmp_path):
# Stand up the v38 minimal shape: schema_version row + the two # Stand up the v38 minimal shape: schema_version row + the two
# marketplace tables + a pre-existing plugin row that must survive # marketplace tables + a pre-existing plugin row that must survive
# the migration with is_system = FALSE. # the migration with is_system = FALSE.
conn.execute( conn.execute("CREATE TABLE schema_version (version INTEGER, applied_at TIMESTAMP DEFAULT current_timestamp)")
"CREATE TABLE schema_version (version INTEGER, "
"applied_at TIMESTAMP DEFAULT current_timestamp)"
)
conn.execute("INSERT INTO schema_version (version) VALUES (38)") conn.execute("INSERT INTO schema_version (version) VALUES (38)")
conn.execute("""CREATE TABLE marketplace_registry ( conn.execute("""CREATE TABLE marketplace_registry (
id VARCHAR PRIMARY KEY, name VARCHAR NOT NULL, id VARCHAR PRIMARY KEY, name VARCHAR NOT NULL,
@ -254,21 +233,17 @@ def test_v38_db_migrates_to_v39(tmp_path):
PRIMARY KEY (marketplace_id, name) PRIMARY KEY (marketplace_id, name)
)""") )""")
conn.execute( conn.execute(
"INSERT INTO marketplace_registry (id, name, url) " "INSERT INTO marketplace_registry (id, name, url) VALUES ('legacy', 'Legacy', 'https://example.com/repo.git')"
"VALUES ('legacy', 'Legacy', 'https://example.com/repo.git')"
)
conn.execute(
"INSERT INTO marketplace_plugins (marketplace_id, name) "
"VALUES ('legacy', 'foo')"
) )
conn.execute("INSERT INTO marketplace_plugins (marketplace_id, name) VALUES ('legacy', 'foo')")
_ensure_schema(conn) _ensure_schema(conn)
assert get_schema_version(conn) == SCHEMA_VERSION assert get_schema_version(conn) == SCHEMA_VERSION
cols = { cols = {
r[0] for r in conn.execute( r[0]
"SELECT column_name FROM information_schema.columns " for r in conn.execute(
"WHERE table_name = 'marketplace_plugins'" "SELECT column_name FROM information_schema.columns WHERE table_name = 'marketplace_plugins'"
).fetchall() ).fetchall()
} }
assert "is_system" in cols assert "is_system" in cols
@ -276,8 +251,7 @@ def test_v38_db_migrates_to_v39(tmp_path):
# Existing pre-v39 row backfilled to FALSE — no plugin lands in # Existing pre-v39 row backfilled to FALSE — no plugin lands in
# everyone's stack just because we ran the migration. # everyone's stack just because we ran the migration.
row = conn.execute( row = conn.execute(
"SELECT is_system FROM marketplace_plugins " "SELECT is_system FROM marketplace_plugins WHERE marketplace_id = 'legacy' AND name = 'foo'"
"WHERE marketplace_id = 'legacy' AND name = 'foo'"
).fetchone() ).fetchone()
assert row[0] is False, f"pre-existing row backfilled to {row[0]!r}" assert row[0] is False, f"pre-existing row backfilled to {row[0]!r}"
conn.close() conn.close()
@ -289,9 +263,9 @@ def test_v20_adds_source_query(tmp_path):
_ensure_schema(conn) _ensure_schema(conn)
cols = { cols = {
r[0] for r in conn.execute( r[0]
"SELECT column_name FROM information_schema.columns " for r in conn.execute(
"WHERE table_name = 'table_registry'" "SELECT column_name FROM information_schema.columns WHERE table_name = 'table_registry'"
).fetchall() ).fetchall()
} }
assert "source_query" in cols, f"source_query missing from {cols}" assert "source_query" in cols, f"source_query missing from {cols}"
@ -312,19 +286,13 @@ def test_claude_md_template_seeded_in_instance_templates(tmp_path):
_ensure_schema(conn) _ensure_schema(conn)
tables = { tables = {
r[0] for r in conn.execute( r[0]
"SELECT table_name FROM information_schema.tables " for r in conn.execute("SELECT table_name FROM information_schema.tables WHERE table_schema = 'main'").fetchall()
"WHERE table_schema = 'main'"
).fetchall()
} }
assert "instance_templates" in tables assert "instance_templates" in tables
assert "claude_md_template" not in tables, ( assert "claude_md_template" not in tables, "claude_md_template should be consolidated away post-v28"
"claude_md_template should be consolidated away post-v28"
)
row = conn.execute( row = conn.execute("SELECT key, content FROM instance_templates WHERE key = 'claude_md'").fetchone()
"SELECT key, content FROM instance_templates WHERE key = 'claude_md'"
).fetchone()
assert row is not None assert row is not None
assert row[0] == "claude_md" assert row[0] == "claude_md"
assert row[1] is None # default = no override assert row[1] is None # default = no override
@ -340,10 +308,7 @@ def test_v19_db_migrates_to_v20(tmp_path):
# Simulate a v19 DB at minimal but realistic shape: schema_version row + # Simulate a v19 DB at minimal but realistic shape: schema_version row +
# a table_registry row in the post-v19 column shape (no is_public column, # a table_registry row in the post-v19 column shape (no is_public column,
# since v19 finalize dropped it via the table-rebuild idiom). # since v19 finalize dropped it via the table-rebuild idiom).
conn.execute( conn.execute("CREATE TABLE schema_version (version INTEGER, applied_at TIMESTAMP DEFAULT current_timestamp)")
"CREATE TABLE schema_version (version INTEGER, "
"applied_at TIMESTAMP DEFAULT current_timestamp)"
)
conn.execute("INSERT INTO schema_version (version) VALUES (19)") conn.execute("INSERT INTO schema_version (version) VALUES (19)")
conn.execute("""CREATE TABLE table_registry ( conn.execute("""CREATE TABLE table_registry (
id VARCHAR PRIMARY KEY, name VARCHAR NOT NULL, id VARCHAR PRIMARY KEY, name VARCHAR NOT NULL,
@ -361,16 +326,14 @@ def test_v19_db_migrates_to_v20(tmp_path):
assert get_schema_version(conn) == SCHEMA_VERSION # bumped 19→28 forward assert get_schema_version(conn) == SCHEMA_VERSION # bumped 19→28 forward
cols = { cols = {
r[0] for r in conn.execute( r[0]
"SELECT column_name FROM information_schema.columns " for r in conn.execute(
"WHERE table_name = 'table_registry'" "SELECT column_name FROM information_schema.columns WHERE table_name = 'table_registry'"
).fetchall() ).fetchall()
} }
assert "source_query" in cols assert "source_query" in cols
# Existing row preserved, new column NULL # Existing row preserved, new column NULL
row = conn.execute( row = conn.execute("SELECT id, source_query FROM table_registry WHERE id='foo'").fetchone()
"SELECT id, source_query FROM table_registry WHERE id='foo'"
).fetchone()
assert row == ("foo", None) assert row == ("foo", None)
conn.close() conn.close()
@ -389,8 +352,7 @@ def _make_v34_store_entities(conn):
) )
""") """)
conn.execute( conn.execute(
"INSERT INTO store_entities (id, visibility_status) VALUES " "INSERT INTO store_entities (id, visibility_status) VALUES ('a', 'approved'), ('b', 'pending'), ('c', 'hidden')"
"('a', 'approved'), ('b', 'pending'), ('c', 'hidden')"
) )
@ -409,9 +371,9 @@ def test_v34_to_v35_clean_path_rebuilds_visibility_column(tmp_path):
_v34_to_v35_migrate(conn) _v34_to_v35_migrate(conn)
cols = { cols = {
r[0] for r in conn.execute( r[0]
"SELECT column_name FROM information_schema.columns " for r in conn.execute(
"WHERE table_name = 'store_entities'" "SELECT column_name FROM information_schema.columns WHERE table_name = 'store_entities'"
).fetchall() ).fetchall()
} }
assert "visibility_status" in cols assert "visibility_status" in cols
@ -419,12 +381,8 @@ def test_v34_to_v35_clean_path_rebuilds_visibility_column(tmp_path):
assert "archived_at" in cols assert "archived_at" in cols
assert "archived_by" in cols assert "archived_by" in cols
rows = dict(conn.execute( rows = dict(conn.execute("SELECT id, visibility_status FROM store_entities ORDER BY id").fetchall())
"SELECT id, visibility_status FROM store_entities ORDER BY id" assert rows == {"a": "approved", "b": "pending", "c": "hidden"}, f"row values must survive the rebuild: {rows}"
).fetchall())
assert rows == {"a": "approved", "b": "pending", "c": "hidden"}, (
f"row values must survive the rebuild: {rows}"
)
conn.close() conn.close()
@ -452,16 +410,15 @@ def test_v34_to_v35_recovers_from_partial_rebuild_missing_visibility(tmp_path):
) )
""") """)
conn.execute( conn.execute(
"INSERT INTO store_entities (id, _vis_v35) VALUES " "INSERT INTO store_entities (id, _vis_v35) VALUES ('a', 'approved'), ('b', 'pending'), ('c', 'hidden')"
"('a', 'approved'), ('b', 'pending'), ('c', 'hidden')"
) )
_v34_to_v35_migrate(conn) _v34_to_v35_migrate(conn)
cols = { cols = {
r[0] for r in conn.execute( r[0]
"SELECT column_name FROM information_schema.columns " for r in conn.execute(
"WHERE table_name = 'store_entities'" "SELECT column_name FROM information_schema.columns WHERE table_name = 'store_entities'"
).fetchall() ).fetchall()
} }
assert "visibility_status" in cols assert "visibility_status" in cols
@ -469,9 +426,7 @@ def test_v34_to_v35_recovers_from_partial_rebuild_missing_visibility(tmp_path):
assert "archived_at" in cols assert "archived_at" in cols
assert "archived_by" in cols assert "archived_by" in cols
rows = dict(conn.execute( rows = dict(conn.execute("SELECT id, visibility_status FROM store_entities ORDER BY id").fetchall())
"SELECT id, visibility_status FROM store_entities ORDER BY id"
).fetchall())
assert rows == {"a": "approved", "b": "pending", "c": "hidden"}, ( assert rows == {"a": "approved", "b": "pending", "c": "hidden"}, (
f"row values must come back via RENAME, not be lost: {rows}" f"row values must come back via RENAME, not be lost: {rows}"
) )
@ -495,25 +450,20 @@ def test_v34_to_v35_recovers_from_partial_rebuild_both_columns(tmp_path):
_vis_v35 VARCHAR _vis_v35 VARCHAR
) )
""") """)
conn.execute( conn.execute("INSERT INTO store_entities (id, visibility_status, _vis_v35) VALUES ('a', 'approved', 'approved')")
"INSERT INTO store_entities (id, visibility_status, _vis_v35) VALUES "
"('a', 'approved', 'approved')"
)
_v34_to_v35_migrate(conn) _v34_to_v35_migrate(conn)
cols = { cols = {
r[0] for r in conn.execute( r[0]
"SELECT column_name FROM information_schema.columns " for r in conn.execute(
"WHERE table_name = 'store_entities'" "SELECT column_name FROM information_schema.columns WHERE table_name = 'store_entities'"
).fetchall() ).fetchall()
} }
assert "visibility_status" in cols assert "visibility_status" in cols
assert "_vis_v35" not in cols, "temp column must be dropped" assert "_vis_v35" not in cols, "temp column must be dropped"
row = conn.execute( row = conn.execute("SELECT id, visibility_status FROM store_entities WHERE id = 'a'").fetchone()
"SELECT id, visibility_status FROM store_entities WHERE id = 'a'"
).fetchone()
assert row == ("a", "approved") assert row == ("a", "approved")
conn.close() conn.close()
@ -533,10 +483,7 @@ def test_v32_db_with_partial_v35_recovers_through_full_ladder(tmp_path):
# Stand up the broken state. We only need enough of the schema for the # Stand up the broken state. We only need enough of the schema for the
# migration ladder to run — ``_ensure_schema`` will create the rest # migration ladder to run — ``_ensure_schema`` will create the rest
# via ``_SYSTEM_SCHEMA``'s IF NOT EXISTS guards. # via ``_SYSTEM_SCHEMA``'s IF NOT EXISTS guards.
conn.execute( conn.execute("CREATE TABLE schema_version (version INTEGER, applied_at TIMESTAMP DEFAULT current_timestamp)")
"CREATE TABLE schema_version (version INTEGER, "
"applied_at TIMESTAMP DEFAULT current_timestamp)"
)
conn.execute("INSERT INTO schema_version (version) VALUES (32)") conn.execute("INSERT INTO schema_version (version) VALUES (32)")
conn.execute(""" conn.execute("""
CREATE TABLE store_entities ( CREATE TABLE store_entities (
@ -550,26 +497,21 @@ def test_v32_db_with_partial_v35_recovers_through_full_ladder(tmp_path):
_vis_v35 VARCHAR _vis_v35 VARCHAR
) )
""") """)
conn.execute( conn.execute("INSERT INTO store_entities (id, type, name, _vis_v35) VALUES ('a', 'skill', 'alpha', 'approved')")
"INSERT INTO store_entities (id, type, name, _vis_v35) "
"VALUES ('a', 'skill', 'alpha', 'approved')"
)
_ensure_schema(conn) _ensure_schema(conn)
assert get_schema_version(conn) == SCHEMA_VERSION assert get_schema_version(conn) == SCHEMA_VERSION
cols = { cols = {
r[0] for r in conn.execute( r[0]
"SELECT column_name FROM information_schema.columns " for r in conn.execute(
"WHERE table_name = 'store_entities'" "SELECT column_name FROM information_schema.columns WHERE table_name = 'store_entities'"
).fetchall() ).fetchall()
} }
assert "visibility_status" in cols assert "visibility_status" in cols
assert "_vis_v35" not in cols assert "_vis_v35" not in cols
# Existing row preserved, value carried over from _vis_v35. # Existing row preserved, value carried over from _vis_v35.
row = conn.execute( row = conn.execute("SELECT id, visibility_status FROM store_entities WHERE id = 'a'").fetchone()
"SELECT id, visibility_status FROM store_entities WHERE id = 'a'"
).fetchone()
assert row == ("a", "approved") assert row == ("a", "approved")
conn.close() conn.close()
@ -593,13 +535,9 @@ def test_v35_to_v36_reapplies_visibility_constraints(tmp_path):
).fetchall() ).fetchall()
assert cols, "visibility_status column missing from store_entities" assert cols, "visibility_status column missing from store_entities"
name, is_nullable, default_expr = cols[0] name, is_nullable, default_expr = cols[0]
assert is_nullable == "NO", ( assert is_nullable == "NO", f"visibility_status must be NOT NULL after v36; got is_nullable={is_nullable!r}"
f"visibility_status must be NOT NULL after v36; got is_nullable={is_nullable!r}"
)
# DuckDB renders the default as a quoted literal — match either form. # DuckDB renders the default as a quoted literal — match either form.
assert default_expr is not None, "visibility_status DEFAULT must be set" assert default_expr is not None, "visibility_status DEFAULT must be set"
assert "pending" in str(default_expr).lower(), ( assert "pending" in str(default_expr).lower(), f"visibility_status DEFAULT must be 'pending'; got {default_expr!r}"
f"visibility_status DEFAULT must be 'pending'; got {default_expr!r}"
)
conn.close() conn.close()

View file

@ -10,6 +10,7 @@ Covers:
- get_home_status_frame_visibility honors the env var + yaml override - get_home_status_frame_visibility honors the env var + yaml override
and defaults true. and defaults true.
""" """
from __future__ import annotations from __future__ import annotations
import asyncio import asyncio
@ -42,10 +43,7 @@ def test_v44_fresh_install_has_token_columns_and_last_pull(tmp_path):
user_cols = {r[1] for r in conn.execute("PRAGMA table_info(users)").fetchall()} user_cols = {r[1] for r in conn.execute("PRAGMA table_info(users)").fetchall()}
assert "last_pull_at" in user_cols assert "last_pull_at" in user_cols
sess_cols = { sess_cols = {r[1] for r in conn.execute("PRAGMA table_info(usage_session_summary)").fetchall()}
r[1]
for r in conn.execute("PRAGMA table_info(usage_session_summary)").fetchall()
}
for col in ( for col in (
"input_tokens", "input_tokens",
"output_tokens", "output_tokens",
@ -61,10 +59,7 @@ def test_v43_to_v44_upgrade_is_idempotent(tmp_path):
db_path = tmp_path / "system.duckdb" db_path = tmp_path / "system.duckdb"
conn = duckdb.connect(str(db_path)) conn = duckdb.connect(str(db_path))
# Hand-roll v43-shaped tables (no last_pull_at, no token cols). # Hand-roll v43-shaped tables (no last_pull_at, no token cols).
conn.execute( conn.execute("CREATE TABLE users (id VARCHAR PRIMARY KEY, email VARCHAR, onboarded BOOLEAN DEFAULT FALSE)")
"CREATE TABLE users (id VARCHAR PRIMARY KEY, email VARCHAR, "
"onboarded BOOLEAN DEFAULT FALSE)"
)
conn.execute( conn.execute(
""" """
CREATE TABLE usage_session_summary ( CREATE TABLE usage_session_summary (
@ -82,14 +77,8 @@ def test_v43_to_v44_upgrade_is_idempotent(tmp_path):
_v43_to_v44(conn) _v43_to_v44(conn)
_v43_to_v44(conn) # idempotent _v43_to_v44(conn) # idempotent
assert "last_pull_at" in { assert "last_pull_at" in {r[1] for r in conn.execute("PRAGMA table_info(users)").fetchall()}
r[1] for r in conn.execute("PRAGMA table_info(users)").fetchall() tok_cols = {r[1] for r in conn.execute("PRAGMA table_info(usage_session_summary)").fetchall() if "token" in r[1]}
}
tok_cols = {
r[1]
for r in conn.execute("PRAGMA table_info(usage_session_summary)").fetchall()
if "token" in r[1]
}
assert tok_cols == { assert tok_cols == {
"input_tokens", "input_tokens",
"output_tokens", "output_tokens",
@ -100,7 +89,7 @@ def test_v43_to_v44_upgrade_is_idempotent(tmp_path):
def test_schema_version_constant_is_44(): def test_schema_version_constant_is_44():
"""Belt + suspenders against schema_version regressions.""" """Belt + suspenders against schema_version regressions."""
assert SCHEMA_VERSION == 44 assert SCHEMA_VERSION == 45
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
@ -118,15 +107,23 @@ def stats_conn(tmp_path):
def _seed_user(conn, *, uid="u1", email="alice@example.com"): def _seed_user(conn, *, uid="u1", email="alice@example.com"):
conn.execute( conn.execute(
"INSERT INTO users (id, email, active, onboarded, last_pull_at) " "INSERT INTO users (id, email, active, onboarded, last_pull_at) VALUES (?, ?, TRUE, TRUE, current_timestamp)",
"VALUES (?, ?, TRUE, TRUE, current_timestamp)",
[uid, email], [uid, email],
) )
def _seed_session(conn, *, session_file, username, started_sql, prompts=0, def _seed_session(
input_tokens=0, output_tokens=0, conn,
cache_read=0, cache_creation=0): *,
session_file,
username,
started_sql,
prompts=0,
input_tokens=0,
output_tokens=0,
cache_read=0,
cache_creation=0,
):
conn.execute( conn.execute(
f""" f"""
INSERT INTO usage_session_summary INSERT INTO usage_session_summary
@ -136,8 +133,7 @@ def _seed_session(conn, *, session_file, username, started_sql, prompts=0,
VALUES (?, ?, ?, {started_sql}, current_timestamp, VALUES (?, ?, ?, {started_sql}, current_timestamp,
?, ?, ?, ?, ?, 2) ?, ?, ?, ?, ?, 2)
""", """,
[session_file, session_file, username, prompts, [session_file, session_file, username, prompts, input_tokens, output_tokens, cache_read, cache_creation],
input_tokens, output_tokens, cache_read, cache_creation],
) )
@ -159,27 +155,60 @@ def test_compute_home_stats_24h_vs_7d_windowing(stats_conn):
from app.api.me import compute_home_stats from app.api.me import compute_home_stats
_seed_user(stats_conn) _seed_user(stats_conn)
_seed_session(stats_conn, session_file="a.jsonl", username="alice", _seed_session(
started_sql="current_timestamp - INTERVAL 1 HOUR", stats_conn,
prompts=5, input_tokens=100, output_tokens=50, session_file="a.jsonl",
cache_read=800, cache_creation=25) username="alice",
_seed_session(stats_conn, session_file="b.jsonl", username="alice", started_sql="current_timestamp - INTERVAL 1 HOUR",
started_sql="current_timestamp - INTERVAL 3 DAY", prompts=5,
prompts=5, input_tokens=100, output_tokens=50, input_tokens=100,
cache_read=800, cache_creation=25) output_tokens=50,
_seed_session(stats_conn, session_file="c.jsonl", username="alice", cache_read=800,
started_sql="current_timestamp - INTERVAL 30 DAY", cache_creation=25,
prompts=99) )
_seed_session(
stats_conn,
session_file="b.jsonl",
username="alice",
started_sql="current_timestamp - INTERVAL 3 DAY",
prompts=5,
input_tokens=100,
output_tokens=50,
cache_read=800,
cache_creation=25,
)
_seed_session(
stats_conn,
session_file="c.jsonl",
username="alice",
started_sql="current_timestamp - INTERVAL 30 DAY",
prompts=99,
)
_seed_event(stats_conn, ev_id="e1", session_file="a.jsonl", _seed_event(
username="alice", cwd="/proj/alpha", stats_conn,
occurred_sql="current_timestamp - INTERVAL 1 HOUR") ev_id="e1",
_seed_event(stats_conn, ev_id="e2", session_file="a.jsonl", session_file="a.jsonl",
username="alice", cwd="/proj/beta", username="alice",
occurred_sql="current_timestamp - INTERVAL 2 HOUR") cwd="/proj/alpha",
_seed_event(stats_conn, ev_id="e3", session_file="b.jsonl", occurred_sql="current_timestamp - INTERVAL 1 HOUR",
username="alice", cwd="/proj/gamma", )
occurred_sql="current_timestamp - INTERVAL 3 DAY") _seed_event(
stats_conn,
ev_id="e2",
session_file="a.jsonl",
username="alice",
cwd="/proj/beta",
occurred_sql="current_timestamp - INTERVAL 2 HOUR",
)
_seed_event(
stats_conn,
ev_id="e3",
session_file="b.jsonl",
username="alice",
cwd="/proj/gamma",
occurred_sql="current_timestamp - INTERVAL 3 DAY",
)
user = {"id": "u1", "email": "alice@example.com"} user = {"id": "u1", "email": "alice@example.com"}
@ -243,8 +272,10 @@ def test_compute_home_stats_missing_users_row_returns_zeros(stats_conn):
"sessions": 0, "sessions": 0,
"prompts": 0, "prompts": 0,
"tokens": { "tokens": {
"input": 0, "output": 0, "input": 0,
"cache_read": 0, "cache_creation": 0, "output": 0,
"cache_read": 0,
"cache_creation": 0,
"total": 0, "total": 0,
}, },
"projects": 0, "projects": 0,
@ -267,9 +298,7 @@ def test_sync_manifest_bumps_last_pull_at(stats_conn, monkeypatch, tmp_path):
_seed_user(stats_conn, uid="u_pull", email="puller@example.com") _seed_user(stats_conn, uid="u_pull", email="puller@example.com")
# Wipe seeded last_pull_at so we can detect the bump. # Wipe seeded last_pull_at so we can detect the bump.
stats_conn.execute( stats_conn.execute("UPDATE users SET last_pull_at = NULL WHERE id = ?", ["u_pull"])
"UPDATE users SET last_pull_at = NULL WHERE id = ?", ["u_pull"]
)
asyncio.run( asyncio.run(
sync_manifest( sync_manifest(
@ -277,9 +306,7 @@ def test_sync_manifest_bumps_last_pull_at(stats_conn, monkeypatch, tmp_path):
conn=stats_conn, conn=stats_conn,
) )
) )
row = stats_conn.execute( row = stats_conn.execute("SELECT last_pull_at FROM users WHERE id = ?", ["u_pull"]).fetchone()
"SELECT last_pull_at FROM users WHERE id = ?", ["u_pull"]
).fetchone()
# Don't compare against `datetime.now(utc)` — DuckDB's # Don't compare against `datetime.now(utc)` — DuckDB's
# ``current_timestamp`` returns the session's wall-clock time which # ``current_timestamp`` returns the session's wall-clock time which
# may be naive-local-or-utc depending on the environment, so a # may be naive-local-or-utc depending on the environment, so a
@ -298,6 +325,7 @@ def test_status_frame_default_is_visible(monkeypatch):
"""Absent both env var and yaml entry, the flag returns True.""" """Absent both env var and yaml entry, the flag returns True."""
monkeypatch.delenv("AGNES_HOME_SHOW_STATUS_FRAME", raising=False) monkeypatch.delenv("AGNES_HOME_SHOW_STATUS_FRAME", raising=False)
from app.instance_config import get_home_status_frame_visibility from app.instance_config import get_home_status_frame_visibility
assert get_home_status_frame_visibility() is True assert get_home_status_frame_visibility() is True
@ -305,12 +333,14 @@ def test_status_frame_env_var_off(monkeypatch):
"""AGNES_HOME_SHOW_STATUS_FRAME=0 hides the frame.""" """AGNES_HOME_SHOW_STATUS_FRAME=0 hides the frame."""
monkeypatch.setenv("AGNES_HOME_SHOW_STATUS_FRAME", "0") monkeypatch.setenv("AGNES_HOME_SHOW_STATUS_FRAME", "0")
from app.instance_config import get_home_status_frame_visibility from app.instance_config import get_home_status_frame_visibility
assert get_home_status_frame_visibility() is False assert get_home_status_frame_visibility() is False
def test_status_frame_env_var_falsey_values(monkeypatch): def test_status_frame_env_var_falsey_values(monkeypatch):
"""Each of {0, false, no, off, ''} hides the frame; anything else shows.""" """Each of {0, false, no, off, ''} hides the frame; anything else shows."""
from app.instance_config import get_home_status_frame_visibility from app.instance_config import get_home_status_frame_visibility
for val in ("0", "false", "False", "FALSE", "no", "off", ""): for val in ("0", "false", "False", "FALSE", "no", "off", ""):
monkeypatch.setenv("AGNES_HOME_SHOW_STATUS_FRAME", val) monkeypatch.setenv("AGNES_HOME_SHOW_STATUS_FRAME", val)
assert get_home_status_frame_visibility() is False, f"{val!r} should hide" assert get_home_status_frame_visibility() is False, f"{val!r} should hide"

View file

@ -12,9 +12,7 @@ Coverage:
from __future__ import annotations from __future__ import annotations
import os
import duckdb
import pytest import pytest
from connectors.internal.access import ( from connectors.internal.access import (
@ -46,22 +44,24 @@ def system_db(tmp_path, monkeypatch):
# Force close any pre-existing handle held by an earlier test. # Force close any pre-existing handle held by an earlier test.
from src.db import close_system_db from src.db import close_system_db
close_system_db() close_system_db()
from src.db import get_system_db from src.db import get_system_db
conn = get_system_db() conn = get_system_db()
_ensure_schema(conn) _ensure_schema(conn)
ensure_internal_tables_registered(conn) ensure_internal_tables_registered(conn)
# Seed a couple of canonical rows for the RBAC checks. # Seed a couple of canonical rows for the RBAC checks.
conn.execute( conn.execute(
"INSERT INTO usage_session_summary " "INSERT INTO usage_session_summary "
"(session_file, session_id, username, tool_calls, tool_errors, processor_version) VALUES " "(session_file, session_id, username, user_id, tool_calls, tool_errors, processor_version) VALUES "
"('alice/s1.jsonl', 's-a-1', 'alice', 10, 1, 1)" "('alice/s1.jsonl', 's-a-1', 'alice', 'alice-uuid', 10, 1, 1)"
) )
conn.execute( conn.execute(
"INSERT INTO usage_session_summary " "INSERT INTO usage_session_summary "
"(session_file, session_id, username, tool_calls, tool_errors, processor_version) VALUES " "(session_file, session_id, username, user_id, tool_calls, tool_errors, processor_version) VALUES "
"('bob/s2.jsonl', 's-b-1', 'bob', 20, 0, 1)" "('bob/s2.jsonl', 's-b-1', 'bob', 'bob-uuid', 20, 0, 1)"
) )
conn.execute( conn.execute(
"INSERT INTO audit_log (id, user_id, action, result) VALUES " "INSERT INTO audit_log (id, user_id, action, result) VALUES "
@ -79,6 +79,7 @@ def system_db(tmp_path, monkeypatch):
# Static helpers — no DB needed # Static helpers — no DB needed
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
def test_is_internal_table_recognises_canonical_ids(): def test_is_internal_table_recognises_canonical_ids():
assert is_internal_table("agnes_sessions") assert is_internal_table("agnes_sessions")
assert is_internal_table("agnes_telemetry") assert is_internal_table("agnes_telemetry")
@ -108,25 +109,31 @@ def test_filter_clause_admin_is_empty():
assert build_filter_clause(table, {"email": "admin@x", "id": "admin-uuid"}, True) == "" assert build_filter_clause(table, {"email": "admin@x", "id": "admin-uuid"}, True) == ""
def test_filter_clause_non_admin_scopes_to_username(): def test_filter_clause_non_admin_scopes_to_user_id():
"""agnes_sessions filters on user_id with OR username fallback for
pre-backfill rows."""
table = INTERNAL_TABLES_BY_ID["agnes_sessions"] table = INTERNAL_TABLES_BY_ID["agnes_sessions"]
clause = build_filter_clause(table, {"email": "alice@example.com", "id": "alice-uuid"}, False) clause = build_filter_clause(table, {"email": "alice@example.com", "id": "alice-uuid"}, False)
assert clause == "WHERE username = 'alice'" assert clause == "WHERE (user_id = 'alice-uuid' OR username = 'alice')"
def test_filter_clause_non_admin_scopes_audit_to_user_id(): def test_filter_clause_non_admin_scopes_audit_to_user_id():
table = INTERNAL_TABLES_BY_ID["agnes_audit"] table = INTERNAL_TABLES_BY_ID["agnes_audit"]
clause = build_filter_clause( clause = build_filter_clause(
table, {"email": "alice@example.com", "id": "alice-uuid"}, False, table,
{"email": "alice@example.com", "id": "alice-uuid"},
False,
) )
assert clause == "WHERE user_id = 'alice-uuid'" assert clause == "WHERE user_id = 'alice-uuid'"
def test_filter_clause_rejects_unsafe_username(): def test_filter_clause_rejects_unsafe_user_id():
table = INTERNAL_TABLES_BY_ID["agnes_sessions"] table = INTERNAL_TABLES_BY_ID["agnes_sessions"]
with pytest.raises(InternalAccessError): with pytest.raises(InternalAccessError):
build_filter_clause( build_filter_clause(
table, {"email": "alice'; DROP TABLE--@example.com", "id": "x"}, False, table,
{"email": "alice@example.com", "id": "'; DROP TABLE--"},
False,
) )
@ -134,6 +141,7 @@ def test_filter_clause_rejects_unsafe_username():
# End-to-end RBAC — runs against a real (fresh) system.duckdb # End-to-end RBAC — runs against a real (fresh) system.duckdb
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
def test_admin_sees_every_user_session(system_db, tmp_path): def test_admin_sees_every_user_session(system_db, tmp_path):
db_path = str(_get_state_dir() / "system.duckdb") db_path = str(_get_state_dir() / "system.duckdb")
_, rows, _ = execute_internal_query( _, rows, _ = execute_internal_query(
@ -156,6 +164,49 @@ def test_non_admin_sees_only_own_sessions(system_db):
assert rows == [("alice", "s-a-1")] assert rows == [("alice", "s-a-1")]
def test_non_admin_sees_sessions_uploaded_via_api(system_db):
"""Sessions uploaded via POST /api/upload/sessions are stored under
user_id (UUID) by the pipeline. The user_id filter covers both
ingestion paths (collector and upload API) because the pipeline
resolves the stable user_id for every session at processing time."""
from src.db import get_system_db
conn = get_system_db()
conn.execute(
"INSERT INTO usage_session_summary "
"(session_file, session_id, username, user_id, tool_calls, tool_errors, processor_version) VALUES "
"('550e8400-uuid/api.jsonl', 's-api-1', '550e8400-uuid', '550e8400-uuid', 5, 0, 1)"
)
db_path = str(_get_state_dir() / "system.duckdb")
_, rows, _ = execute_internal_query(
db_path,
{"email": "alice@example.com", "id": "550e8400-uuid"},
is_admin=False,
sql="SELECT username, session_id FROM agnes_sessions ORDER BY session_id",
)
# alice should see both her collector row (user_id='alice-uuid' from
# fixture — won't match '550e8400-uuid') and her upload-API row
# (user_id='550e8400-uuid').
assert ("550e8400-uuid", "s-api-1") in rows
# bob's rows must NOT appear
assert all(r[0] != "bob" for r in rows)
def test_email_change_does_not_affect_visibility(system_db):
"""If alice changes her email from alice@example.com to
alice.new@example.com, her sessions should still be visible
because the filter uses the stable user_id, not the email."""
db_path = str(_get_state_dir() / "system.duckdb")
# Query with a different email but the same user_id.
_, rows, _ = execute_internal_query(
db_path,
{"email": "alice.new@example.com", "id": "alice-uuid"},
is_admin=False,
sql="SELECT username, session_id FROM agnes_sessions",
)
assert rows == [("alice", "s-a-1")]
def test_non_admin_sees_only_own_audit_rows(system_db): def test_non_admin_sees_only_own_audit_rows(system_db):
db_path = str(_get_state_dir() / "system.duckdb") db_path = str(_get_state_dir() / "system.duckdb")
_, rows, _ = execute_internal_query( _, rows, _ = execute_internal_query(
@ -225,6 +276,7 @@ def test_execute_rejects_sql_without_internal_refs():
# request handler layer where `is_user_admin(user)` was mis-called. # request handler layer where `is_user_admin(user)` was mis-called.
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
def _seed_internal_via_api(): def _seed_internal_via_api():
"""``seeded_app`` bypasses the FastAPI lifespan, so the registry seed """``seeded_app`` bypasses the FastAPI lifespan, so the registry seed
that puts agnes_sessions / agnes_telemetry / agnes_audit into that puts agnes_sessions / agnes_telemetry / agnes_audit into
@ -232,6 +284,7 @@ def _seed_internal_via_api():
routes (`/api/query`, `/api/v2/sample`) can find them.""" routes (`/api/query`, `/api/v2/sample`) can find them."""
from src.db import get_system_db from src.db import get_system_db
from connectors.internal.registry import ensure_internal_tables_registered from connectors.internal.registry import ensure_internal_tables_registered
ensure_internal_tables_registered(get_system_db()) ensure_internal_tables_registered(get_system_db())
@ -243,6 +296,7 @@ def test_query_internal_via_api_admin_path(seeded_app, admin_user):
""" """
_seed_internal_via_api() _seed_internal_via_api()
from src.db import get_system_db from src.db import get_system_db
conn = get_system_db() conn = get_system_db()
conn.execute( conn.execute(
"INSERT INTO usage_session_summary " "INSERT INTO usage_session_summary "
@ -287,6 +341,7 @@ def test_sample_internal_via_api_admin_path(seeded_app, admin_user):
# round-2 review (PR #278 R2 finding #1). # round-2 review (PR #278 R2 finding #1).
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
def test_string_literal_alone_does_not_route_internal(system_db): def test_string_literal_alone_does_not_route_internal(system_db):
"""A SQL with `agnes_sessions` only inside a string literal should not """A SQL with `agnes_sessions` only inside a string literal should not
be treated as referencing the internal table find_internal_refs be treated as referencing the internal table find_internal_refs

View file

@ -1,4 +1,5 @@
"""v41 → v42 migration: 7 new usage_* tables for telemetry.""" """v41 → v42 migration: 7 new usage_* tables for telemetry."""
import duckdb import duckdb
import pytest import pytest
from src.db import _ensure_schema as init_database, SCHEMA_VERSION from src.db import _ensure_schema as init_database, SCHEMA_VERSION
@ -10,18 +11,25 @@ def test_schema_version_is_42():
# constant. Test name preserved for git-blame continuity; the # constant. Test name preserved for git-blame continuity; the
# version-pinned tests in test_db_schema_version.py and # version-pinned tests in test_db_schema_version.py and
# test_home_stats.py carry the v44 commentary. # test_home_stats.py carry the v44 commentary.
assert SCHEMA_VERSION == 44 assert SCHEMA_VERSION == 45
def test_v42_tables_exist_after_init(tmp_path): def test_v42_tables_exist_after_init(tmp_path):
db_path = tmp_path / "test.duckdb" db_path = tmp_path / "test.duckdb"
conn = duckdb.connect(str(db_path)) conn = duckdb.connect(str(db_path))
init_database(conn) init_database(conn)
tables = {row[0] for row in conn.execute("SELECT table_name FROM information_schema.tables WHERE table_schema='main'").fetchall()} tables = {
row[0]
for row in conn.execute("SELECT table_name FROM information_schema.tables WHERE table_schema='main'").fetchall()
}
for tbl in [ for tbl in [
"usage_events", "usage_session_summary", "usage_events",
"usage_tool_daily", "usage_plugin_daily", "usage_session_summary",
"usage_attribution_skills", "usage_attribution_agents", "usage_attribution_commands", "usage_tool_daily",
"usage_plugin_daily",
"usage_attribution_skills",
"usage_attribution_agents",
"usage_attribution_commands",
]: ]:
assert tbl in tables, f"missing table {tbl}" assert tbl in tables, f"missing table {tbl}"
conn.close() conn.close()
@ -31,12 +39,21 @@ def test_v42_indices_exist(tmp_path):
db_path = tmp_path / "test.duckdb" db_path = tmp_path / "test.duckdb"
conn = duckdb.connect(str(db_path)) conn = duckdb.connect(str(db_path))
init_database(conn) init_database(conn)
idx_names = {row[0] for row in conn.execute("SELECT index_name FROM duckdb_indexes WHERE table_name LIKE 'usage_%'").fetchall()} idx_names = {
row[0]
for row in conn.execute("SELECT index_name FROM duckdb_indexes WHERE table_name LIKE 'usage_%'").fetchall()
}
for idx in [ for idx in [
"idx_usage_events_session", "idx_usage_events_user_time", "idx_usage_events_tool", "idx_usage_events_session",
"idx_usage_events_skill", "idx_usage_events_ref", "idx_usage_events_user_time",
"idx_usage_session_user", "idx_usage_session_started", "idx_usage_events_tool",
"idx_usage_attr_skill_lookup", "idx_usage_attr_agent_lookup", "idx_usage_attr_command_lookup", "idx_usage_events_skill",
"idx_usage_events_ref",
"idx_usage_session_user",
"idx_usage_session_started",
"idx_usage_attr_skill_lookup",
"idx_usage_attr_agent_lookup",
"idx_usage_attr_command_lookup",
]: ]:
assert idx in idx_names, f"missing index {idx}" assert idx in idx_names, f"missing index {idx}"
conn.close() conn.close()
@ -51,7 +68,7 @@ def test_v41_to_v42_is_idempotent(tmp_path):
conn = duckdb.connect(str(db_path)) conn = duckdb.connect(str(db_path))
init_database(conn) init_database(conn)
v = conn.execute("SELECT MAX(version) FROM schema_version").fetchone()[0] v = conn.execute("SELECT MAX(version) FROM schema_version").fetchone()[0]
assert v == 44 assert v == 45
conn.close() conn.close()
@ -72,15 +89,20 @@ def test_v41_db_upgrades_cleanly(tmp_path):
conn = duckdb.connect(str(db_path)) conn = duckdb.connect(str(db_path))
init_database(conn) init_database(conn)
v = conn.execute("SELECT MAX(version) FROM schema_version").fetchone()[0] v = conn.execute("SELECT MAX(version) FROM schema_version").fetchone()[0]
assert v == 44 assert v == 45
# All 7 new v41 tables exist after the v40→v41 upgrade # All 7 new v41 tables exist after the v40→v41 upgrade
tables = {row[0] for row in conn.execute( tables = {
"SELECT table_name FROM information_schema.tables WHERE table_schema='main'" row[0]
).fetchall()} for row in conn.execute("SELECT table_name FROM information_schema.tables WHERE table_schema='main'").fetchall()
}
for tbl in [ for tbl in [
"usage_events", "usage_session_summary", "usage_events",
"usage_tool_daily", "usage_plugin_daily", "usage_session_summary",
"usage_attribution_skills", "usage_attribution_agents", "usage_attribution_commands", "usage_tool_daily",
"usage_plugin_daily",
"usage_attribution_skills",
"usage_attribution_agents",
"usage_attribution_commands",
]: ]:
assert tbl in tables, f"missing table {tbl} after v40→v41 upgrade" assert tbl in tables, f"missing table {tbl} after v40→v41 upgrade"
conn.close() conn.close()
@ -99,7 +121,7 @@ def test_v30_db_ladders_all_the_way_up(tmp_path):
conn = duckdb.connect(str(db_path)) conn = duckdb.connect(str(db_path))
init_database(conn) init_database(conn)
v = conn.execute("SELECT MAX(version) FROM schema_version").fetchone()[0] v = conn.execute("SELECT MAX(version) FROM schema_version").fetchone()[0]
assert v == 44 assert v == 45
cnt = conn.execute("SELECT COUNT(*) FROM audit_log WHERE id='vintage'").fetchone()[0] cnt = conn.execute("SELECT COUNT(*) FROM audit_log WHERE id='vintage'").fetchone()[0]
assert cnt == 1 assert cnt == 1
# New v41 table exists # New v41 table exists

View file

@ -17,11 +17,10 @@ import json
from pathlib import Path from pathlib import Path
import duckdb import duckdb
import pytest
from services.session_pipeline.contract import ProcessorResult from services.session_pipeline.contract import ProcessorResult
from services.session_pipeline.lib import compute_file_hash, parse_jsonl from services.session_pipeline.lib import compute_file_hash, parse_jsonl
from services.session_pipeline.runner import run_processor from services.session_pipeline.runner import resolve_user_id, run_processor
from src.repositories.session_processor_state import SessionProcessorStateRepository from src.repositories.session_processor_state import SessionProcessorStateRepository
@ -29,6 +28,7 @@ def _fresh_db(tmp_path, monkeypatch) -> duckdb.DuckDBPyConnection:
"""Same idiom as tests/test_corporate_memory_v1.py — fresh schema in tmp_path.""" """Same idiom as tests/test_corporate_memory_v1.py — fresh schema in tmp_path."""
monkeypatch.setenv("DATA_DIR", str(tmp_path)) monkeypatch.setenv("DATA_DIR", str(tmp_path))
import src.db as db_module import src.db as db_module
db_module._system_db_conn = None db_module._system_db_conn = None
db_module._system_db_path = None db_module._system_db_path = None
return db_module.get_system_db() return db_module.get_system_db()
@ -38,12 +38,15 @@ def _fresh_db(tmp_path, monkeypatch) -> duckdb.DuckDBPyConnection:
# parse_jsonl # parse_jsonl
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
class TestParseJsonl: class TestParseJsonl:
def test_parses_well_formed_lines(self, tmp_path): def test_parses_well_formed_lines(self, tmp_path):
f = tmp_path / "session.jsonl" f = tmp_path / "session.jsonl"
f.write_text( f.write_text(
json.dumps({"role": "user", "content": "hi"}) + "\n" json.dumps({"role": "user", "content": "hi"})
+ json.dumps({"role": "assistant", "content": "hello"}) + "\n" + "\n"
+ json.dumps({"role": "assistant", "content": "hello"})
+ "\n"
) )
turns = parse_jsonl(f) turns = parse_jsonl(f)
assert len(turns) == 2 assert len(turns) == 2
@ -55,9 +58,11 @@ class TestParseJsonl:
a single corrupt row mustn't abort processing of the rest.""" a single corrupt row mustn't abort processing of the rest."""
f = tmp_path / "session.jsonl" f = tmp_path / "session.jsonl"
f.write_text( f.write_text(
json.dumps({"role": "user", "content": "ok"}) + "\n" json.dumps({"role": "user", "content": "ok"})
+ "\n"
+ "this is not json\n" + "this is not json\n"
+ json.dumps({"role": "assistant", "content": "still ok"}) + "\n" + json.dumps({"role": "assistant", "content": "still ok"})
+ "\n"
) )
turns = parse_jsonl(f) turns = parse_jsonl(f)
assert len(turns) == 2 assert len(turns) == 2
@ -66,11 +71,7 @@ class TestParseJsonl:
def test_skips_blank_lines(self, tmp_path): def test_skips_blank_lines(self, tmp_path):
f = tmp_path / "session.jsonl" f = tmp_path / "session.jsonl"
f.write_text( f.write_text("\n" + json.dumps({"role": "user", "content": "x"}) + "\n" + " \n")
"\n"
+ json.dumps({"role": "user", "content": "x"}) + "\n"
+ " \n"
)
turns = parse_jsonl(f) turns = parse_jsonl(f)
assert len(turns) == 1 assert len(turns) == 1
@ -79,6 +80,7 @@ class TestParseJsonl:
# compute_file_hash # compute_file_hash
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
class TestComputeFileHash: class TestComputeFileHash:
def test_deterministic(self, tmp_path): def test_deterministic(self, tmp_path):
f = tmp_path / "x.jsonl" f = tmp_path / "x.jsonl"
@ -94,10 +96,77 @@ class TestComputeFileHash:
assert h1 != h2 assert h1 != h2
# ---------------------------------------------------------------------------
# resolve_user_id
# ---------------------------------------------------------------------------
class TestResolveUserId:
"""Unit tests for the linchpin identity-resolution function."""
@staticmethod
def _seed_users(conn: duckdb.DuckDBPyConnection, rows: list[tuple[str, str, str | None]]) -> None:
for uid, email, updated_at in rows:
conn.execute(
"INSERT INTO users (id, email, updated_at) VALUES (?, ?, ?)",
[uid, email, updated_at],
)
def test_exact_uuid_match(self, tmp_path, monkeypatch):
conn = _fresh_db(tmp_path, monkeypatch)
self._seed_users(conn, [("uuid-aaa", "alice@example.com", "2026-01-01")])
assert resolve_user_id(conn, "uuid-aaa") == "uuid-aaa"
conn.close()
def test_email_local_part_match(self, tmp_path, monkeypatch):
conn = _fresh_db(tmp_path, monkeypatch)
self._seed_users(conn, [("uuid-bbb", "bob@example.com", "2026-01-01")])
assert resolve_user_id(conn, "bob") == "uuid-bbb"
conn.close()
def test_null_fallback_for_unknown(self, tmp_path, monkeypatch):
conn = _fresh_db(tmp_path, monkeypatch)
self._seed_users(conn, [("uuid-aaa", "alice@example.com", "2026-01-01")])
assert resolve_user_id(conn, "nobody") is None
conn.close()
def test_tiebreak_picks_most_recently_updated(self, tmp_path, monkeypatch):
conn = _fresh_db(tmp_path, monkeypatch)
self._seed_users(
conn,
[
("uuid-old", "zara@old.com", "2025-01-01"),
("uuid-new", "zara@new.com", "2026-06-01"),
],
)
assert resolve_user_id(conn, "zara") == "uuid-new"
conn.close()
def test_underscore_not_treated_as_wildcard(self, tmp_path, monkeypatch):
conn = _fresh_db(tmp_path, monkeypatch)
self._seed_users(
conn,
[
("uuid-alice", "alicexsmith@example.com", "2026-01-01"),
("uuid-real", "alice_smith@example.com", "2025-01-01"),
],
)
# "alice_smith" must match only the literal underscore email
assert resolve_user_id(conn, "alice_smith") == "uuid-real"
conn.close()
def test_uuid_branch_takes_priority_over_email(self, tmp_path, monkeypatch):
conn = _fresh_db(tmp_path, monkeypatch)
self._seed_users(conn, [("uuid-aaa", "uuid-aaa@example.com", "2026-01-01")])
assert resolve_user_id(conn, "uuid-aaa") == "uuid-aaa"
conn.close()
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
# SessionProcessorStateRepository # SessionProcessorStateRepository
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
class TestSessionProcessorStateRepository: class TestSessionProcessorStateRepository:
def test_unprocessed_when_empty(self, tmp_path, monkeypatch): def test_unprocessed_when_empty(self, tmp_path, monkeypatch):
conn = _fresh_db(tmp_path, monkeypatch) conn = _fresh_db(tmp_path, monkeypatch)
@ -175,7 +244,6 @@ class TestSessionProcessorStateRepository:
every scheduler tick.""" every scheduler tick."""
import os import os
import time import time
from datetime import datetime, timezone
conn = _fresh_db(tmp_path, monkeypatch) conn = _fresh_db(tmp_path, monkeypatch)
sessions = tmp_path / "sessions" sessions = tmp_path / "sessions"
@ -233,6 +301,7 @@ class TestSessionProcessorStateRepository:
# run_processor # run_processor
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
class _FakeProcessor: class _FakeProcessor:
"""Test double that records its calls and is configurable per behavior.""" """Test double that records its calls and is configurable per behavior."""
@ -249,7 +318,7 @@ class _FakeProcessor:
self.raise_on_session = raise_on_session self.raise_on_session = raise_on_session
self.calls: list[str] = [] self.calls: list[str] = []
def process_session(self, session_path: Path, username: str, session_key: str, conn): def process_session(self, session_path: Path, username: str, session_key: str, conn, **kwargs: object):
self.calls.append(session_key) self.calls.append(session_key)
if self.raise_on_session is not None and session_key == self.raise_on_session: if self.raise_on_session is not None and session_key == self.raise_on_session:
raise RuntimeError("simulated processor failure") raise RuntimeError("simulated processor failure")
@ -385,6 +454,7 @@ class TestRunProcessor:
class _BadReturn: class _BadReturn:
name = "bad" name = "bad"
cadence_minutes = 1 cadence_minutes = 1
def process_session(self, *a, **kw): def process_session(self, *a, **kw):
return None # type: ignore[return-value] return None # type: ignore[return-value]
@ -398,6 +468,7 @@ class TestRunProcessor:
# v29 migration — verification rows preserved, old table dropped # v29 migration — verification rows preserved, old table dropped
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
class TestV29Migration: class TestV29Migration:
"""Exercise the v28 → v29 migration directly. Builds a v28 schema (using """Exercise the v28 → v29 migration directly. Builds a v28 schema (using
the pre-v29 idiom inline so the test doesn't depend on _SYSTEM_SCHEMA's the pre-v29 idiom inline so the test doesn't depend on _SYSTEM_SCHEMA's
@ -426,6 +497,7 @@ class TestV29Migration:
# Run v29 migration steps via the helper (which conditionally copies # Run v29 migration steps via the helper (which conditionally copies
# from the legacy table when present). # from the legacy table when present).
from src.db import _v30_to_v31_migrate from src.db import _v30_to_v31_migrate
_v30_to_v31_migrate(conn) _v30_to_v31_migrate(conn)
# New table has the row tagged with processor_name='verification'. # New table has the row tagged with processor_name='verification'.
@ -437,7 +509,8 @@ class TestV29Migration:
# Old table is gone. # Old table is gone.
existing = { existing = {
r[0] for r in conn.execute( r[0]
for r in conn.execute(
"SELECT table_name FROM information_schema.tables WHERE table_schema='main'" "SELECT table_name FROM information_schema.tables WHERE table_schema='main'"
).fetchall() ).fetchall()
} }
@ -477,10 +550,12 @@ class TestV29Migration:
) )
from src.db import _v30_to_v31_migrate from src.db import _v30_to_v31_migrate
_v30_to_v31_migrate(conn) _v30_to_v31_migrate(conn)
existing = { existing = {
r[0] for r in conn.execute( r[0]
for r in conn.execute(
"SELECT table_name FROM information_schema.tables WHERE table_schema='main'" "SELECT table_name FROM information_schema.tables WHERE table_schema='main'"
).fetchall() ).fetchall()
} }