* fix(security): RBAC filter for agnes_sessions matches both email local-part and user_id
The upload API (POST /api/upload/sessions) stores session files under
user_sessions/{user_id}/ (UUID), while the session collector uses the
OS username (email local-part). The session pipeline writes the directory
name verbatim into usage_session_summary.username, so the column can
contain either value depending on the ingestion path.
The RBAC filter in build_filter_clause previously only matched the email
local-part, missing sessions uploaded via the API. The fix adds an OR
condition so non-admin users see rows where username matches either their
email local-part or their user_id.
Closes #293
Co-Authored-By: zdenek.srotyr <zdenek.srotyr@keboola.com>
* fix(security): RBAC filter uses stable user_id instead of mutable email local-part
Closes #293
Previous fix used OR condition matching both email local-part and user_id
in the username column. This was fragile: email changes would break
filtering. This commit introduces a dedicated user_id column populated
by the session pipeline via resolve_user_id(), and switches the RBAC
filter to use it exclusively.
Changes:
- Schema v45: add user_id column to usage_session_summary and usage_events
- UsageProcessor: accept and store user_id in both tables
- runner.py: resolve_user_id() maps directory name to users.id UUID
(exact match for UUID dirs, email LIKE for local-part dirs)
- INTERNAL_TABLES: agnes_sessions/agnes_telemetry filter on user_id column
- build_filter_clause: simplified to WHERE user_id = '<uuid>' (no OR)
- me.py/admin_user_sessions.py: query by user_id OR username for
backward compatibility during transition
- USAGE_PROCESSOR_VERSION bumped 2→3 to trigger reprocessing/backfill
- Tests updated: 27 pass including new email-change resilience test
Co-Authored-By: zdenek.srotyr <zdenek.srotyr@keboola.com>
* fix(tests): bump schema version assertions 44→45
Co-Authored-By: zdenek.srotyr <zdenek.srotyr@keboola.com>
* fix(docs): correct resolve_user_id docstring, add TypeError comment
Co-Authored-By: zdenek.srotyr <zdenek.srotyr@keboola.com>
* fix(security): address review — backward-compat OR, LIKE escaping, narrower TypeError
Co-Authored-By: zdenek.srotyr <zdenek.srotyr@keboola.com>
* fix(security): address code review — eliminate TypeError hack, add resolve_user_id tests
Co-Authored-By: zdenek.srotyr <zdenek.srotyr@keboola.com>
* fix(db): create user_id indexes in _v44_to_v45, not _SYSTEM_SCHEMA
_SYSTEM_SCHEMA runs before the migration ladder. On an upgrade from
v42/v43/v44, usage_events / usage_session_summary already exist without
the user_id column (CREATE TABLE IF NOT EXISTS is a no-op), so the
CREATE INDEX ... (user_id) lines in _SYSTEM_SCHEMA failed to bind and
aborted _ensure_schema — the app would not start post-upgrade. Move the
index creation to _v44_to_v45, which ADDs the column first. Same pattern
as the v41 audit_log indices.
* fix(usage): bump USAGE_PROCESSOR_VERSION 3→4 for user_id backfill
#303 shipped USAGE_PROCESSOR_VERSION=3 (release 0.54.12) for its
<command-name> slash extraction. This PR's 2→3 bump collided with it
on rebase, so the reprocess loop would not re-trigger to backfill the
new user_id column on deployments already running v3. Bump to 4.
* release: 0.54.13 — RBAC filter uses stable user_id (#293)
---------
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
365 lines
15 KiB
Python
365 lines
15 KiB
Python
"""Per-request scoped query execution for the ``internal`` data source.
|
|
|
|
Architecture: open ``system.duckdb`` read-only, build a temporary view per
|
|
referenced internal table with RBAC applied as a WHERE clause, execute the
|
|
user's SQL inside that scope, return rows.
|
|
|
|
RBAC model:
|
|
- Admin (per ``is_user_admin``) sees the table unscoped — no WHERE clause.
|
|
- Everyone else gets a single-row-filter projection. The filter column is
|
|
hard-coded per table (``INTERNAL_TABLES``) and the filter value comes
|
|
from the auth-resolved user object — never from user-supplied SQL.
|
|
|
|
SQL-injection considerations:
|
|
- The temp-view DDL interpolates a literal string for the filter value.
|
|
- ``username`` for ``usage_*`` is the local-part of an email; we enforce
|
|
the same regex used by the session-file path (alnums + ``._-``) before
|
|
interpolation.
|
|
- ``user_id`` for ``audit_log`` is a UUID; we validate the regex before
|
|
interpolation.
|
|
- The user's SELECT itself is gated by the same SELECT-only validator the
|
|
``/api/query`` endpoint uses (denylist of write/DDL keywords + file
|
|
functions). That validator runs in the API layer.
|
|
"""
|
|
|
|
from __future__ import annotations
|
|
|
|
import logging
|
|
import re
|
|
from dataclasses import dataclass
|
|
from typing import Any
|
|
|
|
|
|
logger = logging.getLogger(__name__)
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# Internal-table registry — single source of truth
|
|
# ---------------------------------------------------------------------------
|
|
|
|
|
|
@dataclass(frozen=True)
|
|
class InternalTable:
|
|
"""One internal table mapping.
|
|
|
|
Fields:
|
|
registry_id: the value used in ``table_registry.id`` and what
|
|
analysts type in SQL (``SELECT * FROM <registry_id>``)
|
|
source_table: the underlying physical table in ``system.duckdb``
|
|
filter_column: the column on ``source_table`` carrying the per-row
|
|
owner. Used for the non-admin scoping clause.
|
|
filter_kind: how to resolve the filter value from the auth user
|
|
dict — ``'username'`` (email local-part) or
|
|
``'user_id'`` (UUID).
|
|
display_name: human-readable name (also goes into ``table_registry.name``)
|
|
description: short blurb (catalog UI + ``agnes catalog`` output)
|
|
"""
|
|
|
|
registry_id: str
|
|
source_table: str
|
|
filter_column: str
|
|
filter_kind: str # 'username' | 'user_id'
|
|
display_name: str
|
|
description: str
|
|
legacy_username_column: str | None = None # backward-compat OR fallback
|
|
|
|
|
|
INTERNAL_TABLES: tuple[InternalTable, ...] = (
|
|
InternalTable(
|
|
registry_id="agnes_sessions",
|
|
source_table="usage_session_summary",
|
|
filter_column="user_id",
|
|
filter_kind="user_id",
|
|
display_name="Agnes sessions",
|
|
description="Claude Code sessions. Also available locally for analysis.",
|
|
legacy_username_column="username",
|
|
),
|
|
InternalTable(
|
|
registry_id="agnes_telemetry",
|
|
source_table="usage_events",
|
|
filter_column="user_id",
|
|
filter_kind="user_id",
|
|
display_name="Agnes telemetry events",
|
|
description="Tool and skill invocations from Claude Code. Also available locally for analysis.",
|
|
legacy_username_column="username",
|
|
),
|
|
InternalTable(
|
|
registry_id="agnes_audit",
|
|
source_table="audit_log",
|
|
filter_column="user_id",
|
|
filter_kind="user_id",
|
|
display_name="Agnes audit log",
|
|
description="Server-side actions performed against Agnes. Also available locally for analysis.",
|
|
),
|
|
)
|
|
|
|
INTERNAL_TABLES_BY_ID: dict[str, InternalTable] = {t.registry_id: t for t in INTERNAL_TABLES}
|
|
|
|
|
|
def is_internal_table(table_id: str) -> bool:
|
|
return table_id in INTERNAL_TABLES_BY_ID
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# RBAC filter resolution
|
|
# ---------------------------------------------------------------------------
|
|
|
|
# `+` allowed in both regexes — RFC 5321 local-parts (e.g. alice+test@x)
|
|
# resolve to filesystem usernames with a `+`, and the session-data-dir
|
|
# layout already supports the same character class.
|
|
_USERNAME_RE = re.compile(r"^[A-Za-z0-9._+-]{1,200}$")
|
|
_USER_ID_RE = re.compile(r"^[A-Za-z0-9._@:+-]{1,200}$")
|
|
|
|
|
|
class InternalAccessError(Exception):
|
|
"""Raised when the caller cannot be safely resolved to a filter value
|
|
or when an internal table is misconfigured."""
|
|
|
|
|
|
def _filter_value(user: dict[str, Any], kind: str) -> str:
|
|
"""Derive the per-row filter value from the authenticated user.
|
|
|
|
For ``username`` we mirror the session-data-dir convention: the
|
|
local-part of the email. This is the same value the UsageProcessor
|
|
writes into ``usage_events.username`` / ``usage_session_summary``.
|
|
|
|
For ``user_id`` we use the ``users.id`` UUID directly — same value
|
|
audit_log writes carry.
|
|
"""
|
|
if kind == "username":
|
|
email = (user or {}).get("email", "") or ""
|
|
username = email.split("@")[0] if "@" in email else email
|
|
if not _USERNAME_RE.match(username):
|
|
raise InternalAccessError(f"user email {email!r} does not yield a safe username for scoping")
|
|
return username
|
|
if kind == "user_id":
|
|
uid = (user or {}).get("id", "") or ""
|
|
if not _USER_ID_RE.match(uid):
|
|
raise InternalAccessError(f"user_id {uid!r} fails the safe-identifier check")
|
|
return uid
|
|
raise InternalAccessError(f"unknown filter_kind: {kind!r}")
|
|
|
|
|
|
def build_filter_clause(table: InternalTable, user: dict[str, Any], is_admin: bool) -> str:
|
|
"""Return the WHERE clause for one internal table.
|
|
|
|
Admins get an empty string (unscoped view). Everyone else get
|
|
``WHERE <col> = '<value>'`` where value has been regex-validated.
|
|
|
|
``agnes_sessions`` and ``agnes_telemetry`` filter primarily on
|
|
``user_id`` (stable UUID) but include an OR fallback on
|
|
``username`` (email local-part) for rows that pre-date the v45
|
|
backfill. Once all rows carry a non-NULL ``user_id`` the
|
|
``legacy_username_column`` field can be removed.
|
|
"""
|
|
if is_admin:
|
|
return ""
|
|
value = _filter_value(user, table.filter_kind)
|
|
safe = value.replace("'", "''")
|
|
|
|
if table.legacy_username_column:
|
|
legacy = _filter_value(user, "username").replace("'", "''")
|
|
return f"WHERE ({table.filter_column} = '{safe}' OR {table.legacy_username_column} = '{legacy}')"
|
|
|
|
return f"WHERE {table.filter_column} = '{safe}'"
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# Query execution
|
|
# ---------------------------------------------------------------------------
|
|
|
|
_TABLE_REF_RE = re.compile(
|
|
r"\b(" + "|".join(re.escape(t.registry_id) for t in INTERNAL_TABLES) + r")\b",
|
|
re.IGNORECASE,
|
|
)
|
|
|
|
# Single-quoted SQL string literals (with `''` escape handling). Stripped
|
|
# before reference detection so a non-admin can't trigger the internal
|
|
# privileged code path by smuggling the alias inside a literal.
|
|
_SQL_STRING_LITERAL_RE = re.compile(r"'(?:''|[^'])*'")
|
|
|
|
# SQL comments — block `/* … */` and `--` line forms. Stripped so a
|
|
# comment-wrapped table name (`/**/users/**/`) can't slip past the
|
|
# identifier scan downstream.
|
|
_SQL_BLOCK_COMMENT_RE = re.compile(r"/\*[\s\S]*?\*/")
|
|
_SQL_LINE_COMMENT_RE = re.compile(r"--[^\n]*")
|
|
|
|
|
|
def _strip_sql_noise(sql: str) -> str:
|
|
"""Strip string literals + block + line comments so the identifier
|
|
scanners that follow see only structural SQL. Order matters: strip
|
|
literals first (they can contain comment-looking text), then
|
|
comments (which can contain literal-looking text). String content
|
|
is replaced with empty `''` to keep token spacing intact."""
|
|
s = _SQL_STRING_LITERAL_RE.sub("''", sql)
|
|
s = _SQL_BLOCK_COMMENT_RE.sub(" ", s)
|
|
s = _SQL_LINE_COMMENT_RE.sub(" ", s)
|
|
return s
|
|
|
|
|
|
_INTERNAL_ALIAS_NAMES: frozenset[str] = frozenset(t.registry_id.lower() for t in INTERNAL_TABLES)
|
|
|
|
|
|
def _sensitive_table_reference(stripped_sql: str, conn) -> str | None:
|
|
"""Return the first non-allowlisted system.duckdb table name that
|
|
appears in ``stripped_sql``, or None if clean.
|
|
|
|
Allowlist = the registered ``agnes_*`` internal-table IDs. The
|
|
denylist is derived dynamically from ``information_schema.tables``
|
|
in the system.duckdb main schema, so adding a new sensitive table
|
|
in a future migration is automatically covered without re-editing
|
|
this module.
|
|
|
|
``stripped_sql`` MUST already have string literals and comments
|
|
stripped (see ``_strip_sql_noise``). Identifier scan is
|
|
case-insensitive word-boundary; schema-prefixed (`main.users`) and
|
|
double-quoted (`"users"`) forms both match because the bare name
|
|
still sits between word boundaries.
|
|
"""
|
|
rows = conn.execute("SELECT table_name FROM information_schema.tables WHERE table_schema = 'main'").fetchall()
|
|
for (name,) in rows:
|
|
if name is None:
|
|
continue
|
|
if name.lower() in _INTERNAL_ALIAS_NAMES:
|
|
continue
|
|
if re.search(rf"\b{re.escape(name)}\b", stripped_sql, re.IGNORECASE):
|
|
return name
|
|
return None
|
|
|
|
|
|
def find_internal_refs(sql: str) -> list[str]:
|
|
"""Word-boundary scan of `sql` for the registered internal table IDs.
|
|
|
|
Returns the matched IDs (lowercase, deduped) in declaration order.
|
|
String literals AND comments are stripped first so a literal /
|
|
commented mention of `agnes_sessions` doesn't route the request
|
|
into the privileged internal-query path (review #278 R2 / R3).
|
|
"""
|
|
stripped = _strip_sql_noise(sql)
|
|
found = {m.group(1).lower() for m in _TABLE_REF_RE.finditer(stripped)}
|
|
# Preserve declaration order so reasoning about the resulting set is stable.
|
|
return [t.registry_id for t in INTERNAL_TABLES if t.registry_id.lower() in found]
|
|
|
|
|
|
def execute_internal_query(
|
|
system_db_path: str,
|
|
user: dict[str, Any],
|
|
is_admin: bool,
|
|
sql: str,
|
|
limit: int = 1000,
|
|
) -> tuple[list[str], list[tuple], bool]:
|
|
"""Run a SELECT against per-request-scoped internal views.
|
|
|
|
Approach: wrap the user SQL in a CTE prefix that defines one
|
|
``agnes_*`` alias per referenced internal table, each scoped to the
|
|
caller's rows (admin → unscoped). We then ``SELECT * FROM (<user_sql>)``
|
|
inside the wrapper. DuckDB resolves the CTE aliases when it parses
|
|
the user_sql; the caller never sees the real ``usage_*`` tables.
|
|
|
|
Why a CTE wrapper instead of TEMP VIEW or ATTACH:
|
|
|
|
- ``duckdb.connect(path, read_only=True)`` from a fresh handle is
|
|
rejected when the app's main connection already holds
|
|
``system.duckdb`` open RW: "Can't open a connection to same
|
|
database file with a different configuration than existing
|
|
connections".
|
|
- ``ATTACH '<path>' AS sys (READ_ONLY)`` from a :memory: handle is
|
|
rejected with "Binder Error: Unique file handle conflict — the
|
|
database file is already attached by …": DuckDB serialises file
|
|
handles process-wide, so two connections can't reach the same
|
|
``.duckdb`` file even at different attach points.
|
|
- ``TEMP VIEW`` on the shared singleton connection bleeds across
|
|
concurrent requests (TEMP VIEWS are connection-scoped, and the
|
|
request-handler pool reuses the same handle).
|
|
|
|
CTE wrap leaves no residual state on the connection and isolates
|
|
naturally per request. The SQL stays in SELECT space; existing
|
|
keyword-denylist + sanitised-username defenses still apply.
|
|
"""
|
|
refs = find_internal_refs(sql)
|
|
if not refs:
|
|
raise InternalAccessError("no internal-table references in SQL")
|
|
|
|
# Lazy import to avoid a hard cycle (src.db imports go via repositories
|
|
# which then end up importing access in some test paths).
|
|
from src.db import get_system_db
|
|
|
|
# Non-admins are NOT allowed to reference any system.duckdb table
|
|
# outside the registered agnes_* aliases. The CTE wrapper only
|
|
# scopes those aliases; a direct FROM on the base table
|
|
# (`usage_session_summary`, `audit_log`, `users`,
|
|
# `personal_access_tokens`, etc.) would bypass row-level RBAC and
|
|
# leak other users' data. Denylist is derived dynamically from
|
|
# `information_schema.tables` — every table in system.duckdb that
|
|
# is NOT one of the agnes_* aliases is sensitive. This is
|
|
# future-proof: new tables added by later migrations are
|
|
# automatically covered without re-editing this module.
|
|
#
|
|
# Admin path is unaffected — admins have legitimate need to read
|
|
# raw rows, and the filter clause is empty for them anyway.
|
|
if not is_admin:
|
|
stripped = _strip_sql_noise(sql)
|
|
sensitive = _sensitive_table_reference(stripped, get_system_db())
|
|
if sensitive is not None:
|
|
raise InternalAccessError(
|
|
f"non-admin SQL cannot reference table {sensitive!r}; query one of the agnes_* aliases instead"
|
|
)
|
|
cte_parts = []
|
|
for table_id in refs:
|
|
table = INTERNAL_TABLES_BY_ID[table_id]
|
|
where_clause = build_filter_clause(table, user, is_admin)
|
|
cte_parts.append(f"{table.registry_id} AS (SELECT * FROM {table.source_table} {where_clause})")
|
|
cte_prefix = "WITH " + ", ".join(cte_parts)
|
|
wrapped = f"{cte_prefix} SELECT * FROM ({sql}) AS _agnes_user_query"
|
|
|
|
conn = get_system_db()
|
|
cursor = conn.cursor()
|
|
try:
|
|
rows = cursor.execute(wrapped).fetchmany(limit + 1)
|
|
cols = [d[0] for d in cursor.description] if cursor.description else []
|
|
truncated = len(rows) > limit
|
|
return cols, rows[:limit], truncated
|
|
finally:
|
|
try:
|
|
cursor.close()
|
|
except Exception:
|
|
logger.exception("close() failed on internal-query cursor")
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# Schema introspection — feeds /api/v2/schema/{id} for internal tables
|
|
# ---------------------------------------------------------------------------
|
|
|
|
|
|
def get_schema(system_db_path: str, table_id: str) -> list[dict]:
|
|
"""Return the underlying physical schema for an internal table.
|
|
|
|
Used by ``/api/v2/schema/<id>`` so ``agnes schema <table>`` works
|
|
against internal sources. Reuses the shared ``system.duckdb``
|
|
connection — same rationale as ``execute_internal_query``: opening
|
|
a parallel handle to the same file is process-wide blocked. The
|
|
information_schema query is read-only and small.
|
|
|
|
``system_db_path`` is kept in the signature for API symmetry with the
|
|
earlier draft, but is unused — the singleton handle already knows the
|
|
path.
|
|
"""
|
|
if table_id not in INTERNAL_TABLES_BY_ID:
|
|
return []
|
|
table = INTERNAL_TABLES_BY_ID[table_id]
|
|
from src.db import get_system_db
|
|
|
|
cursor = get_system_db().cursor()
|
|
try:
|
|
rows = cursor.execute(
|
|
"SELECT column_name, data_type, is_nullable "
|
|
"FROM information_schema.columns "
|
|
"WHERE table_name = ? ORDER BY ordinal_position",
|
|
[table.source_table],
|
|
).fetchall()
|
|
return [{"name": r[0], "type": r[1], "nullable": r[2] == "YES"} for r in rows]
|
|
finally:
|
|
try:
|
|
cursor.close()
|
|
except Exception:
|
|
pass
|