agnes-the-ai-analyst/app/api/data.py
ZdenekSrotyr 62336bfd32
fix(rbac): stack-gated analyst access + first-demo polish (#333 follow-up) (#356)
* fix(rbac): stack-gate analyst table access via data_packages exclusively

Previously analysts could see a table in ``agnes catalog`` /
``/api/sync/manifest`` either by:
  1. being in a group with ``resource_grants(group, 'table', id)``, or
  2. being in a group with ``resource_grants(group, 'data_package', …)``
     for a package containing the table.

Path 1 leaked: admins who minted a per-table grant without ever
wrapping the table in a data_package still shipped the table to
analysts — directly contradicting the unified-stack mental model
("the stack is the unit of access"). User report:
"i když to admin nedal do data package tak to by default uživatelé
dostali to by se nemělo stát".

New policy: analyst visibility is strictly stack-gated. A table is
visible iff at least one data_package containing it is in the
analyst's stack (required ∪ subscribed). Admin god-mode and the three
internal data-source tables (agnes_sessions / _telemetry / _audit
with row-level RBAC) keep their existing carve-outs.

Touched surfaces:
* ``src/rbac.can_access_table`` + ``get_accessible_tables`` —
  routed through ``StackResolver.stack(user, DATA_PACKAGE)`` +
  ``data_package_tables`` join instead of ``resource_grants(table)``.
* ``app/api/sync._build_direct_tables_section`` — always returns
  ``[]`` (key kept for older CLI destructuring); per-table grants
  no longer manifest.
* Standardised 403 detail across ``/api/data/*``, ``/api/query``,
  ``/api/v2/sample``, ``/api/v2/scan``, ``/api/v2/schema``:
  ``Table 'X' is not in your stack. Ask an admin to add it to a
  Data Package you have access to (Required or in your stack),
  then run `agnes pull` to refresh.`` Single source of truth lives
  in ``src.rbac.table_not_in_stack_message`` so the wording stays
  consistent across CLI surfaces.

UX side: ``/catalog/t/<id>`` (table detail page) dropped the four
editorial sections (Sample questions, What's inside, Things to know,
Pairs well with) per user feedback — the page's job is now
"what is this table, where do I find it" (hero + parent packages).

Tests:
* ``tests/conftest.grant_table_via_package`` / ``revoke_table_via_package``
  — shared helpers that wrap a table in an auto-named data_package +
  grant the package required to a custom group. Replaces the legacy
  per-test ``_grant_table_to_analyst`` table-grant pattern.
* All 17 previously-failing legacy tests (test_access_control,
  test_journey_rbac, test_audit_gap_*, test_rbac, …) migrated to use
  the new helper; logic stays the same.
* ``tests/fixtures/analyst_bootstrap._grant_table_access`` updated
  to wrap via data_package so the ``test_pat`` fixture's "two table
  grants" semantics still ship parquets through ``agnes init``.
* New ``tests/test_table_not_in_stack_message.py`` locks in the
  standardised 403 detail across the data + check-access endpoints.

5204 tests passing (added 1).

* fix(catalog): first-demo UX feedback — required-first grouping + longer card description

Two minor polish items from the 2026-05-19 stakeholder demo:

1. Required packages cluster at the top of the Browse grid instead of
   being interleaved by ``created_at``. Sort key
   ``(requirement != 'required', name)`` runs before the adapter
   call in both /catalog (data_packages) and /corporate-memory
   (memory_domains) so the required block is visible without
   scrolling. Regression test pins the order via
   ``data-id="…"`` position in rendered HTML.

2. ``.stack-card__desc`` line clamp bumped 2 → 4 lines. Two-line clamp
   trailed almost every admin-authored description off in "…" before
   the second clause, forcing a click-through to read it. The detail
   page (/catalog/p/<slug>) keeps the unclamped body for longer
   content.

* release: 0.55.3 — stack-gated analyst RBAC (BREAKING) + first-demo UX polish + #345 A/B/C/D + #347 UI consistency
2026-05-19 17:01:14 +02:00

157 lines
6.1 KiB
Python

"""Data download endpoint — streaming parquet files."""
import logging
import time
from fastapi import APIRouter, Depends, HTTPException, Request, Response
from fastapi.responses import FileResponse
import duckdb
from app.auth.dependencies import get_current_user, _get_db
from app.utils import get_data_dir as _get_data_dir
from src.audit_helpers import client_kind_from_user
from src.identifier_validation import _SAFE_QUOTED_IDENTIFIER
from src.rbac import can_access_table
from src.repositories.audit import AuditRepository
logger = logging.getLogger(__name__)
router = APIRouter(prefix="/api/data", tags=["data"])
@router.get("/{table_id}/check-access")
async def check_access(
table_id: str,
user: dict = Depends(get_current_user),
conn: duckdb.DuckDBPyConnection = Depends(_get_db),
):
"""Lightweight RBAC probe used by Caddy's ``forward_auth`` directive
to gate file_server-served parquet downloads without involving the
app's request workers in the bulk byte transfer.
Returns HTTP 204 No Content when the caller has read access to
``table_id``; HTTP 403 (via ``can_access_table`` returning False)
otherwise. Caddy treats 2xx as authorized and forwards the request
to its own ``file_server`` block; non-2xx is returned to the client
verbatim.
Why a separate endpoint and not just ``HEAD /download``: ``HEAD`` on
the FileResponse-based ``download`` handler still opens the file and
runs stat() to populate Content-Length / ETag. ``forward_auth`` calls
this endpoint on every request, so the per-call cost matters; a pure
RBAC check is ~1 ms while a HEAD path involves filesystem walks
(``rglob`` for the parquet across source subdirs).
"""
t0 = time.monotonic()
resource = f"table:{table_id}"[:256]
if not _SAFE_QUOTED_IDENTIFIER.match(table_id):
try:
AuditRepository(conn).log(
user_id=user.get("id"),
action="data.access_check",
resource=resource,
params={"granted": False,
"duration_ms": int((time.monotonic() - t0) * 1000),
"error": "invalid_table_id"},
result="error.404",
client_kind=client_kind_from_user(user),
)
except Exception:
logger.exception("audit_log write failed for data.access_check (invalid id); continuing")
raise HTTPException(status_code=404, detail="Table not found")
granted = can_access_table(user, table_id, conn)
try:
AuditRepository(conn).log(
user_id=user.get("id"),
action="data.access_check",
resource=resource,
params={
"granted": granted,
"duration_ms": int((time.monotonic() - t0) * 1000),
},
result="success" if granted else "error.403",
client_kind=client_kind_from_user(user),
)
except Exception:
logger.exception("audit_log write failed for data.access_check; continuing")
if not granted:
from src.rbac import table_not_in_stack_message
raise HTTPException(
status_code=403, detail=table_not_in_stack_message(table_id),
)
return Response(status_code=204)
@router.get("/{table_id}/download")
async def download_table(
table_id: str,
request: Request,
user: dict = Depends(get_current_user),
conn: duckdb.DuckDBPyConnection = Depends(_get_db),
):
"""Stream a parquet file for download. Supports ETag for caching.
On Caddy-fronted deployments the matching Caddyfile rule intercepts
``GET /api/data/{table_id}/download``, calls ``check-access`` via
``forward_auth``, and serves the parquet directly via ``file_server``
— bypassing this handler entirely. This handler stays as the
canonical fallback for non-Caddy deployments (dev `docker compose
up`, alternative reverse proxies, direct :8000 access) where the
bulk transfer goes through uvicorn.
"""
# Reject unsafe table_id before any filesystem or DB operations.
# Use the relaxed quoted-identifier check that allows dots and hyphens
# (Keboola table IDs like "in.c-crm.orders") while still blocking
# path-traversal characters (/, .., \) and quote/control chars.
if not _SAFE_QUOTED_IDENTIFIER.match(table_id):
raise HTTPException(status_code=404, detail="Table not found")
# Check access FIRST
if not can_access_table(user, table_id, conn):
from src.rbac import table_not_in_stack_message
raise HTTPException(
status_code=403, detail=table_not_in_stack_message(table_id),
)
data_dir = _get_data_dir()
# Search in extracts directory (v2 extract.duckdb architecture)
extracts_dir = data_dir / "extracts"
candidates = list(extracts_dir.rglob(f"data/{table_id}.parquet")) if extracts_dir.exists() else []
# Fallback to legacy path for backward compatibility
if not candidates:
parquet_dir = data_dir / "src_data" / "parquet"
candidates = list(parquet_dir.rglob(f"{table_id}.parquet"))
if not candidates:
candidates = list(parquet_dir.rglob(f"*/{table_id}.parquet"))
if not candidates:
raise HTTPException(status_code=404, detail=f"Table '{table_id}' not found")
file_path = candidates[0]
# ETag support
stat = file_path.stat()
etag = f'"{stat.st_mtime_ns}"'
if_none_match = request.headers.get("if-none-match")
if if_none_match == etag:
return Response(status_code=304)
try:
AuditRepository(conn).log(
user_id=user.get("id"),
action="data.download",
resource=f"table:{table_id}"[:256],
params={"bytes": stat.st_size, "format": "parquet"},
result="success",
client_kind=client_kind_from_user(user),
)
except Exception:
logger.exception("audit_log write failed for data.download; continuing")
return FileResponse(
path=file_path,
filename=f"{table_id}.parquet",
media_type="application/octet-stream",
headers={"ETag": etag},
)