* fix(rbac): stack-gate analyst table access via data_packages exclusively
Previously analysts could see a table in ``agnes catalog`` /
``/api/sync/manifest`` either by:
1. being in a group with ``resource_grants(group, 'table', id)``, or
2. being in a group with ``resource_grants(group, 'data_package', …)``
for a package containing the table.
Path 1 leaked: admins who minted a per-table grant without ever
wrapping the table in a data_package still shipped the table to
analysts — directly contradicting the unified-stack mental model
("the stack is the unit of access"). User report:
"i když to admin nedal do data package tak to by default uživatelé
dostali to by se nemělo stát".
New policy: analyst visibility is strictly stack-gated. A table is
visible iff at least one data_package containing it is in the
analyst's stack (required ∪ subscribed). Admin god-mode and the three
internal data-source tables (agnes_sessions / _telemetry / _audit
with row-level RBAC) keep their existing carve-outs.
Touched surfaces:
* ``src/rbac.can_access_table`` + ``get_accessible_tables`` —
routed through ``StackResolver.stack(user, DATA_PACKAGE)`` +
``data_package_tables`` join instead of ``resource_grants(table)``.
* ``app/api/sync._build_direct_tables_section`` — always returns
``[]`` (key kept for older CLI destructuring); per-table grants
no longer manifest.
* Standardised 403 detail across ``/api/data/*``, ``/api/query``,
``/api/v2/sample``, ``/api/v2/scan``, ``/api/v2/schema``:
``Table 'X' is not in your stack. Ask an admin to add it to a
Data Package you have access to (Required or in your stack),
then run `agnes pull` to refresh.`` Single source of truth lives
in ``src.rbac.table_not_in_stack_message`` so the wording stays
consistent across CLI surfaces.
UX side: ``/catalog/t/<id>`` (table detail page) dropped the four
editorial sections (Sample questions, What's inside, Things to know,
Pairs well with) per user feedback — the page's job is now
"what is this table, where do I find it" (hero + parent packages).
Tests:
* ``tests/conftest.grant_table_via_package`` / ``revoke_table_via_package``
— shared helpers that wrap a table in an auto-named data_package +
grant the package required to a custom group. Replaces the legacy
per-test ``_grant_table_to_analyst`` table-grant pattern.
* All 17 previously-failing legacy tests (test_access_control,
test_journey_rbac, test_audit_gap_*, test_rbac, …) migrated to use
the new helper; logic stays the same.
* ``tests/fixtures/analyst_bootstrap._grant_table_access`` updated
to wrap via data_package so the ``test_pat`` fixture's "two table
grants" semantics still ship parquets through ``agnes init``.
* New ``tests/test_table_not_in_stack_message.py`` locks in the
standardised 403 detail across the data + check-access endpoints.
5204 tests passing (added 1).
* fix(catalog): first-demo UX feedback — required-first grouping + longer card description
Two minor polish items from the 2026-05-19 stakeholder demo:
1. Required packages cluster at the top of the Browse grid instead of
being interleaved by ``created_at``. Sort key
``(requirement != 'required', name)`` runs before the adapter
call in both /catalog (data_packages) and /corporate-memory
(memory_domains) so the required block is visible without
scrolling. Regression test pins the order via
``data-id="…"`` position in rendered HTML.
2. ``.stack-card__desc`` line clamp bumped 2 → 4 lines. Two-line clamp
trailed almost every admin-authored description off in "…" before
the second clause, forcing a click-through to read it. The detail
page (/catalog/p/<slug>) keeps the unclamped body for longer
content.
* release: 0.55.3 — stack-gated analyst RBAC (BREAKING) + first-demo UX polish + #345 A/B/C/D + #347 UI consistency
157 lines
6.1 KiB
Python
157 lines
6.1 KiB
Python
"""Data download endpoint — streaming parquet files."""
|
|
|
|
import logging
|
|
import time
|
|
|
|
from fastapi import APIRouter, Depends, HTTPException, Request, Response
|
|
from fastapi.responses import FileResponse
|
|
import duckdb
|
|
|
|
from app.auth.dependencies import get_current_user, _get_db
|
|
from app.utils import get_data_dir as _get_data_dir
|
|
from src.audit_helpers import client_kind_from_user
|
|
from src.identifier_validation import _SAFE_QUOTED_IDENTIFIER
|
|
from src.rbac import can_access_table
|
|
from src.repositories.audit import AuditRepository
|
|
|
|
logger = logging.getLogger(__name__)
|
|
|
|
router = APIRouter(prefix="/api/data", tags=["data"])
|
|
|
|
|
|
@router.get("/{table_id}/check-access")
|
|
async def check_access(
|
|
table_id: str,
|
|
user: dict = Depends(get_current_user),
|
|
conn: duckdb.DuckDBPyConnection = Depends(_get_db),
|
|
):
|
|
"""Lightweight RBAC probe used by Caddy's ``forward_auth`` directive
|
|
to gate file_server-served parquet downloads without involving the
|
|
app's request workers in the bulk byte transfer.
|
|
|
|
Returns HTTP 204 No Content when the caller has read access to
|
|
``table_id``; HTTP 403 (via ``can_access_table`` returning False)
|
|
otherwise. Caddy treats 2xx as authorized and forwards the request
|
|
to its own ``file_server`` block; non-2xx is returned to the client
|
|
verbatim.
|
|
|
|
Why a separate endpoint and not just ``HEAD /download``: ``HEAD`` on
|
|
the FileResponse-based ``download`` handler still opens the file and
|
|
runs stat() to populate Content-Length / ETag. ``forward_auth`` calls
|
|
this endpoint on every request, so the per-call cost matters; a pure
|
|
RBAC check is ~1 ms while a HEAD path involves filesystem walks
|
|
(``rglob`` for the parquet across source subdirs).
|
|
"""
|
|
t0 = time.monotonic()
|
|
resource = f"table:{table_id}"[:256]
|
|
if not _SAFE_QUOTED_IDENTIFIER.match(table_id):
|
|
try:
|
|
AuditRepository(conn).log(
|
|
user_id=user.get("id"),
|
|
action="data.access_check",
|
|
resource=resource,
|
|
params={"granted": False,
|
|
"duration_ms": int((time.monotonic() - t0) * 1000),
|
|
"error": "invalid_table_id"},
|
|
result="error.404",
|
|
client_kind=client_kind_from_user(user),
|
|
)
|
|
except Exception:
|
|
logger.exception("audit_log write failed for data.access_check (invalid id); continuing")
|
|
raise HTTPException(status_code=404, detail="Table not found")
|
|
granted = can_access_table(user, table_id, conn)
|
|
try:
|
|
AuditRepository(conn).log(
|
|
user_id=user.get("id"),
|
|
action="data.access_check",
|
|
resource=resource,
|
|
params={
|
|
"granted": granted,
|
|
"duration_ms": int((time.monotonic() - t0) * 1000),
|
|
},
|
|
result="success" if granted else "error.403",
|
|
client_kind=client_kind_from_user(user),
|
|
)
|
|
except Exception:
|
|
logger.exception("audit_log write failed for data.access_check; continuing")
|
|
if not granted:
|
|
from src.rbac import table_not_in_stack_message
|
|
raise HTTPException(
|
|
status_code=403, detail=table_not_in_stack_message(table_id),
|
|
)
|
|
return Response(status_code=204)
|
|
|
|
|
|
@router.get("/{table_id}/download")
|
|
async def download_table(
|
|
table_id: str,
|
|
request: Request,
|
|
user: dict = Depends(get_current_user),
|
|
conn: duckdb.DuckDBPyConnection = Depends(_get_db),
|
|
):
|
|
"""Stream a parquet file for download. Supports ETag for caching.
|
|
|
|
On Caddy-fronted deployments the matching Caddyfile rule intercepts
|
|
``GET /api/data/{table_id}/download``, calls ``check-access`` via
|
|
``forward_auth``, and serves the parquet directly via ``file_server``
|
|
— bypassing this handler entirely. This handler stays as the
|
|
canonical fallback for non-Caddy deployments (dev `docker compose
|
|
up`, alternative reverse proxies, direct :8000 access) where the
|
|
bulk transfer goes through uvicorn.
|
|
"""
|
|
# Reject unsafe table_id before any filesystem or DB operations.
|
|
# Use the relaxed quoted-identifier check that allows dots and hyphens
|
|
# (Keboola table IDs like "in.c-crm.orders") while still blocking
|
|
# path-traversal characters (/, .., \) and quote/control chars.
|
|
if not _SAFE_QUOTED_IDENTIFIER.match(table_id):
|
|
raise HTTPException(status_code=404, detail="Table not found")
|
|
# Check access FIRST
|
|
if not can_access_table(user, table_id, conn):
|
|
from src.rbac import table_not_in_stack_message
|
|
raise HTTPException(
|
|
status_code=403, detail=table_not_in_stack_message(table_id),
|
|
)
|
|
|
|
data_dir = _get_data_dir()
|
|
|
|
# Search in extracts directory (v2 extract.duckdb architecture)
|
|
extracts_dir = data_dir / "extracts"
|
|
candidates = list(extracts_dir.rglob(f"data/{table_id}.parquet")) if extracts_dir.exists() else []
|
|
|
|
# Fallback to legacy path for backward compatibility
|
|
if not candidates:
|
|
parquet_dir = data_dir / "src_data" / "parquet"
|
|
candidates = list(parquet_dir.rglob(f"{table_id}.parquet"))
|
|
if not candidates:
|
|
candidates = list(parquet_dir.rglob(f"*/{table_id}.parquet"))
|
|
|
|
if not candidates:
|
|
raise HTTPException(status_code=404, detail=f"Table '{table_id}' not found")
|
|
|
|
file_path = candidates[0]
|
|
|
|
# ETag support
|
|
stat = file_path.stat()
|
|
etag = f'"{stat.st_mtime_ns}"'
|
|
if_none_match = request.headers.get("if-none-match")
|
|
if if_none_match == etag:
|
|
return Response(status_code=304)
|
|
|
|
try:
|
|
AuditRepository(conn).log(
|
|
user_id=user.get("id"),
|
|
action="data.download",
|
|
resource=f"table:{table_id}"[:256],
|
|
params={"bytes": stat.st_size, "format": "parquet"},
|
|
result="success",
|
|
client_kind=client_kind_from_user(user),
|
|
)
|
|
except Exception:
|
|
logger.exception("audit_log write failed for data.download; continuing")
|
|
|
|
return FileResponse(
|
|
path=file_path,
|
|
filename=f"{table_id}.parquet",
|
|
media_type="application/octet-stream",
|
|
headers={"ETag": etag},
|
|
)
|