agnes-the-ai-analyst/docs/superpowers/plans/2026-04-27-claude-fetch-primitives.md
ZdenekSrotyr 2e1dfb7553
feat(v2): claude-driven fetch primitives + 0.14.0 (#102)
Replaces the BigQuery wrap-view pattern with a discovery + scoped-fetch toolkit driven by the analyst's Claude session. Adds /api/v2/{catalog,schema,sample,scan,scan/estimate}, da catalog/schema/describe/fetch/snapshot/disk-info CLI commands, sqlglot-backed WHERE validator, process-local quota tracker, agent rails skill (cli/skills/agnes-data-querying.md). BREAKING: BQ wrap views off by default — set data_source.bigquery.legacy_wrap_views=true for one cycle. Backward-compat field_validator on primary_key. Catalog cache now matches documented 300s TTL with RBAC fresh per request. Cuts release v0.14.0.
2026-04-29 01:07:19 +02:00

126 KiB

Claude-Driven Fetch Primitives Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Replace the broken BQ-view-wrapping approach (issue #101) with primitive operations the Claude agent composes — da catalog, da schema, da fetch, da snapshot *, da query — backed by /api/v2/{catalog,schema,sample,scan,scan/estimate} server endpoints. Secrets stay server-side; agent does the planning.

Architecture: Two-tier query model (laptop DuckDB ↔ server DuckDB ↔ BQ) preserved. New v2 endpoints expose discovery + scoped scan. CLI commands materialize filtered subsets locally as parquet snapshots registered as DuckDB views. Server-side WHERE validator (sqlglot, allow-list-driven) is the security perimeter.

Tech Stack: Python 3.13, FastAPI, DuckDB, pyarrow (Arrow IPC over HTTP), sqlglot (server-side WHERE validation), bigquery_query() DuckDB BQ extension function, GCE metadata-token auth (#98 cache reused), pytest. No new dependencies beyond sqlglot (already optional in repo).

Spec: docs/superpowers/specs/2026-04-27-claude-fetch-primitives-design.md


File structure

New files (server):

  • app/api/where_validator.py — sqlglot-backed WHERE clause validator (§3.7 of spec)
  • app/api/v2_quota.py — process-local concurrent + daily-byte quota tracker
  • app/api/v2_cache.py — LRU+TTL cache helper for catalog/schema/sample
  • app/api/v2_arrow.py — Arrow IPC streaming helper (response builder)
  • app/api/v2_catalog.pyGET /api/v2/catalog
  • app/api/v2_schema.pyGET /api/v2/schema/{table_id}
  • app/api/v2_sample.pyGET /api/v2/sample/{table_id}
  • app/api/v2_scan.pyPOST /api/v2/scan + POST /api/v2/scan/estimate

New files (client):

  • cli/v2_client.py — Arrow over HTTP client + JSON request helpers
  • cli/snapshot_meta.py — sidecar JSON I/O + flock helper
  • cli/commands/fetch.pyda fetch
  • cli/commands/snapshot.pyda snapshot list/refresh/drop/prune
  • cli/commands/catalog.pyda catalog
  • cli/commands/schema.pyda schema
  • cli/commands/describe.pyda describe
  • cli/commands/disk_info.pyda disk-info

New files (tests):

  • tests/test_where_validator.py — adversarial corpus (50+ cases)
  • tests/test_v2_quota.py, tests/test_v2_cache.py, tests/test_v2_arrow.py
  • tests/test_v2_catalog.py, tests/test_v2_schema.py, tests/test_v2_sample.py, tests/test_v2_scan.py, tests/test_v2_scan_estimate.py
  • tests/test_cli_fetch.py, tests/test_cli_snapshot.py, tests/test_cli_catalog.py, tests/test_cli_schema.py, tests/test_cli_describe.py, tests/test_cli_disk_info.py
  • tests/test_snapshot_meta.py, tests/test_v2_client.py

New files (docs/skill):

  • cli/skills/agnes-data-querying.md — agent rails skill (§5.2)

Modified files:

  • app/main.py — register v2 routers
  • cli/main.py — register new command groups
  • connectors/bigquery/extractor.py — drop wrap-view code path for VIEW entities + legacy_wrap_views toggle
  • tests/test_bigquery_extractor.py — update tests for legacy toggle
  • CLAUDE.md — agent rails addendum (§5.1)
  • CHANGELOG.md**BREAKING** entry under [Unreleased]
  • config/instance.yaml.example — new api.scan.* knobs

Task 1: WHERE validator — parser + structural rejects

Foundation for /api/v2/scan security perimeter. Spec §3.7 part 1.

Files:

  • Create: app/api/where_validator.py

  • Test: tests/test_where_validator.py

  • Step 1.1: Write failing tests for parse + structural rejects

# tests/test_where_validator.py
"""Adversarial test corpus for the WHERE clause validator (spec §3.7)."""

import pytest
from app.api.where_validator import (
    validate_where,
    WhereValidationError,
    REJECT_NESTED_SELECT,
    REJECT_MULTI_STATEMENT,
    REJECT_DDL_DML,
    REJECT_PARSE,
    REJECT_CROSS_TABLE,
)


# A schema-like dict the validator uses to verify column references.
SCHEMA = {
    "event_date": "DATE",
    "country_code": "STRING",
    "session_id": "STRING",
    "amount": "INT64",
}
TABLE_ID = "web_sessions_example"


class TestParse:
    def test_empty_string_rejected(self):
        with pytest.raises(WhereValidationError) as e:
            validate_where("", TABLE_ID, SCHEMA)
        assert e.value.kind == REJECT_PARSE

    def test_unparseable_rejected(self):
        with pytest.raises(WhereValidationError) as e:
            validate_where("SELECT * FROM", TABLE_ID, SCHEMA)
        assert e.value.kind == REJECT_PARSE


class TestStructural:
    def test_nested_select_rejected(self):
        with pytest.raises(WhereValidationError) as e:
            validate_where(
                "country_code IN (SELECT country FROM other_table)",
                TABLE_ID, SCHEMA,
            )
        assert e.value.kind == REJECT_NESTED_SELECT

    def test_multi_statement_rejected(self):
        with pytest.raises(WhereValidationError) as e:
            validate_where("amount = 1; DROP TABLE x", TABLE_ID, SCHEMA)
        assert e.value.kind == REJECT_MULTI_STATEMENT

    def test_drop_table_rejected(self):
        with pytest.raises(WhereValidationError) as e:
            validate_where("amount = (DROP TABLE x)", TABLE_ID, SCHEMA)
        assert e.value.kind in (REJECT_DDL_DML, REJECT_PARSE)

    def test_cross_table_reference_rejected(self):
        """Predicates may only reference the target table."""
        with pytest.raises(WhereValidationError) as e:
            validate_where(
                "other_table.id = 1",
                TABLE_ID, SCHEMA,
            )
        assert e.value.kind == REJECT_CROSS_TABLE
  • Step 1.2: Run tests to verify failure

Run: pytest tests/test_where_validator.py::TestParse tests/test_where_validator.py::TestStructural -v Expected: FAIL with ModuleNotFoundError: No module named 'app.api.where_validator'

  • Step 1.3: Implement parser + structural validator

Create app/api/where_validator.py:

"""WHERE clause validator for /api/v2/scan.

Single security perimeter — every analyst-supplied predicate flows through here
before reaching BigQuery. Allow-list-driven; explicit rejection codes per spec §3.7.
"""

from __future__ import annotations
import logging
from dataclasses import dataclass
from typing import Mapping

import sqlglot
from sqlglot import exp
from sqlglot.errors import ParseError

logger = logging.getLogger(__name__)

# Rejection kind codes (stable; used by callers + tests + audit log)
REJECT_PARSE = "parse_error"
REJECT_NESTED_SELECT = "nested_select"
REJECT_MULTI_STATEMENT = "multi_statement"
REJECT_DDL_DML = "ddl_or_dml"
REJECT_CROSS_TABLE = "cross_table_reference"
REJECT_UNKNOWN_FUNCTION = "unknown_function"
REJECT_UNKNOWN_COLUMN = "unknown_column"
REJECT_DISALLOWED_NODE = "disallowed_node"


@dataclass
class WhereValidationError(Exception):
    kind: str
    message: str
    detail: dict | None = None

    def __str__(self) -> str:
        return f"[{self.kind}] {self.message}"


# Nodes that imply DDL/DML (rejected outright).
_DDL_DML_NODES = (
    exp.Insert, exp.Update, exp.Delete, exp.Drop, exp.Truncate,
    exp.Alter, exp.Create, exp.Copy, exp.Merge,
)


def validate_where(
    predicate: str,
    table_id: str,
    schema: Mapping[str, str],
) -> exp.Expression:
    """Validate a WHERE-clause fragment.

    Args:
        predicate: SQL fragment (without leading 'WHERE').
        table_id: target table id; cross-table references rejected.
        schema: {column_name: type} for the target table.

    Returns:
        Parsed sqlglot expression tree (caller may re-stringify or inspect).

    Raises:
        WhereValidationError: with .kind set to one of the REJECT_* codes.
    """
    if not predicate or not predicate.strip():
        raise WhereValidationError(REJECT_PARSE, "empty predicate")

    # Multi-statement detection: BQ statements separated by ';' would parse
    # as multiple expressions in sqlglot.parse() (returns a list).
    try:
        statements = sqlglot.parse(f"SELECT 1 FROM t WHERE {predicate}", dialect="bigquery")
    except ParseError as e:
        raise WhereValidationError(REJECT_PARSE, f"parse failed: {e}")

    if statements is None or len(statements) != 1 or statements[0] is None:
        raise WhereValidationError(REJECT_MULTI_STATEMENT, "multi-statement input not allowed")

    select = statements[0]
    where = select.find(exp.Where)
    if where is None:
        raise WhereValidationError(REJECT_PARSE, "no WHERE expression found in parsed input")

    _walk_structural(where, table_id, schema)
    return where


def _walk_structural(node: exp.Expression, table_id: str, schema: Mapping[str, str]) -> None:
    """Walk the WHERE AST and reject disallowed structures."""
    for sub in node.walk():
        # `node.walk()` yields the node itself first; check structural rules.
        if isinstance(sub, exp.Subquery) or (isinstance(sub, exp.Select) and sub is not node):
            raise WhereValidationError(REJECT_NESTED_SELECT, "nested SELECT/subquery not allowed")
        if isinstance(sub, _DDL_DML_NODES):
            raise WhereValidationError(REJECT_DDL_DML, f"DDL/DML node {type(sub).__name__} not allowed")

    # Cross-table reference detection: any column with a qualifier other than
    # the target table_id (or unqualified) is rejected.
    for col in node.find_all(exp.Column):
        qualifier = col.table  # e.g. "other_table" in `other_table.id`
        if qualifier and qualifier.lower() != table_id.lower():
            raise WhereValidationError(
                REJECT_CROSS_TABLE,
                f"column {col.sql()} references table {qualifier!r}, expected {table_id!r}",
            )
  • Step 1.4: Run tests to verify pass

Run: pytest tests/test_where_validator.py::TestParse tests/test_where_validator.py::TestStructural -v Expected: 5 passed

  • Step 1.5: Commit
git add app/api/where_validator.py tests/test_where_validator.py
git commit -m "feat(validator): WHERE clause parser + structural rejects"

Task 2: WHERE validator — function allow-list

Spec §3.7 enumerated function set. Reject unknown functions with explicit name in error.

Files:

  • Modify: app/api/where_validator.py

  • Modify: tests/test_where_validator.py

  • Step 2.1: Append failing tests

# tests/test_where_validator.py (append after TestStructural)

class TestFunctionAllowList:
    @pytest.mark.parametrize(
        "predicate",
        [
            # Comparison
            "amount = 1", "amount != 1", "amount IS NULL", "amount IS NOT NULL",
            "country_code IN ('CZ', 'SK')", "amount BETWEEN 1 AND 100",
            "country_code LIKE 'C%'", "country_code NOT LIKE 'X%'",
            # Boolean
            "amount = 1 AND country_code = 'CZ'",
            "amount = 1 OR amount = 2",
            "NOT (amount = 1)",
            # Date/Time
            "event_date > DATE '2026-01-01'",
            "event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)",
            "EXTRACT(YEAR FROM event_date) = 2026",
            # String
            "STARTS_WITH(country_code, 'C')",
            "REGEXP_CONTAINS(country_code, r'C[ZS]')",
            "LENGTH(country_code) = 2",
            # Math
            "amount > ABS(-5)",
            "amount BETWEEN GREATEST(0, 10) AND LEAST(100, 200)",
            # Cast
            "CAST(country_code AS STRING) = 'CZ'",
            # Conditional
            "IFNULL(country_code, 'XX') = 'CZ'",
            "COALESCE(amount, 0) > 0",
        ],
    )
    def test_allowed_predicate(self, predicate):
        validate_where(predicate, TABLE_ID, SCHEMA)  # must not raise

    @pytest.mark.parametrize(
        "predicate,expected_func",
        [
            ("amount = EXTERNAL_QUERY('connection', 'SELECT 1')", "EXTERNAL_QUERY"),
            ("country_code = SESSION_USER()", "SESSION_USER"),
            ("amount = ML.PREDICT(MODEL m, TABLE t)", "ML.PREDICT"),
            ("amount = OBSCURE_BUILTIN(country_code)", "OBSCURE_BUILTIN"),
            ("amount = ARRAY_AGG(amount)", "ARRAY_AGG"),
            ("amount = ROW_NUMBER() OVER (PARTITION BY country_code)", "ROW_NUMBER"),
        ],
    )
    def test_disallowed_function(self, predicate, expected_func):
        with pytest.raises(WhereValidationError) as e:
            validate_where(predicate, TABLE_ID, SCHEMA)
        assert e.value.kind == REJECT_UNKNOWN_FUNCTION
        # The rejected function name must appear in detail or message
        assert expected_func.upper() in str(e.value).upper() or (
            e.value.detail and expected_func.upper() in str(e.value.detail).upper()
        )
  • Step 2.2: Run tests to verify failure

Run: pytest tests/test_where_validator.py::TestFunctionAllowList -v Expected: FAIL — current validator allows any function (no allow-list yet).

  • Step 2.3: Add allow-list + function check

In app/api/where_validator.py, after the imports and _DDL_DML_NODES:

# v1 BigQuery function allow-list (spec §3.7). Stored as upper-case names.
# Categorized for documentation; merged into one set for membership check.
_ALLOW_FUNCTIONS_COMPARISON = {
    # Operators are AST nodes, not functions; not listed here.
}
_ALLOW_FUNCTIONS_DATETIME = {
    "CURRENT_DATE", "CURRENT_TIMESTAMP", "CURRENT_TIME",
    "DATE", "DATETIME", "TIMESTAMP", "TIME",
    "DATE_ADD", "DATE_SUB", "DATE_DIFF", "DATE_TRUNC", "EXTRACT",
    "FORMAT_DATE", "FORMAT_TIMESTAMP", "PARSE_DATE", "PARSE_TIMESTAMP",
    "UNIX_SECONDS", "UNIX_MILLIS",
}
_ALLOW_FUNCTIONS_STRING = {
    "CONCAT", "LENGTH", "LOWER", "UPPER", "SUBSTR", "SUBSTRING",
    "TRIM", "LTRIM", "RTRIM", "REPLACE",
    "STARTS_WITH", "ENDS_WITH", "CONTAINS_SUBSTR",
    "REGEXP_CONTAINS", "REGEXP_EXTRACT", "SAFE_CAST",
}
_ALLOW_FUNCTIONS_MATH = {
    "ABS", "CEIL", "FLOOR", "ROUND", "MOD", "POWER", "SQRT",
    "LOG", "LN", "EXP", "SIGN", "GREATEST", "LEAST",
}
_ALLOW_FUNCTIONS_CAST = {"CAST"}
_ALLOW_FUNCTIONS_CONDITIONAL = {"IF", "IFNULL", "COALESCE", "NULLIF", "CASE"}

ALLOWED_FUNCTIONS: frozenset[str] = frozenset(
    _ALLOW_FUNCTIONS_DATETIME
    | _ALLOW_FUNCTIONS_STRING
    | _ALLOW_FUNCTIONS_MATH
    | _ALLOW_FUNCTIONS_CAST
    | _ALLOW_FUNCTIONS_CONDITIONAL
)

# CAST target types allowed
_ALLOW_CAST_TYPES = {
    "INT64", "FLOAT64", "NUMERIC", "STRING", "BYTES", "BOOL",
    "DATE", "DATETIME", "TIMESTAMP", "TIME", "DECIMAL", "BIGNUMERIC",
}

Then add _walk_functions() and call it from _walk_structural. Add at the end of _walk_structural:

    _walk_functions(node)

And new helper:

def _walk_functions(node: exp.Expression) -> None:
    for func in node.find_all(exp.Func):
        # Window functions, aggregates, anonymous funcs — sqlglot uses subclasses.
        if isinstance(func, exp.Window):
            raise WhereValidationError(
                REJECT_UNKNOWN_FUNCTION,
                f"window function not allowed: {func.sql()}",
                detail={"function": "WINDOW"},
            )
        if isinstance(func, exp.AggFunc):
            raise WhereValidationError(
                REJECT_UNKNOWN_FUNCTION,
                f"aggregate function not allowed in WHERE: {type(func).__name__}",
                detail={"function": type(func).__name__.upper()},
            )

        # `func.name` is the SQL function name; might be empty for built-in operators.
        name = (func.name or "").upper()
        # Anonymous function nodes carry their identifier in a different slot
        if not name and hasattr(func, "this") and hasattr(func.this, "name"):
            name = (func.this.name or "").upper()

        # Skip operators-as-nodes (Add, Sub, Mul, Div, Eq, Neq, Lt, Gt, Like, In, Between, etc.)
        # — these are exp.Binary subclasses, not exp.Func subclasses, so usually not seen here.
        # But be defensive: if name is empty AFTER all heuristics, skip rather than flag.
        if not name:
            continue

        if name not in ALLOWED_FUNCTIONS:
            raise WhereValidationError(
                REJECT_UNKNOWN_FUNCTION,
                f"function not in v1 allow-list: {name}",
                detail={"function": name},
            )
  • Step 2.4: Run tests to verify pass

Run: pytest tests/test_where_validator.py -v Expected: all previous + new TestFunctionAllowList pass.

  • Step 2.5: Commit
git add app/api/where_validator.py tests/test_where_validator.py
git commit -m "feat(validator): function allow-list with explicit reject codes"

Task 3: WHERE validator — column existence + identifier-path validation

Reject WHERE referring to columns not in the target schema. Spec §3.7 identifier-path section.

Files:

  • Modify: app/api/where_validator.py

  • Modify: tests/test_where_validator.py

  • Step 3.1: Append failing tests

# tests/test_where_validator.py (append)

class TestColumnExistence:
    def test_known_column_accepted(self):
        validate_where("country_code = 'CZ'", TABLE_ID, SCHEMA)

    def test_unknown_column_rejected(self):
        with pytest.raises(WhereValidationError) as e:
            validate_where("nonexistent_field = 'X'", TABLE_ID, SCHEMA)
        assert e.value.kind == REJECT_UNKNOWN_COLUMN
        assert "nonexistent_field" in str(e.value).lower()

    def test_qualified_known_column_accepted(self):
        # Same-table qualifier is allowed
        validate_where(
            f"{TABLE_ID}.country_code = 'CZ'",
            TABLE_ID, SCHEMA,
        )

    def test_qualified_unknown_column_rejected(self):
        with pytest.raises(WhereValidationError) as e:
            validate_where(
                f"{TABLE_ID}.bogus_field = 'X'",
                TABLE_ID, SCHEMA,
            )
        assert e.value.kind == REJECT_UNKNOWN_COLUMN
  • Step 3.2: Run tests to verify failure

Run: pytest tests/test_where_validator.py::TestColumnExistence -v Expected: 3 passes (qualified known + unqualified known will pass already), 1 fail on unknown_column expectation OR all 4 fail because the validator never checks columns.

  • Step 3.3: Add column-existence check

Add new helper after _walk_functions:

def _walk_columns(node: exp.Expression, schema: Mapping[str, str]) -> None:
    """Reject column references not present in the target table's schema."""
    known = {c.lower() for c in schema}
    for col in node.find_all(exp.Column):
        # `col.name` is the leaf column name (e.g. "country_code" in
        # "tbl.country_code"). For dotted struct fields like "rec.sub.leaf",
        # sqlglot models as nested exp.Dot; v1 only checks top-level names.
        leaf = (col.name or "").lower()
        if leaf and leaf not in known:
            raise WhereValidationError(
                REJECT_UNKNOWN_COLUMN,
                f"column {col.name!r} not in schema for {col.table!r}",
                detail={"column": col.name},
            )

Call from _walk_structural after _walk_functions(node):

    _walk_functions(node)
    _walk_columns(node, schema)
  • Step 3.4: Run tests to verify pass

Run: pytest tests/test_where_validator.py -v Expected: all pass (including the new TestColumnExistence).

  • Step 3.5: Commit
git add app/api/where_validator.py tests/test_where_validator.py
git commit -m "feat(validator): column-existence check via target-table schema"

Task 4: Process-local quota tracker

Spec §3.8. Per-user concurrent count + daily byte cap, in-memory. Multi-replica caveat documented in spec §9.4 — out of scope.

Files:

  • Create: app/api/v2_quota.py

  • Test: tests/test_v2_quota.py

  • Step 4.1: Write failing tests

# tests/test_v2_quota.py
"""Tests for the process-local v2 scan quota tracker (spec §3.8)."""

from datetime import datetime, timedelta, timezone
import pytest

from app.api.v2_quota import (
    QuotaTracker,
    QuotaExceededError,
    KIND_CONCURRENT,
    KIND_DAILY_BYTES,
)


def make_tracker(max_concurrent=5, max_daily_bytes=100):
    return QuotaTracker(
        max_concurrent_per_user=max_concurrent,
        max_daily_bytes_per_user=max_daily_bytes,
    )


class TestConcurrent:
    def test_acquire_within_cap_succeeds(self):
        q = make_tracker(max_concurrent=3)
        with q.acquire(user="alice"):
            with q.acquire(user="alice"):
                with q.acquire(user="alice"):
                    pass

    def test_acquire_above_cap_raises(self):
        q = make_tracker(max_concurrent=2)
        with q.acquire(user="alice"):
            with q.acquire(user="alice"):
                with pytest.raises(QuotaExceededError) as e:
                    with q.acquire(user="alice"):
                        pass
                assert e.value.kind == KIND_CONCURRENT
                assert e.value.current == 2
                assert e.value.limit == 2

    def test_release_on_context_exit(self):
        q = make_tracker(max_concurrent=1)
        with q.acquire(user="alice"):
            pass
        # Counter dropped on exit; new acquire works
        with q.acquire(user="alice"):
            pass

    def test_release_on_exception(self):
        q = make_tracker(max_concurrent=1)
        try:
            with q.acquire(user="alice"):
                raise RuntimeError("boom")
        except RuntimeError:
            pass
        with q.acquire(user="alice"):
            pass

    def test_per_user_isolation(self):
        q = make_tracker(max_concurrent=1)
        with q.acquire(user="alice"):
            with q.acquire(user="bob"):
                pass


class TestDailyBytes:
    def test_record_within_cap(self):
        q = make_tracker(max_daily_bytes=1000)
        q.record_bytes(user="alice", n=300)
        q.record_bytes(user="alice", n=400)
        assert q.bytes_used_today(user="alice") == 700

    def test_record_above_cap_raises(self):
        q = make_tracker(max_daily_bytes=1000)
        q.record_bytes(user="alice", n=600)
        with pytest.raises(QuotaExceededError) as e:
            q.record_bytes(user="alice", n=500)
        assert e.value.kind == KIND_DAILY_BYTES
        assert e.value.current == 1100  # would-be total
        assert e.value.limit == 1000

    def test_per_user_isolation(self):
        q = make_tracker(max_daily_bytes=100)
        q.record_bytes(user="alice", n=80)
        q.record_bytes(user="bob", n=80)  # bob's bucket independent
        with pytest.raises(QuotaExceededError):
            q.record_bytes(user="alice", n=30)

    def test_reset_on_utc_midnight(self, monkeypatch):
        q = make_tracker(max_daily_bytes=100)
        # Simulate the day boundary by injecting "now"
        d1 = datetime(2026, 4, 27, 23, 0, 0, tzinfo=timezone.utc)
        monkeypatch.setattr("app.api.v2_quota._utcnow", lambda: d1)
        q.record_bytes(user="alice", n=80)
        assert q.bytes_used_today(user="alice") == 80

        d2 = d1 + timedelta(hours=2)  # crosses UTC midnight
        monkeypatch.setattr("app.api.v2_quota._utcnow", lambda: d2)
        assert q.bytes_used_today(user="alice") == 0
        q.record_bytes(user="alice", n=80)  # ok, fresh bucket
  • Step 4.2: Run tests to verify failure

Run: pytest tests/test_v2_quota.py -v Expected: FAIL with ModuleNotFoundError: No module named 'app.api.v2_quota'

  • Step 4.3: Implement quota tracker

Create app/api/v2_quota.py:

"""Process-local quota tracker for /api/v2/scan (spec §3.8).

In-memory only. Multi-replica deployments effectively multiply caps by N
(documented caveat — see spec §9.4). Future v2 should move to durable
storage if horizontal scale is needed.
"""

from __future__ import annotations
import contextlib
import logging
import threading
from dataclasses import dataclass
from datetime import datetime, timezone
from typing import Iterator

logger = logging.getLogger(__name__)

KIND_CONCURRENT = "concurrent_scans"
KIND_DAILY_BYTES = "daily_bytes"


@dataclass
class QuotaExceededError(Exception):
    kind: str
    current: int
    limit: int
    retry_after_seconds: int = 0

    def __str__(self) -> str:
        return f"{self.kind}: {self.current}/{self.limit}"


def _utcnow() -> datetime:  # patched in tests
    return datetime.now(timezone.utc)


def _utc_today() -> str:
    """ISO date string in UTC, used as the daily-bucket key."""
    return _utcnow().strftime("%Y-%m-%d")


class QuotaTracker:
    """Thread-safe quota state. Caller wraps each request in `with q.acquire(user)`,
    and after the BQ result lands records bytes via `record_bytes(user, n)`.
    """

    def __init__(self, *, max_concurrent_per_user: int, max_daily_bytes_per_user: int):
        self._max_concurrent = max_concurrent_per_user
        self._max_daily_bytes = max_daily_bytes_per_user
        self._lock = threading.Lock()
        # state: { user_id: { "concurrent": int, "bucket_day": "YYYY-MM-DD", "bytes": int } }
        self._state: dict[str, dict] = {}

    def _ensure_bucket(self, user: str) -> dict:
        today = _utc_today()
        s = self._state.setdefault(user, {"concurrent": 0, "bucket_day": today, "bytes": 0})
        if s["bucket_day"] != today:
            s["bucket_day"] = today
            s["bytes"] = 0
        return s

    @contextlib.contextmanager
    def acquire(self, user: str) -> Iterator[None]:
        with self._lock:
            s = self._ensure_bucket(user)
            if s["concurrent"] >= self._max_concurrent:
                raise QuotaExceededError(
                    kind=KIND_CONCURRENT,
                    current=s["concurrent"],
                    limit=self._max_concurrent,
                )
            s["concurrent"] += 1
        try:
            yield
        finally:
            with self._lock:
                s = self._ensure_bucket(user)
                s["concurrent"] = max(0, s["concurrent"] - 1)

    def record_bytes(self, user: str, n: int) -> None:
        if n <= 0:
            return
        with self._lock:
            s = self._ensure_bucket(user)
            new_total = s["bytes"] + n
            if new_total > self._max_daily_bytes:
                # Surface the would-be total so caller can include it in 429 body.
                raise QuotaExceededError(
                    kind=KIND_DAILY_BYTES,
                    current=new_total,
                    limit=self._max_daily_bytes,
                    retry_after_seconds=_seconds_until_utc_midnight(),
                )
            s["bytes"] = new_total

    def bytes_used_today(self, user: str) -> int:
        with self._lock:
            return self._ensure_bucket(user)["bytes"]


def _seconds_until_utc_midnight() -> int:
    now = _utcnow()
    midnight = now.replace(hour=0, minute=0, second=0, microsecond=0).replace(day=now.day)
    # Next midnight = today's midnight + 1 day
    from datetime import timedelta
    next_midnight = midnight + timedelta(days=1)
    return int((next_midnight - now).total_seconds())
  • Step 4.4: Run tests to verify pass

Run: pytest tests/test_v2_quota.py -v Expected: 8 passed.

  • Step 4.5: Commit
git add app/api/v2_quota.py tests/test_v2_quota.py
git commit -m "feat(v2): process-local quota tracker (concurrent + daily bytes)"

Task 5: LRU+TTL cache helper

Used by catalog/schema/sample endpoints. Spec §3.6.

Files:

  • Create: app/api/v2_cache.py

  • Test: tests/test_v2_cache.py

  • Step 5.1: Write failing tests

# tests/test_v2_cache.py
import pytest
import time

from app.api.v2_cache import TTLCache


class TestTTLCache:
    def test_set_get(self):
        c = TTLCache(maxsize=10, ttl_seconds=60)
        c.set("k", "v")
        assert c.get("k") == "v"

    def test_get_missing_returns_default(self):
        c = TTLCache(maxsize=10, ttl_seconds=60)
        assert c.get("missing") is None
        assert c.get("missing", default="x") == "x"

    def test_expiry(self, monkeypatch):
        now = [1000.0]
        monkeypatch.setattr("app.api.v2_cache._now", lambda: now[0])
        c = TTLCache(maxsize=10, ttl_seconds=10)
        c.set("k", "v")
        assert c.get("k") == "v"
        now[0] += 11
        assert c.get("k") is None  # expired

    def test_lru_eviction(self):
        c = TTLCache(maxsize=2, ttl_seconds=60)
        c.set("a", 1)
        c.set("b", 2)
        c.set("c", 3)  # should evict 'a' (LRU)
        assert c.get("a") is None
        assert c.get("b") == 2
        assert c.get("c") == 3

    def test_invalidate(self):
        c = TTLCache(maxsize=10, ttl_seconds=60)
        c.set("k", "v")
        c.invalidate("k")
        assert c.get("k") is None

    def test_clear(self):
        c = TTLCache(maxsize=10, ttl_seconds=60)
        c.set("a", 1)
        c.set("b", 2)
        c.clear()
        assert c.get("a") is None
        assert c.get("b") is None
  • Step 5.2: Run tests to verify failure

Run: pytest tests/test_v2_cache.py -v Expected: FAIL — module doesn't exist.

  • Step 5.3: Implement TTLCache

Create app/api/v2_cache.py:

"""Simple thread-safe LRU + TTL cache for v2 endpoints."""

from __future__ import annotations
import threading
import time
from collections import OrderedDict
from typing import Any


def _now() -> float:  # patched in tests
    return time.monotonic()


class TTLCache:
    def __init__(self, *, maxsize: int, ttl_seconds: float):
        self._max = maxsize
        self._ttl = ttl_seconds
        self._lock = threading.Lock()
        self._data: "OrderedDict[str, tuple[float, Any]]" = OrderedDict()

    def get(self, key: str, default: Any = None) -> Any:
        with self._lock:
            entry = self._data.get(key)
            if entry is None:
                return default
            expiry, value = entry
            if _now() > expiry:
                del self._data[key]
                return default
            self._data.move_to_end(key)  # mark as recently used
            return value

    def set(self, key: str, value: Any) -> None:
        with self._lock:
            expiry = _now() + self._ttl
            if key in self._data:
                self._data.move_to_end(key)
            self._data[key] = (expiry, value)
            while len(self._data) > self._max:
                self._data.popitem(last=False)

    def invalidate(self, key: str) -> None:
        with self._lock:
            self._data.pop(key, None)

    def clear(self) -> None:
        with self._lock:
            self._data.clear()
  • Step 5.4: Run tests to verify pass

Run: pytest tests/test_v2_cache.py -v Expected: 6 passed.

  • Step 5.5: Commit
git add app/api/v2_cache.py tests/test_v2_cache.py
git commit -m "feat(v2): TTLCache helper (LRU + TTL, thread-safe)"

Task 6: Arrow IPC streaming response helper

Spec §3.4 step 9. Used by /api/v2/scan. Wraps a pyarrow Table or RecordBatchReader as an HTTP streaming response.

Files:

  • Create: app/api/v2_arrow.py

  • Test: tests/test_v2_arrow.py

  • Step 6.1: Write failing tests

# tests/test_v2_arrow.py
import io
import pyarrow as pa
import pytest

from app.api.v2_arrow import arrow_table_to_ipc_bytes, parse_ipc_bytes


def test_round_trip_simple_table():
    src = pa.table({"a": [1, 2, 3], "b": ["x", "y", "z"]})
    body = arrow_table_to_ipc_bytes(src)
    assert isinstance(body, bytes) and len(body) > 0
    got = parse_ipc_bytes(body)
    assert got.equals(src)


def test_empty_table_round_trip():
    src = pa.table({"a": pa.array([], type=pa.int64())})
    body = arrow_table_to_ipc_bytes(src)
    got = parse_ipc_bytes(body)
    assert got.num_rows == 0
    assert got.schema.equals(src.schema)
  • Step 6.2: Run tests to verify failure

Run: pytest tests/test_v2_arrow.py -v Expected: FAIL — module doesn't exist.

  • Step 6.3: Implement Arrow helpers

Create app/api/v2_arrow.py:

"""Arrow IPC serialization helpers for /api/v2/scan responses.

Server side serializes a pyarrow.Table to IPC stream bytes; client side
deserializes back. Content-Type is `application/vnd.apache.arrow.stream`.
"""

from __future__ import annotations
import io
import pyarrow as pa


CONTENT_TYPE = "application/vnd.apache.arrow.stream"


def arrow_table_to_ipc_bytes(table: pa.Table) -> bytes:
    """Serialize a pyarrow.Table to Arrow IPC stream bytes."""
    sink = io.BytesIO()
    with pa.ipc.new_stream(sink, table.schema) as writer:
        for batch in table.to_batches():
            writer.write_batch(batch)
    return sink.getvalue()


def parse_ipc_bytes(data: bytes) -> pa.Table:
    """Deserialize Arrow IPC stream bytes to a pyarrow.Table."""
    reader = pa.ipc.open_stream(io.BytesIO(data))
    return reader.read_all()
  • Step 6.4: Run tests to verify pass

Run: pytest tests/test_v2_arrow.py -v Expected: 2 passed.

  • Step 6.5: Commit
git add app/api/v2_arrow.py tests/test_v2_arrow.py
git commit -m "feat(v2): Arrow IPC serialization helpers"

Task 7: GET /api/v2/catalog

Spec §3.1. Lists tables visible to user (RBAC-filtered) with metadata.

Files:

  • Create: app/api/v2_catalog.py

  • Modify: app/main.py

  • Test: tests/test_v2_catalog.py

  • Step 7.1: Write failing tests

# tests/test_v2_catalog.py
import importlib
import pytest


@pytest.fixture
def reload_db(tmp_path, monkeypatch):
    monkeypatch.setenv("DATA_DIR", str(tmp_path))
    import src.db as db_module
    importlib.reload(db_module)
    yield db_module


def _seed_two_tables(conn):
    from src.repositories.table_registry import TableRegistryRepository
    repo = TableRegistryRepository(conn)
    repo.register(
        id="orders", name="orders", source_type="keboola",
        bucket="sales", source_table="orders", query_mode="local",
        is_public=True,
    )
    repo.register(
        id="bq_view", name="bq_view", source_type="bigquery",
        bucket="ds", source_table="bq_view", query_mode="remote",
        is_public=True,
    )


class TestCatalogShape:
    def test_admin_sees_both_tables(self, reload_db):
        from app.api.v2_catalog import build_catalog
        conn = reload_db.get_system_db()
        try:
            _seed_two_tables(conn)
            admin = {"role": "admin", "email": "a@x.com"}
            data = build_catalog(conn, admin)
            ids = {t["id"] for t in data["tables"]}
            assert {"orders", "bq_view"} <= ids
        finally:
            conn.close()

    def test_local_table_has_duckdb_flavor(self, reload_db):
        from app.api.v2_catalog import build_catalog
        conn = reload_db.get_system_db()
        try:
            _seed_two_tables(conn)
            admin = {"role": "admin", "email": "a@x.com"}
            data = build_catalog(conn, admin)
            row = next(t for t in data["tables"] if t["id"] == "orders")
            assert row["sql_flavor"] == "duckdb"
            assert row["query_mode"] == "local"

    def test_bq_table_has_bigquery_flavor(self, reload_db):
        from app.api.v2_catalog import build_catalog
        conn = reload_db.get_system_db()
        try:
            _seed_two_tables(conn)
            admin = {"role": "admin", "email": "a@x.com"}
            data = build_catalog(conn, admin)
            row = next(t for t in data["tables"] if t["id"] == "bq_view")
            assert row["sql_flavor"] == "bigquery"
            assert row["query_mode"] == "remote"
            assert "where_examples" in row
            assert "fetch_via" in row
  • Step 7.2: Run tests to verify failure

Run: pytest tests/test_v2_catalog.py -v Expected: FAIL — module doesn't exist.

  • Step 7.3: Implement catalog endpoint

Create app/api/v2_catalog.py:

"""GET /api/v2/catalog — list tables visible to caller (spec §3.1)."""

from __future__ import annotations
from datetime import datetime, timezone
from fastapi import APIRouter, Depends
import duckdb

from app.auth.dependencies import get_current_user, _get_db
from src.rbac import can_access_table
from src.repositories.table_registry import TableRegistryRepository
from app.api.v2_cache import TTLCache

router = APIRouter(prefix="/api/v2", tags=["v2"])

_catalog_cache = TTLCache(maxsize=1024, ttl_seconds=300)  # per-user, 5 min


def _flavor_for(source_type: str) -> str:
    return "bigquery" if source_type == "bigquery" else "duckdb"


def _examples_for(source_type: str) -> list[str]:
    if source_type == "bigquery":
        return [
            "event_date > DATE '2026-01-01'",
            "country_code = 'CZ' AND platform = 'web'",
        ]
    return []


def _fetch_hint(table_id: str, source_type: str) -> str:
    if source_type == "bigquery":
        return f"da fetch {table_id} --select <cols> --where '<BQ predicate>' --limit <N>"
    return "already local — query directly via `da query`"


def build_catalog(conn: duckdb.DuckDBPyConnection, user: dict) -> dict:
    cache_key = f"{user.get('email', '?')}|catalog"
    cached = _catalog_cache.get(cache_key)
    if cached is not None:
        return cached

    repo = TableRegistryRepository(conn)
    rows = repo.list_all()

    visible = []
    for r in rows:
        if user.get("role") != "admin" and not can_access_table(user, r["id"], conn):
            continue
        visible.append({
            "id": r["id"],
            "name": r.get("name") or r["id"],
            "description": r.get("description") or "",
            "source_type": r.get("source_type") or "",
            "query_mode": r.get("query_mode") or "local",
            "sql_flavor": _flavor_for(r.get("source_type") or ""),
            "where_examples": _examples_for(r.get("source_type") or ""),
            "fetch_via": _fetch_hint(r["id"], r.get("source_type") or ""),
            "rough_size_hint": None,  # populated by Task 8 schema endpoint when called
        })

    payload = {
        "tables": visible,
        "server_time": datetime.now(timezone.utc).isoformat(),
    }
    _catalog_cache.set(cache_key, payload)
    return payload


@router.get("/catalog")
async def catalog(
    user: dict = Depends(get_current_user),
    conn: duckdb.DuckDBPyConnection = Depends(_get_db),
):
    return build_catalog(conn, user)
  • Step 7.4: Mount in app/main.py

Find the app.include_router(...) block (around line 279-287). Add:

    from app.api.v2_catalog import router as v2_catalog_router
    app.include_router(v2_catalog_router)
  • Step 7.5: Run tests to verify pass

Run: pytest tests/test_v2_catalog.py -v Expected: 3 passed.

  • Step 7.6: Commit
git add app/api/v2_catalog.py app/main.py tests/test_v2_catalog.py
git commit -m "feat(v2): GET /api/v2/catalog — RBAC-filtered table list"

Task 8: GET /api/v2/schema/{table_id}

Spec §3.2. Column metadata + BQ flavor hints.

Files:

  • Create: app/api/v2_schema.py

  • Modify: app/main.py

  • Test: tests/test_v2_schema.py

  • Step 8.1: Write failing tests

# tests/test_v2_schema.py
import importlib
from unittest.mock import patch, MagicMock
import pytest


@pytest.fixture
def reload_db(tmp_path, monkeypatch):
    monkeypatch.setenv("DATA_DIR", str(tmp_path))
    import src.db as db_module
    importlib.reload(db_module)
    yield db_module


def _seed_bq_table(conn):
    from src.repositories.table_registry import TableRegistryRepository
    TableRegistryRepository(conn).register(
        id="bq_view", name="bq_view", source_type="bigquery",
        bucket="ds", source_table="bq_view", query_mode="remote",
        is_public=True,
    )


class TestSchemaEndpoint:
    def test_bq_table_returns_columns_and_dialect_hints(self, reload_db, monkeypatch):
        from app.api import v2_schema
        # Stub the BQ schema fetch to avoid hitting real BQ
        monkeypatch.setattr(
            v2_schema, "_fetch_bq_schema",
            lambda project, dataset, table: [
                {"name": "event_date", "type": "DATE", "nullable": False, "description": ""},
                {"name": "country_code", "type": "STRING", "nullable": True, "description": ""},
            ],
        )
        monkeypatch.setattr(v2_schema, "_fetch_bq_table_options", lambda *a: {"partition_by": "event_date", "clustered_by": []})

        conn = reload_db.get_system_db()
        try:
            _seed_bq_table(conn)
            user = {"role": "admin", "email": "a@x.com"}
            data = v2_schema.build_schema(conn, user, "bq_view", project_id="my-proj")
        finally:
            conn.close()
        assert data["table_id"] == "bq_view"
        assert data["sql_flavor"] == "bigquery"
        assert {c["name"] for c in data["columns"]} == {"event_date", "country_code"}
        assert "where_dialect_hints" in data
        assert data["partition_by"] == "event_date"

    def test_unknown_table_raises_404(self, reload_db):
        from app.api.v2_schema import build_schema, NotFound
        conn = reload_db.get_system_db()
        try:
            user = {"role": "admin", "email": "a@x.com"}
            with pytest.raises(NotFound):
                build_schema(conn, user, "missing", project_id="my-proj")
        finally:
            conn.close()
  • Step 8.2: Run tests to verify failure

Run: pytest tests/test_v2_schema.py -v Expected: FAIL — module doesn't exist.

  • Step 8.3: Implement schema endpoint

Create app/api/v2_schema.py:

"""GET /api/v2/schema/{table_id} — table column metadata (spec §3.2)."""

from __future__ import annotations
import logging
from fastapi import APIRouter, Depends, HTTPException
import duckdb

from app.auth.dependencies import get_current_user, _get_db
from app.instance_config import get_value
from src.rbac import can_access_table
from src.repositories.table_registry import TableRegistryRepository
from app.api.v2_cache import TTLCache

logger = logging.getLogger(__name__)
router = APIRouter(prefix="/api/v2", tags=["v2"])

_schema_cache = TTLCache(maxsize=512, ttl_seconds=3600)


class NotFound(Exception):
    pass


_BQ_DIALECT_HINTS = {
    "date_literal": "DATE '2026-01-01'",
    "timestamp_literal": "TIMESTAMP '2026-01-01 00:00:00 UTC'",
    "interval_subtract": "DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)",
    "regex": "REGEXP_CONTAINS(field, r'pattern')",
    "cast": "CAST(x AS INT64)",
}


def _fetch_bq_schema(project: str, dataset: str, table: str) -> list[dict]:
    """Fetch column list via INFORMATION_SCHEMA.COLUMNS using DuckDB BQ extension."""
    import duckdb
    from connectors.bigquery.auth import get_metadata_token

    token = get_metadata_token()
    conn = duckdb.connect(":memory:")
    try:
        conn.execute("INSTALL bigquery FROM community; LOAD bigquery;")
        escaped = token.replace("'", "''")
        conn.execute(f"CREATE OR REPLACE SECRET bq_s (TYPE bigquery, ACCESS_TOKEN '{escaped}')")
        bq_sql = (
            f"SELECT column_name, data_type, is_nullable, description "
            f"FROM `{project}.{dataset}.INFORMATION_SCHEMA.COLUMNS` "
            f"WHERE table_name = ? ORDER BY ordinal_position"
        )
        rows = conn.execute(
            "SELECT * FROM bigquery_query(?, ?, ?)",
            [project, bq_sql, table],
        ).fetchall()
        return [
            {
                "name": r[0],
                "type": r[1],
                "nullable": r[2] == "YES",
                "description": r[3] or "",
            }
            for r in rows
        ]
    finally:
        conn.close()


def _fetch_bq_table_options(project: str, dataset: str, table: str) -> dict:
    """Best-effort fetch of partition/cluster info; returns empty dict on miss."""
    import duckdb
    from connectors.bigquery.auth import get_metadata_token

    try:
        token = get_metadata_token()
        conn = duckdb.connect(":memory:")
        try:
            conn.execute("INSTALL bigquery FROM community; LOAD bigquery;")
            escaped = token.replace("'", "''")
            conn.execute(f"CREATE OR REPLACE SECRET bq_s (TYPE bigquery, ACCESS_TOKEN '{escaped}')")
            bq_sql = (
                f"SELECT partition_column, cluster_columns "
                f"FROM `{project}.{dataset}.INFORMATION_SCHEMA.TABLES` "
                f"WHERE table_name = ?"
            )
            row = conn.execute(
                "SELECT * FROM bigquery_query(?, ?, ?)",
                [project, bq_sql, table],
            ).fetchone()
            if not row:
                return {}
            return {
                "partition_by": row[0],
                "clustered_by": (row[1] or "").split(",") if row[1] else [],
            }
        finally:
            conn.close()
    except Exception as e:
        logger.warning("BQ table options fetch failed for %s.%s.%s: %s", project, dataset, table, e)
        return {}


def build_schema(
    conn: duckdb.DuckDBPyConnection,
    user: dict,
    table_id: str,
    *,
    project_id: str,
) -> dict:
    cache_key = f"{table_id}"
    cached = _schema_cache.get(cache_key)
    if cached is not None:
        return cached

    repo = TableRegistryRepository(conn)
    row = repo.get(table_id)
    if not row:
        raise NotFound(table_id)

    if user.get("role") != "admin" and not can_access_table(user, table_id, conn):
        raise PermissionError(table_id)

    source_type = row.get("source_type") or ""
    if source_type == "bigquery":
        dataset = row.get("bucket") or ""
        source_table = row.get("source_table") or table_id
        columns = _fetch_bq_schema(project_id, dataset, source_table)
        opts = _fetch_bq_table_options(project_id, dataset, source_table)
        payload = {
            "table_id": table_id,
            "source_type": source_type,
            "sql_flavor": "bigquery",
            "columns": columns,
            "partition_by": opts.get("partition_by"),
            "clustered_by": opts.get("clustered_by", []),
            "where_dialect_hints": _BQ_DIALECT_HINTS,
        }
    else:
        # Local source — read schema from the parquet via DuckDB
        from pathlib import Path
        from app.utils import get_data_dir
        parquet = (
            get_data_dir() / "extracts" / source_type / "data" / f"{table_id}.parquet"
        )
        local_conn = duckdb.connect(":memory:")
        try:
            cols = local_conn.execute(
                "DESCRIBE SELECT * FROM read_parquet(?)", [str(parquet)]
            ).fetchall()
        finally:
            local_conn.close()
        payload = {
            "table_id": table_id,
            "source_type": source_type,
            "sql_flavor": "duckdb",
            "columns": [
                {"name": c[0], "type": c[1], "nullable": c[2] == "YES", "description": ""}
                for c in cols
            ],
            "partition_by": None,
            "clustered_by": [],
            "where_dialect_hints": {},
        }

    _schema_cache.set(cache_key, payload)
    return payload


@router.get("/schema/{table_id}")
async def schema(
    table_id: str,
    user: dict = Depends(get_current_user),
    conn: duckdb.DuckDBPyConnection = Depends(_get_db),
):
    project_id = get_value("data_source", "bigquery", "project", default="") or ""
    try:
        return build_schema(conn, user, table_id, project_id=project_id)
    except NotFound:
        raise HTTPException(status_code=404, detail=f"table {table_id!r} not found")
    except PermissionError:
        raise HTTPException(status_code=403, detail="not authorized for this table")
  • Step 8.4: Mount in app/main.py

Add alongside catalog router:

    from app.api.v2_schema import router as v2_schema_router
    app.include_router(v2_schema_router)
  • Step 8.5: Run tests to verify pass

Run: pytest tests/test_v2_schema.py -v Expected: 2 passed.

  • Step 8.6: Commit
git add app/api/v2_schema.py app/main.py tests/test_v2_schema.py
git commit -m "feat(v2): GET /api/v2/schema/{table_id} — column metadata + BQ hints"

Task 9: GET /api/v2/sample/{table_id}

Spec §3.3. Returns N sample rows.

Files:

  • Create: app/api/v2_sample.py

  • Modify: app/main.py

  • Test: tests/test_v2_sample.py

  • Step 9.1: Write failing tests

# tests/test_v2_sample.py
import importlib
import pytest


@pytest.fixture
def reload_db(tmp_path, monkeypatch):
    monkeypatch.setenv("DATA_DIR", str(tmp_path))
    import src.db as db_module
    importlib.reload(db_module)
    yield db_module


def _seed(conn):
    from src.repositories.table_registry import TableRegistryRepository
    TableRegistryRepository(conn).register(
        id="bq_view", name="bq_view", source_type="bigquery",
        bucket="ds", source_table="bq_view", query_mode="remote",
        is_public=True,
    )


class TestSampleEndpoint:
    def test_returns_n_rows_for_bq_table(self, reload_db, monkeypatch):
        from app.api import v2_sample
        monkeypatch.setattr(
            v2_sample, "_fetch_bq_sample",
            lambda project, dataset, table, n: [
                {"event_date": "2026-04-27", "country_code": "CZ"},
                {"event_date": "2026-04-26", "country_code": "SK"},
            ],
        )
        conn = reload_db.get_system_db()
        try:
            _seed(conn)
            user = {"role": "admin", "email": "a@x.com"}
            data = v2_sample.build_sample(conn, user, "bq_view", n=2, project_id="proj")
        finally:
            conn.close()
        assert data["table_id"] == "bq_view"
        assert len(data["rows"]) == 2

    def test_caps_n_at_100(self, reload_db, monkeypatch):
        from app.api import v2_sample
        captured = {}
        def fake_fetch(project, dataset, table, n):
            captured["n"] = n
            return []
        monkeypatch.setattr(v2_sample, "_fetch_bq_sample", fake_fetch)
        conn = reload_db.get_system_db()
        try:
            _seed(conn)
            user = {"role": "admin", "email": "a@x.com"}
            v2_sample.build_sample(conn, user, "bq_view", n=999, project_id="proj")
        finally:
            conn.close()
        assert captured["n"] == 100
  • Step 9.2: Run tests to verify failure

Run: pytest tests/test_v2_sample.py -v Expected: FAIL — module doesn't exist.

  • Step 9.3: Implement sample endpoint

Create app/api/v2_sample.py:

"""GET /api/v2/sample/{table_id}?n=5 — sample rows (spec §3.3)."""

from __future__ import annotations
import logging
from fastapi import APIRouter, Depends, HTTPException, Query
import duckdb

from app.auth.dependencies import get_current_user, _get_db
from app.instance_config import get_value
from src.rbac import can_access_table
from src.repositories.table_registry import TableRegistryRepository
from app.api.v2_cache import TTLCache

logger = logging.getLogger(__name__)
router = APIRouter(prefix="/api/v2", tags=["v2"])

_sample_cache = TTLCache(maxsize=512, ttl_seconds=3600)
_MAX_N = 100


def _fetch_bq_sample(project: str, dataset: str, table: str, n: int) -> list[dict]:
    import duckdb
    from connectors.bigquery.auth import get_metadata_token

    token = get_metadata_token()
    conn = duckdb.connect(":memory:")
    try:
        conn.execute("INSTALL bigquery FROM community; LOAD bigquery;")
        escaped = token.replace("'", "''")
        conn.execute(f"CREATE OR REPLACE SECRET bq_s (TYPE bigquery, ACCESS_TOKEN '{escaped}')")
        bq_sql = f"SELECT * FROM `{project}.{dataset}.{table}` LIMIT {int(n)}"
        df = conn.execute(
            "SELECT * FROM bigquery_query(?, ?)",
            [project, bq_sql],
        ).fetchdf()
        return df.to_dict(orient="records")
    finally:
        conn.close()


def build_sample(
    conn: duckdb.DuckDBPyConnection,
    user: dict,
    table_id: str,
    *,
    n: int,
    project_id: str,
) -> dict:
    n = max(1, min(int(n), _MAX_N))
    cache_key = f"{table_id}|{n}"
    cached = _sample_cache.get(cache_key)
    if cached is not None:
        return cached

    repo = TableRegistryRepository(conn)
    row = repo.get(table_id)
    if not row:
        raise FileNotFoundError(table_id)

    if user.get("role") != "admin" and not can_access_table(user, table_id, conn):
        raise PermissionError(table_id)

    source_type = row.get("source_type") or ""
    if source_type == "bigquery":
        rows = _fetch_bq_sample(project_id, row.get("bucket") or "", row.get("source_table") or table_id, n)
    else:
        from app.utils import get_data_dir
        parquet = get_data_dir() / "extracts" / source_type / "data" / f"{table_id}.parquet"
        c = duckdb.connect(":memory:")
        try:
            df = c.execute(
                f"SELECT * FROM read_parquet(?) LIMIT {n}",
                [str(parquet)],
            ).fetchdf()
            rows = df.to_dict(orient="records")
        finally:
            c.close()

    payload = {"table_id": table_id, "rows": rows, "source": source_type}
    _sample_cache.set(cache_key, payload)
    return payload


@router.get("/sample/{table_id}")
async def sample(
    table_id: str,
    n: int = Query(default=5, ge=1, le=_MAX_N),
    user: dict = Depends(get_current_user),
    conn: duckdb.DuckDBPyConnection = Depends(_get_db),
):
    project_id = get_value("data_source", "bigquery", "project", default="") or ""
    try:
        return build_sample(conn, user, table_id, n=n, project_id=project_id)
    except FileNotFoundError:
        raise HTTPException(status_code=404, detail=f"table {table_id!r} not found")
    except PermissionError:
        raise HTTPException(status_code=403, detail="not authorized for this table")
  • Step 9.4: Mount + run tests

Add from app.api.v2_sample import router as v2_sample_router; app.include_router(v2_sample_router) in app/main.py. Run: pytest tests/test_v2_sample.py -v. Expected: 2 passed.

  • Step 9.5: Commit
git add app/api/v2_sample.py app/main.py tests/test_v2_sample.py
git commit -m "feat(v2): GET /api/v2/sample/{table_id} — N sample rows"

Task 10: POST /api/v2/scan/estimate

Spec §3.5. BQ dryRun for cost estimate.

Files:

  • Create: app/api/v2_scan.py (will be extended in Task 11 for /scan proper)

  • Modify: app/main.py

  • Test: tests/test_v2_scan_estimate.py

  • Step 10.1: Write failing test

# tests/test_v2_scan_estimate.py
import importlib
import pytest


@pytest.fixture
def reload_db(tmp_path, monkeypatch):
    monkeypatch.setenv("DATA_DIR", str(tmp_path))
    import src.db as db_module
    importlib.reload(db_module)
    yield db_module


def _seed(conn):
    from src.repositories.table_registry import TableRegistryRepository
    TableRegistryRepository(conn).register(
        id="bq_view", name="bq_view", source_type="bigquery",
        bucket="ds", source_table="bq_view", query_mode="remote",
        is_public=True,
    )


class TestScanEstimate:
    def test_returns_scan_bytes_for_bq(self, reload_db, monkeypatch):
        from app.api import v2_scan
        monkeypatch.setattr(
            v2_scan, "_bq_dry_run_bytes",
            lambda project, sql: 4_400_000_000,
        )
        # Stub the schema fetch the validator uses
        monkeypatch.setattr(
            v2_scan, "_resolve_schema",
            lambda *a, **kw: {"event_date": "DATE", "country_code": "STRING"},
        )

        conn = reload_db.get_system_db()
        try:
            _seed(conn)
            user = {"role": "admin", "email": "a@x.com"}
            req = {
                "table_id": "bq_view",
                "select": ["event_date", "country_code"],
                "where": "event_date > DATE '2026-01-01'",
                "limit": 1000000,
            }
            data = v2_scan.estimate(conn, user, req, project_id="proj")
        finally:
            conn.close()
        assert data["estimated_scan_bytes"] == 4_400_000_000
        assert "estimated_result_rows" in data
        assert "bq_cost_estimate_usd" in data
  • Step 10.2: Run test to verify failure

Run: pytest tests/test_v2_scan_estimate.py -v Expected: FAIL — module doesn't exist.

  • Step 10.3: Implement estimate endpoint

Create app/api/v2_scan.py:

"""POST /api/v2/scan and POST /api/v2/scan/estimate (spec §3.4 + §3.5)."""

from __future__ import annotations
import logging
import time
from typing import Optional

from fastapi import APIRouter, Depends, HTTPException
from pydantic import BaseModel, Field
import duckdb

from app.auth.dependencies import get_current_user, _get_db
from app.instance_config import get_value
from src.rbac import can_access_table
from src.repositories.table_registry import TableRegistryRepository
from app.api.where_validator import (
    validate_where, WhereValidationError,
)
from app.api.v2_schema import build_schema  # reused for column resolution

logger = logging.getLogger(__name__)
router = APIRouter(prefix="/api/v2", tags=["v2"])


class ScanRequest(BaseModel):
    table_id: str
    select: Optional[list[str]] = None
    where: Optional[str] = None
    limit: Optional[int] = Field(default=None, ge=1)
    order_by: Optional[list[str]] = None


def _resolve_schema(conn, user, table_id: str, project_id: str) -> dict:
    """Get {column: type} dict for the target table — used by validator + projection check."""
    s = build_schema(conn, user, table_id, project_id=project_id)
    return {c["name"]: c["type"] for c in s.get("columns", [])}


def _bq_dry_run_bytes(project: str, sql: str) -> int:
    """Run a BQ dry-run via the google-cloud-bigquery client and return totalBytesProcessed."""
    from google.cloud import bigquery
    from google.api_core.client_options import ClientOptions
    client = bigquery.Client(
        project=project,
        client_options=ClientOptions(quota_project_id=project),
    )
    job = client.query(
        sql,
        job_config=bigquery.QueryJobConfig(dry_run=True, use_query_cache=False),
    )
    return int(job.total_bytes_processed or 0)


def _build_bq_sql(table_row: dict, project_id: str, req: ScanRequest) -> str:
    select_sql = ", ".join(req.select) if req.select else "*"
    table_ref = f"`{project_id}.{table_row.get('bucket') or ''}.{table_row.get('source_table') or req.table_id}`"
    sql = f"SELECT {select_sql} FROM {table_ref}"
    if req.where:
        sql += f" WHERE {req.where}"
    if req.order_by:
        sql += f" ORDER BY {', '.join(req.order_by)}"
    if req.limit:
        sql += f" LIMIT {int(req.limit)}"
    return sql


def estimate(conn, user, raw_request: dict, *, project_id: str) -> dict:
    req = ScanRequest(**raw_request)
    repo = TableRegistryRepository(conn)
    row = repo.get(req.table_id)
    if not row:
        raise FileNotFoundError(req.table_id)
    if user.get("role") != "admin" and not can_access_table(user, req.table_id, conn):
        raise PermissionError(req.table_id)

    schema = _resolve_schema(conn, user, req.table_id, project_id)

    # Validate WHERE first
    if req.where:
        validate_where(req.where, req.table_id, schema)
    # Validate select columns exist
    if req.select:
        unknown = [c for c in req.select if c not in schema]
        if unknown:
            raise ValueError(f"unknown columns: {unknown}")

    if (row.get("source_type") or "") != "bigquery":
        return {
            "table_id": req.table_id,
            "estimated_scan_bytes": 0,
            "estimated_result_rows": None,
            "estimated_result_bytes": None,
            "bq_cost_estimate_usd": 0.0,
        }

    bq_sql = _build_bq_sql(row, project_id, req)
    scan_bytes = _bq_dry_run_bytes(project_id, bq_sql)

    cost_per_tb = float(get_value("api", "scan", "bq_cost_per_tb_usd", default=5.0) or 5.0)
    cost = (scan_bytes / 1_099_511_627_776) * cost_per_tb  # 1 TiB = 2^40

    # Heuristic for result row/byte estimate
    avg_row_bytes = max(1, sum(_avg_bytes_for_type(t) for t in schema.values()) // max(1, len(schema)))
    rows_est = scan_bytes // max(avg_row_bytes, 1)
    if req.limit:
        rows_est = min(rows_est, req.limit)

    return {
        "table_id": req.table_id,
        "estimated_scan_bytes": int(scan_bytes),
        "estimated_result_rows": int(rows_est),
        "estimated_result_bytes": int(rows_est * avg_row_bytes),
        "bq_cost_estimate_usd": round(cost, 4),
    }


def _avg_bytes_for_type(t: str) -> int:
    t = (t or "").upper()
    if t in ("INT64", "FLOAT64", "DATE", "TIMESTAMP", "DATETIME", "TIME"):
        return 8
    if t == "STRING":
        return 32  # rough average
    if t == "BYTES":
        return 64
    if t == "BOOL":
        return 1
    return 16


@router.post("/scan/estimate")
async def scan_estimate_endpoint(
    raw: dict,
    user: dict = Depends(get_current_user),
    conn: duckdb.DuckDBPyConnection = Depends(_get_db),
):
    project_id = get_value("data_source", "bigquery", "project", default="") or ""
    try:
        return estimate(conn, user, raw, project_id=project_id)
    except WhereValidationError as e:
        raise HTTPException(status_code=400, detail={"error": "validator_rejected", "kind": e.kind, "details": e.detail or {}})
    except PermissionError:
        raise HTTPException(status_code=403, detail="not authorized")
    except FileNotFoundError:
        raise HTTPException(status_code=404, detail="table not found")
    except ValueError as e:
        raise HTTPException(status_code=400, detail=str(e))
  • Step 10.4: Mount + run test

Add from app.api.v2_scan import router as v2_scan_router; app.include_router(v2_scan_router) in app/main.py.

Run: pytest tests/test_v2_scan_estimate.py -v. Expected: 1 passed.

  • Step 10.5: Commit
git add app/api/v2_scan.py app/main.py tests/test_v2_scan_estimate.py
git commit -m "feat(v2): POST /api/v2/scan/estimate via BQ dryRun"

Task 11: POST /api/v2/scan — full pipeline

Spec §3.4 — combine validator + RBAC + quota + max_result_bytes + Arrow IPC streaming.

Files:

  • Modify: app/api/v2_scan.py (extend with /scan endpoint)

  • Test: tests/test_v2_scan.py

  • Step 11.1: Write failing tests

# tests/test_v2_scan.py
import importlib
from unittest.mock import MagicMock
import pyarrow as pa
import pytest

from app.api.v2_arrow import parse_ipc_bytes


@pytest.fixture
def reload_db(tmp_path, monkeypatch):
    monkeypatch.setenv("DATA_DIR", str(tmp_path))
    import src.db as db_module
    importlib.reload(db_module)
    yield db_module


def _seed(conn):
    from src.repositories.table_registry import TableRegistryRepository
    TableRegistryRepository(conn).register(
        id="bq_view", name="bq_view", source_type="bigquery",
        bucket="ds", source_table="bq_view", query_mode="remote",
        is_public=True,
    )


class TestScan:
    def test_returns_arrow_ipc_for_simple_request(self, reload_db, monkeypatch):
        from app.api import v2_scan
        monkeypatch.setattr(
            v2_scan, "_resolve_schema",
            lambda *a, **kw: {"event_date": "DATE", "country_code": "STRING"},
        )
        fake_table = pa.table(
            {"event_date": ["2026-04-27"], "country_code": ["CZ"]}
        )
        monkeypatch.setattr(
            v2_scan, "_run_bq_scan",
            lambda *a, **kw: fake_table,
        )
        conn = reload_db.get_system_db()
        try:
            _seed(conn)
            user = {"role": "admin", "email": "a@x.com"}
            req = {
                "table_id": "bq_view",
                "select": ["event_date", "country_code"],
                "where": "event_date > DATE '2026-01-01'",
                "limit": 100,
            }
            tracker = v2_scan._build_quota_tracker()
            ipc_bytes = v2_scan.run_scan(conn, user, req, project_id="proj", quota=tracker)
        finally:
            conn.close()
        got = parse_ipc_bytes(ipc_bytes)
        assert got.num_rows == 1
        assert got.column_names == ["event_date", "country_code"]

    def test_quota_concurrent_exceeded_raises_429(self, reload_db, monkeypatch):
        from app.api import v2_scan
        from app.api.v2_quota import QuotaTracker, QuotaExceededError, KIND_CONCURRENT
        monkeypatch.setattr(
            v2_scan, "_resolve_schema",
            lambda *a, **kw: {"event_date": "DATE"},
        )
        fake_table = pa.table({"event_date": ["2026-04-27"]})
        monkeypatch.setattr(v2_scan, "_run_bq_scan", lambda *a, **kw: fake_table)

        tracker = QuotaTracker(max_concurrent_per_user=1, max_daily_bytes_per_user=10**12)
        conn = reload_db.get_system_db()
        try:
            _seed(conn)
            user = {"role": "admin", "email": "a@x.com"}
            req = {"table_id": "bq_view", "select": ["event_date"], "limit": 1}

            # Hold one concurrent slot
            with tracker.acquire(user="a@x.com"):
                with pytest.raises(QuotaExceededError) as e:
                    v2_scan.run_scan(conn, user, req, project_id="proj", quota=tracker)
                assert e.value.kind == KIND_CONCURRENT
        finally:
            conn.close()

    def test_validator_rejection_propagates(self, reload_db, monkeypatch):
        from app.api import v2_scan
        from app.api.where_validator import WhereValidationError, REJECT_UNKNOWN_FUNCTION
        monkeypatch.setattr(
            v2_scan, "_resolve_schema",
            lambda *a, **kw: {"event_date": "DATE"},
        )

        tracker = v2_scan._build_quota_tracker()
        conn = reload_db.get_system_db()
        try:
            _seed(conn)
            user = {"role": "admin", "email": "a@x.com"}
            req = {
                "table_id": "bq_view",
                "where": "event_date = NUKE_FN()",
            }
            with pytest.raises(WhereValidationError) as e:
                v2_scan.run_scan(conn, user, req, project_id="proj", quota=tracker)
            assert e.value.kind == REJECT_UNKNOWN_FUNCTION
        finally:
            conn.close()
  • Step 11.2: Run tests to verify failure

Run: pytest tests/test_v2_scan.py -v Expected: FAIL — run_scan and _run_bq_scan don't exist yet.

  • Step 11.3: Extend app/api/v2_scan.py

Append to the file:

import io
import pyarrow as pa
from app.api.v2_arrow import arrow_table_to_ipc_bytes, CONTENT_TYPE
from app.api.v2_quota import QuotaTracker, QuotaExceededError
from fastapi.responses import Response

# Module-level singleton (process-local quota state per spec §3.8)
_quota_singleton: QuotaTracker | None = None


def _build_quota_tracker() -> QuotaTracker:
    """Returns or constructs the process-local quota tracker."""
    global _quota_singleton
    if _quota_singleton is None:
        _quota_singleton = QuotaTracker(
            max_concurrent_per_user=int(get_value("api", "scan", "max_concurrent_per_user", default=5) or 5),
            max_daily_bytes_per_user=int(get_value("api", "scan", "max_daily_bytes_per_user", default=53687091200) or 53687091200),
        )
    return _quota_singleton


def _max_result_bytes() -> int:
    return int(get_value("api", "scan", "max_result_bytes", default=2_147_483_648) or 2_147_483_648)


def _max_limit() -> int:
    return int(get_value("api", "scan", "max_limit", default=10_000_000) or 10_000_000)


def _run_bq_scan(project: str, sql: str) -> pa.Table:
    """Execute SQL via DuckDB BQ extension, return pyarrow Table."""
    import duckdb
    from connectors.bigquery.auth import get_metadata_token

    token = get_metadata_token()
    conn = duckdb.connect(":memory:")
    try:
        conn.execute("INSTALL bigquery FROM community; LOAD bigquery;")
        escaped = token.replace("'", "''")
        conn.execute(f"CREATE OR REPLACE SECRET bq_s (TYPE bigquery, ACCESS_TOKEN '{escaped}')")
        # Use bigquery_query() since the SQL is already authored against the BQ jobs API
        return conn.execute(
            "SELECT * FROM bigquery_query(?, ?)",
            [project, sql],
        ).arrow()
    finally:
        conn.close()


def run_scan(
    conn: duckdb.DuckDBPyConnection,
    user: dict,
    raw_request: dict,
    *,
    project_id: str,
    quota: QuotaTracker,
) -> bytes:
    """Validate → quota → execute → serialize. Returns Arrow IPC bytes.

    Raises:
        WhereValidationError, QuotaExceededError, FileNotFoundError, PermissionError, ValueError
    """
    req = ScanRequest(**raw_request)
    repo = TableRegistryRepository(conn)
    row = repo.get(req.table_id)
    if not row:
        raise FileNotFoundError(req.table_id)
    if user.get("role") != "admin" and not can_access_table(user, req.table_id, conn):
        raise PermissionError(req.table_id)

    if req.limit and req.limit > _max_limit():
        raise ValueError(f"limit {req.limit} exceeds max {_max_limit()}")

    schema = _resolve_schema(conn, user, req.table_id, project_id)
    if req.where:
        validate_where(req.where, req.table_id, schema)
    if req.select:
        unknown = [c for c in req.select if c not in schema]
        if unknown:
            raise ValueError(f"unknown columns: {unknown}")

    user_id = user.get("email") or "anon"

    with quota.acquire(user=user_id):
        if (row.get("source_type") or "") != "bigquery":
            # Local source: query parquet directly
            from app.utils import get_data_dir
            parquet = (
                get_data_dir() / "extracts" / row["source_type"] / "data" / f"{req.table_id}.parquet"
            )
            local = duckdb.connect(":memory:")
            try:
                projection = ", ".join(req.select) if req.select else "*"
                sql = f"SELECT {projection} FROM read_parquet(?)"
                if req.where:
                    sql += f" WHERE {req.where}"
                if req.order_by:
                    sql += f" ORDER BY {', '.join(req.order_by)}"
                if req.limit:
                    sql += f" LIMIT {int(req.limit)}"
                table = local.execute(sql, [str(parquet)]).arrow()
            finally:
                local.close()
        else:
            bq_sql = _build_bq_sql(row, project_id, req)
            table = _run_bq_scan(project_id, bq_sql)

        ipc = arrow_table_to_ipc_bytes(table)

        # Enforce max_result_bytes guard (spec §3.4 step 8)
        if len(ipc) > _max_result_bytes():
            # Truncate by taking only as many rows as fit roughly
            # Simple heuristic: cap rows to estimated avg per max_bytes
            row_count = table.num_rows
            avg = max(1, len(ipc) // max(row_count, 1))
            keep = min(row_count, _max_result_bytes() // max(avg, 1))
            table = table.slice(0, keep)
            ipc = arrow_table_to_ipc_bytes(table)

        # Record bytes for daily quota
        quota.record_bytes(user=user_id, n=len(ipc))
        return ipc


@router.post("/scan")
async def scan_endpoint(
    raw: dict,
    user: dict = Depends(get_current_user),
    conn: duckdb.DuckDBPyConnection = Depends(_get_db),
):
    project_id = get_value("data_source", "bigquery", "project", default="") or ""
    quota = _build_quota_tracker()
    try:
        ipc = run_scan(conn, user, raw, project_id=project_id, quota=quota)
        return Response(content=ipc, media_type=CONTENT_TYPE)
    except WhereValidationError as e:
        raise HTTPException(
            status_code=400,
            detail={"error": "validator_rejected", "kind": e.kind, "details": e.detail or {}},
        )
    except QuotaExceededError as e:
        raise HTTPException(
            status_code=429,
            detail={
                "error": "quota_exceeded",
                "kind": e.kind,
                "current": e.current,
                "limit": e.limit,
                "retry_after_seconds": e.retry_after_seconds,
            },
        )
    except FileNotFoundError:
        raise HTTPException(status_code=404, detail="table not found")
    except PermissionError:
        raise HTTPException(status_code=403, detail="not authorized")
    except ValueError as e:
        raise HTTPException(status_code=400, detail=str(e))
  • Step 11.4: Run tests to verify pass

Run: pytest tests/test_v2_scan.py -v Expected: 3 passed.

  • Step 11.5: Commit
git add app/api/v2_scan.py tests/test_v2_scan.py
git commit -m "feat(v2): POST /api/v2/scan — validator + quota + Arrow IPC pipeline"

Task 12: Drop wrap-view code path with legacy_wrap_views toggle

Spec §6.1. The wrap view in connectors/bigquery/extractor.py for VIEW entities is the source of #101 problem.

Files:

  • Modify: connectors/bigquery/extractor.py

  • Modify: tests/test_bigquery_extractor.py

  • Step 12.1: Write failing test for the new behavior

Append to tests/test_bigquery_extractor.py:

class TestDropWrapViewForBQViews:
    def test_view_entity_does_not_create_master_view_by_default(self, tmp_path, monkeypatch):
        from connectors.bigquery.extractor import init_extract
        monkeypatch.setattr("connectors.bigquery.extractor.get_metadata_token", lambda: "tok")
        monkeypatch.setattr("connectors.bigquery.extractor._detect_table_type", lambda *a, **kw: "VIEW")

        # Stub BQ extension calls to avoid hitting real BQ
        real_connect = duckdb.connect

        def safe_connect(*a, **kw):
            return _CapturingProxy(real_connect(*a, **kw))
        monkeypatch.setattr("connectors.bigquery.extractor.duckdb.connect", safe_connect)

        # legacy toggle is OFF by default → expect no CREATE VIEW for the BQ view
        monkeypatch.setattr(
            "connectors.bigquery.extractor.get_value",
            lambda *args, default=None, **kw: False if "legacy_wrap_views" in args else default,
            raising=False,
        )

        result = init_extract(
            str(tmp_path),
            "my-project",
            [{"name": "myview", "bucket": "ds", "source_table": "myview", "description": ""}],
        )

        # Confirm extract.duckdb has _meta + _remote_attach but NO master view for myview
        c = duckdb.connect(str(tmp_path / "extract.duckdb"), read_only=True)
        try:
            views = c.execute(
                "SELECT view_name FROM duckdb_views() WHERE view_name='myview'"
            ).fetchall()
            assert views == [], f"expected no wrap view for VIEW entity by default; got {views}"
            meta = c.execute("SELECT table_name FROM _meta").fetchall()
            assert ("myview",) in meta, "_meta must still record the view"
        finally:
            c.close()

    def test_legacy_wrap_views_toggle_restores_old_behavior(self, tmp_path, monkeypatch):
        from connectors.bigquery.extractor import init_extract
        monkeypatch.setattr("connectors.bigquery.extractor.get_metadata_token", lambda: "tok")
        monkeypatch.setattr("connectors.bigquery.extractor._detect_table_type", lambda *a, **kw: "VIEW")

        real_connect = duckdb.connect
        def safe_connect(*a, **kw):
            return _CapturingProxy(real_connect(*a, **kw))
        monkeypatch.setattr("connectors.bigquery.extractor.duckdb.connect", safe_connect)

        # legacy toggle ON → should still create the wrap view
        monkeypatch.setattr(
            "connectors.bigquery.extractor.get_value",
            lambda *args, default=None, **kw: True if "legacy_wrap_views" in args else default,
            raising=False,
        )

        init_extract(
            str(tmp_path),
            "my-project",
            [{"name": "myview", "bucket": "ds", "source_table": "myview", "description": ""}],
        )

        c = duckdb.connect(str(tmp_path / "extract.duckdb"), read_only=True)
        try:
            views = c.execute(
                "SELECT view_name FROM duckdb_views() WHERE view_name='myview'"
            ).fetchall()
            assert views == [("myview",)]
        finally:
            c.close()
  • Step 12.2: Run tests to verify failure

Run: pytest tests/test_bigquery_extractor.py::TestDropWrapViewForBQViews -v Expected: FAIL — current code always emits the wrap view for VIEW entities.

  • Step 12.3: Modify connectors/bigquery/extractor.py

Find the section in init_extract that emits the wrap view for VIEW entities. Replace the dual-path branch:

# OLD:
if entity_type == "BASE TABLE":
    view_sql = (...)  # direct ref
else:
    if entity_type not in ("VIEW", "MATERIALIZED_VIEW"):
        logger.warning(...)
    bq_inner = ...
    view_sql = (...)  # bigquery_query() wrap

conn.execute(view_sql)

With:

# NEW: only emit wrap view for BASE TABLE; for VIEW types, just record in _meta.
from app.instance_config import get_value as _get_value
legacy_wrap_views = bool(_get_value("data_source", "bigquery", "legacy_wrap_views", default=False))

if entity_type == "BASE TABLE":
    view_sql = (
        f'CREATE OR REPLACE VIEW "{table_name}" AS '
        f'SELECT * FROM bq."{dataset}"."{source_table}"'
    )
    conn.execute(view_sql)
elif legacy_wrap_views:
    # Backwards compatibility — for one release cycle only.
    if entity_type not in ("VIEW", "MATERIALIZED_VIEW"):
        logger.warning(
            "Unknown BQ entity type %r for %s.%s.%s — using bigquery_query() path",
            entity_type, project_id, dataset, source_table,
        )
    bq_inner = f"SELECT * FROM `{project_id}.{dataset}.{source_table}`"
    bq_inner_escaped = bq_inner.replace("'", "''")
    view_sql = (
        f'CREATE OR REPLACE VIEW "{table_name}" AS '
        f"SELECT * FROM bigquery_query('{project_id}', '{bq_inner_escaped}')"
    )
    conn.execute(view_sql)
else:
    # Default: VIEW / MATERIALIZED_VIEW are recorded in _meta but no master view created.
    # Analyst must use `da fetch` (v2 primitives) to materialize a snapshot locally.
    logger.info(
        "Skipping wrap view for %s entity %s.%s.%s — use `da fetch`",
        entity_type, project_id, dataset, source_table,
    )

# _meta entry is recorded in ALL branches (existing code below stays as-is)
conn.execute(
    "INSERT INTO _meta VALUES (?, ?, 0, 0, ?, 'remote')",
    [table_name, tc.get("description", ""), now],
)
  • Step 12.4: Run tests to verify pass

Run: pytest tests/test_bigquery_extractor.py -v Expected: ALL pass — including the 2 new TestDropWrapViewForBQViews tests + existing tests (some may need updating; if so, update them to reflect that VIEW now skips the master view by default).

If existing TestViewVsTableTemplates::test_view_uses_bigquery_query_function breaks, update it to enable the legacy toggle in its monkeypatch (per pattern in Step 12.1's second test).

  • Step 12.5: Commit
git add connectors/bigquery/extractor.py tests/test_bigquery_extractor.py
git commit -m "feat(bq): drop wrap-view for VIEW entities by default; legacy toggle behind flag"

Task 13: Arrow over HTTP client + JSON helpers

Client-side counterpart to v2 endpoints. Used by all da commands that talk to v2 API.

Files:

  • Create: cli/v2_client.py

  • Test: tests/test_v2_client.py

  • Step 13.1: Write failing tests

# tests/test_v2_client.py
import json
import pyarrow as pa
import pytest
from unittest.mock import MagicMock, patch

from cli.v2_client import (
    api_get_json,
    api_post_arrow,
    api_post_json,
    V2ClientError,
)


def _fake_response(*, status=200, json_body=None, arrow_body=None, content_type=None):
    resp = MagicMock()
    resp.status_code = status
    if json_body is not None:
        resp.json.return_value = json_body
        resp.text = json.dumps(json_body)
        resp.content = resp.text.encode()
    if arrow_body is not None:
        resp.content = arrow_body
    if content_type:
        resp.headers = {"content-type": content_type}
    else:
        resp.headers = {}
    return resp


class TestApiGetJson:
    def test_200_returns_parsed_json(self):
        with patch("cli.v2_client.requests.get") as m:
            m.return_value = _fake_response(json_body={"hello": "world"})
            assert api_get_json("/api/v2/catalog") == {"hello": "world"}

    def test_4xx_raises_v2clienterror(self):
        with patch("cli.v2_client.requests.get") as m:
            m.return_value = _fake_response(status=403, json_body={"detail": "nope"})
            with pytest.raises(V2ClientError) as e:
                api_get_json("/api/v2/catalog")
            assert e.value.status_code == 403


class TestApiPostArrow:
    def test_returns_arrow_table(self):
        from app.api.v2_arrow import arrow_table_to_ipc_bytes
        ipc = arrow_table_to_ipc_bytes(pa.table({"x": [1, 2, 3]}))
        with patch("cli.v2_client.requests.post") as m:
            m.return_value = _fake_response(
                arrow_body=ipc,
                content_type="application/vnd.apache.arrow.stream",
            )
            got = api_post_arrow("/api/v2/scan", {"table_id": "x"})
        assert got.num_rows == 3
        assert got.column_names == ["x"]
  • Step 13.2: Run tests to verify failure

Run: pytest tests/test_v2_client.py -v Expected: FAIL — module doesn't exist.

  • Step 13.3: Implement v2 client

Create cli/v2_client.py:

"""HTTP client helpers for /api/v2/* endpoints (CLI side)."""

from __future__ import annotations
from dataclasses import dataclass
from typing import Any
import io

import requests
import pyarrow as pa

from cli.config import get_server_url, get_pat


@dataclass
class V2ClientError(Exception):
    status_code: int
    body: Any
    message: str = ""

    def __str__(self) -> str:
        return f"HTTP {self.status_code}: {self.message or self.body}"


def _headers() -> dict:
    pat = get_pat()
    return {"Authorization": f"Bearer {pat}"} if pat else {}


def api_get_json(path: str, **params) -> dict:
    url = f"{get_server_url().rstrip('/')}{path}"
    r = requests.get(url, headers=_headers(), params=params or None, timeout=30)
    if r.status_code >= 400:
        body = r.json() if "json" in r.headers.get("content-type", "") else r.text
        raise V2ClientError(status_code=r.status_code, body=body, message=str(body)[:200])
    return r.json()


def api_post_json(path: str, payload: dict) -> dict:
    url = f"{get_server_url().rstrip('/')}{path}"
    r = requests.post(url, json=payload, headers=_headers(), timeout=120)
    if r.status_code >= 400:
        body = r.json() if "json" in r.headers.get("content-type", "") else r.text
        raise V2ClientError(status_code=r.status_code, body=body, message=str(body)[:200])
    return r.json()


def api_post_arrow(path: str, payload: dict) -> pa.Table:
    """Post JSON, expect Arrow IPC stream response."""
    url = f"{get_server_url().rstrip('/')}{path}"
    r = requests.post(url, json=payload, headers=_headers(), timeout=600)
    if r.status_code >= 400:
        body = r.json() if "json" in r.headers.get("content-type", "") else r.text
        raise V2ClientError(status_code=r.status_code, body=body, message=str(body)[:200])
    reader = pa.ipc.open_stream(io.BytesIO(r.content))
    return reader.read_all()

If cli/config.py lacks get_server_url / get_pat, those helpers exist already under different names — adapt to whatever the existing CLI uses (api_get in cli/client.py is the existing helper; mirror its config-loading pattern).

  • Step 13.4: Run tests to verify pass

Run: pytest tests/test_v2_client.py -v Expected: 3 passed.

  • Step 13.5: Commit
git add cli/v2_client.py tests/test_v2_client.py
git commit -m "feat(cli): v2 HTTP client (JSON + Arrow IPC)"

Task 14: Snapshot metadata I/O + flock helper

Backing for da fetch and da snapshot *. Spec §4.2.

Files:

  • Create: cli/snapshot_meta.py

  • Test: tests/test_snapshot_meta.py

  • Step 14.1: Write failing tests

# tests/test_snapshot_meta.py
import json
import pytest
from pathlib import Path

from cli.snapshot_meta import (
    SnapshotMeta,
    write_meta,
    read_meta,
    list_snapshots,
    snapshot_lock,
)


@pytest.fixture
def snap_dir(tmp_path):
    d = tmp_path / "snapshots"
    d.mkdir()
    return d


class TestMetaIO:
    def test_round_trip(self, snap_dir):
        meta = SnapshotMeta(
            name="cz_recent", table_id="bq_view",
            select=["a", "b"], where="a > 1", limit=100, order_by=None,
            fetched_at="2026-04-27T17:30:00Z",
            effective_as_of="2026-04-27T17:30:00Z",
            rows=10, bytes_local=1024,
            estimated_scan_bytes_at_fetch=5_000_000,
            result_hash_md5="abc",
        )
        write_meta(snap_dir, meta)
        got = read_meta(snap_dir, "cz_recent")
        assert got == meta

    def test_read_missing_returns_none(self, snap_dir):
        assert read_meta(snap_dir, "missing") is None

    def test_list_snapshots_empty(self, snap_dir):
        assert list_snapshots(snap_dir) == []

    def test_list_snapshots_with_data(self, snap_dir):
        for name in ("a", "b", "c"):
            (snap_dir / f"{name}.parquet").write_bytes(b"PAR1\\x00\\x00PAR1")
            write_meta(snap_dir, SnapshotMeta(
                name=name, table_id="t", select=None, where=None, limit=None, order_by=None,
                fetched_at="t", effective_as_of="t", rows=0, bytes_local=10,
                estimated_scan_bytes_at_fetch=0, result_hash_md5="",
            ))
        names = sorted(s.name for s in list_snapshots(snap_dir))
        assert names == ["a", "b", "c"]


class TestSnapshotLock:
    def test_lock_is_exclusive(self, snap_dir, tmp_path):
        """Two processes can't both hold the lock at once."""
        import threading, time
        held_at = []
        def worker(label, hold_seconds):
            with snapshot_lock(snap_dir):
                held_at.append((label, time.time()))
                time.sleep(hold_seconds)
                held_at.append((f"{label}-done", time.time()))

        t1 = threading.Thread(target=worker, args=("A", 0.2))
        t2 = threading.Thread(target=worker, args=("B", 0.2))
        t1.start(); time.sleep(0.05); t2.start()
        t1.join(); t2.join()
        # A acquired, A-done, B acquired, B-done — never interleaved
        labels = [x[0] for x in held_at]
        assert labels in (
            ["A", "A-done", "B", "B-done"],
            ["B", "B-done", "A", "A-done"],
        ), f"expected serialized acquisition; got {labels}"
  • Step 14.2: Run tests to verify failure

Run: pytest tests/test_snapshot_meta.py -v Expected: FAIL — module doesn't exist.

  • Step 14.3: Implement snapshot meta + lock

Create cli/snapshot_meta.py:

"""Snapshot sidecar metadata + file lock helpers (spec §4.2)."""

from __future__ import annotations
import contextlib
import fcntl
import json
from dataclasses import dataclass, asdict
from pathlib import Path
from typing import Optional


@dataclass
class SnapshotMeta:
    name: str
    table_id: str
    select: Optional[list[str]]
    where: Optional[str]
    limit: Optional[int]
    order_by: Optional[list[str]]
    fetched_at: str               # ISO 8601 UTC
    effective_as_of: str          # ISO 8601 UTC, server-side eval time
    rows: int
    bytes_local: int
    estimated_scan_bytes_at_fetch: int
    result_hash_md5: str


def _meta_path(snap_dir: Path, name: str) -> Path:
    return snap_dir / f"{name}.meta.json"


def write_meta(snap_dir: Path, meta: SnapshotMeta) -> None:
    snap_dir.mkdir(parents=True, exist_ok=True)
    with _meta_path(snap_dir, meta.name).open("w") as f:
        json.dump(asdict(meta), f, indent=2)


def read_meta(snap_dir: Path, name: str) -> Optional[SnapshotMeta]:
    p = _meta_path(snap_dir, name)
    if not p.exists():
        return None
    data = json.loads(p.read_text())
    return SnapshotMeta(**data)


def list_snapshots(snap_dir: Path) -> list[SnapshotMeta]:
    if not snap_dir.exists():
        return []
    out = []
    for meta_file in snap_dir.glob("*.meta.json"):
        try:
            data = json.loads(meta_file.read_text())
            out.append(SnapshotMeta(**data))
        except (json.JSONDecodeError, TypeError):
            continue
    return out


def delete_snapshot(snap_dir: Path, name: str) -> bool:
    """Delete the snapshot's parquet + meta. Returns True if removed, False if missing."""
    parquet = snap_dir / f"{name}.parquet"
    meta = _meta_path(snap_dir, name)
    removed = False
    if parquet.exists():
        parquet.unlink(); removed = True
    if meta.exists():
        meta.unlink(); removed = True
    return removed


@contextlib.contextmanager
def snapshot_lock(snap_dir: Path):
    """Exclusive flock on snap_dir/.lock — serializes snapshot installs.

    Concurrent `da fetch` invocations queue here.
    """
    snap_dir.mkdir(parents=True, exist_ok=True)
    lock_file = snap_dir / ".lock"
    lock_file.touch(exist_ok=True)
    fd = open(lock_file, "r+")
    try:
        fcntl.flock(fd.fileno(), fcntl.LOCK_EX)
        yield
    finally:
        fcntl.flock(fd.fileno(), fcntl.LOCK_UN)
        fd.close()
  • Step 14.4: Run tests to verify pass

Run: pytest tests/test_snapshot_meta.py -v Expected: 5 passed.

  • Step 14.5: Commit
git add cli/snapshot_meta.py tests/test_snapshot_meta.py
git commit -m "feat(cli): snapshot metadata sidecar + flock helper"

Task 15: da catalog / da schema / da describe

Spec §4.1. Discovery commands.

Files:

  • Create: cli/commands/catalog.py, cli/commands/schema.py, cli/commands/describe.py

  • Modify: cli/main.py

  • Test: tests/test_cli_catalog.py

  • Step 15.1: Write failing tests

# tests/test_cli_catalog.py
import json
from typer.testing import CliRunner
from unittest.mock import patch
import pytest


def test_da_catalog_json_output(monkeypatch):
    """`da catalog --json` emits the server's JSON verbatim."""
    payload = {
        "tables": [
            {"id": "orders", "name": "orders", "source_type": "keboola",
             "query_mode": "local", "sql_flavor": "duckdb",
             "where_examples": [], "fetch_via": "...", "rough_size_hint": None},
        ],
        "server_time": "2026-04-27T17:30:00Z",
    }
    with patch("cli.commands.catalog.api_get_json", return_value=payload):
        from cli.main import app as cli_app
        runner = CliRunner()
        result = runner.invoke(cli_app, ["catalog", "--json"])
    assert result.exit_code == 0
    out = json.loads(result.stdout)
    assert out["tables"][0]["id"] == "orders"


def test_da_catalog_table_output(monkeypatch):
    payload = {
        "tables": [
            {"id": "orders", "name": "orders", "source_type": "keboola",
             "query_mode": "local", "sql_flavor": "duckdb",
             "where_examples": [], "fetch_via": "...", "rough_size_hint": None},
        ],
        "server_time": "2026-04-27T17:30:00Z",
    }
    with patch("cli.commands.catalog.api_get_json", return_value=payload):
        from cli.main import app as cli_app
        runner = CliRunner()
        result = runner.invoke(cli_app, ["catalog"])
    assert result.exit_code == 0
    assert "orders" in result.stdout
    assert "keboola" in result.stdout
  • Step 15.2: Run tests to verify failure

Run: pytest tests/test_cli_catalog.py -v Expected: FAIL — cli.commands.catalog doesn't exist.

  • Step 15.3: Implement cli/commands/catalog.py
"""`da catalog` — list registered tables (spec §4.1)."""

import json as json_lib
import typer
from cli.v2_client import api_get_json, V2ClientError

catalog_app = typer.Typer(help="List tables visible to you")


@catalog_app.callback(invoke_without_command=True)
def catalog(
    ctx: typer.Context,
    json: bool = typer.Option(False, "--json", help="Emit raw JSON"),
    refresh: bool = typer.Option(False, "--refresh", help="Bypass client-side cache"),
):
    """List tables visible to you (RBAC-filtered)."""
    if ctx.invoked_subcommand is not None:
        return
    try:
        data = api_get_json("/api/v2/catalog", refresh=int(refresh))
    except V2ClientError as e:
        typer.echo(f"Error: catalog fetch failed: {e}", err=True)
        raise typer.Exit(5)

    if json:
        typer.echo(json_lib.dumps(data, indent=2))
        return
    # Human-readable table
    typer.echo(f"{'ID':30s}  {'SOURCE':10s}  {'MODE':8s}  {'FLAVOR':10s}  NAME")
    for t in data.get("tables", []):
        typer.echo(
            f"{t['id']:30s}  {t['source_type']:10s}  {t['query_mode']:8s}  "
            f"{t['sql_flavor']:10s}  {t.get('name', '')}"
        )
  • Step 15.4: Implement cli/commands/schema.py
"""`da schema <table>` — show columns + BQ flavor hints (spec §4.1)."""

import json as json_lib
import typer
from cli.v2_client import api_get_json, V2ClientError

schema_app = typer.Typer(help="Show column metadata for a table")


@schema_app.callback(invoke_without_command=True)
def schema(
    ctx: typer.Context,
    table_id: str = typer.Argument(...),
    json: bool = typer.Option(False, "--json"),
):
    """Show column metadata for a table."""
    if ctx.invoked_subcommand is not None:
        return
    try:
        data = api_get_json(f"/api/v2/schema/{table_id}")
    except V2ClientError as e:
        typer.echo(f"Error: schema fetch failed: {e}", err=True)
        raise typer.Exit(5 if e.status_code >= 500 else 8 if e.status_code == 403 else 2)

    if json:
        typer.echo(json_lib.dumps(data, indent=2))
        return

    flavor = data.get("sql_flavor", "duckdb")
    typer.echo(f"Table: {data['table_id']}  ({data['source_type']} — use {flavor.upper()} SQL dialect)")
    typer.echo("")
    typer.echo(f"{'COLUMN':30s}  {'TYPE':15s}  {'NULL':5s}  DESCRIPTION")
    for c in data.get("columns", []):
        typer.echo(
            f"{c['name']:30s}  {c['type']:15s}  "
            f"{'YES' if c.get('nullable') else 'NO':5s}  {c.get('description', '')}"
        )
    if data.get("partition_by"):
        typer.echo(f"\\nPartition: {data['partition_by']}")
    if data.get("clustered_by"):
        typer.echo(f"Clustered: {', '.join(data['clustered_by'])}")
    if data.get("where_dialect_hints"):
        typer.echo("\\nWHERE dialect hints:")
        for k, v in data["where_dialect_hints"].items():
            typer.echo(f"  {k:25s}  {v}")
  • Step 15.5: Implement cli/commands/describe.py
"""`da describe <table>` — schema + sample rows (spec §4.1)."""

import json as json_lib
import typer
from cli.v2_client import api_get_json, V2ClientError

describe_app = typer.Typer(help="Show schema + sample rows for a table")


@describe_app.callback(invoke_without_command=True)
def describe(
    ctx: typer.Context,
    table_id: str = typer.Argument(...),
    n: int = typer.Option(5, "-n", "--rows", help="Sample rows count"),
    json: bool = typer.Option(False, "--json"),
):
    """Show schema + sample rows for a table."""
    if ctx.invoked_subcommand is not None:
        return
    try:
        sch = api_get_json(f"/api/v2/schema/{table_id}")
        sam = api_get_json(f"/api/v2/sample/{table_id}", n=n)
    except V2ClientError as e:
        typer.echo(f"Error: describe failed: {e}", err=True)
        raise typer.Exit(5 if e.status_code >= 500 else 8 if e.status_code == 403 else 2)

    if json:
        typer.echo(json_lib.dumps({"schema": sch, "sample": sam}, indent=2, default=str))
        return

    # Reuse schema printing
    from cli.commands.schema import schema as schema_cmd
    typer.echo(f"Table: {sch['table_id']}")
    typer.echo("")
    typer.echo("Schema:")
    for c in sch.get("columns", []):
        typer.echo(f"  {c['name']:30s} {c['type']}")
    typer.echo("")
    typer.echo(f"Sample ({len(sam.get('rows', []))} rows):")
    for row in sam.get("rows", []):
        typer.echo(f"  {row}")
  • Step 15.6: Register commands in cli/main.py

Find where other subcommands are added (app.add_typer(...) calls). Add:

from cli.commands.catalog import catalog_app
from cli.commands.schema import schema_app
from cli.commands.describe import describe_app

app.add_typer(catalog_app, name="catalog")
app.add_typer(schema_app, name="schema")
app.add_typer(describe_app, name="describe")

(Adjust based on the actual cli/main.py pattern — could be flat commands instead of typers.)

  • Step 15.7: Run tests to verify pass

Run: pytest tests/test_cli_catalog.py -v Expected: 2 passed.

  • Step 15.8: Commit
git add cli/commands/catalog.py cli/commands/schema.py cli/commands/describe.py cli/main.py tests/test_cli_catalog.py
git commit -m "feat(cli): da catalog/schema/describe — discovery commands"

Task 16: da fetch

Spec §4.2. The headline command.

Files:

  • Create: cli/commands/fetch.py

  • Modify: cli/main.py

  • Test: tests/test_cli_fetch.py

  • Step 16.1: Write failing tests

# tests/test_cli_fetch.py
from typer.testing import CliRunner
from unittest.mock import patch, MagicMock
import pyarrow as pa
import json
import pytest


def _seed_local_dir(tmp_path):
    """Set up the user's agnes-data directory for the CLI to find."""
    (tmp_path / "user" / "duckdb").mkdir(parents=True)
    (tmp_path / "user" / "snapshots").mkdir(parents=True)
    return tmp_path


@pytest.fixture
def cli_env(tmp_path, monkeypatch):
    monkeypatch.setenv("DA_LOCAL_DIR", str(_seed_local_dir(tmp_path)))
    yield tmp_path


class TestDaFetch:
    def test_estimate_only_does_not_create_snapshot(self, cli_env, monkeypatch):
        from cli.main import app as cli_app
        with patch("cli.commands.fetch.api_post_json") as m:
            m.return_value = {
                "estimated_scan_bytes": 1_000_000,
                "estimated_result_rows": 100,
                "estimated_result_bytes": 1_000,
                "bq_cost_estimate_usd": 0.0001,
            }
            runner = CliRunner()
            result = runner.invoke(cli_app, [
                "fetch", "bq_view",
                "--select", "a,b",
                "--where", "a > 1",
                "--limit", "100",
                "--estimate",
            ])
        assert result.exit_code == 0, result.stdout
        # No parquet should be created
        assert not list((cli_env / "user" / "snapshots").glob("*.parquet"))

    def test_fetch_creates_snapshot_with_meta(self, cli_env, monkeypatch):
        from cli.main import app as cli_app
        # Estimate path
        with patch("cli.commands.fetch.api_post_json") as m_est, \
             patch("cli.commands.fetch.api_post_arrow") as m_scan:
            m_est.return_value = {
                "estimated_scan_bytes": 1000,
                "estimated_result_rows": 2,
                "estimated_result_bytes": 100,
                "bq_cost_estimate_usd": 0.0,
            }
            m_scan.return_value = pa.table({"a": [1, 2], "b": ["x", "y"]})
            runner = CliRunner()
            result = runner.invoke(cli_app, [
                "fetch", "bq_view",
                "--select", "a,b",
                "--limit", "10",
                "--no-estimate",
            ])
        assert result.exit_code == 0, result.stdout
        snap = cli_env / "user" / "snapshots" / "bq_view.parquet"
        meta = cli_env / "user" / "snapshots" / "bq_view.meta.json"
        assert snap.exists()
        assert meta.exists()
        assert json.loads(meta.read_text())["rows"] == 2

    def test_fetch_existing_snapshot_without_force_fails(self, cli_env, monkeypatch):
        from cli.main import app as cli_app
        # Pre-create a snapshot
        snap = cli_env / "user" / "snapshots" / "bq_view.parquet"
        snap.write_bytes(b"PAR1\\x00\\x00PAR1")
        meta = cli_env / "user" / "snapshots" / "bq_view.meta.json"
        meta.write_text('{"name": "bq_view", "table_id": "bq_view", "select": null, "where": null, "limit": null, "order_by": null, "fetched_at": "x", "effective_as_of": "x", "rows": 0, "bytes_local": 0, "estimated_scan_bytes_at_fetch": 0, "result_hash_md5": ""}')

        runner = CliRunner()
        result = runner.invoke(cli_app, ["fetch", "bq_view", "--no-estimate"])
        assert result.exit_code == 6, f"expected exit code 6 (snapshot_exists); got {result.exit_code}\\n{result.stdout}"
  • Step 16.2: Run tests to verify failure

Run: pytest tests/test_cli_fetch.py -v Expected: FAIL — cli.commands.fetch doesn't exist.

  • Step 16.3: Implement cli/commands/fetch.py
"""`da fetch` — materialize a filtered subset of a remote table locally (spec §4.2)."""

from __future__ import annotations
import hashlib
import json
import os
from datetime import datetime, timezone
from pathlib import Path

import duckdb
import pyarrow as pa
import pyarrow.parquet as pq
import typer

from cli.snapshot_meta import (
    SnapshotMeta, write_meta, read_meta, snapshot_lock,
)
from cli.v2_client import api_post_json, api_post_arrow, V2ClientError

fetch_app = typer.Typer(help="Fetch a filtered subset of a remote table locally")


def _local_dir() -> Path:
    return Path(os.environ.get("DA_LOCAL_DIR", ".")).resolve()


def _print_estimate(d: dict) -> None:
    typer.echo(f"  estimated_scan_bytes:   {d.get('estimated_scan_bytes', 0):>15,} bytes")
    typer.echo(f"  estimated_result_rows:  {d.get('estimated_result_rows', 0):>15,}")
    typer.echo(f"  estimated_result_bytes: {d.get('estimated_result_bytes', 0):>15,} bytes")
    typer.echo(f"  bq_cost_estimate_usd:   $ {d.get('bq_cost_estimate_usd', 0):.4f}")


@fetch_app.callback(invoke_without_command=True)
def fetch(
    ctx: typer.Context,
    table_id: str = typer.Argument(...),
    select: str = typer.Option(None, "--select", help="Comma-separated column list"),
    where: str = typer.Option(None, "--where", help="WHERE predicate (BQ flavor for remote tables)"),
    limit: int = typer.Option(None, "--limit"),
    order_by: str = typer.Option(None, "--order-by", help="Comma-separated"),
    as_name: str = typer.Option(None, "--as", help="Local snapshot name (default: <table_id>)"),
    estimate: bool = typer.Option(False, "--estimate", help="Run dry-run only, do not fetch"),
    no_estimate: bool = typer.Option(False, "--no-estimate", help="Skip the pre-fetch estimate"),
    force: bool = typer.Option(False, "--force", help="Overwrite existing snapshot of the same name"),
):
    """Fetch a filtered subset of a remote table locally."""
    if ctx.invoked_subcommand is not None:
        return

    name = as_name or table_id
    snap_dir = _local_dir() / "user" / "snapshots"
    snap_dir.mkdir(parents=True, exist_ok=True)

    # Build request
    req = {"table_id": table_id}
    if select:
        req["select"] = [c.strip() for c in select.split(",") if c.strip()]
    if where:
        req["where"] = where
    if limit:
        req["limit"] = int(limit)
    if order_by:
        req["order_by"] = [c.strip() for c in order_by.split(",") if c.strip()]

    # Estimate (always shown unless --no-estimate)
    if not no_estimate:
        try:
            est = api_post_json("/api/v2/scan/estimate", req)
        except V2ClientError as e:
            typer.echo(f"Error: estimate failed: {e}", err=True)
            raise typer.Exit(_exit_code_for(e))
        typer.echo(f"Estimate for {table_id}:")
        _print_estimate(est)
        if estimate:
            return

    # Snapshot existence check
    if not force and read_meta(snap_dir, name) is not None:
        existing = read_meta(snap_dir, name)
        typer.echo(
            f"Error: snapshot {name!r} already exists "
            f"(fetched {existing.fetched_at}, {existing.rows:,} rows). "
            f"Pass --force to overwrite, or 'da snapshot refresh {name}' to update in place.",
            err=True,
        )
        raise typer.Exit(6)

    # Fetch
    try:
        table = api_post_arrow("/api/v2/scan", req)
    except V2ClientError as e:
        typer.echo(f"Error: fetch failed: {e}", err=True)
        raise typer.Exit(_exit_code_for(e))

    # Install under flock
    parquet_path = snap_dir / f"{name}.parquet"
    with snapshot_lock(snap_dir):
        pq.write_table(table, parquet_path)
        # Register view in user analytics.duckdb
        local_db = _local_dir() / "user" / "duckdb" / "analytics.duckdb"
        local_db.parent.mkdir(parents=True, exist_ok=True)
        conn = duckdb.connect(str(local_db))
        try:
            conn.execute(
                f'CREATE OR REPLACE VIEW "{name}" AS SELECT * FROM read_parquet(?)',
                [str(parquet_path)],
            )
        finally:
            conn.close()

        # Compute hash + write meta
        result_hash = hashlib.md5(parquet_path.read_bytes()[:1_000_000]).hexdigest()
        now = datetime.now(timezone.utc).isoformat()
        meta = SnapshotMeta(
            name=name, table_id=table_id,
            select=req.get("select"), where=req.get("where"),
            limit=req.get("limit"), order_by=req.get("order_by"),
            fetched_at=now, effective_as_of=now,
            rows=int(table.num_rows),
            bytes_local=parquet_path.stat().st_size,
            estimated_scan_bytes_at_fetch=int(est.get("estimated_scan_bytes", 0)) if not no_estimate else 0,
            result_hash_md5=result_hash,
        )
        write_meta(snap_dir, meta)

    typer.echo(f"Fetched {table.num_rows:,} rows -> {name}")


def _exit_code_for(e: V2ClientError) -> int:
    if e.status_code == 400:
        # Inspect body for 'kind'
        body = e.body if isinstance(e.body, dict) else {}
        if body.get("error") == "validator_rejected":
            return 2
        return 2
    if e.status_code == 401:
        return 7
    if e.status_code == 403:
        return 8
    if e.status_code == 404:
        return 8  # treat unknown table as RBAC-equivalent
    if e.status_code == 429:
        return 3
    if e.status_code >= 500:
        return 5
    return 9
  • Step 16.4: Register in cli/main.py
from cli.commands.fetch import fetch_app
app.add_typer(fetch_app, name="fetch")
  • Step 16.5: Run tests to verify pass

Run: pytest tests/test_cli_fetch.py -v Expected: 3 passed.

  • Step 16.6: Commit
git add cli/commands/fetch.py cli/main.py tests/test_cli_fetch.py
git commit -m "feat(cli): da fetch — materialize filtered remote subset locally"

Task 17: da snapshot list/refresh/drop/prune

Spec §4.2.

Files:

  • Create: cli/commands/snapshot.py

  • Modify: cli/main.py

  • Test: tests/test_cli_snapshot.py

  • Step 17.1: Write failing tests

# tests/test_cli_snapshot.py
from typer.testing import CliRunner
from unittest.mock import patch
import json
import pytest

from cli.snapshot_meta import SnapshotMeta, write_meta


@pytest.fixture
def cli_env(tmp_path, monkeypatch):
    monkeypatch.setenv("DA_LOCAL_DIR", str(tmp_path))
    snap_dir = tmp_path / "user" / "snapshots"
    snap_dir.mkdir(parents=True)
    yield tmp_path


def _seed_meta(tmp_path, name="cz_recent", rows=100):
    snap_dir = tmp_path / "user" / "snapshots"
    parquet = snap_dir / f"{name}.parquet"
    parquet.write_bytes(b"PAR1\\x00\\x00PAR1")
    write_meta(snap_dir, SnapshotMeta(
        name=name, table_id="bq_view", select=None, where=None, limit=None, order_by=None,
        fetched_at="2026-04-27T10:00:00+00:00",
        effective_as_of="2026-04-27T10:00:00+00:00",
        rows=rows, bytes_local=parquet.stat().st_size,
        estimated_scan_bytes_at_fetch=0, result_hash_md5="abc",
    ))


class TestSnapshotList:
    def test_list_empty(self, cli_env):
        from cli.main import app as cli_app
        runner = CliRunner()
        result = runner.invoke(cli_app, ["snapshot", "list"])
        assert result.exit_code == 0


class TestSnapshotDrop:
    def test_drop_removes_files(self, cli_env):
        from cli.main import app as cli_app
        _seed_meta(cli_env, "cz_recent")
        snap_dir = cli_env / "user" / "snapshots"
        assert (snap_dir / "cz_recent.parquet").exists()

        runner = CliRunner()
        result = runner.invoke(cli_app, ["snapshot", "drop", "cz_recent"])
        assert result.exit_code == 0
        assert not (snap_dir / "cz_recent.parquet").exists()
        assert not (snap_dir / "cz_recent.meta.json").exists()

    def test_drop_missing_returns_2(self, cli_env):
        from cli.main import app as cli_app
        runner = CliRunner()
        result = runner.invoke(cli_app, ["snapshot", "drop", "nonexistent"])
        assert result.exit_code != 0
  • Step 17.2: Run tests to verify failure

Run: pytest tests/test_cli_snapshot.py -v Expected: FAIL — module doesn't exist.

  • Step 17.3: Implement cli/commands/snapshot.py
"""`da snapshot list/refresh/drop/prune` (spec §4.2)."""

from __future__ import annotations
import hashlib
import os
import json as json_lib
from datetime import datetime, timedelta, timezone
from pathlib import Path

import duckdb
import pyarrow.parquet as pq
import typer

from cli.snapshot_meta import (
    list_snapshots, read_meta, write_meta, delete_snapshot,
    snapshot_lock, SnapshotMeta,
)
from cli.v2_client import api_post_arrow, V2ClientError

snapshot_app = typer.Typer(help="Manage local snapshots")


def _local_dir() -> Path:
    return Path(os.environ.get("DA_LOCAL_DIR", ".")).resolve()


def _snap_dir() -> Path:
    return _local_dir() / "user" / "snapshots"


def _format_size(n: int) -> str:
    for unit in ("B", "KB", "MB", "GB", "TB"):
        if n < 1024 or unit == "TB":
            return f"{n:.1f} {unit}"
        n //= 1024
    return f"{n} TB"


@snapshot_app.command("list")
def list_cmd(
    json: bool = typer.Option(False, "--json"),
):
    """List local snapshots."""
    snaps = list_snapshots(_snap_dir())
    if json:
        typer.echo(json_lib.dumps([s.__dict__ for s in snaps], indent=2))
        return
    if not snaps:
        typer.echo("(no snapshots)")
        return
    typer.echo(f"{'NAME':30s}  {'ROWS':>10s}  {'SIZE':>10s}  {'AGE':>10s}  {'TABLE':30s}  WHERE")
    now = datetime.now(timezone.utc)
    for s in sorted(snaps, key=lambda x: x.name):
        try:
            age = now - datetime.fromisoformat(s.fetched_at.replace("Z", "+00:00"))
            age_str = f"{age.days}d" if age.days else f"{int(age.total_seconds() // 3600)}h"
        except (ValueError, TypeError):
            age_str = "?"
        where = (s.where or "")[:40]
        typer.echo(
            f"{s.name:30s}  {s.rows:>10,}  {_format_size(s.bytes_local):>10s}  "
            f"{age_str:>10s}  {s.table_id:30s}  {where}"
        )


@snapshot_app.command("drop")
def drop_cmd(name: str):
    """Delete a snapshot."""
    snap_dir = _snap_dir()
    if read_meta(snap_dir, name) is None:
        typer.echo(f"Error: snapshot {name!r} not found", err=True)
        raise typer.Exit(2)

    with snapshot_lock(snap_dir):
        delete_snapshot(snap_dir, name)
        # Also drop the view from user analytics DB
        local_db = _local_dir() / "user" / "duckdb" / "analytics.duckdb"
        if local_db.exists():
            conn = duckdb.connect(str(local_db))
            try:
                conn.execute(f'DROP VIEW IF EXISTS "{name}"')
            finally:
                conn.close()
    typer.echo(f"Dropped {name}")


@snapshot_app.command("refresh")
def refresh_cmd(
    name: str,
    where: str = typer.Option(None, "--where", help="Override stored WHERE"),
):
    """Re-fetch a snapshot using its stored fetch parameters (spec §4.2)."""
    snap_dir = _snap_dir()
    meta = read_meta(snap_dir, name)
    if meta is None:
        typer.echo(f"Error: snapshot {name!r} not found", err=True)
        raise typer.Exit(2)

    req = {
        "table_id": meta.table_id,
        "select": meta.select,
        "where": where if where else meta.where,
        "limit": meta.limit,
        "order_by": meta.order_by,
    }
    try:
        table = api_post_arrow("/api/v2/scan", req)
    except V2ClientError as e:
        typer.echo(f"Error: refresh failed: {e}", err=True)
        raise typer.Exit(5 if e.status_code >= 500 else 8 if e.status_code == 403 else 2)

    parquet_path = snap_dir / f"{name}.parquet"
    with snapshot_lock(snap_dir):
        pq.write_table(table, parquet_path)
        new_hash = hashlib.md5(parquet_path.read_bytes()[:1_000_000]).hexdigest()
        identical = new_hash == meta.result_hash_md5
        old_rows = meta.rows
        old_bytes = meta.bytes_local
        new_rows = int(table.num_rows)
        new_bytes = parquet_path.stat().st_size
        now = datetime.now(timezone.utc).isoformat()
        new_meta = SnapshotMeta(
            name=name, table_id=meta.table_id,
            select=req.get("select"), where=req.get("where"),
            limit=req.get("limit"), order_by=req.get("order_by"),
            fetched_at=now, effective_as_of=now,
            rows=new_rows, bytes_local=new_bytes,
            estimated_scan_bytes_at_fetch=meta.estimated_scan_bytes_at_fetch,
            result_hash_md5=new_hash,
        )
        write_meta(snap_dir, new_meta)

    typer.echo(f"Refreshed {name}")
    typer.echo(f"  rows:           {old_rows:>10,}  ->  {new_rows:>10,}  ({new_rows - old_rows:+,})")
    typer.echo(f"  bytes_local:    {_format_size(old_bytes)}  ->  {_format_size(new_bytes)}")
    typer.echo(f"  effective_as_of:{meta.effective_as_of}  ->  {now}")
    typer.echo(f"  identical:      {'yes' if identical else 'no'}")


@snapshot_app.command("prune")
def prune_cmd(
    older_than: str = typer.Option(None, "--older-than", help="e.g. 7d, 24h"),
    larger_than: str = typer.Option(None, "--larger-than", help="e.g. 1g, 500m"),
    dry_run: bool = typer.Option(False, "--dry-run"),
):
    """Drop snapshots matching predicates."""
    snap_dir = _snap_dir()
    snaps = list_snapshots(snap_dir)

    matches = []
    for s in snaps:
        keep = True
        if older_than:
            try:
                age = datetime.now(timezone.utc) - datetime.fromisoformat(s.fetched_at.replace("Z", "+00:00"))
                threshold = _parse_duration(older_than)
                if age < threshold:
                    keep = False
            except ValueError:
                pass
        if larger_than:
            threshold = _parse_size(larger_than)
            if s.bytes_local < threshold:
                keep = False
        if not keep and (older_than or larger_than):
            continue
        if older_than or larger_than:
            matches.append(s)
    
    # When BOTH conditions provided, intersection. We've used `keep` to mean "both pass".
    # Simplified: re-compute with explicit AND
    matches = []
    for s in snaps:
        ok = True
        if older_than:
            try:
                age = datetime.now(timezone.utc) - datetime.fromisoformat(s.fetched_at.replace("Z", "+00:00"))
                if age < _parse_duration(older_than):
                    ok = False
            except (ValueError, TypeError):
                ok = False
        if larger_than and s.bytes_local < _parse_size(larger_than):
            ok = False
        if ok:
            matches.append(s)

    for s in matches:
        if dry_run:
            typer.echo(f"would drop: {s.name}  ({_format_size(s.bytes_local)}, {s.fetched_at})")
        else:
            with snapshot_lock(snap_dir):
                delete_snapshot(snap_dir, s.name)
            typer.echo(f"dropped: {s.name}")
    if not matches:
        typer.echo("(no matches)")


def _parse_duration(s: str) -> timedelta:
    s = s.strip().lower()
    if s.endswith("d"):
        return timedelta(days=int(s[:-1]))
    if s.endswith("h"):
        return timedelta(hours=int(s[:-1]))
    if s.endswith("m"):
        return timedelta(minutes=int(s[:-1]))
    raise ValueError(f"unknown duration: {s!r}")


def _parse_size(s: str) -> int:
    s = s.strip().lower()
    multipliers = {"k": 1024, "m": 1024 ** 2, "g": 1024 ** 3, "t": 1024 ** 4}
    if s[-1] in multipliers:
        return int(float(s[:-1]) * multipliers[s[-1]])
    return int(s)
  • Step 17.4: Register in cli/main.py
from cli.commands.snapshot import snapshot_app
app.add_typer(snapshot_app, name="snapshot")
  • Step 17.5: Run tests to verify pass

Run: pytest tests/test_cli_snapshot.py -v Expected: 3 passed.

  • Step 17.6: Commit
git add cli/commands/snapshot.py cli/main.py tests/test_cli_snapshot.py
git commit -m "feat(cli): da snapshot list/refresh/drop/prune"

Task 18: da disk-info

Spec §4.3. Trivial.

Files:

  • Create: cli/commands/disk_info.py

  • Modify: cli/main.py

  • Test: tests/test_cli_disk_info.py

  • Step 18.1: Write failing test

# tests/test_cli_disk_info.py
import os
from typer.testing import CliRunner
import pytest


@pytest.fixture
def cli_env(tmp_path, monkeypatch):
    monkeypatch.setenv("DA_LOCAL_DIR", str(tmp_path))
    snap = tmp_path / "user" / "snapshots"
    snap.mkdir(parents=True)
    yield tmp_path


def test_disk_info_runs_and_reports(cli_env):
    (cli_env / "user" / "snapshots" / "x.parquet").write_bytes(b"A" * 1024)
    from cli.main import app as cli_app
    runner = CliRunner()
    result = runner.invoke(cli_app, ["disk-info"])
    assert result.exit_code == 0
    assert "Snapshots dir" in result.stdout
  • Step 18.2: Run test to verify failure

Run: pytest tests/test_cli_disk_info.py -v Expected: FAIL — module doesn't exist.

  • Step 18.3: Implement cli/commands/disk_info.py
"""`da disk-info` — show snapshot dir disk usage (spec §4.3)."""

import os
import shutil
from pathlib import Path
import typer

disk_info_app = typer.Typer(help="Show snapshot disk usage")


def _local_dir() -> Path:
    return Path(os.environ.get("DA_LOCAL_DIR", ".")).resolve()


def _format_size(n: int) -> str:
    for unit in ("B", "KB", "MB", "GB", "TB"):
        if n < 1024 or unit == "TB":
            return f"{n:.1f} {unit}"
        n //= 1024
    return f"{n} TB"


@disk_info_app.callback(invoke_without_command=True)
def disk_info(
    ctx: typer.Context,
    json: bool = typer.Option(False, "--json"),
):
    """Show snapshots disk usage."""
    if ctx.invoked_subcommand is not None:
        return
    snap_dir = _local_dir() / "user" / "snapshots"
    used = sum(p.stat().st_size for p in snap_dir.rglob("*") if p.is_file()) if snap_dir.exists() else 0
    count = len(list(snap_dir.glob("*.parquet"))) if snap_dir.exists() else 0
    free = shutil.disk_usage(snap_dir).free if snap_dir.exists() else 0
    quota_gb = int(os.environ.get("AGNES_SNAPSHOT_QUOTA_GB", "10"))

    if json:
        import json as json_lib
        typer.echo(json_lib.dumps({
            "snapshots_dir": str(snap_dir),
            "used_bytes": used, "snapshot_count": count,
            "free_bytes": free, "quota_gb": quota_gb,
        }))
        return

    typer.echo(f"Snapshots dir:    {snap_dir}")
    typer.echo(f"Used by Agnes:    {_format_size(used)} across {count} snapshots")
    typer.echo(f"Free disk:        {_format_size(free)}")
    typer.echo(f"Configured cap:   {quota_gb} GB (set AGNES_SNAPSHOT_QUOTA_GB to override)")
  • Step 18.4: Register in cli/main.py
from cli.commands.disk_info import disk_info_app
app.add_typer(disk_info_app, name="disk-info")
  • Step 18.5: Run test to verify pass

Run: pytest tests/test_cli_disk_info.py -v Expected: 1 passed.

  • Step 18.6: Commit
git add cli/commands/disk_info.py cli/main.py tests/test_cli_disk_info.py
git commit -m "feat(cli): da disk-info — snapshot dir usage report"

Task 19: CLAUDE.md agent rails addendum

Spec §5.1.

Files:

  • Modify: CLAUDE.md

  • Step 19.1: Add the section

Open CLAUDE.md. Find a sensible location (after "Business Metrics" or before "Hybrid Queries"). Add the full section from spec §5.1. The literal markdown is large but verbatim from the spec — copy from docs/superpowers/specs/2026-04-27-claude-fetch-primitives-design.md lines 437-528 (the "## Querying Agnes data — agent rails" block), and match the existing CLAUDE.md heading style.

  • Step 19.2: Verify tabs/style

Run: grep -n '^## ' CLAUDE.md — confirm new section is at H2 level and ordered sensibly with neighboring sections.

  • Step 19.3: Commit
git add CLAUDE.md
git commit -m "docs(claude.md): agent rails for v2 primitives (catalog -> schema -> fetch)"

Task 20: Standalone skill file agnes-data-querying

Spec §5.2.

Files:

  • Create: cli/skills/agnes-data-querying.md

  • Step 20.1: Write the skill

Mirror the CLAUDE.md addendum but framed as a skill. ~200 lines max. Structure:

---
name: agnes-data-querying
description: Query Agnes data correctly — discovery first (`da catalog` → `da schema``da describe`), then `da fetch` for remote tables, then `da query` locally. Use BigQuery SQL flavor for `--where` on remote tables.
---

# Agnes Data Querying

[full content per spec §5 + a quick-reference BQ flavor card]
  • Step 20.2: Verify path matches existing skill loader convention

Other skills in this repo live at cli/skills/<name>.md — confirm and adjust the spec path if convention is different (e.g. cli/skills/<name>/SKILL.md).

  • Step 20.3: Commit
git add cli/skills/agnes-data-querying.md
git commit -m "docs(skills): agnes-data-querying — agent rails for v2 fetch primitives"

Task 21: CHANGELOG **BREAKING** entry + instance.yaml.example knobs

Spec §6, §10.4. CLAUDE.md changelog discipline.

Files:

  • Modify: CHANGELOG.md

  • Modify: config/instance.yaml.example

  • Step 21.1: Add CHANGELOG entries

Under ## [Unreleased], add:

### Added
- `/api/v2/{catalog,schema,sample,scan,scan/estimate}` — discovery + scoped fetch primitives. See `docs/superpowers/specs/2026-04-27-claude-fetch-primitives-design.md`.
- `da catalog`, `da schema`, `da describe`, `da fetch`, `da snapshot {list,refresh,drop,prune}`, `da disk-info` — CLI primitives backed by the v2 API.
- `cli/skills/agnes-data-querying.md` — Claude rails skill loaded for Agnes-flavored projects.
- `instance.yaml: api.scan.*` knobs (`max_limit`, `max_result_bytes`, `max_concurrent_per_user`, `max_daily_bytes_per_user`, `bq_cost_per_tb_usd`, `request_timeout_seconds`).

### Changed
- **BREAKING:** BigQuery views (`query_mode='remote'` with `_meta.query_mode='remote'`) are no longer wrapped as DuckDB master views in `analytics.duckdb`. `da query --remote "SELECT * FROM <bq_view>"` no longer resolves the view name; analysts must use `da fetch` to materialize a snapshot or `da query --remote "SELECT * FROM bigquery_query('proj', '<inner BQ SQL>')"` directly. To restore the previous behavior for one release cycle, set `instance.yaml: data_source.bigquery.legacy_wrap_views: true`.
  • Step 21.2: Add config knob defaults to instance.yaml.example

Append a section showing the new knobs with sensible defaults + comments.

  • Step 21.3: Commit
git add CHANGELOG.md config/instance.yaml.example
git commit -m "docs: CHANGELOG + instance.yaml.example for v2 fetch primitives (BREAKING)"

Task 22: E2E verification on dev VM

Spec §11 manual gates. Pure verification — no code changes.

  • Step 22.1: Push branch + wait for image rebuild
git push origin zs/test-bq-e2e
gh run watch --branch zs/test-bq-e2e
  • Step 22.2: Auto-upgrade VM + verify health
ssh foundryai-dev-zsrotyr   # or via gcloud compute ssh
sudo /usr/local/bin/agnes-auto-upgrade.sh
sudo docker exec agnes-app-1 curl -sS http://localhost:8000/api/health \
    | python3 -c 'import json,sys; d=json.load(sys.stdin); print(d["commit_sha"], d["schema_version"])'

Expected: commit sha matches the latest local commit; schema_version is current.

  • Step 22.3: Verify v2 endpoints respond
PAT=$(... your pat ...)
curl -sS -H "Authorization: Bearer $PAT" https://<vm-host>/api/v2/catalog | jq '.tables[].id'
curl -sS -H "Authorization: Bearer $PAT" https://<vm-host>/api/v2/schema/<some-bq-table> | jq
curl -sS -H "Authorization: Bearer $PAT" -X POST -H "Content-Type: application/json" \
    -d '{"table_id": "<some-bq-table>", "select": ["..."], "where": "...", "limit": 100}' \
    https://<vm-host>/api/v2/scan/estimate | jq

Expected: each returns 200 with reasonable JSON.

  • Step 22.4: Run da fetch end-to-end

From a laptop with da CLI installed (and pointed at the dev VM):

da catalog --json | jq '.tables[].id'
da schema <bq-table>
da fetch <bq-table> \
    --select <cols> \
    --where "<bq predicate>" \
    --limit 1000 \
    --as test_snap \
    --estimate
da fetch <bq-table> --select <cols> --where "<pred>" --limit 1000 --as test_snap --no-estimate
da query "SELECT COUNT(*) FROM test_snap"
da snapshot list
da snapshot drop test_snap

All commands should succeed. da query on the snapshot returns the expected row count.

  • Step 22.5: Verify Claude rails (manual gate)

Open Claude Code in a fresh session against the Agnes repo. Ask: "Show me the count of rows in <bq-table> for the last 7 days."

Expected agent flow (visible in the conversation):

  1. Run da catalog
  2. Run da schema <bq-table>
  3. Run da fetch <bq-table> --where "..." --limit ... --as ...
  4. Run da query "SELECT COUNT(*) FROM ..."
  5. Report the count

Repeat 2 more times in fresh sessions. Document the transcripts in the PR description.

  • Step 22.6: Open PR
gh pr create --title "v2 fetch primitives + Claude agent rails (#101)" --body "$(cat <<EOF
## Summary
Replaces the BQ-view-wrapping approach with primitives the Claude agent composes:
- `/api/v2/{catalog,schema,sample,scan,scan/estimate}` server endpoints
- `da catalog/schema/describe/fetch/snapshot/disk-info` CLI commands
- CLAUDE.md addendum + standalone skill for agent rails
- BREAKING: BQ view wrap dropped; `legacy_wrap_views` toggle for one release cycle

## Test plan
- [x] All unit tests green
- [x] CI on branch passes
- [x] E2E on dev VM: discover -> estimate -> fetch -> query loop in <2 min
- [x] Three unguided Claude sessions follow the protocol

## Closes
- #101 (BQ view-wrapping outer-query pushdown)

## Spec & plan
- docs/superpowers/specs/2026-04-27-claude-fetch-primitives-design.md
- docs/superpowers/plans/2026-04-27-claude-fetch-primitives.md
EOF
)"
  • Step 22.7: Mark E2E gate complete in PR description

Update the PR body with:

  • Commit SHAs for each phase
  • Demo recording link or summary
  • Three Claude transcripts confirming agent rails are followed

Self-Review

Spec coverage:

  • §1 motivation, §2 architecture: covered by tasks 7-11 + 12 (drop wrap)
  • §3.0 identifier conventions: enforced via existing validate_identifier + Task 7+8+9 RBAC use of registry id
  • §3.1 catalog: Task 7
  • §3.2 schema: Task 8
  • §3.3 sample: Task 9
  • §3.4 scan: Task 11
  • §3.5 scan/estimate: Task 10
  • §3.6 caching: Task 5 + reuse in Tasks 7/8/9
  • §3.7 validator: Tasks 1-3
  • §3.8 quotas: Task 4 (used in Task 11)
  • §4.1 catalog/schema/describe CLI: Task 15
  • §4.2 fetch + snapshot mgmt: Tasks 16, 17
  • §4.3 disk-info: Task 18
  • §5 CLAUDE.md + skill: Tasks 19, 20
  • §6 migration: Task 12
  • §10 implementation contracts: distributed across tasks (audit shape in Task 11, exit codes in Tasks 16/17, error UX in _exit_code_for helpers, config knobs in Task 21)
  • §11 success criteria: Task 22

Placeholder scan: No "TBD"/"TODO" found.

Type consistency:

  • validate_where(predicate, table_id, schema) — same signature in tasks 1-3 and Task 11.
  • QuotaTracker.acquire(user) / record_bytes(user, n) — same in Task 4 + Task 11.
  • SnapshotMeta dataclass — same fields in Task 14 + Tasks 16-17.
  • ScanRequest pydantic model — same in Tasks 10 + 11.
  • _resolve_schema(conn, user, table_id, project_id) — defined in Task 10, reused in Task 11.

No drift detected.


Total tasks: 22 Estimated effort: ~16-18 person-days (1 dev) / 8-9 days (2 devs in parallel)