# Claude-Driven Fetch Primitives Implementation Plan > **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. **Goal:** Replace the broken BQ-view-wrapping approach (issue #101) with primitive operations the Claude agent composes — `da catalog`, `da schema`, `da fetch`, `da snapshot *`, `da query` — backed by `/api/v2/{catalog,schema,sample,scan,scan/estimate}` server endpoints. Secrets stay server-side; agent does the planning. **Architecture:** Two-tier query model (laptop DuckDB ↔ server DuckDB ↔ BQ) preserved. New v2 endpoints expose discovery + scoped scan. CLI commands materialize filtered subsets locally as parquet snapshots registered as DuckDB views. Server-side WHERE validator (sqlglot, allow-list-driven) is the security perimeter. **Tech Stack:** Python 3.13, FastAPI, DuckDB, pyarrow (Arrow IPC over HTTP), sqlglot (server-side WHERE validation), `bigquery_query()` DuckDB BQ extension function, GCE metadata-token auth (#98 cache reused), pytest. No new dependencies beyond sqlglot (already optional in repo). **Spec:** `docs/superpowers/specs/2026-04-27-claude-fetch-primitives-design.md` --- ## File structure **New files (server):** - `app/api/where_validator.py` — sqlglot-backed WHERE clause validator (§3.7 of spec) - `app/api/v2_quota.py` — process-local concurrent + daily-byte quota tracker - `app/api/v2_cache.py` — LRU+TTL cache helper for catalog/schema/sample - `app/api/v2_arrow.py` — Arrow IPC streaming helper (response builder) - `app/api/v2_catalog.py` — `GET /api/v2/catalog` - `app/api/v2_schema.py` — `GET /api/v2/schema/{table_id}` - `app/api/v2_sample.py` — `GET /api/v2/sample/{table_id}` - `app/api/v2_scan.py` — `POST /api/v2/scan` + `POST /api/v2/scan/estimate` **New files (client):** - `cli/v2_client.py` — Arrow over HTTP client + JSON request helpers - `cli/snapshot_meta.py` — sidecar JSON I/O + flock helper - `cli/commands/fetch.py` — `da fetch` - `cli/commands/snapshot.py` — `da snapshot list/refresh/drop/prune` - `cli/commands/catalog.py` — `da catalog` - `cli/commands/schema.py` — `da schema` - `cli/commands/describe.py` — `da describe` - `cli/commands/disk_info.py` — `da disk-info` **New files (tests):** - `tests/test_where_validator.py` — adversarial corpus (50+ cases) - `tests/test_v2_quota.py`, `tests/test_v2_cache.py`, `tests/test_v2_arrow.py` - `tests/test_v2_catalog.py`, `tests/test_v2_schema.py`, `tests/test_v2_sample.py`, `tests/test_v2_scan.py`, `tests/test_v2_scan_estimate.py` - `tests/test_cli_fetch.py`, `tests/test_cli_snapshot.py`, `tests/test_cli_catalog.py`, `tests/test_cli_schema.py`, `tests/test_cli_describe.py`, `tests/test_cli_disk_info.py` - `tests/test_snapshot_meta.py`, `tests/test_v2_client.py` **New files (docs/skill):** - `cli/skills/agnes-data-querying.md` — agent rails skill (§5.2) **Modified files:** - `app/main.py` — register v2 routers - `cli/main.py` — register new command groups - `connectors/bigquery/extractor.py` — drop wrap-view code path for VIEW entities + `legacy_wrap_views` toggle - `tests/test_bigquery_extractor.py` — update tests for legacy toggle - `CLAUDE.md` — agent rails addendum (§5.1) - `CHANGELOG.md` — `**BREAKING**` entry under `[Unreleased]` - `config/instance.yaml.example` — new `api.scan.*` knobs --- ## Task 1: WHERE validator — parser + structural rejects Foundation for `/api/v2/scan` security perimeter. Spec §3.7 part 1. **Files:** - Create: `app/api/where_validator.py` - Test: `tests/test_where_validator.py` - [ ] **Step 1.1: Write failing tests for parse + structural rejects** ```python # tests/test_where_validator.py """Adversarial test corpus for the WHERE clause validator (spec §3.7).""" import pytest from app.api.where_validator import ( validate_where, WhereValidationError, REJECT_NESTED_SELECT, REJECT_MULTI_STATEMENT, REJECT_DDL_DML, REJECT_PARSE, REJECT_CROSS_TABLE, ) # A schema-like dict the validator uses to verify column references. SCHEMA = { "event_date": "DATE", "country_code": "STRING", "session_id": "STRING", "amount": "INT64", } TABLE_ID = "web_sessions_example" class TestParse: def test_empty_string_rejected(self): with pytest.raises(WhereValidationError) as e: validate_where("", TABLE_ID, SCHEMA) assert e.value.kind == REJECT_PARSE def test_unparseable_rejected(self): with pytest.raises(WhereValidationError) as e: validate_where("SELECT * FROM", TABLE_ID, SCHEMA) assert e.value.kind == REJECT_PARSE class TestStructural: def test_nested_select_rejected(self): with pytest.raises(WhereValidationError) as e: validate_where( "country_code IN (SELECT country FROM other_table)", TABLE_ID, SCHEMA, ) assert e.value.kind == REJECT_NESTED_SELECT def test_multi_statement_rejected(self): with pytest.raises(WhereValidationError) as e: validate_where("amount = 1; DROP TABLE x", TABLE_ID, SCHEMA) assert e.value.kind == REJECT_MULTI_STATEMENT def test_drop_table_rejected(self): with pytest.raises(WhereValidationError) as e: validate_where("amount = (DROP TABLE x)", TABLE_ID, SCHEMA) assert e.value.kind in (REJECT_DDL_DML, REJECT_PARSE) def test_cross_table_reference_rejected(self): """Predicates may only reference the target table.""" with pytest.raises(WhereValidationError) as e: validate_where( "other_table.id = 1", TABLE_ID, SCHEMA, ) assert e.value.kind == REJECT_CROSS_TABLE ``` - [ ] **Step 1.2: Run tests to verify failure** Run: `pytest tests/test_where_validator.py::TestParse tests/test_where_validator.py::TestStructural -v` Expected: FAIL with `ModuleNotFoundError: No module named 'app.api.where_validator'` - [ ] **Step 1.3: Implement parser + structural validator** Create `app/api/where_validator.py`: ```python """WHERE clause validator for /api/v2/scan. Single security perimeter — every analyst-supplied predicate flows through here before reaching BigQuery. Allow-list-driven; explicit rejection codes per spec §3.7. """ from __future__ import annotations import logging from dataclasses import dataclass from typing import Mapping import sqlglot from sqlglot import exp from sqlglot.errors import ParseError logger = logging.getLogger(__name__) # Rejection kind codes (stable; used by callers + tests + audit log) REJECT_PARSE = "parse_error" REJECT_NESTED_SELECT = "nested_select" REJECT_MULTI_STATEMENT = "multi_statement" REJECT_DDL_DML = "ddl_or_dml" REJECT_CROSS_TABLE = "cross_table_reference" REJECT_UNKNOWN_FUNCTION = "unknown_function" REJECT_UNKNOWN_COLUMN = "unknown_column" REJECT_DISALLOWED_NODE = "disallowed_node" @dataclass class WhereValidationError(Exception): kind: str message: str detail: dict | None = None def __str__(self) -> str: return f"[{self.kind}] {self.message}" # Nodes that imply DDL/DML (rejected outright). _DDL_DML_NODES = ( exp.Insert, exp.Update, exp.Delete, exp.Drop, exp.Truncate, exp.Alter, exp.Create, exp.Copy, exp.Merge, ) def validate_where( predicate: str, table_id: str, schema: Mapping[str, str], ) -> exp.Expression: """Validate a WHERE-clause fragment. Args: predicate: SQL fragment (without leading 'WHERE'). table_id: target table id; cross-table references rejected. schema: {column_name: type} for the target table. Returns: Parsed sqlglot expression tree (caller may re-stringify or inspect). Raises: WhereValidationError: with .kind set to one of the REJECT_* codes. """ if not predicate or not predicate.strip(): raise WhereValidationError(REJECT_PARSE, "empty predicate") # Multi-statement detection: BQ statements separated by ';' would parse # as multiple expressions in sqlglot.parse() (returns a list). try: statements = sqlglot.parse(f"SELECT 1 FROM t WHERE {predicate}", dialect="bigquery") except ParseError as e: raise WhereValidationError(REJECT_PARSE, f"parse failed: {e}") if statements is None or len(statements) != 1 or statements[0] is None: raise WhereValidationError(REJECT_MULTI_STATEMENT, "multi-statement input not allowed") select = statements[0] where = select.find(exp.Where) if where is None: raise WhereValidationError(REJECT_PARSE, "no WHERE expression found in parsed input") _walk_structural(where, table_id, schema) return where def _walk_structural(node: exp.Expression, table_id: str, schema: Mapping[str, str]) -> None: """Walk the WHERE AST and reject disallowed structures.""" for sub in node.walk(): # `node.walk()` yields the node itself first; check structural rules. if isinstance(sub, exp.Subquery) or (isinstance(sub, exp.Select) and sub is not node): raise WhereValidationError(REJECT_NESTED_SELECT, "nested SELECT/subquery not allowed") if isinstance(sub, _DDL_DML_NODES): raise WhereValidationError(REJECT_DDL_DML, f"DDL/DML node {type(sub).__name__} not allowed") # Cross-table reference detection: any column with a qualifier other than # the target table_id (or unqualified) is rejected. for col in node.find_all(exp.Column): qualifier = col.table # e.g. "other_table" in `other_table.id` if qualifier and qualifier.lower() != table_id.lower(): raise WhereValidationError( REJECT_CROSS_TABLE, f"column {col.sql()} references table {qualifier!r}, expected {table_id!r}", ) ``` - [ ] **Step 1.4: Run tests to verify pass** Run: `pytest tests/test_where_validator.py::TestParse tests/test_where_validator.py::TestStructural -v` Expected: 5 passed - [ ] **Step 1.5: Commit** ```bash git add app/api/where_validator.py tests/test_where_validator.py git commit -m "feat(validator): WHERE clause parser + structural rejects" ``` --- ## Task 2: WHERE validator — function allow-list Spec §3.7 enumerated function set. Reject unknown functions with explicit name in error. **Files:** - Modify: `app/api/where_validator.py` - Modify: `tests/test_where_validator.py` - [ ] **Step 2.1: Append failing tests** ```python # tests/test_where_validator.py (append after TestStructural) class TestFunctionAllowList: @pytest.mark.parametrize( "predicate", [ # Comparison "amount = 1", "amount != 1", "amount IS NULL", "amount IS NOT NULL", "country_code IN ('CZ', 'SK')", "amount BETWEEN 1 AND 100", "country_code LIKE 'C%'", "country_code NOT LIKE 'X%'", # Boolean "amount = 1 AND country_code = 'CZ'", "amount = 1 OR amount = 2", "NOT (amount = 1)", # Date/Time "event_date > DATE '2026-01-01'", "event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)", "EXTRACT(YEAR FROM event_date) = 2026", # String "STARTS_WITH(country_code, 'C')", "REGEXP_CONTAINS(country_code, r'C[ZS]')", "LENGTH(country_code) = 2", # Math "amount > ABS(-5)", "amount BETWEEN GREATEST(0, 10) AND LEAST(100, 200)", # Cast "CAST(country_code AS STRING) = 'CZ'", # Conditional "IFNULL(country_code, 'XX') = 'CZ'", "COALESCE(amount, 0) > 0", ], ) def test_allowed_predicate(self, predicate): validate_where(predicate, TABLE_ID, SCHEMA) # must not raise @pytest.mark.parametrize( "predicate,expected_func", [ ("amount = EXTERNAL_QUERY('connection', 'SELECT 1')", "EXTERNAL_QUERY"), ("country_code = SESSION_USER()", "SESSION_USER"), ("amount = ML.PREDICT(MODEL m, TABLE t)", "ML.PREDICT"), ("amount = OBSCURE_BUILTIN(country_code)", "OBSCURE_BUILTIN"), ("amount = ARRAY_AGG(amount)", "ARRAY_AGG"), ("amount = ROW_NUMBER() OVER (PARTITION BY country_code)", "ROW_NUMBER"), ], ) def test_disallowed_function(self, predicate, expected_func): with pytest.raises(WhereValidationError) as e: validate_where(predicate, TABLE_ID, SCHEMA) assert e.value.kind == REJECT_UNKNOWN_FUNCTION # The rejected function name must appear in detail or message assert expected_func.upper() in str(e.value).upper() or ( e.value.detail and expected_func.upper() in str(e.value.detail).upper() ) ``` - [ ] **Step 2.2: Run tests to verify failure** Run: `pytest tests/test_where_validator.py::TestFunctionAllowList -v` Expected: FAIL — current validator allows any function (no allow-list yet). - [ ] **Step 2.3: Add allow-list + function check** In `app/api/where_validator.py`, after the imports and `_DDL_DML_NODES`: ```python # v1 BigQuery function allow-list (spec §3.7). Stored as upper-case names. # Categorized for documentation; merged into one set for membership check. _ALLOW_FUNCTIONS_COMPARISON = { # Operators are AST nodes, not functions; not listed here. } _ALLOW_FUNCTIONS_DATETIME = { "CURRENT_DATE", "CURRENT_TIMESTAMP", "CURRENT_TIME", "DATE", "DATETIME", "TIMESTAMP", "TIME", "DATE_ADD", "DATE_SUB", "DATE_DIFF", "DATE_TRUNC", "EXTRACT", "FORMAT_DATE", "FORMAT_TIMESTAMP", "PARSE_DATE", "PARSE_TIMESTAMP", "UNIX_SECONDS", "UNIX_MILLIS", } _ALLOW_FUNCTIONS_STRING = { "CONCAT", "LENGTH", "LOWER", "UPPER", "SUBSTR", "SUBSTRING", "TRIM", "LTRIM", "RTRIM", "REPLACE", "STARTS_WITH", "ENDS_WITH", "CONTAINS_SUBSTR", "REGEXP_CONTAINS", "REGEXP_EXTRACT", "SAFE_CAST", } _ALLOW_FUNCTIONS_MATH = { "ABS", "CEIL", "FLOOR", "ROUND", "MOD", "POWER", "SQRT", "LOG", "LN", "EXP", "SIGN", "GREATEST", "LEAST", } _ALLOW_FUNCTIONS_CAST = {"CAST"} _ALLOW_FUNCTIONS_CONDITIONAL = {"IF", "IFNULL", "COALESCE", "NULLIF", "CASE"} ALLOWED_FUNCTIONS: frozenset[str] = frozenset( _ALLOW_FUNCTIONS_DATETIME | _ALLOW_FUNCTIONS_STRING | _ALLOW_FUNCTIONS_MATH | _ALLOW_FUNCTIONS_CAST | _ALLOW_FUNCTIONS_CONDITIONAL ) # CAST target types allowed _ALLOW_CAST_TYPES = { "INT64", "FLOAT64", "NUMERIC", "STRING", "BYTES", "BOOL", "DATE", "DATETIME", "TIMESTAMP", "TIME", "DECIMAL", "BIGNUMERIC", } ``` Then add `_walk_functions()` and call it from `_walk_structural`. Add at the end of `_walk_structural`: ```python _walk_functions(node) ``` And new helper: ```python def _walk_functions(node: exp.Expression) -> None: for func in node.find_all(exp.Func): # Window functions, aggregates, anonymous funcs — sqlglot uses subclasses. if isinstance(func, exp.Window): raise WhereValidationError( REJECT_UNKNOWN_FUNCTION, f"window function not allowed: {func.sql()}", detail={"function": "WINDOW"}, ) if isinstance(func, exp.AggFunc): raise WhereValidationError( REJECT_UNKNOWN_FUNCTION, f"aggregate function not allowed in WHERE: {type(func).__name__}", detail={"function": type(func).__name__.upper()}, ) # `func.name` is the SQL function name; might be empty for built-in operators. name = (func.name or "").upper() # Anonymous function nodes carry their identifier in a different slot if not name and hasattr(func, "this") and hasattr(func.this, "name"): name = (func.this.name or "").upper() # Skip operators-as-nodes (Add, Sub, Mul, Div, Eq, Neq, Lt, Gt, Like, In, Between, etc.) # — these are exp.Binary subclasses, not exp.Func subclasses, so usually not seen here. # But be defensive: if name is empty AFTER all heuristics, skip rather than flag. if not name: continue if name not in ALLOWED_FUNCTIONS: raise WhereValidationError( REJECT_UNKNOWN_FUNCTION, f"function not in v1 allow-list: {name}", detail={"function": name}, ) ``` - [ ] **Step 2.4: Run tests to verify pass** Run: `pytest tests/test_where_validator.py -v` Expected: all previous + new TestFunctionAllowList pass. - [ ] **Step 2.5: Commit** ```bash git add app/api/where_validator.py tests/test_where_validator.py git commit -m "feat(validator): function allow-list with explicit reject codes" ``` --- ## Task 3: WHERE validator — column existence + identifier-path validation Reject WHERE referring to columns not in the target schema. Spec §3.7 identifier-path section. **Files:** - Modify: `app/api/where_validator.py` - Modify: `tests/test_where_validator.py` - [ ] **Step 3.1: Append failing tests** ```python # tests/test_where_validator.py (append) class TestColumnExistence: def test_known_column_accepted(self): validate_where("country_code = 'CZ'", TABLE_ID, SCHEMA) def test_unknown_column_rejected(self): with pytest.raises(WhereValidationError) as e: validate_where("nonexistent_field = 'X'", TABLE_ID, SCHEMA) assert e.value.kind == REJECT_UNKNOWN_COLUMN assert "nonexistent_field" in str(e.value).lower() def test_qualified_known_column_accepted(self): # Same-table qualifier is allowed validate_where( f"{TABLE_ID}.country_code = 'CZ'", TABLE_ID, SCHEMA, ) def test_qualified_unknown_column_rejected(self): with pytest.raises(WhereValidationError) as e: validate_where( f"{TABLE_ID}.bogus_field = 'X'", TABLE_ID, SCHEMA, ) assert e.value.kind == REJECT_UNKNOWN_COLUMN ``` - [ ] **Step 3.2: Run tests to verify failure** Run: `pytest tests/test_where_validator.py::TestColumnExistence -v` Expected: 3 passes (qualified known + unqualified known will pass already), 1 fail on unknown_column expectation OR all 4 fail because the validator never checks columns. - [ ] **Step 3.3: Add column-existence check** Add new helper after `_walk_functions`: ```python def _walk_columns(node: exp.Expression, schema: Mapping[str, str]) -> None: """Reject column references not present in the target table's schema.""" known = {c.lower() for c in schema} for col in node.find_all(exp.Column): # `col.name` is the leaf column name (e.g. "country_code" in # "tbl.country_code"). For dotted struct fields like "rec.sub.leaf", # sqlglot models as nested exp.Dot; v1 only checks top-level names. leaf = (col.name or "").lower() if leaf and leaf not in known: raise WhereValidationError( REJECT_UNKNOWN_COLUMN, f"column {col.name!r} not in schema for {col.table!r}", detail={"column": col.name}, ) ``` Call from `_walk_structural` after `_walk_functions(node)`: ```python _walk_functions(node) _walk_columns(node, schema) ``` - [ ] **Step 3.4: Run tests to verify pass** Run: `pytest tests/test_where_validator.py -v` Expected: all pass (including the new TestColumnExistence). - [ ] **Step 3.5: Commit** ```bash git add app/api/where_validator.py tests/test_where_validator.py git commit -m "feat(validator): column-existence check via target-table schema" ``` --- ## Task 4: Process-local quota tracker Spec §3.8. Per-user concurrent count + daily byte cap, in-memory. Multi-replica caveat documented in spec §9.4 — out of scope. **Files:** - Create: `app/api/v2_quota.py` - Test: `tests/test_v2_quota.py` - [ ] **Step 4.1: Write failing tests** ```python # tests/test_v2_quota.py """Tests for the process-local v2 scan quota tracker (spec §3.8).""" from datetime import datetime, timedelta, timezone import pytest from app.api.v2_quota import ( QuotaTracker, QuotaExceededError, KIND_CONCURRENT, KIND_DAILY_BYTES, ) def make_tracker(max_concurrent=5, max_daily_bytes=100): return QuotaTracker( max_concurrent_per_user=max_concurrent, max_daily_bytes_per_user=max_daily_bytes, ) class TestConcurrent: def test_acquire_within_cap_succeeds(self): q = make_tracker(max_concurrent=3) with q.acquire(user="alice"): with q.acquire(user="alice"): with q.acquire(user="alice"): pass def test_acquire_above_cap_raises(self): q = make_tracker(max_concurrent=2) with q.acquire(user="alice"): with q.acquire(user="alice"): with pytest.raises(QuotaExceededError) as e: with q.acquire(user="alice"): pass assert e.value.kind == KIND_CONCURRENT assert e.value.current == 2 assert e.value.limit == 2 def test_release_on_context_exit(self): q = make_tracker(max_concurrent=1) with q.acquire(user="alice"): pass # Counter dropped on exit; new acquire works with q.acquire(user="alice"): pass def test_release_on_exception(self): q = make_tracker(max_concurrent=1) try: with q.acquire(user="alice"): raise RuntimeError("boom") except RuntimeError: pass with q.acquire(user="alice"): pass def test_per_user_isolation(self): q = make_tracker(max_concurrent=1) with q.acquire(user="alice"): with q.acquire(user="bob"): pass class TestDailyBytes: def test_record_within_cap(self): q = make_tracker(max_daily_bytes=1000) q.record_bytes(user="alice", n=300) q.record_bytes(user="alice", n=400) assert q.bytes_used_today(user="alice") == 700 def test_record_above_cap_raises(self): q = make_tracker(max_daily_bytes=1000) q.record_bytes(user="alice", n=600) with pytest.raises(QuotaExceededError) as e: q.record_bytes(user="alice", n=500) assert e.value.kind == KIND_DAILY_BYTES assert e.value.current == 1100 # would-be total assert e.value.limit == 1000 def test_per_user_isolation(self): q = make_tracker(max_daily_bytes=100) q.record_bytes(user="alice", n=80) q.record_bytes(user="bob", n=80) # bob's bucket independent with pytest.raises(QuotaExceededError): q.record_bytes(user="alice", n=30) def test_reset_on_utc_midnight(self, monkeypatch): q = make_tracker(max_daily_bytes=100) # Simulate the day boundary by injecting "now" d1 = datetime(2026, 4, 27, 23, 0, 0, tzinfo=timezone.utc) monkeypatch.setattr("app.api.v2_quota._utcnow", lambda: d1) q.record_bytes(user="alice", n=80) assert q.bytes_used_today(user="alice") == 80 d2 = d1 + timedelta(hours=2) # crosses UTC midnight monkeypatch.setattr("app.api.v2_quota._utcnow", lambda: d2) assert q.bytes_used_today(user="alice") == 0 q.record_bytes(user="alice", n=80) # ok, fresh bucket ``` - [ ] **Step 4.2: Run tests to verify failure** Run: `pytest tests/test_v2_quota.py -v` Expected: FAIL with `ModuleNotFoundError: No module named 'app.api.v2_quota'` - [ ] **Step 4.3: Implement quota tracker** Create `app/api/v2_quota.py`: ```python """Process-local quota tracker for /api/v2/scan (spec §3.8). In-memory only. Multi-replica deployments effectively multiply caps by N (documented caveat — see spec §9.4). Future v2 should move to durable storage if horizontal scale is needed. """ from __future__ import annotations import contextlib import logging import threading from dataclasses import dataclass from datetime import datetime, timezone from typing import Iterator logger = logging.getLogger(__name__) KIND_CONCURRENT = "concurrent_scans" KIND_DAILY_BYTES = "daily_bytes" @dataclass class QuotaExceededError(Exception): kind: str current: int limit: int retry_after_seconds: int = 0 def __str__(self) -> str: return f"{self.kind}: {self.current}/{self.limit}" def _utcnow() -> datetime: # patched in tests return datetime.now(timezone.utc) def _utc_today() -> str: """ISO date string in UTC, used as the daily-bucket key.""" return _utcnow().strftime("%Y-%m-%d") class QuotaTracker: """Thread-safe quota state. Caller wraps each request in `with q.acquire(user)`, and after the BQ result lands records bytes via `record_bytes(user, n)`. """ def __init__(self, *, max_concurrent_per_user: int, max_daily_bytes_per_user: int): self._max_concurrent = max_concurrent_per_user self._max_daily_bytes = max_daily_bytes_per_user self._lock = threading.Lock() # state: { user_id: { "concurrent": int, "bucket_day": "YYYY-MM-DD", "bytes": int } } self._state: dict[str, dict] = {} def _ensure_bucket(self, user: str) -> dict: today = _utc_today() s = self._state.setdefault(user, {"concurrent": 0, "bucket_day": today, "bytes": 0}) if s["bucket_day"] != today: s["bucket_day"] = today s["bytes"] = 0 return s @contextlib.contextmanager def acquire(self, user: str) -> Iterator[None]: with self._lock: s = self._ensure_bucket(user) if s["concurrent"] >= self._max_concurrent: raise QuotaExceededError( kind=KIND_CONCURRENT, current=s["concurrent"], limit=self._max_concurrent, ) s["concurrent"] += 1 try: yield finally: with self._lock: s = self._ensure_bucket(user) s["concurrent"] = max(0, s["concurrent"] - 1) def record_bytes(self, user: str, n: int) -> None: if n <= 0: return with self._lock: s = self._ensure_bucket(user) new_total = s["bytes"] + n if new_total > self._max_daily_bytes: # Surface the would-be total so caller can include it in 429 body. raise QuotaExceededError( kind=KIND_DAILY_BYTES, current=new_total, limit=self._max_daily_bytes, retry_after_seconds=_seconds_until_utc_midnight(), ) s["bytes"] = new_total def bytes_used_today(self, user: str) -> int: with self._lock: return self._ensure_bucket(user)["bytes"] def _seconds_until_utc_midnight() -> int: now = _utcnow() midnight = now.replace(hour=0, minute=0, second=0, microsecond=0).replace(day=now.day) # Next midnight = today's midnight + 1 day from datetime import timedelta next_midnight = midnight + timedelta(days=1) return int((next_midnight - now).total_seconds()) ``` - [ ] **Step 4.4: Run tests to verify pass** Run: `pytest tests/test_v2_quota.py -v` Expected: 8 passed. - [ ] **Step 4.5: Commit** ```bash git add app/api/v2_quota.py tests/test_v2_quota.py git commit -m "feat(v2): process-local quota tracker (concurrent + daily bytes)" ``` --- ## Task 5: LRU+TTL cache helper Used by catalog/schema/sample endpoints. Spec §3.6. **Files:** - Create: `app/api/v2_cache.py` - Test: `tests/test_v2_cache.py` - [ ] **Step 5.1: Write failing tests** ```python # tests/test_v2_cache.py import pytest import time from app.api.v2_cache import TTLCache class TestTTLCache: def test_set_get(self): c = TTLCache(maxsize=10, ttl_seconds=60) c.set("k", "v") assert c.get("k") == "v" def test_get_missing_returns_default(self): c = TTLCache(maxsize=10, ttl_seconds=60) assert c.get("missing") is None assert c.get("missing", default="x") == "x" def test_expiry(self, monkeypatch): now = [1000.0] monkeypatch.setattr("app.api.v2_cache._now", lambda: now[0]) c = TTLCache(maxsize=10, ttl_seconds=10) c.set("k", "v") assert c.get("k") == "v" now[0] += 11 assert c.get("k") is None # expired def test_lru_eviction(self): c = TTLCache(maxsize=2, ttl_seconds=60) c.set("a", 1) c.set("b", 2) c.set("c", 3) # should evict 'a' (LRU) assert c.get("a") is None assert c.get("b") == 2 assert c.get("c") == 3 def test_invalidate(self): c = TTLCache(maxsize=10, ttl_seconds=60) c.set("k", "v") c.invalidate("k") assert c.get("k") is None def test_clear(self): c = TTLCache(maxsize=10, ttl_seconds=60) c.set("a", 1) c.set("b", 2) c.clear() assert c.get("a") is None assert c.get("b") is None ``` - [ ] **Step 5.2: Run tests to verify failure** Run: `pytest tests/test_v2_cache.py -v` Expected: FAIL — module doesn't exist. - [ ] **Step 5.3: Implement TTLCache** Create `app/api/v2_cache.py`: ```python """Simple thread-safe LRU + TTL cache for v2 endpoints.""" from __future__ import annotations import threading import time from collections import OrderedDict from typing import Any def _now() -> float: # patched in tests return time.monotonic() class TTLCache: def __init__(self, *, maxsize: int, ttl_seconds: float): self._max = maxsize self._ttl = ttl_seconds self._lock = threading.Lock() self._data: "OrderedDict[str, tuple[float, Any]]" = OrderedDict() def get(self, key: str, default: Any = None) -> Any: with self._lock: entry = self._data.get(key) if entry is None: return default expiry, value = entry if _now() > expiry: del self._data[key] return default self._data.move_to_end(key) # mark as recently used return value def set(self, key: str, value: Any) -> None: with self._lock: expiry = _now() + self._ttl if key in self._data: self._data.move_to_end(key) self._data[key] = (expiry, value) while len(self._data) > self._max: self._data.popitem(last=False) def invalidate(self, key: str) -> None: with self._lock: self._data.pop(key, None) def clear(self) -> None: with self._lock: self._data.clear() ``` - [ ] **Step 5.4: Run tests to verify pass** Run: `pytest tests/test_v2_cache.py -v` Expected: 6 passed. - [ ] **Step 5.5: Commit** ```bash git add app/api/v2_cache.py tests/test_v2_cache.py git commit -m "feat(v2): TTLCache helper (LRU + TTL, thread-safe)" ``` --- ## Task 6: Arrow IPC streaming response helper Spec §3.4 step 9. Used by `/api/v2/scan`. Wraps a pyarrow Table or RecordBatchReader as an HTTP streaming response. **Files:** - Create: `app/api/v2_arrow.py` - Test: `tests/test_v2_arrow.py` - [ ] **Step 6.1: Write failing tests** ```python # tests/test_v2_arrow.py import io import pyarrow as pa import pytest from app.api.v2_arrow import arrow_table_to_ipc_bytes, parse_ipc_bytes def test_round_trip_simple_table(): src = pa.table({"a": [1, 2, 3], "b": ["x", "y", "z"]}) body = arrow_table_to_ipc_bytes(src) assert isinstance(body, bytes) and len(body) > 0 got = parse_ipc_bytes(body) assert got.equals(src) def test_empty_table_round_trip(): src = pa.table({"a": pa.array([], type=pa.int64())}) body = arrow_table_to_ipc_bytes(src) got = parse_ipc_bytes(body) assert got.num_rows == 0 assert got.schema.equals(src.schema) ``` - [ ] **Step 6.2: Run tests to verify failure** Run: `pytest tests/test_v2_arrow.py -v` Expected: FAIL — module doesn't exist. - [ ] **Step 6.3: Implement Arrow helpers** Create `app/api/v2_arrow.py`: ```python """Arrow IPC serialization helpers for /api/v2/scan responses. Server side serializes a pyarrow.Table to IPC stream bytes; client side deserializes back. Content-Type is `application/vnd.apache.arrow.stream`. """ from __future__ import annotations import io import pyarrow as pa CONTENT_TYPE = "application/vnd.apache.arrow.stream" def arrow_table_to_ipc_bytes(table: pa.Table) -> bytes: """Serialize a pyarrow.Table to Arrow IPC stream bytes.""" sink = io.BytesIO() with pa.ipc.new_stream(sink, table.schema) as writer: for batch in table.to_batches(): writer.write_batch(batch) return sink.getvalue() def parse_ipc_bytes(data: bytes) -> pa.Table: """Deserialize Arrow IPC stream bytes to a pyarrow.Table.""" reader = pa.ipc.open_stream(io.BytesIO(data)) return reader.read_all() ``` - [ ] **Step 6.4: Run tests to verify pass** Run: `pytest tests/test_v2_arrow.py -v` Expected: 2 passed. - [ ] **Step 6.5: Commit** ```bash git add app/api/v2_arrow.py tests/test_v2_arrow.py git commit -m "feat(v2): Arrow IPC serialization helpers" ``` --- ## Task 7: `GET /api/v2/catalog` Spec §3.1. Lists tables visible to user (RBAC-filtered) with metadata. **Files:** - Create: `app/api/v2_catalog.py` - Modify: `app/main.py` - Test: `tests/test_v2_catalog.py` - [ ] **Step 7.1: Write failing tests** ```python # tests/test_v2_catalog.py import importlib import pytest @pytest.fixture def reload_db(tmp_path, monkeypatch): monkeypatch.setenv("DATA_DIR", str(tmp_path)) import src.db as db_module importlib.reload(db_module) yield db_module def _seed_two_tables(conn): from src.repositories.table_registry import TableRegistryRepository repo = TableRegistryRepository(conn) repo.register( id="orders", name="orders", source_type="keboola", bucket="sales", source_table="orders", query_mode="local", is_public=True, ) repo.register( id="bq_view", name="bq_view", source_type="bigquery", bucket="ds", source_table="bq_view", query_mode="remote", is_public=True, ) class TestCatalogShape: def test_admin_sees_both_tables(self, reload_db): from app.api.v2_catalog import build_catalog conn = reload_db.get_system_db() try: _seed_two_tables(conn) admin = {"role": "admin", "email": "a@x.com"} data = build_catalog(conn, admin) ids = {t["id"] for t in data["tables"]} assert {"orders", "bq_view"} <= ids finally: conn.close() def test_local_table_has_duckdb_flavor(self, reload_db): from app.api.v2_catalog import build_catalog conn = reload_db.get_system_db() try: _seed_two_tables(conn) admin = {"role": "admin", "email": "a@x.com"} data = build_catalog(conn, admin) row = next(t for t in data["tables"] if t["id"] == "orders") assert row["sql_flavor"] == "duckdb" assert row["query_mode"] == "local" def test_bq_table_has_bigquery_flavor(self, reload_db): from app.api.v2_catalog import build_catalog conn = reload_db.get_system_db() try: _seed_two_tables(conn) admin = {"role": "admin", "email": "a@x.com"} data = build_catalog(conn, admin) row = next(t for t in data["tables"] if t["id"] == "bq_view") assert row["sql_flavor"] == "bigquery" assert row["query_mode"] == "remote" assert "where_examples" in row assert "fetch_via" in row ``` - [ ] **Step 7.2: Run tests to verify failure** Run: `pytest tests/test_v2_catalog.py -v` Expected: FAIL — module doesn't exist. - [ ] **Step 7.3: Implement catalog endpoint** Create `app/api/v2_catalog.py`: ```python """GET /api/v2/catalog — list tables visible to caller (spec §3.1).""" from __future__ import annotations from datetime import datetime, timezone from fastapi import APIRouter, Depends import duckdb from app.auth.dependencies import get_current_user, _get_db from src.rbac import can_access_table from src.repositories.table_registry import TableRegistryRepository from app.api.v2_cache import TTLCache router = APIRouter(prefix="/api/v2", tags=["v2"]) _catalog_cache = TTLCache(maxsize=1024, ttl_seconds=300) # per-user, 5 min def _flavor_for(source_type: str) -> str: return "bigquery" if source_type == "bigquery" else "duckdb" def _examples_for(source_type: str) -> list[str]: if source_type == "bigquery": return [ "event_date > DATE '2026-01-01'", "country_code = 'CZ' AND platform = 'web'", ] return [] def _fetch_hint(table_id: str, source_type: str) -> str: if source_type == "bigquery": return f"da fetch {table_id} --select --where '' --limit " return "already local — query directly via `da query`" def build_catalog(conn: duckdb.DuckDBPyConnection, user: dict) -> dict: cache_key = f"{user.get('email', '?')}|catalog" cached = _catalog_cache.get(cache_key) if cached is not None: return cached repo = TableRegistryRepository(conn) rows = repo.list_all() visible = [] for r in rows: if user.get("role") != "admin" and not can_access_table(user, r["id"], conn): continue visible.append({ "id": r["id"], "name": r.get("name") or r["id"], "description": r.get("description") or "", "source_type": r.get("source_type") or "", "query_mode": r.get("query_mode") or "local", "sql_flavor": _flavor_for(r.get("source_type") or ""), "where_examples": _examples_for(r.get("source_type") or ""), "fetch_via": _fetch_hint(r["id"], r.get("source_type") or ""), "rough_size_hint": None, # populated by Task 8 schema endpoint when called }) payload = { "tables": visible, "server_time": datetime.now(timezone.utc).isoformat(), } _catalog_cache.set(cache_key, payload) return payload @router.get("/catalog") async def catalog( user: dict = Depends(get_current_user), conn: duckdb.DuckDBPyConnection = Depends(_get_db), ): return build_catalog(conn, user) ``` - [ ] **Step 7.4: Mount in `app/main.py`** Find the `app.include_router(...)` block (around line 279-287). Add: ```python from app.api.v2_catalog import router as v2_catalog_router app.include_router(v2_catalog_router) ``` - [ ] **Step 7.5: Run tests to verify pass** Run: `pytest tests/test_v2_catalog.py -v` Expected: 3 passed. - [ ] **Step 7.6: Commit** ```bash git add app/api/v2_catalog.py app/main.py tests/test_v2_catalog.py git commit -m "feat(v2): GET /api/v2/catalog — RBAC-filtered table list" ``` --- ## Task 8: `GET /api/v2/schema/{table_id}` Spec §3.2. Column metadata + BQ flavor hints. **Files:** - Create: `app/api/v2_schema.py` - Modify: `app/main.py` - Test: `tests/test_v2_schema.py` - [ ] **Step 8.1: Write failing tests** ```python # tests/test_v2_schema.py import importlib from unittest.mock import patch, MagicMock import pytest @pytest.fixture def reload_db(tmp_path, monkeypatch): monkeypatch.setenv("DATA_DIR", str(tmp_path)) import src.db as db_module importlib.reload(db_module) yield db_module def _seed_bq_table(conn): from src.repositories.table_registry import TableRegistryRepository TableRegistryRepository(conn).register( id="bq_view", name="bq_view", source_type="bigquery", bucket="ds", source_table="bq_view", query_mode="remote", is_public=True, ) class TestSchemaEndpoint: def test_bq_table_returns_columns_and_dialect_hints(self, reload_db, monkeypatch): from app.api import v2_schema # Stub the BQ schema fetch to avoid hitting real BQ monkeypatch.setattr( v2_schema, "_fetch_bq_schema", lambda project, dataset, table: [ {"name": "event_date", "type": "DATE", "nullable": False, "description": ""}, {"name": "country_code", "type": "STRING", "nullable": True, "description": ""}, ], ) monkeypatch.setattr(v2_schema, "_fetch_bq_table_options", lambda *a: {"partition_by": "event_date", "clustered_by": []}) conn = reload_db.get_system_db() try: _seed_bq_table(conn) user = {"role": "admin", "email": "a@x.com"} data = v2_schema.build_schema(conn, user, "bq_view", project_id="my-proj") finally: conn.close() assert data["table_id"] == "bq_view" assert data["sql_flavor"] == "bigquery" assert {c["name"] for c in data["columns"]} == {"event_date", "country_code"} assert "where_dialect_hints" in data assert data["partition_by"] == "event_date" def test_unknown_table_raises_404(self, reload_db): from app.api.v2_schema import build_schema, NotFound conn = reload_db.get_system_db() try: user = {"role": "admin", "email": "a@x.com"} with pytest.raises(NotFound): build_schema(conn, user, "missing", project_id="my-proj") finally: conn.close() ``` - [ ] **Step 8.2: Run tests to verify failure** Run: `pytest tests/test_v2_schema.py -v` Expected: FAIL — module doesn't exist. - [ ] **Step 8.3: Implement schema endpoint** Create `app/api/v2_schema.py`: ```python """GET /api/v2/schema/{table_id} — table column metadata (spec §3.2).""" from __future__ import annotations import logging from fastapi import APIRouter, Depends, HTTPException import duckdb from app.auth.dependencies import get_current_user, _get_db from app.instance_config import get_value from src.rbac import can_access_table from src.repositories.table_registry import TableRegistryRepository from app.api.v2_cache import TTLCache logger = logging.getLogger(__name__) router = APIRouter(prefix="/api/v2", tags=["v2"]) _schema_cache = TTLCache(maxsize=512, ttl_seconds=3600) class NotFound(Exception): pass _BQ_DIALECT_HINTS = { "date_literal": "DATE '2026-01-01'", "timestamp_literal": "TIMESTAMP '2026-01-01 00:00:00 UTC'", "interval_subtract": "DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)", "regex": "REGEXP_CONTAINS(field, r'pattern')", "cast": "CAST(x AS INT64)", } def _fetch_bq_schema(project: str, dataset: str, table: str) -> list[dict]: """Fetch column list via INFORMATION_SCHEMA.COLUMNS using DuckDB BQ extension.""" import duckdb from connectors.bigquery.auth import get_metadata_token token = get_metadata_token() conn = duckdb.connect(":memory:") try: conn.execute("INSTALL bigquery FROM community; LOAD bigquery;") escaped = token.replace("'", "''") conn.execute(f"CREATE OR REPLACE SECRET bq_s (TYPE bigquery, ACCESS_TOKEN '{escaped}')") bq_sql = ( f"SELECT column_name, data_type, is_nullable, description " f"FROM `{project}.{dataset}.INFORMATION_SCHEMA.COLUMNS` " f"WHERE table_name = ? ORDER BY ordinal_position" ) rows = conn.execute( "SELECT * FROM bigquery_query(?, ?, ?)", [project, bq_sql, table], ).fetchall() return [ { "name": r[0], "type": r[1], "nullable": r[2] == "YES", "description": r[3] or "", } for r in rows ] finally: conn.close() def _fetch_bq_table_options(project: str, dataset: str, table: str) -> dict: """Best-effort fetch of partition/cluster info; returns empty dict on miss.""" import duckdb from connectors.bigquery.auth import get_metadata_token try: token = get_metadata_token() conn = duckdb.connect(":memory:") try: conn.execute("INSTALL bigquery FROM community; LOAD bigquery;") escaped = token.replace("'", "''") conn.execute(f"CREATE OR REPLACE SECRET bq_s (TYPE bigquery, ACCESS_TOKEN '{escaped}')") bq_sql = ( f"SELECT partition_column, cluster_columns " f"FROM `{project}.{dataset}.INFORMATION_SCHEMA.TABLES` " f"WHERE table_name = ?" ) row = conn.execute( "SELECT * FROM bigquery_query(?, ?, ?)", [project, bq_sql, table], ).fetchone() if not row: return {} return { "partition_by": row[0], "clustered_by": (row[1] or "").split(",") if row[1] else [], } finally: conn.close() except Exception as e: logger.warning("BQ table options fetch failed for %s.%s.%s: %s", project, dataset, table, e) return {} def build_schema( conn: duckdb.DuckDBPyConnection, user: dict, table_id: str, *, project_id: str, ) -> dict: cache_key = f"{table_id}" cached = _schema_cache.get(cache_key) if cached is not None: return cached repo = TableRegistryRepository(conn) row = repo.get(table_id) if not row: raise NotFound(table_id) if user.get("role") != "admin" and not can_access_table(user, table_id, conn): raise PermissionError(table_id) source_type = row.get("source_type") or "" if source_type == "bigquery": dataset = row.get("bucket") or "" source_table = row.get("source_table") or table_id columns = _fetch_bq_schema(project_id, dataset, source_table) opts = _fetch_bq_table_options(project_id, dataset, source_table) payload = { "table_id": table_id, "source_type": source_type, "sql_flavor": "bigquery", "columns": columns, "partition_by": opts.get("partition_by"), "clustered_by": opts.get("clustered_by", []), "where_dialect_hints": _BQ_DIALECT_HINTS, } else: # Local source — read schema from the parquet via DuckDB from pathlib import Path from app.utils import get_data_dir parquet = ( get_data_dir() / "extracts" / source_type / "data" / f"{table_id}.parquet" ) local_conn = duckdb.connect(":memory:") try: cols = local_conn.execute( "DESCRIBE SELECT * FROM read_parquet(?)", [str(parquet)] ).fetchall() finally: local_conn.close() payload = { "table_id": table_id, "source_type": source_type, "sql_flavor": "duckdb", "columns": [ {"name": c[0], "type": c[1], "nullable": c[2] == "YES", "description": ""} for c in cols ], "partition_by": None, "clustered_by": [], "where_dialect_hints": {}, } _schema_cache.set(cache_key, payload) return payload @router.get("/schema/{table_id}") async def schema( table_id: str, user: dict = Depends(get_current_user), conn: duckdb.DuckDBPyConnection = Depends(_get_db), ): project_id = get_value("data_source", "bigquery", "project", default="") or "" try: return build_schema(conn, user, table_id, project_id=project_id) except NotFound: raise HTTPException(status_code=404, detail=f"table {table_id!r} not found") except PermissionError: raise HTTPException(status_code=403, detail="not authorized for this table") ``` - [ ] **Step 8.4: Mount in `app/main.py`** Add alongside catalog router: ```python from app.api.v2_schema import router as v2_schema_router app.include_router(v2_schema_router) ``` - [ ] **Step 8.5: Run tests to verify pass** Run: `pytest tests/test_v2_schema.py -v` Expected: 2 passed. - [ ] **Step 8.6: Commit** ```bash git add app/api/v2_schema.py app/main.py tests/test_v2_schema.py git commit -m "feat(v2): GET /api/v2/schema/{table_id} — column metadata + BQ hints" ``` --- ## Task 9: `GET /api/v2/sample/{table_id}` Spec §3.3. Returns N sample rows. **Files:** - Create: `app/api/v2_sample.py` - Modify: `app/main.py` - Test: `tests/test_v2_sample.py` - [ ] **Step 9.1: Write failing tests** ```python # tests/test_v2_sample.py import importlib import pytest @pytest.fixture def reload_db(tmp_path, monkeypatch): monkeypatch.setenv("DATA_DIR", str(tmp_path)) import src.db as db_module importlib.reload(db_module) yield db_module def _seed(conn): from src.repositories.table_registry import TableRegistryRepository TableRegistryRepository(conn).register( id="bq_view", name="bq_view", source_type="bigquery", bucket="ds", source_table="bq_view", query_mode="remote", is_public=True, ) class TestSampleEndpoint: def test_returns_n_rows_for_bq_table(self, reload_db, monkeypatch): from app.api import v2_sample monkeypatch.setattr( v2_sample, "_fetch_bq_sample", lambda project, dataset, table, n: [ {"event_date": "2026-04-27", "country_code": "CZ"}, {"event_date": "2026-04-26", "country_code": "SK"}, ], ) conn = reload_db.get_system_db() try: _seed(conn) user = {"role": "admin", "email": "a@x.com"} data = v2_sample.build_sample(conn, user, "bq_view", n=2, project_id="proj") finally: conn.close() assert data["table_id"] == "bq_view" assert len(data["rows"]) == 2 def test_caps_n_at_100(self, reload_db, monkeypatch): from app.api import v2_sample captured = {} def fake_fetch(project, dataset, table, n): captured["n"] = n return [] monkeypatch.setattr(v2_sample, "_fetch_bq_sample", fake_fetch) conn = reload_db.get_system_db() try: _seed(conn) user = {"role": "admin", "email": "a@x.com"} v2_sample.build_sample(conn, user, "bq_view", n=999, project_id="proj") finally: conn.close() assert captured["n"] == 100 ``` - [ ] **Step 9.2: Run tests to verify failure** Run: `pytest tests/test_v2_sample.py -v` Expected: FAIL — module doesn't exist. - [ ] **Step 9.3: Implement sample endpoint** Create `app/api/v2_sample.py`: ```python """GET /api/v2/sample/{table_id}?n=5 — sample rows (spec §3.3).""" from __future__ import annotations import logging from fastapi import APIRouter, Depends, HTTPException, Query import duckdb from app.auth.dependencies import get_current_user, _get_db from app.instance_config import get_value from src.rbac import can_access_table from src.repositories.table_registry import TableRegistryRepository from app.api.v2_cache import TTLCache logger = logging.getLogger(__name__) router = APIRouter(prefix="/api/v2", tags=["v2"]) _sample_cache = TTLCache(maxsize=512, ttl_seconds=3600) _MAX_N = 100 def _fetch_bq_sample(project: str, dataset: str, table: str, n: int) -> list[dict]: import duckdb from connectors.bigquery.auth import get_metadata_token token = get_metadata_token() conn = duckdb.connect(":memory:") try: conn.execute("INSTALL bigquery FROM community; LOAD bigquery;") escaped = token.replace("'", "''") conn.execute(f"CREATE OR REPLACE SECRET bq_s (TYPE bigquery, ACCESS_TOKEN '{escaped}')") bq_sql = f"SELECT * FROM `{project}.{dataset}.{table}` LIMIT {int(n)}" df = conn.execute( "SELECT * FROM bigquery_query(?, ?)", [project, bq_sql], ).fetchdf() return df.to_dict(orient="records") finally: conn.close() def build_sample( conn: duckdb.DuckDBPyConnection, user: dict, table_id: str, *, n: int, project_id: str, ) -> dict: n = max(1, min(int(n), _MAX_N)) cache_key = f"{table_id}|{n}" cached = _sample_cache.get(cache_key) if cached is not None: return cached repo = TableRegistryRepository(conn) row = repo.get(table_id) if not row: raise FileNotFoundError(table_id) if user.get("role") != "admin" and not can_access_table(user, table_id, conn): raise PermissionError(table_id) source_type = row.get("source_type") or "" if source_type == "bigquery": rows = _fetch_bq_sample(project_id, row.get("bucket") or "", row.get("source_table") or table_id, n) else: from app.utils import get_data_dir parquet = get_data_dir() / "extracts" / source_type / "data" / f"{table_id}.parquet" c = duckdb.connect(":memory:") try: df = c.execute( f"SELECT * FROM read_parquet(?) LIMIT {n}", [str(parquet)], ).fetchdf() rows = df.to_dict(orient="records") finally: c.close() payload = {"table_id": table_id, "rows": rows, "source": source_type} _sample_cache.set(cache_key, payload) return payload @router.get("/sample/{table_id}") async def sample( table_id: str, n: int = Query(default=5, ge=1, le=_MAX_N), user: dict = Depends(get_current_user), conn: duckdb.DuckDBPyConnection = Depends(_get_db), ): project_id = get_value("data_source", "bigquery", "project", default="") or "" try: return build_sample(conn, user, table_id, n=n, project_id=project_id) except FileNotFoundError: raise HTTPException(status_code=404, detail=f"table {table_id!r} not found") except PermissionError: raise HTTPException(status_code=403, detail="not authorized for this table") ``` - [ ] **Step 9.4: Mount + run tests** Add `from app.api.v2_sample import router as v2_sample_router; app.include_router(v2_sample_router)` in `app/main.py`. Run: `pytest tests/test_v2_sample.py -v`. Expected: 2 passed. - [ ] **Step 9.5: Commit** ```bash git add app/api/v2_sample.py app/main.py tests/test_v2_sample.py git commit -m "feat(v2): GET /api/v2/sample/{table_id} — N sample rows" ``` --- ## Task 10: `POST /api/v2/scan/estimate` Spec §3.5. BQ dryRun for cost estimate. **Files:** - Create: `app/api/v2_scan.py` (will be extended in Task 11 for `/scan` proper) - Modify: `app/main.py` - Test: `tests/test_v2_scan_estimate.py` - [ ] **Step 10.1: Write failing test** ```python # tests/test_v2_scan_estimate.py import importlib import pytest @pytest.fixture def reload_db(tmp_path, monkeypatch): monkeypatch.setenv("DATA_DIR", str(tmp_path)) import src.db as db_module importlib.reload(db_module) yield db_module def _seed(conn): from src.repositories.table_registry import TableRegistryRepository TableRegistryRepository(conn).register( id="bq_view", name="bq_view", source_type="bigquery", bucket="ds", source_table="bq_view", query_mode="remote", is_public=True, ) class TestScanEstimate: def test_returns_scan_bytes_for_bq(self, reload_db, monkeypatch): from app.api import v2_scan monkeypatch.setattr( v2_scan, "_bq_dry_run_bytes", lambda project, sql: 4_400_000_000, ) # Stub the schema fetch the validator uses monkeypatch.setattr( v2_scan, "_resolve_schema", lambda *a, **kw: {"event_date": "DATE", "country_code": "STRING"}, ) conn = reload_db.get_system_db() try: _seed(conn) user = {"role": "admin", "email": "a@x.com"} req = { "table_id": "bq_view", "select": ["event_date", "country_code"], "where": "event_date > DATE '2026-01-01'", "limit": 1000000, } data = v2_scan.estimate(conn, user, req, project_id="proj") finally: conn.close() assert data["estimated_scan_bytes"] == 4_400_000_000 assert "estimated_result_rows" in data assert "bq_cost_estimate_usd" in data ``` - [ ] **Step 10.2: Run test to verify failure** Run: `pytest tests/test_v2_scan_estimate.py -v` Expected: FAIL — module doesn't exist. - [ ] **Step 10.3: Implement estimate endpoint** Create `app/api/v2_scan.py`: ```python """POST /api/v2/scan and POST /api/v2/scan/estimate (spec §3.4 + §3.5).""" from __future__ import annotations import logging import time from typing import Optional from fastapi import APIRouter, Depends, HTTPException from pydantic import BaseModel, Field import duckdb from app.auth.dependencies import get_current_user, _get_db from app.instance_config import get_value from src.rbac import can_access_table from src.repositories.table_registry import TableRegistryRepository from app.api.where_validator import ( validate_where, WhereValidationError, ) from app.api.v2_schema import build_schema # reused for column resolution logger = logging.getLogger(__name__) router = APIRouter(prefix="/api/v2", tags=["v2"]) class ScanRequest(BaseModel): table_id: str select: Optional[list[str]] = None where: Optional[str] = None limit: Optional[int] = Field(default=None, ge=1) order_by: Optional[list[str]] = None def _resolve_schema(conn, user, table_id: str, project_id: str) -> dict: """Get {column: type} dict for the target table — used by validator + projection check.""" s = build_schema(conn, user, table_id, project_id=project_id) return {c["name"]: c["type"] for c in s.get("columns", [])} def _bq_dry_run_bytes(project: str, sql: str) -> int: """Run a BQ dry-run via the google-cloud-bigquery client and return totalBytesProcessed.""" from google.cloud import bigquery from google.api_core.client_options import ClientOptions client = bigquery.Client( project=project, client_options=ClientOptions(quota_project_id=project), ) job = client.query( sql, job_config=bigquery.QueryJobConfig(dry_run=True, use_query_cache=False), ) return int(job.total_bytes_processed or 0) def _build_bq_sql(table_row: dict, project_id: str, req: ScanRequest) -> str: select_sql = ", ".join(req.select) if req.select else "*" table_ref = f"`{project_id}.{table_row.get('bucket') or ''}.{table_row.get('source_table') or req.table_id}`" sql = f"SELECT {select_sql} FROM {table_ref}" if req.where: sql += f" WHERE {req.where}" if req.order_by: sql += f" ORDER BY {', '.join(req.order_by)}" if req.limit: sql += f" LIMIT {int(req.limit)}" return sql def estimate(conn, user, raw_request: dict, *, project_id: str) -> dict: req = ScanRequest(**raw_request) repo = TableRegistryRepository(conn) row = repo.get(req.table_id) if not row: raise FileNotFoundError(req.table_id) if user.get("role") != "admin" and not can_access_table(user, req.table_id, conn): raise PermissionError(req.table_id) schema = _resolve_schema(conn, user, req.table_id, project_id) # Validate WHERE first if req.where: validate_where(req.where, req.table_id, schema) # Validate select columns exist if req.select: unknown = [c for c in req.select if c not in schema] if unknown: raise ValueError(f"unknown columns: {unknown}") if (row.get("source_type") or "") != "bigquery": return { "table_id": req.table_id, "estimated_scan_bytes": 0, "estimated_result_rows": None, "estimated_result_bytes": None, "bq_cost_estimate_usd": 0.0, } bq_sql = _build_bq_sql(row, project_id, req) scan_bytes = _bq_dry_run_bytes(project_id, bq_sql) cost_per_tb = float(get_value("api", "scan", "bq_cost_per_tb_usd", default=5.0) or 5.0) cost = (scan_bytes / 1_099_511_627_776) * cost_per_tb # 1 TiB = 2^40 # Heuristic for result row/byte estimate avg_row_bytes = max(1, sum(_avg_bytes_for_type(t) for t in schema.values()) // max(1, len(schema))) rows_est = scan_bytes // max(avg_row_bytes, 1) if req.limit: rows_est = min(rows_est, req.limit) return { "table_id": req.table_id, "estimated_scan_bytes": int(scan_bytes), "estimated_result_rows": int(rows_est), "estimated_result_bytes": int(rows_est * avg_row_bytes), "bq_cost_estimate_usd": round(cost, 4), } def _avg_bytes_for_type(t: str) -> int: t = (t or "").upper() if t in ("INT64", "FLOAT64", "DATE", "TIMESTAMP", "DATETIME", "TIME"): return 8 if t == "STRING": return 32 # rough average if t == "BYTES": return 64 if t == "BOOL": return 1 return 16 @router.post("/scan/estimate") async def scan_estimate_endpoint( raw: dict, user: dict = Depends(get_current_user), conn: duckdb.DuckDBPyConnection = Depends(_get_db), ): project_id = get_value("data_source", "bigquery", "project", default="") or "" try: return estimate(conn, user, raw, project_id=project_id) except WhereValidationError as e: raise HTTPException(status_code=400, detail={"error": "validator_rejected", "kind": e.kind, "details": e.detail or {}}) except PermissionError: raise HTTPException(status_code=403, detail="not authorized") except FileNotFoundError: raise HTTPException(status_code=404, detail="table not found") except ValueError as e: raise HTTPException(status_code=400, detail=str(e)) ``` - [ ] **Step 10.4: Mount + run test** Add `from app.api.v2_scan import router as v2_scan_router; app.include_router(v2_scan_router)` in `app/main.py`. Run: `pytest tests/test_v2_scan_estimate.py -v`. Expected: 1 passed. - [ ] **Step 10.5: Commit** ```bash git add app/api/v2_scan.py app/main.py tests/test_v2_scan_estimate.py git commit -m "feat(v2): POST /api/v2/scan/estimate via BQ dryRun" ``` --- ## Task 11: `POST /api/v2/scan` — full pipeline Spec §3.4 — combine validator + RBAC + quota + max_result_bytes + Arrow IPC streaming. **Files:** - Modify: `app/api/v2_scan.py` (extend with `/scan` endpoint) - Test: `tests/test_v2_scan.py` - [ ] **Step 11.1: Write failing tests** ```python # tests/test_v2_scan.py import importlib from unittest.mock import MagicMock import pyarrow as pa import pytest from app.api.v2_arrow import parse_ipc_bytes @pytest.fixture def reload_db(tmp_path, monkeypatch): monkeypatch.setenv("DATA_DIR", str(tmp_path)) import src.db as db_module importlib.reload(db_module) yield db_module def _seed(conn): from src.repositories.table_registry import TableRegistryRepository TableRegistryRepository(conn).register( id="bq_view", name="bq_view", source_type="bigquery", bucket="ds", source_table="bq_view", query_mode="remote", is_public=True, ) class TestScan: def test_returns_arrow_ipc_for_simple_request(self, reload_db, monkeypatch): from app.api import v2_scan monkeypatch.setattr( v2_scan, "_resolve_schema", lambda *a, **kw: {"event_date": "DATE", "country_code": "STRING"}, ) fake_table = pa.table( {"event_date": ["2026-04-27"], "country_code": ["CZ"]} ) monkeypatch.setattr( v2_scan, "_run_bq_scan", lambda *a, **kw: fake_table, ) conn = reload_db.get_system_db() try: _seed(conn) user = {"role": "admin", "email": "a@x.com"} req = { "table_id": "bq_view", "select": ["event_date", "country_code"], "where": "event_date > DATE '2026-01-01'", "limit": 100, } tracker = v2_scan._build_quota_tracker() ipc_bytes = v2_scan.run_scan(conn, user, req, project_id="proj", quota=tracker) finally: conn.close() got = parse_ipc_bytes(ipc_bytes) assert got.num_rows == 1 assert got.column_names == ["event_date", "country_code"] def test_quota_concurrent_exceeded_raises_429(self, reload_db, monkeypatch): from app.api import v2_scan from app.api.v2_quota import QuotaTracker, QuotaExceededError, KIND_CONCURRENT monkeypatch.setattr( v2_scan, "_resolve_schema", lambda *a, **kw: {"event_date": "DATE"}, ) fake_table = pa.table({"event_date": ["2026-04-27"]}) monkeypatch.setattr(v2_scan, "_run_bq_scan", lambda *a, **kw: fake_table) tracker = QuotaTracker(max_concurrent_per_user=1, max_daily_bytes_per_user=10**12) conn = reload_db.get_system_db() try: _seed(conn) user = {"role": "admin", "email": "a@x.com"} req = {"table_id": "bq_view", "select": ["event_date"], "limit": 1} # Hold one concurrent slot with tracker.acquire(user="a@x.com"): with pytest.raises(QuotaExceededError) as e: v2_scan.run_scan(conn, user, req, project_id="proj", quota=tracker) assert e.value.kind == KIND_CONCURRENT finally: conn.close() def test_validator_rejection_propagates(self, reload_db, monkeypatch): from app.api import v2_scan from app.api.where_validator import WhereValidationError, REJECT_UNKNOWN_FUNCTION monkeypatch.setattr( v2_scan, "_resolve_schema", lambda *a, **kw: {"event_date": "DATE"}, ) tracker = v2_scan._build_quota_tracker() conn = reload_db.get_system_db() try: _seed(conn) user = {"role": "admin", "email": "a@x.com"} req = { "table_id": "bq_view", "where": "event_date = NUKE_FN()", } with pytest.raises(WhereValidationError) as e: v2_scan.run_scan(conn, user, req, project_id="proj", quota=tracker) assert e.value.kind == REJECT_UNKNOWN_FUNCTION finally: conn.close() ``` - [ ] **Step 11.2: Run tests to verify failure** Run: `pytest tests/test_v2_scan.py -v` Expected: FAIL — `run_scan` and `_run_bq_scan` don't exist yet. - [ ] **Step 11.3: Extend `app/api/v2_scan.py`** Append to the file: ```python import io import pyarrow as pa from app.api.v2_arrow import arrow_table_to_ipc_bytes, CONTENT_TYPE from app.api.v2_quota import QuotaTracker, QuotaExceededError from fastapi.responses import Response # Module-level singleton (process-local quota state per spec §3.8) _quota_singleton: QuotaTracker | None = None def _build_quota_tracker() -> QuotaTracker: """Returns or constructs the process-local quota tracker.""" global _quota_singleton if _quota_singleton is None: _quota_singleton = QuotaTracker( max_concurrent_per_user=int(get_value("api", "scan", "max_concurrent_per_user", default=5) or 5), max_daily_bytes_per_user=int(get_value("api", "scan", "max_daily_bytes_per_user", default=53687091200) or 53687091200), ) return _quota_singleton def _max_result_bytes() -> int: return int(get_value("api", "scan", "max_result_bytes", default=2_147_483_648) or 2_147_483_648) def _max_limit() -> int: return int(get_value("api", "scan", "max_limit", default=10_000_000) or 10_000_000) def _run_bq_scan(project: str, sql: str) -> pa.Table: """Execute SQL via DuckDB BQ extension, return pyarrow Table.""" import duckdb from connectors.bigquery.auth import get_metadata_token token = get_metadata_token() conn = duckdb.connect(":memory:") try: conn.execute("INSTALL bigquery FROM community; LOAD bigquery;") escaped = token.replace("'", "''") conn.execute(f"CREATE OR REPLACE SECRET bq_s (TYPE bigquery, ACCESS_TOKEN '{escaped}')") # Use bigquery_query() since the SQL is already authored against the BQ jobs API return conn.execute( "SELECT * FROM bigquery_query(?, ?)", [project, sql], ).arrow() finally: conn.close() def run_scan( conn: duckdb.DuckDBPyConnection, user: dict, raw_request: dict, *, project_id: str, quota: QuotaTracker, ) -> bytes: """Validate → quota → execute → serialize. Returns Arrow IPC bytes. Raises: WhereValidationError, QuotaExceededError, FileNotFoundError, PermissionError, ValueError """ req = ScanRequest(**raw_request) repo = TableRegistryRepository(conn) row = repo.get(req.table_id) if not row: raise FileNotFoundError(req.table_id) if user.get("role") != "admin" and not can_access_table(user, req.table_id, conn): raise PermissionError(req.table_id) if req.limit and req.limit > _max_limit(): raise ValueError(f"limit {req.limit} exceeds max {_max_limit()}") schema = _resolve_schema(conn, user, req.table_id, project_id) if req.where: validate_where(req.where, req.table_id, schema) if req.select: unknown = [c for c in req.select if c not in schema] if unknown: raise ValueError(f"unknown columns: {unknown}") user_id = user.get("email") or "anon" with quota.acquire(user=user_id): if (row.get("source_type") or "") != "bigquery": # Local source: query parquet directly from app.utils import get_data_dir parquet = ( get_data_dir() / "extracts" / row["source_type"] / "data" / f"{req.table_id}.parquet" ) local = duckdb.connect(":memory:") try: projection = ", ".join(req.select) if req.select else "*" sql = f"SELECT {projection} FROM read_parquet(?)" if req.where: sql += f" WHERE {req.where}" if req.order_by: sql += f" ORDER BY {', '.join(req.order_by)}" if req.limit: sql += f" LIMIT {int(req.limit)}" table = local.execute(sql, [str(parquet)]).arrow() finally: local.close() else: bq_sql = _build_bq_sql(row, project_id, req) table = _run_bq_scan(project_id, bq_sql) ipc = arrow_table_to_ipc_bytes(table) # Enforce max_result_bytes guard (spec §3.4 step 8) if len(ipc) > _max_result_bytes(): # Truncate by taking only as many rows as fit roughly # Simple heuristic: cap rows to estimated avg per max_bytes row_count = table.num_rows avg = max(1, len(ipc) // max(row_count, 1)) keep = min(row_count, _max_result_bytes() // max(avg, 1)) table = table.slice(0, keep) ipc = arrow_table_to_ipc_bytes(table) # Record bytes for daily quota quota.record_bytes(user=user_id, n=len(ipc)) return ipc @router.post("/scan") async def scan_endpoint( raw: dict, user: dict = Depends(get_current_user), conn: duckdb.DuckDBPyConnection = Depends(_get_db), ): project_id = get_value("data_source", "bigquery", "project", default="") or "" quota = _build_quota_tracker() try: ipc = run_scan(conn, user, raw, project_id=project_id, quota=quota) return Response(content=ipc, media_type=CONTENT_TYPE) except WhereValidationError as e: raise HTTPException( status_code=400, detail={"error": "validator_rejected", "kind": e.kind, "details": e.detail or {}}, ) except QuotaExceededError as e: raise HTTPException( status_code=429, detail={ "error": "quota_exceeded", "kind": e.kind, "current": e.current, "limit": e.limit, "retry_after_seconds": e.retry_after_seconds, }, ) except FileNotFoundError: raise HTTPException(status_code=404, detail="table not found") except PermissionError: raise HTTPException(status_code=403, detail="not authorized") except ValueError as e: raise HTTPException(status_code=400, detail=str(e)) ``` - [ ] **Step 11.4: Run tests to verify pass** Run: `pytest tests/test_v2_scan.py -v` Expected: 3 passed. - [ ] **Step 11.5: Commit** ```bash git add app/api/v2_scan.py tests/test_v2_scan.py git commit -m "feat(v2): POST /api/v2/scan — validator + quota + Arrow IPC pipeline" ``` --- ## Task 12: Drop wrap-view code path with `legacy_wrap_views` toggle Spec §6.1. The wrap view in `connectors/bigquery/extractor.py` for VIEW entities is the source of #101 problem. **Files:** - Modify: `connectors/bigquery/extractor.py` - Modify: `tests/test_bigquery_extractor.py` - [ ] **Step 12.1: Write failing test for the new behavior** Append to `tests/test_bigquery_extractor.py`: ```python class TestDropWrapViewForBQViews: def test_view_entity_does_not_create_master_view_by_default(self, tmp_path, monkeypatch): from connectors.bigquery.extractor import init_extract monkeypatch.setattr("connectors.bigquery.extractor.get_metadata_token", lambda: "tok") monkeypatch.setattr("connectors.bigquery.extractor._detect_table_type", lambda *a, **kw: "VIEW") # Stub BQ extension calls to avoid hitting real BQ real_connect = duckdb.connect def safe_connect(*a, **kw): return _CapturingProxy(real_connect(*a, **kw)) monkeypatch.setattr("connectors.bigquery.extractor.duckdb.connect", safe_connect) # legacy toggle is OFF by default → expect no CREATE VIEW for the BQ view monkeypatch.setattr( "connectors.bigquery.extractor.get_value", lambda *args, default=None, **kw: False if "legacy_wrap_views" in args else default, raising=False, ) result = init_extract( str(tmp_path), "my-project", [{"name": "myview", "bucket": "ds", "source_table": "myview", "description": ""}], ) # Confirm extract.duckdb has _meta + _remote_attach but NO master view for myview c = duckdb.connect(str(tmp_path / "extract.duckdb"), read_only=True) try: views = c.execute( "SELECT view_name FROM duckdb_views() WHERE view_name='myview'" ).fetchall() assert views == [], f"expected no wrap view for VIEW entity by default; got {views}" meta = c.execute("SELECT table_name FROM _meta").fetchall() assert ("myview",) in meta, "_meta must still record the view" finally: c.close() def test_legacy_wrap_views_toggle_restores_old_behavior(self, tmp_path, monkeypatch): from connectors.bigquery.extractor import init_extract monkeypatch.setattr("connectors.bigquery.extractor.get_metadata_token", lambda: "tok") monkeypatch.setattr("connectors.bigquery.extractor._detect_table_type", lambda *a, **kw: "VIEW") real_connect = duckdb.connect def safe_connect(*a, **kw): return _CapturingProxy(real_connect(*a, **kw)) monkeypatch.setattr("connectors.bigquery.extractor.duckdb.connect", safe_connect) # legacy toggle ON → should still create the wrap view monkeypatch.setattr( "connectors.bigquery.extractor.get_value", lambda *args, default=None, **kw: True if "legacy_wrap_views" in args else default, raising=False, ) init_extract( str(tmp_path), "my-project", [{"name": "myview", "bucket": "ds", "source_table": "myview", "description": ""}], ) c = duckdb.connect(str(tmp_path / "extract.duckdb"), read_only=True) try: views = c.execute( "SELECT view_name FROM duckdb_views() WHERE view_name='myview'" ).fetchall() assert views == [("myview",)] finally: c.close() ``` - [ ] **Step 12.2: Run tests to verify failure** Run: `pytest tests/test_bigquery_extractor.py::TestDropWrapViewForBQViews -v` Expected: FAIL — current code always emits the wrap view for VIEW entities. - [ ] **Step 12.3: Modify `connectors/bigquery/extractor.py`** Find the section in `init_extract` that emits the wrap view for VIEW entities. Replace the dual-path branch: ```python # OLD: if entity_type == "BASE TABLE": view_sql = (...) # direct ref else: if entity_type not in ("VIEW", "MATERIALIZED_VIEW"): logger.warning(...) bq_inner = ... view_sql = (...) # bigquery_query() wrap conn.execute(view_sql) ``` With: ```python # NEW: only emit wrap view for BASE TABLE; for VIEW types, just record in _meta. from app.instance_config import get_value as _get_value legacy_wrap_views = bool(_get_value("data_source", "bigquery", "legacy_wrap_views", default=False)) if entity_type == "BASE TABLE": view_sql = ( f'CREATE OR REPLACE VIEW "{table_name}" AS ' f'SELECT * FROM bq."{dataset}"."{source_table}"' ) conn.execute(view_sql) elif legacy_wrap_views: # Backwards compatibility — for one release cycle only. if entity_type not in ("VIEW", "MATERIALIZED_VIEW"): logger.warning( "Unknown BQ entity type %r for %s.%s.%s — using bigquery_query() path", entity_type, project_id, dataset, source_table, ) bq_inner = f"SELECT * FROM `{project_id}.{dataset}.{source_table}`" bq_inner_escaped = bq_inner.replace("'", "''") view_sql = ( f'CREATE OR REPLACE VIEW "{table_name}" AS ' f"SELECT * FROM bigquery_query('{project_id}', '{bq_inner_escaped}')" ) conn.execute(view_sql) else: # Default: VIEW / MATERIALIZED_VIEW are recorded in _meta but no master view created. # Analyst must use `da fetch` (v2 primitives) to materialize a snapshot locally. logger.info( "Skipping wrap view for %s entity %s.%s.%s — use `da fetch`", entity_type, project_id, dataset, source_table, ) # _meta entry is recorded in ALL branches (existing code below stays as-is) conn.execute( "INSERT INTO _meta VALUES (?, ?, 0, 0, ?, 'remote')", [table_name, tc.get("description", ""), now], ) ``` - [ ] **Step 12.4: Run tests to verify pass** Run: `pytest tests/test_bigquery_extractor.py -v` Expected: ALL pass — including the 2 new TestDropWrapViewForBQViews tests + existing tests (some may need updating; if so, update them to reflect that VIEW now skips the master view by default). If existing `TestViewVsTableTemplates::test_view_uses_bigquery_query_function` breaks, **update it** to enable the legacy toggle in its monkeypatch (per pattern in Step 12.1's second test). - [ ] **Step 12.5: Commit** ```bash git add connectors/bigquery/extractor.py tests/test_bigquery_extractor.py git commit -m "feat(bq): drop wrap-view for VIEW entities by default; legacy toggle behind flag" ``` --- ## Task 13: Arrow over HTTP client + JSON helpers Client-side counterpart to v2 endpoints. Used by all `da` commands that talk to v2 API. **Files:** - Create: `cli/v2_client.py` - Test: `tests/test_v2_client.py` - [ ] **Step 13.1: Write failing tests** ```python # tests/test_v2_client.py import json import pyarrow as pa import pytest from unittest.mock import MagicMock, patch from cli.v2_client import ( api_get_json, api_post_arrow, api_post_json, V2ClientError, ) def _fake_response(*, status=200, json_body=None, arrow_body=None, content_type=None): resp = MagicMock() resp.status_code = status if json_body is not None: resp.json.return_value = json_body resp.text = json.dumps(json_body) resp.content = resp.text.encode() if arrow_body is not None: resp.content = arrow_body if content_type: resp.headers = {"content-type": content_type} else: resp.headers = {} return resp class TestApiGetJson: def test_200_returns_parsed_json(self): with patch("cli.v2_client.requests.get") as m: m.return_value = _fake_response(json_body={"hello": "world"}) assert api_get_json("/api/v2/catalog") == {"hello": "world"} def test_4xx_raises_v2clienterror(self): with patch("cli.v2_client.requests.get") as m: m.return_value = _fake_response(status=403, json_body={"detail": "nope"}) with pytest.raises(V2ClientError) as e: api_get_json("/api/v2/catalog") assert e.value.status_code == 403 class TestApiPostArrow: def test_returns_arrow_table(self): from app.api.v2_arrow import arrow_table_to_ipc_bytes ipc = arrow_table_to_ipc_bytes(pa.table({"x": [1, 2, 3]})) with patch("cli.v2_client.requests.post") as m: m.return_value = _fake_response( arrow_body=ipc, content_type="application/vnd.apache.arrow.stream", ) got = api_post_arrow("/api/v2/scan", {"table_id": "x"}) assert got.num_rows == 3 assert got.column_names == ["x"] ``` - [ ] **Step 13.2: Run tests to verify failure** Run: `pytest tests/test_v2_client.py -v` Expected: FAIL — module doesn't exist. - [ ] **Step 13.3: Implement v2 client** Create `cli/v2_client.py`: ```python """HTTP client helpers for /api/v2/* endpoints (CLI side).""" from __future__ import annotations from dataclasses import dataclass from typing import Any import io import requests import pyarrow as pa from cli.config import get_server_url, get_pat @dataclass class V2ClientError(Exception): status_code: int body: Any message: str = "" def __str__(self) -> str: return f"HTTP {self.status_code}: {self.message or self.body}" def _headers() -> dict: pat = get_pat() return {"Authorization": f"Bearer {pat}"} if pat else {} def api_get_json(path: str, **params) -> dict: url = f"{get_server_url().rstrip('/')}{path}" r = requests.get(url, headers=_headers(), params=params or None, timeout=30) if r.status_code >= 400: body = r.json() if "json" in r.headers.get("content-type", "") else r.text raise V2ClientError(status_code=r.status_code, body=body, message=str(body)[:200]) return r.json() def api_post_json(path: str, payload: dict) -> dict: url = f"{get_server_url().rstrip('/')}{path}" r = requests.post(url, json=payload, headers=_headers(), timeout=120) if r.status_code >= 400: body = r.json() if "json" in r.headers.get("content-type", "") else r.text raise V2ClientError(status_code=r.status_code, body=body, message=str(body)[:200]) return r.json() def api_post_arrow(path: str, payload: dict) -> pa.Table: """Post JSON, expect Arrow IPC stream response.""" url = f"{get_server_url().rstrip('/')}{path}" r = requests.post(url, json=payload, headers=_headers(), timeout=600) if r.status_code >= 400: body = r.json() if "json" in r.headers.get("content-type", "") else r.text raise V2ClientError(status_code=r.status_code, body=body, message=str(body)[:200]) reader = pa.ipc.open_stream(io.BytesIO(r.content)) return reader.read_all() ``` If `cli/config.py` lacks `get_server_url` / `get_pat`, those helpers exist already under different names — adapt to whatever the existing CLI uses (`api_get` in `cli/client.py` is the existing helper; mirror its config-loading pattern). - [ ] **Step 13.4: Run tests to verify pass** Run: `pytest tests/test_v2_client.py -v` Expected: 3 passed. - [ ] **Step 13.5: Commit** ```bash git add cli/v2_client.py tests/test_v2_client.py git commit -m "feat(cli): v2 HTTP client (JSON + Arrow IPC)" ``` --- ## Task 14: Snapshot metadata I/O + flock helper Backing for `da fetch` and `da snapshot *`. Spec §4.2. **Files:** - Create: `cli/snapshot_meta.py` - Test: `tests/test_snapshot_meta.py` - [ ] **Step 14.1: Write failing tests** ```python # tests/test_snapshot_meta.py import json import pytest from pathlib import Path from cli.snapshot_meta import ( SnapshotMeta, write_meta, read_meta, list_snapshots, snapshot_lock, ) @pytest.fixture def snap_dir(tmp_path): d = tmp_path / "snapshots" d.mkdir() return d class TestMetaIO: def test_round_trip(self, snap_dir): meta = SnapshotMeta( name="cz_recent", table_id="bq_view", select=["a", "b"], where="a > 1", limit=100, order_by=None, fetched_at="2026-04-27T17:30:00Z", effective_as_of="2026-04-27T17:30:00Z", rows=10, bytes_local=1024, estimated_scan_bytes_at_fetch=5_000_000, result_hash_md5="abc", ) write_meta(snap_dir, meta) got = read_meta(snap_dir, "cz_recent") assert got == meta def test_read_missing_returns_none(self, snap_dir): assert read_meta(snap_dir, "missing") is None def test_list_snapshots_empty(self, snap_dir): assert list_snapshots(snap_dir) == [] def test_list_snapshots_with_data(self, snap_dir): for name in ("a", "b", "c"): (snap_dir / f"{name}.parquet").write_bytes(b"PAR1\\x00\\x00PAR1") write_meta(snap_dir, SnapshotMeta( name=name, table_id="t", select=None, where=None, limit=None, order_by=None, fetched_at="t", effective_as_of="t", rows=0, bytes_local=10, estimated_scan_bytes_at_fetch=0, result_hash_md5="", )) names = sorted(s.name for s in list_snapshots(snap_dir)) assert names == ["a", "b", "c"] class TestSnapshotLock: def test_lock_is_exclusive(self, snap_dir, tmp_path): """Two processes can't both hold the lock at once.""" import threading, time held_at = [] def worker(label, hold_seconds): with snapshot_lock(snap_dir): held_at.append((label, time.time())) time.sleep(hold_seconds) held_at.append((f"{label}-done", time.time())) t1 = threading.Thread(target=worker, args=("A", 0.2)) t2 = threading.Thread(target=worker, args=("B", 0.2)) t1.start(); time.sleep(0.05); t2.start() t1.join(); t2.join() # A acquired, A-done, B acquired, B-done — never interleaved labels = [x[0] for x in held_at] assert labels in ( ["A", "A-done", "B", "B-done"], ["B", "B-done", "A", "A-done"], ), f"expected serialized acquisition; got {labels}" ``` - [ ] **Step 14.2: Run tests to verify failure** Run: `pytest tests/test_snapshot_meta.py -v` Expected: FAIL — module doesn't exist. - [ ] **Step 14.3: Implement snapshot meta + lock** Create `cli/snapshot_meta.py`: ```python """Snapshot sidecar metadata + file lock helpers (spec §4.2).""" from __future__ import annotations import contextlib import fcntl import json from dataclasses import dataclass, asdict from pathlib import Path from typing import Optional @dataclass class SnapshotMeta: name: str table_id: str select: Optional[list[str]] where: Optional[str] limit: Optional[int] order_by: Optional[list[str]] fetched_at: str # ISO 8601 UTC effective_as_of: str # ISO 8601 UTC, server-side eval time rows: int bytes_local: int estimated_scan_bytes_at_fetch: int result_hash_md5: str def _meta_path(snap_dir: Path, name: str) -> Path: return snap_dir / f"{name}.meta.json" def write_meta(snap_dir: Path, meta: SnapshotMeta) -> None: snap_dir.mkdir(parents=True, exist_ok=True) with _meta_path(snap_dir, meta.name).open("w") as f: json.dump(asdict(meta), f, indent=2) def read_meta(snap_dir: Path, name: str) -> Optional[SnapshotMeta]: p = _meta_path(snap_dir, name) if not p.exists(): return None data = json.loads(p.read_text()) return SnapshotMeta(**data) def list_snapshots(snap_dir: Path) -> list[SnapshotMeta]: if not snap_dir.exists(): return [] out = [] for meta_file in snap_dir.glob("*.meta.json"): try: data = json.loads(meta_file.read_text()) out.append(SnapshotMeta(**data)) except (json.JSONDecodeError, TypeError): continue return out def delete_snapshot(snap_dir: Path, name: str) -> bool: """Delete the snapshot's parquet + meta. Returns True if removed, False if missing.""" parquet = snap_dir / f"{name}.parquet" meta = _meta_path(snap_dir, name) removed = False if parquet.exists(): parquet.unlink(); removed = True if meta.exists(): meta.unlink(); removed = True return removed @contextlib.contextmanager def snapshot_lock(snap_dir: Path): """Exclusive flock on snap_dir/.lock — serializes snapshot installs. Concurrent `da fetch` invocations queue here. """ snap_dir.mkdir(parents=True, exist_ok=True) lock_file = snap_dir / ".lock" lock_file.touch(exist_ok=True) fd = open(lock_file, "r+") try: fcntl.flock(fd.fileno(), fcntl.LOCK_EX) yield finally: fcntl.flock(fd.fileno(), fcntl.LOCK_UN) fd.close() ``` - [ ] **Step 14.4: Run tests to verify pass** Run: `pytest tests/test_snapshot_meta.py -v` Expected: 5 passed. - [ ] **Step 14.5: Commit** ```bash git add cli/snapshot_meta.py tests/test_snapshot_meta.py git commit -m "feat(cli): snapshot metadata sidecar + flock helper" ``` --- ## Task 15: `da catalog` / `da schema` / `da describe` Spec §4.1. Discovery commands. **Files:** - Create: `cli/commands/catalog.py`, `cli/commands/schema.py`, `cli/commands/describe.py` - Modify: `cli/main.py` - Test: `tests/test_cli_catalog.py` - [ ] **Step 15.1: Write failing tests** ```python # tests/test_cli_catalog.py import json from typer.testing import CliRunner from unittest.mock import patch import pytest def test_da_catalog_json_output(monkeypatch): """`da catalog --json` emits the server's JSON verbatim.""" payload = { "tables": [ {"id": "orders", "name": "orders", "source_type": "keboola", "query_mode": "local", "sql_flavor": "duckdb", "where_examples": [], "fetch_via": "...", "rough_size_hint": None}, ], "server_time": "2026-04-27T17:30:00Z", } with patch("cli.commands.catalog.api_get_json", return_value=payload): from cli.main import app as cli_app runner = CliRunner() result = runner.invoke(cli_app, ["catalog", "--json"]) assert result.exit_code == 0 out = json.loads(result.stdout) assert out["tables"][0]["id"] == "orders" def test_da_catalog_table_output(monkeypatch): payload = { "tables": [ {"id": "orders", "name": "orders", "source_type": "keboola", "query_mode": "local", "sql_flavor": "duckdb", "where_examples": [], "fetch_via": "...", "rough_size_hint": None}, ], "server_time": "2026-04-27T17:30:00Z", } with patch("cli.commands.catalog.api_get_json", return_value=payload): from cli.main import app as cli_app runner = CliRunner() result = runner.invoke(cli_app, ["catalog"]) assert result.exit_code == 0 assert "orders" in result.stdout assert "keboola" in result.stdout ``` - [ ] **Step 15.2: Run tests to verify failure** Run: `pytest tests/test_cli_catalog.py -v` Expected: FAIL — `cli.commands.catalog` doesn't exist. - [ ] **Step 15.3: Implement `cli/commands/catalog.py`** ```python """`da catalog` — list registered tables (spec §4.1).""" import json as json_lib import typer from cli.v2_client import api_get_json, V2ClientError catalog_app = typer.Typer(help="List tables visible to you") @catalog_app.callback(invoke_without_command=True) def catalog( ctx: typer.Context, json: bool = typer.Option(False, "--json", help="Emit raw JSON"), refresh: bool = typer.Option(False, "--refresh", help="Bypass client-side cache"), ): """List tables visible to you (RBAC-filtered).""" if ctx.invoked_subcommand is not None: return try: data = api_get_json("/api/v2/catalog", refresh=int(refresh)) except V2ClientError as e: typer.echo(f"Error: catalog fetch failed: {e}", err=True) raise typer.Exit(5) if json: typer.echo(json_lib.dumps(data, indent=2)) return # Human-readable table typer.echo(f"{'ID':30s} {'SOURCE':10s} {'MODE':8s} {'FLAVOR':10s} NAME") for t in data.get("tables", []): typer.echo( f"{t['id']:30s} {t['source_type']:10s} {t['query_mode']:8s} " f"{t['sql_flavor']:10s} {t.get('name', '')}" ) ``` - [ ] **Step 15.4: Implement `cli/commands/schema.py`** ```python """`da schema ` — show columns + BQ flavor hints (spec §4.1).""" import json as json_lib import typer from cli.v2_client import api_get_json, V2ClientError schema_app = typer.Typer(help="Show column metadata for a table") @schema_app.callback(invoke_without_command=True) def schema( ctx: typer.Context, table_id: str = typer.Argument(...), json: bool = typer.Option(False, "--json"), ): """Show column metadata for a table.""" if ctx.invoked_subcommand is not None: return try: data = api_get_json(f"/api/v2/schema/{table_id}") except V2ClientError as e: typer.echo(f"Error: schema fetch failed: {e}", err=True) raise typer.Exit(5 if e.status_code >= 500 else 8 if e.status_code == 403 else 2) if json: typer.echo(json_lib.dumps(data, indent=2)) return flavor = data.get("sql_flavor", "duckdb") typer.echo(f"Table: {data['table_id']} ({data['source_type']} — use {flavor.upper()} SQL dialect)") typer.echo("") typer.echo(f"{'COLUMN':30s} {'TYPE':15s} {'NULL':5s} DESCRIPTION") for c in data.get("columns", []): typer.echo( f"{c['name']:30s} {c['type']:15s} " f"{'YES' if c.get('nullable') else 'NO':5s} {c.get('description', '')}" ) if data.get("partition_by"): typer.echo(f"\\nPartition: {data['partition_by']}") if data.get("clustered_by"): typer.echo(f"Clustered: {', '.join(data['clustered_by'])}") if data.get("where_dialect_hints"): typer.echo("\\nWHERE dialect hints:") for k, v in data["where_dialect_hints"].items(): typer.echo(f" {k:25s} {v}") ``` - [ ] **Step 15.5: Implement `cli/commands/describe.py`** ```python """`da describe
` — schema + sample rows (spec §4.1).""" import json as json_lib import typer from cli.v2_client import api_get_json, V2ClientError describe_app = typer.Typer(help="Show schema + sample rows for a table") @describe_app.callback(invoke_without_command=True) def describe( ctx: typer.Context, table_id: str = typer.Argument(...), n: int = typer.Option(5, "-n", "--rows", help="Sample rows count"), json: bool = typer.Option(False, "--json"), ): """Show schema + sample rows for a table.""" if ctx.invoked_subcommand is not None: return try: sch = api_get_json(f"/api/v2/schema/{table_id}") sam = api_get_json(f"/api/v2/sample/{table_id}", n=n) except V2ClientError as e: typer.echo(f"Error: describe failed: {e}", err=True) raise typer.Exit(5 if e.status_code >= 500 else 8 if e.status_code == 403 else 2) if json: typer.echo(json_lib.dumps({"schema": sch, "sample": sam}, indent=2, default=str)) return # Reuse schema printing from cli.commands.schema import schema as schema_cmd typer.echo(f"Table: {sch['table_id']}") typer.echo("") typer.echo("Schema:") for c in sch.get("columns", []): typer.echo(f" {c['name']:30s} {c['type']}") typer.echo("") typer.echo(f"Sample ({len(sam.get('rows', []))} rows):") for row in sam.get("rows", []): typer.echo(f" {row}") ``` - [ ] **Step 15.6: Register commands in `cli/main.py`** Find where other subcommands are added (`app.add_typer(...)` calls). Add: ```python from cli.commands.catalog import catalog_app from cli.commands.schema import schema_app from cli.commands.describe import describe_app app.add_typer(catalog_app, name="catalog") app.add_typer(schema_app, name="schema") app.add_typer(describe_app, name="describe") ``` (Adjust based on the actual `cli/main.py` pattern — could be flat commands instead of typers.) - [ ] **Step 15.7: Run tests to verify pass** Run: `pytest tests/test_cli_catalog.py -v` Expected: 2 passed. - [ ] **Step 15.8: Commit** ```bash git add cli/commands/catalog.py cli/commands/schema.py cli/commands/describe.py cli/main.py tests/test_cli_catalog.py git commit -m "feat(cli): da catalog/schema/describe — discovery commands" ``` --- ## Task 16: `da fetch` Spec §4.2. The headline command. **Files:** - Create: `cli/commands/fetch.py` - Modify: `cli/main.py` - Test: `tests/test_cli_fetch.py` - [ ] **Step 16.1: Write failing tests** ```python # tests/test_cli_fetch.py from typer.testing import CliRunner from unittest.mock import patch, MagicMock import pyarrow as pa import json import pytest def _seed_local_dir(tmp_path): """Set up the user's agnes-data directory for the CLI to find.""" (tmp_path / "user" / "duckdb").mkdir(parents=True) (tmp_path / "user" / "snapshots").mkdir(parents=True) return tmp_path @pytest.fixture def cli_env(tmp_path, monkeypatch): monkeypatch.setenv("DA_LOCAL_DIR", str(_seed_local_dir(tmp_path))) yield tmp_path class TestDaFetch: def test_estimate_only_does_not_create_snapshot(self, cli_env, monkeypatch): from cli.main import app as cli_app with patch("cli.commands.fetch.api_post_json") as m: m.return_value = { "estimated_scan_bytes": 1_000_000, "estimated_result_rows": 100, "estimated_result_bytes": 1_000, "bq_cost_estimate_usd": 0.0001, } runner = CliRunner() result = runner.invoke(cli_app, [ "fetch", "bq_view", "--select", "a,b", "--where", "a > 1", "--limit", "100", "--estimate", ]) assert result.exit_code == 0, result.stdout # No parquet should be created assert not list((cli_env / "user" / "snapshots").glob("*.parquet")) def test_fetch_creates_snapshot_with_meta(self, cli_env, monkeypatch): from cli.main import app as cli_app # Estimate path with patch("cli.commands.fetch.api_post_json") as m_est, \ patch("cli.commands.fetch.api_post_arrow") as m_scan: m_est.return_value = { "estimated_scan_bytes": 1000, "estimated_result_rows": 2, "estimated_result_bytes": 100, "bq_cost_estimate_usd": 0.0, } m_scan.return_value = pa.table({"a": [1, 2], "b": ["x", "y"]}) runner = CliRunner() result = runner.invoke(cli_app, [ "fetch", "bq_view", "--select", "a,b", "--limit", "10", "--no-estimate", ]) assert result.exit_code == 0, result.stdout snap = cli_env / "user" / "snapshots" / "bq_view.parquet" meta = cli_env / "user" / "snapshots" / "bq_view.meta.json" assert snap.exists() assert meta.exists() assert json.loads(meta.read_text())["rows"] == 2 def test_fetch_existing_snapshot_without_force_fails(self, cli_env, monkeypatch): from cli.main import app as cli_app # Pre-create a snapshot snap = cli_env / "user" / "snapshots" / "bq_view.parquet" snap.write_bytes(b"PAR1\\x00\\x00PAR1") meta = cli_env / "user" / "snapshots" / "bq_view.meta.json" meta.write_text('{"name": "bq_view", "table_id": "bq_view", "select": null, "where": null, "limit": null, "order_by": null, "fetched_at": "x", "effective_as_of": "x", "rows": 0, "bytes_local": 0, "estimated_scan_bytes_at_fetch": 0, "result_hash_md5": ""}') runner = CliRunner() result = runner.invoke(cli_app, ["fetch", "bq_view", "--no-estimate"]) assert result.exit_code == 6, f"expected exit code 6 (snapshot_exists); got {result.exit_code}\\n{result.stdout}" ``` - [ ] **Step 16.2: Run tests to verify failure** Run: `pytest tests/test_cli_fetch.py -v` Expected: FAIL — `cli.commands.fetch` doesn't exist. - [ ] **Step 16.3: Implement `cli/commands/fetch.py`** ```python """`da fetch` — materialize a filtered subset of a remote table locally (spec §4.2).""" from __future__ import annotations import hashlib import json import os from datetime import datetime, timezone from pathlib import Path import duckdb import pyarrow as pa import pyarrow.parquet as pq import typer from cli.snapshot_meta import ( SnapshotMeta, write_meta, read_meta, snapshot_lock, ) from cli.v2_client import api_post_json, api_post_arrow, V2ClientError fetch_app = typer.Typer(help="Fetch a filtered subset of a remote table locally") def _local_dir() -> Path: return Path(os.environ.get("DA_LOCAL_DIR", ".")).resolve() def _print_estimate(d: dict) -> None: typer.echo(f" estimated_scan_bytes: {d.get('estimated_scan_bytes', 0):>15,} bytes") typer.echo(f" estimated_result_rows: {d.get('estimated_result_rows', 0):>15,}") typer.echo(f" estimated_result_bytes: {d.get('estimated_result_bytes', 0):>15,} bytes") typer.echo(f" bq_cost_estimate_usd: $ {d.get('bq_cost_estimate_usd', 0):.4f}") @fetch_app.callback(invoke_without_command=True) def fetch( ctx: typer.Context, table_id: str = typer.Argument(...), select: str = typer.Option(None, "--select", help="Comma-separated column list"), where: str = typer.Option(None, "--where", help="WHERE predicate (BQ flavor for remote tables)"), limit: int = typer.Option(None, "--limit"), order_by: str = typer.Option(None, "--order-by", help="Comma-separated"), as_name: str = typer.Option(None, "--as", help="Local snapshot name (default: )"), estimate: bool = typer.Option(False, "--estimate", help="Run dry-run only, do not fetch"), no_estimate: bool = typer.Option(False, "--no-estimate", help="Skip the pre-fetch estimate"), force: bool = typer.Option(False, "--force", help="Overwrite existing snapshot of the same name"), ): """Fetch a filtered subset of a remote table locally.""" if ctx.invoked_subcommand is not None: return name = as_name or table_id snap_dir = _local_dir() / "user" / "snapshots" snap_dir.mkdir(parents=True, exist_ok=True) # Build request req = {"table_id": table_id} if select: req["select"] = [c.strip() for c in select.split(",") if c.strip()] if where: req["where"] = where if limit: req["limit"] = int(limit) if order_by: req["order_by"] = [c.strip() for c in order_by.split(",") if c.strip()] # Estimate (always shown unless --no-estimate) if not no_estimate: try: est = api_post_json("/api/v2/scan/estimate", req) except V2ClientError as e: typer.echo(f"Error: estimate failed: {e}", err=True) raise typer.Exit(_exit_code_for(e)) typer.echo(f"Estimate for {table_id}:") _print_estimate(est) if estimate: return # Snapshot existence check if not force and read_meta(snap_dir, name) is not None: existing = read_meta(snap_dir, name) typer.echo( f"Error: snapshot {name!r} already exists " f"(fetched {existing.fetched_at}, {existing.rows:,} rows). " f"Pass --force to overwrite, or 'da snapshot refresh {name}' to update in place.", err=True, ) raise typer.Exit(6) # Fetch try: table = api_post_arrow("/api/v2/scan", req) except V2ClientError as e: typer.echo(f"Error: fetch failed: {e}", err=True) raise typer.Exit(_exit_code_for(e)) # Install under flock parquet_path = snap_dir / f"{name}.parquet" with snapshot_lock(snap_dir): pq.write_table(table, parquet_path) # Register view in user analytics.duckdb local_db = _local_dir() / "user" / "duckdb" / "analytics.duckdb" local_db.parent.mkdir(parents=True, exist_ok=True) conn = duckdb.connect(str(local_db)) try: conn.execute( f'CREATE OR REPLACE VIEW "{name}" AS SELECT * FROM read_parquet(?)', [str(parquet_path)], ) finally: conn.close() # Compute hash + write meta result_hash = hashlib.md5(parquet_path.read_bytes()[:1_000_000]).hexdigest() now = datetime.now(timezone.utc).isoformat() meta = SnapshotMeta( name=name, table_id=table_id, select=req.get("select"), where=req.get("where"), limit=req.get("limit"), order_by=req.get("order_by"), fetched_at=now, effective_as_of=now, rows=int(table.num_rows), bytes_local=parquet_path.stat().st_size, estimated_scan_bytes_at_fetch=int(est.get("estimated_scan_bytes", 0)) if not no_estimate else 0, result_hash_md5=result_hash, ) write_meta(snap_dir, meta) typer.echo(f"Fetched {table.num_rows:,} rows -> {name}") def _exit_code_for(e: V2ClientError) -> int: if e.status_code == 400: # Inspect body for 'kind' body = e.body if isinstance(e.body, dict) else {} if body.get("error") == "validator_rejected": return 2 return 2 if e.status_code == 401: return 7 if e.status_code == 403: return 8 if e.status_code == 404: return 8 # treat unknown table as RBAC-equivalent if e.status_code == 429: return 3 if e.status_code >= 500: return 5 return 9 ``` - [ ] **Step 16.4: Register in `cli/main.py`** ```python from cli.commands.fetch import fetch_app app.add_typer(fetch_app, name="fetch") ``` - [ ] **Step 16.5: Run tests to verify pass** Run: `pytest tests/test_cli_fetch.py -v` Expected: 3 passed. - [ ] **Step 16.6: Commit** ```bash git add cli/commands/fetch.py cli/main.py tests/test_cli_fetch.py git commit -m "feat(cli): da fetch — materialize filtered remote subset locally" ``` --- ## Task 17: `da snapshot list/refresh/drop/prune` Spec §4.2. **Files:** - Create: `cli/commands/snapshot.py` - Modify: `cli/main.py` - Test: `tests/test_cli_snapshot.py` - [ ] **Step 17.1: Write failing tests** ```python # tests/test_cli_snapshot.py from typer.testing import CliRunner from unittest.mock import patch import json import pytest from cli.snapshot_meta import SnapshotMeta, write_meta @pytest.fixture def cli_env(tmp_path, monkeypatch): monkeypatch.setenv("DA_LOCAL_DIR", str(tmp_path)) snap_dir = tmp_path / "user" / "snapshots" snap_dir.mkdir(parents=True) yield tmp_path def _seed_meta(tmp_path, name="cz_recent", rows=100): snap_dir = tmp_path / "user" / "snapshots" parquet = snap_dir / f"{name}.parquet" parquet.write_bytes(b"PAR1\\x00\\x00PAR1") write_meta(snap_dir, SnapshotMeta( name=name, table_id="bq_view", select=None, where=None, limit=None, order_by=None, fetched_at="2026-04-27T10:00:00+00:00", effective_as_of="2026-04-27T10:00:00+00:00", rows=rows, bytes_local=parquet.stat().st_size, estimated_scan_bytes_at_fetch=0, result_hash_md5="abc", )) class TestSnapshotList: def test_list_empty(self, cli_env): from cli.main import app as cli_app runner = CliRunner() result = runner.invoke(cli_app, ["snapshot", "list"]) assert result.exit_code == 0 class TestSnapshotDrop: def test_drop_removes_files(self, cli_env): from cli.main import app as cli_app _seed_meta(cli_env, "cz_recent") snap_dir = cli_env / "user" / "snapshots" assert (snap_dir / "cz_recent.parquet").exists() runner = CliRunner() result = runner.invoke(cli_app, ["snapshot", "drop", "cz_recent"]) assert result.exit_code == 0 assert not (snap_dir / "cz_recent.parquet").exists() assert not (snap_dir / "cz_recent.meta.json").exists() def test_drop_missing_returns_2(self, cli_env): from cli.main import app as cli_app runner = CliRunner() result = runner.invoke(cli_app, ["snapshot", "drop", "nonexistent"]) assert result.exit_code != 0 ``` - [ ] **Step 17.2: Run tests to verify failure** Run: `pytest tests/test_cli_snapshot.py -v` Expected: FAIL — module doesn't exist. - [ ] **Step 17.3: Implement `cli/commands/snapshot.py`** ```python """`da snapshot list/refresh/drop/prune` (spec §4.2).""" from __future__ import annotations import hashlib import os import json as json_lib from datetime import datetime, timedelta, timezone from pathlib import Path import duckdb import pyarrow.parquet as pq import typer from cli.snapshot_meta import ( list_snapshots, read_meta, write_meta, delete_snapshot, snapshot_lock, SnapshotMeta, ) from cli.v2_client import api_post_arrow, V2ClientError snapshot_app = typer.Typer(help="Manage local snapshots") def _local_dir() -> Path: return Path(os.environ.get("DA_LOCAL_DIR", ".")).resolve() def _snap_dir() -> Path: return _local_dir() / "user" / "snapshots" def _format_size(n: int) -> str: for unit in ("B", "KB", "MB", "GB", "TB"): if n < 1024 or unit == "TB": return f"{n:.1f} {unit}" n //= 1024 return f"{n} TB" @snapshot_app.command("list") def list_cmd( json: bool = typer.Option(False, "--json"), ): """List local snapshots.""" snaps = list_snapshots(_snap_dir()) if json: typer.echo(json_lib.dumps([s.__dict__ for s in snaps], indent=2)) return if not snaps: typer.echo("(no snapshots)") return typer.echo(f"{'NAME':30s} {'ROWS':>10s} {'SIZE':>10s} {'AGE':>10s} {'TABLE':30s} WHERE") now = datetime.now(timezone.utc) for s in sorted(snaps, key=lambda x: x.name): try: age = now - datetime.fromisoformat(s.fetched_at.replace("Z", "+00:00")) age_str = f"{age.days}d" if age.days else f"{int(age.total_seconds() // 3600)}h" except (ValueError, TypeError): age_str = "?" where = (s.where or "")[:40] typer.echo( f"{s.name:30s} {s.rows:>10,} {_format_size(s.bytes_local):>10s} " f"{age_str:>10s} {s.table_id:30s} {where}" ) @snapshot_app.command("drop") def drop_cmd(name: str): """Delete a snapshot.""" snap_dir = _snap_dir() if read_meta(snap_dir, name) is None: typer.echo(f"Error: snapshot {name!r} not found", err=True) raise typer.Exit(2) with snapshot_lock(snap_dir): delete_snapshot(snap_dir, name) # Also drop the view from user analytics DB local_db = _local_dir() / "user" / "duckdb" / "analytics.duckdb" if local_db.exists(): conn = duckdb.connect(str(local_db)) try: conn.execute(f'DROP VIEW IF EXISTS "{name}"') finally: conn.close() typer.echo(f"Dropped {name}") @snapshot_app.command("refresh") def refresh_cmd( name: str, where: str = typer.Option(None, "--where", help="Override stored WHERE"), ): """Re-fetch a snapshot using its stored fetch parameters (spec §4.2).""" snap_dir = _snap_dir() meta = read_meta(snap_dir, name) if meta is None: typer.echo(f"Error: snapshot {name!r} not found", err=True) raise typer.Exit(2) req = { "table_id": meta.table_id, "select": meta.select, "where": where if where else meta.where, "limit": meta.limit, "order_by": meta.order_by, } try: table = api_post_arrow("/api/v2/scan", req) except V2ClientError as e: typer.echo(f"Error: refresh failed: {e}", err=True) raise typer.Exit(5 if e.status_code >= 500 else 8 if e.status_code == 403 else 2) parquet_path = snap_dir / f"{name}.parquet" with snapshot_lock(snap_dir): pq.write_table(table, parquet_path) new_hash = hashlib.md5(parquet_path.read_bytes()[:1_000_000]).hexdigest() identical = new_hash == meta.result_hash_md5 old_rows = meta.rows old_bytes = meta.bytes_local new_rows = int(table.num_rows) new_bytes = parquet_path.stat().st_size now = datetime.now(timezone.utc).isoformat() new_meta = SnapshotMeta( name=name, table_id=meta.table_id, select=req.get("select"), where=req.get("where"), limit=req.get("limit"), order_by=req.get("order_by"), fetched_at=now, effective_as_of=now, rows=new_rows, bytes_local=new_bytes, estimated_scan_bytes_at_fetch=meta.estimated_scan_bytes_at_fetch, result_hash_md5=new_hash, ) write_meta(snap_dir, new_meta) typer.echo(f"Refreshed {name}") typer.echo(f" rows: {old_rows:>10,} -> {new_rows:>10,} ({new_rows - old_rows:+,})") typer.echo(f" bytes_local: {_format_size(old_bytes)} -> {_format_size(new_bytes)}") typer.echo(f" effective_as_of:{meta.effective_as_of} -> {now}") typer.echo(f" identical: {'yes' if identical else 'no'}") @snapshot_app.command("prune") def prune_cmd( older_than: str = typer.Option(None, "--older-than", help="e.g. 7d, 24h"), larger_than: str = typer.Option(None, "--larger-than", help="e.g. 1g, 500m"), dry_run: bool = typer.Option(False, "--dry-run"), ): """Drop snapshots matching predicates.""" snap_dir = _snap_dir() snaps = list_snapshots(snap_dir) matches = [] for s in snaps: keep = True if older_than: try: age = datetime.now(timezone.utc) - datetime.fromisoformat(s.fetched_at.replace("Z", "+00:00")) threshold = _parse_duration(older_than) if age < threshold: keep = False except ValueError: pass if larger_than: threshold = _parse_size(larger_than) if s.bytes_local < threshold: keep = False if not keep and (older_than or larger_than): continue if older_than or larger_than: matches.append(s) # When BOTH conditions provided, intersection. We've used `keep` to mean "both pass". # Simplified: re-compute with explicit AND matches = [] for s in snaps: ok = True if older_than: try: age = datetime.now(timezone.utc) - datetime.fromisoformat(s.fetched_at.replace("Z", "+00:00")) if age < _parse_duration(older_than): ok = False except (ValueError, TypeError): ok = False if larger_than and s.bytes_local < _parse_size(larger_than): ok = False if ok: matches.append(s) for s in matches: if dry_run: typer.echo(f"would drop: {s.name} ({_format_size(s.bytes_local)}, {s.fetched_at})") else: with snapshot_lock(snap_dir): delete_snapshot(snap_dir, s.name) typer.echo(f"dropped: {s.name}") if not matches: typer.echo("(no matches)") def _parse_duration(s: str) -> timedelta: s = s.strip().lower() if s.endswith("d"): return timedelta(days=int(s[:-1])) if s.endswith("h"): return timedelta(hours=int(s[:-1])) if s.endswith("m"): return timedelta(minutes=int(s[:-1])) raise ValueError(f"unknown duration: {s!r}") def _parse_size(s: str) -> int: s = s.strip().lower() multipliers = {"k": 1024, "m": 1024 ** 2, "g": 1024 ** 3, "t": 1024 ** 4} if s[-1] in multipliers: return int(float(s[:-1]) * multipliers[s[-1]]) return int(s) ``` - [ ] **Step 17.4: Register in `cli/main.py`** ```python from cli.commands.snapshot import snapshot_app app.add_typer(snapshot_app, name="snapshot") ``` - [ ] **Step 17.5: Run tests to verify pass** Run: `pytest tests/test_cli_snapshot.py -v` Expected: 3 passed. - [ ] **Step 17.6: Commit** ```bash git add cli/commands/snapshot.py cli/main.py tests/test_cli_snapshot.py git commit -m "feat(cli): da snapshot list/refresh/drop/prune" ``` --- ## Task 18: `da disk-info` Spec §4.3. Trivial. **Files:** - Create: `cli/commands/disk_info.py` - Modify: `cli/main.py` - Test: `tests/test_cli_disk_info.py` - [ ] **Step 18.1: Write failing test** ```python # tests/test_cli_disk_info.py import os from typer.testing import CliRunner import pytest @pytest.fixture def cli_env(tmp_path, monkeypatch): monkeypatch.setenv("DA_LOCAL_DIR", str(tmp_path)) snap = tmp_path / "user" / "snapshots" snap.mkdir(parents=True) yield tmp_path def test_disk_info_runs_and_reports(cli_env): (cli_env / "user" / "snapshots" / "x.parquet").write_bytes(b"A" * 1024) from cli.main import app as cli_app runner = CliRunner() result = runner.invoke(cli_app, ["disk-info"]) assert result.exit_code == 0 assert "Snapshots dir" in result.stdout ``` - [ ] **Step 18.2: Run test to verify failure** Run: `pytest tests/test_cli_disk_info.py -v` Expected: FAIL — module doesn't exist. - [ ] **Step 18.3: Implement `cli/commands/disk_info.py`** ```python """`da disk-info` — show snapshot dir disk usage (spec §4.3).""" import os import shutil from pathlib import Path import typer disk_info_app = typer.Typer(help="Show snapshot disk usage") def _local_dir() -> Path: return Path(os.environ.get("DA_LOCAL_DIR", ".")).resolve() def _format_size(n: int) -> str: for unit in ("B", "KB", "MB", "GB", "TB"): if n < 1024 or unit == "TB": return f"{n:.1f} {unit}" n //= 1024 return f"{n} TB" @disk_info_app.callback(invoke_without_command=True) def disk_info( ctx: typer.Context, json: bool = typer.Option(False, "--json"), ): """Show snapshots disk usage.""" if ctx.invoked_subcommand is not None: return snap_dir = _local_dir() / "user" / "snapshots" used = sum(p.stat().st_size for p in snap_dir.rglob("*") if p.is_file()) if snap_dir.exists() else 0 count = len(list(snap_dir.glob("*.parquet"))) if snap_dir.exists() else 0 free = shutil.disk_usage(snap_dir).free if snap_dir.exists() else 0 quota_gb = int(os.environ.get("AGNES_SNAPSHOT_QUOTA_GB", "10")) if json: import json as json_lib typer.echo(json_lib.dumps({ "snapshots_dir": str(snap_dir), "used_bytes": used, "snapshot_count": count, "free_bytes": free, "quota_gb": quota_gb, })) return typer.echo(f"Snapshots dir: {snap_dir}") typer.echo(f"Used by Agnes: {_format_size(used)} across {count} snapshots") typer.echo(f"Free disk: {_format_size(free)}") typer.echo(f"Configured cap: {quota_gb} GB (set AGNES_SNAPSHOT_QUOTA_GB to override)") ``` - [ ] **Step 18.4: Register in `cli/main.py`** ```python from cli.commands.disk_info import disk_info_app app.add_typer(disk_info_app, name="disk-info") ``` - [ ] **Step 18.5: Run test to verify pass** Run: `pytest tests/test_cli_disk_info.py -v` Expected: 1 passed. - [ ] **Step 18.6: Commit** ```bash git add cli/commands/disk_info.py cli/main.py tests/test_cli_disk_info.py git commit -m "feat(cli): da disk-info — snapshot dir usage report" ``` --- ## Task 19: CLAUDE.md agent rails addendum Spec §5.1. **Files:** - Modify: `CLAUDE.md` - [ ] **Step 19.1: Add the section** Open `CLAUDE.md`. Find a sensible location (after "Business Metrics" or before "Hybrid Queries"). Add the full section from spec §5.1. The literal markdown is large but verbatim from the spec — copy from `docs/superpowers/specs/2026-04-27-claude-fetch-primitives-design.md` lines 437-528 (the "## Querying Agnes data — agent rails" block), and match the existing CLAUDE.md heading style. - [ ] **Step 19.2: Verify tabs/style** Run: `grep -n '^## ' CLAUDE.md` — confirm new section is at H2 level and ordered sensibly with neighboring sections. - [ ] **Step 19.3: Commit** ```bash git add CLAUDE.md git commit -m "docs(claude.md): agent rails for v2 primitives (catalog -> schema -> fetch)" ``` --- ## Task 20: Standalone skill file `agnes-data-querying` Spec §5.2. **Files:** - Create: `cli/skills/agnes-data-querying.md` - [ ] **Step 20.1: Write the skill** Mirror the CLAUDE.md addendum but framed as a skill. ~200 lines max. Structure: ```markdown --- name: agnes-data-querying description: Query Agnes data correctly — discovery first (`da catalog` → `da schema` → `da describe`), then `da fetch` for remote tables, then `da query` locally. Use BigQuery SQL flavor for `--where` on remote tables. --- # Agnes Data Querying [full content per spec §5 + a quick-reference BQ flavor card] ``` - [ ] **Step 20.2: Verify path matches existing skill loader convention** Other skills in this repo live at `cli/skills/.md` — confirm and adjust the spec path if convention is different (e.g. `cli/skills//SKILL.md`). - [ ] **Step 20.3: Commit** ```bash git add cli/skills/agnes-data-querying.md git commit -m "docs(skills): agnes-data-querying — agent rails for v2 fetch primitives" ``` --- ## Task 21: CHANGELOG `**BREAKING**` entry + `instance.yaml.example` knobs Spec §6, §10.4. CLAUDE.md changelog discipline. **Files:** - Modify: `CHANGELOG.md` - Modify: `config/instance.yaml.example` - [ ] **Step 21.1: Add CHANGELOG entries** Under `## [Unreleased]`, add: ```markdown ### Added - `/api/v2/{catalog,schema,sample,scan,scan/estimate}` — discovery + scoped fetch primitives. See `docs/superpowers/specs/2026-04-27-claude-fetch-primitives-design.md`. - `da catalog`, `da schema`, `da describe`, `da fetch`, `da snapshot {list,refresh,drop,prune}`, `da disk-info` — CLI primitives backed by the v2 API. - `cli/skills/agnes-data-querying.md` — Claude rails skill loaded for Agnes-flavored projects. - `instance.yaml: api.scan.*` knobs (`max_limit`, `max_result_bytes`, `max_concurrent_per_user`, `max_daily_bytes_per_user`, `bq_cost_per_tb_usd`, `request_timeout_seconds`). ### Changed - **BREAKING:** BigQuery views (`query_mode='remote'` with `_meta.query_mode='remote'`) are no longer wrapped as DuckDB master views in `analytics.duckdb`. `da query --remote "SELECT * FROM "` no longer resolves the view name; analysts must use `da fetch` to materialize a snapshot or `da query --remote "SELECT * FROM bigquery_query('proj', '')"` directly. To restore the previous behavior for one release cycle, set `instance.yaml: data_source.bigquery.legacy_wrap_views: true`. ``` - [ ] **Step 21.2: Add config knob defaults to `instance.yaml.example`** Append a section showing the new knobs with sensible defaults + comments. - [ ] **Step 21.3: Commit** ```bash git add CHANGELOG.md config/instance.yaml.example git commit -m "docs: CHANGELOG + instance.yaml.example for v2 fetch primitives (BREAKING)" ``` --- ## Task 22: E2E verification on dev VM Spec §11 manual gates. Pure verification — no code changes. - [ ] **Step 22.1: Push branch + wait for image rebuild** ```bash git push origin zs/test-bq-e2e gh run watch --branch zs/test-bq-e2e ``` - [ ] **Step 22.2: Auto-upgrade VM + verify health** ```bash ssh foundryai-dev-zsrotyr # or via gcloud compute ssh sudo /usr/local/bin/agnes-auto-upgrade.sh sudo docker exec agnes-app-1 curl -sS http://localhost:8000/api/health \ | python3 -c 'import json,sys; d=json.load(sys.stdin); print(d["commit_sha"], d["schema_version"])' ``` Expected: commit sha matches the latest local commit; schema_version is current. - [ ] **Step 22.3: Verify v2 endpoints respond** ```bash PAT=$(... your pat ...) curl -sS -H "Authorization: Bearer $PAT" https:///api/v2/catalog | jq '.tables[].id' curl -sS -H "Authorization: Bearer $PAT" https:///api/v2/schema/ | jq curl -sS -H "Authorization: Bearer $PAT" -X POST -H "Content-Type: application/json" \ -d '{"table_id": "", "select": ["..."], "where": "...", "limit": 100}' \ https:///api/v2/scan/estimate | jq ``` Expected: each returns 200 with reasonable JSON. - [ ] **Step 22.4: Run `da fetch` end-to-end** From a laptop with `da` CLI installed (and pointed at the dev VM): ```bash da catalog --json | jq '.tables[].id' da schema da fetch \ --select \ --where "" \ --limit 1000 \ --as test_snap \ --estimate da fetch --select --where "" --limit 1000 --as test_snap --no-estimate da query "SELECT COUNT(*) FROM test_snap" da snapshot list da snapshot drop test_snap ``` All commands should succeed. `da query` on the snapshot returns the expected row count. - [ ] **Step 22.5: Verify Claude rails (manual gate)** Open Claude Code in a fresh session against the Agnes repo. Ask: "Show me the count of rows in `` for the last 7 days." Expected agent flow (visible in the conversation): 1. Run `da catalog` 2. Run `da schema ` 3. Run `da fetch --where "..." --limit ... --as ...` 4. Run `da query "SELECT COUNT(*) FROM ..."` 5. Report the count Repeat 2 more times in fresh sessions. Document the transcripts in the PR description. - [ ] **Step 22.6: Open PR** ```bash gh pr create --title "v2 fetch primitives + Claude agent rails (#101)" --body "$(cat < estimate -> fetch -> query loop in <2 min - [x] Three unguided Claude sessions follow the protocol ## Closes - #101 (BQ view-wrapping outer-query pushdown) ## Spec & plan - docs/superpowers/specs/2026-04-27-claude-fetch-primitives-design.md - docs/superpowers/plans/2026-04-27-claude-fetch-primitives.md EOF )" ``` - [ ] **Step 22.7: Mark E2E gate complete in PR description** Update the PR body with: - Commit SHAs for each phase - Demo recording link or summary - Three Claude transcripts confirming agent rails are followed --- ## Self-Review **Spec coverage:** - §1 motivation, §2 architecture: covered by tasks 7-11 + 12 (drop wrap) - §3.0 identifier conventions: enforced via existing `validate_identifier` + Task 7+8+9 RBAC use of registry id - §3.1 catalog: Task 7 - §3.2 schema: Task 8 - §3.3 sample: Task 9 - §3.4 scan: Task 11 - §3.5 scan/estimate: Task 10 - §3.6 caching: Task 5 + reuse in Tasks 7/8/9 - §3.7 validator: Tasks 1-3 - §3.8 quotas: Task 4 (used in Task 11) - §4.1 catalog/schema/describe CLI: Task 15 - §4.2 fetch + snapshot mgmt: Tasks 16, 17 - §4.3 disk-info: Task 18 - §5 CLAUDE.md + skill: Tasks 19, 20 - §6 migration: Task 12 - §10 implementation contracts: distributed across tasks (audit shape in Task 11, exit codes in Tasks 16/17, error UX in `_exit_code_for` helpers, config knobs in Task 21) - §11 success criteria: Task 22 **Placeholder scan:** No "TBD"/"TODO" found. **Type consistency:** - `validate_where(predicate, table_id, schema)` — same signature in tasks 1-3 and Task 11. - `QuotaTracker.acquire(user)` / `record_bytes(user, n)` — same in Task 4 + Task 11. - `SnapshotMeta` dataclass — same fields in Task 14 + Tasks 16-17. - `ScanRequest` pydantic model — same in Tasks 10 + 11. - `_resolve_schema(conn, user, table_id, project_id)` — defined in Task 10, reused in Task 11. No drift detected. --- **Total tasks: 22** **Estimated effort: ~16-18 person-days (1 dev) / 8-9 days (2 devs in parallel)**