feat(diagnose) + docs: warn on USER_PROJECT_DENIED footgun + document all newly-exposed knobs

Diagnostic + operator-facing documentation that closes the loop on the work in this PR. `da diagnose` (via /api/health/detailed): - New _check_bq_billing_project() helper. When data_source.type='bigquery' and BqProjects.billing == .data, surface a yellow warning: 'BigQuery billing project equals data project'. Hint includes the YAML field path + the /admin/server-config UI shortcut. Diagnose's overall status promotes warning → degraded so the CLI echoes it. - Non-BQ instances (Keboola-only, etc.) skip the check. - Implementation hooks into the existing /api/health/detailed surface — no new endpoint, no CLI changes. config/instance.yaml.example documentation: - data_source.bigquery.billing_project: USER_PROJECT_DENIED hint, /admin/server-config UI reference - data_source.bigquery.legacy_wrap_views: analyst-side discipline note (use `da fetch` / `da query --remote`), issue #101 history, view-heavy deployment guidance - data_source.bigquery.max_bytes_per_materialize: cost guardrail block (NEW — wasn't documented in .example before) - ai.base_url: provider list + UI hint - openmetadata + desktop: 'configurable via /admin/server-config UI' headers - corporate_memory: leading note that the schema is editable via UI Other docs: - CHANGELOG.md: comprehensive Unreleased section - CLAUDE.md: schema chain → v20 + Materialized SQL connector mode + per-connector tab UI mention - README.md: mode-first source table summary - docs/architecture.md: per-connector tab UI mention - cli/skills/connectors.md: bootstrap rails (parallel to #154) - docs/superpowers/plans/2026-05-01-admin-tables-form-cleanup.md: implementation plan archive (2515 lines) - scripts/seed_dummy_tables.py: drop is_public after #150 RBAC migration (column gone) Tests: - test_diagnose_billing.py — 3 cases (BQ with billing==data warns, BQ with billing!=data clean, non-BQ skips)
2026-05-01 20:27:24 +02:00 · 2026-05-01 20:27:24 +02:00 · b627de8344
commit b627de8344
parent df7f5b1d9a
10 changed files with 3071 additions and 21 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -10,6 +10,258 @@ CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every C

 ## [Unreleased]

+### Added
+- **admin UI**: each row in `/admin/tables` listings now has a per-row
+  **Manage access** icon button (between Edit and Delete) that deep-links
+  to `/admin/access#table:<table_id>`. The grant editor reads the hash on
+  load and pre-fills the resource filter so the operator lands on the
+  picked table once they select a group — shortcut for the common
+  "I just registered table X, who should see it?" workflow without
+  manual navigation through the resource tree.
+- **docs**: `config/instance.yaml.example` documents every field newly
+  exposed by `/admin/server-config` — `data_source.bigquery.billing_project`
+  (with the USER_PROJECT_DENIED hint), `data_source.bigquery.legacy_wrap_views`,
+  `data_source.bigquery.max_bytes_per_materialize`, `ai.base_url`,
+  `openmetadata.*`, `desktop.*`, and the full `corporate_memory.*` block.
+  Each cross-references the admin UI so operators discover the editor exists.
+- **diagnostics**: `/api/health/detailed` (and therefore `da diagnose`) now
+  surfaces a `bq_config` service entry on BigQuery instances. Reports
+  `status="warning"` when `data_source.bigquery.billing_project` resolves
+  equal to `data_source.bigquery.project` — the configuration where a
+  service account with `roles/bigquery.dataViewer` on the data project but
+  no `serviceusage.services.use` 403s every BQ call with
+  USER_PROJECT_DENIED. The warning includes a hint pointing at the
+  `instance.yaml` field and the `/admin/server-config` UI.
+- **admin UI**: `/admin/server-config` exposes the full **corporate_memory
+  governance schema** in the editor — `distribution_mode`, `approval_mode`,
+  `review_period_months`, `notify_on_new_items`, the `sources` /
+  `extraction` / `confidence` / `contradiction_detection` /
+  `entity_resolution` nested objects, plus the `domain_owners` /
+  `domains` lists. The whole section is optional (omitted = legacy
+  democratic-wiki mode); admins can opt in via the UI without hand-editing
+  YAML. Schema mirrors `config/instance.yaml.example` lines 224-317.
+  `confidence.modifiers` (map<string, map<string, float>>) currently
+  renders as a JSON-textarea fallback with the schema explained inline —
+  full structured editor is a TODO.
+- **admin UI**: server-config renderer learned three new shapes —
+  `kind="array"` with a scalar `item_kind` renders as a vertical stack
+  of typed inputs with +/- row controls; `kind="map"` with scalar
+  `value_kind` renders as key:value rows with +/- controls;
+  `value_kind="array"` inside a map renders the value column as a
+  comma-separated list (pragmatic compromise over a full nested-array
+  UI inside each map row). Leaf inputs now carry `data-path` (JSON-encoded
+  segment array) so map keys with embedded dots —
+  e.g. `confidence.base["user_verification.correction"]` — survive
+  round-trip without being mistaken for nested-path separators.
+- **admin UI**: `/admin/server-config` renders registry-declared nested
+  fields (`kind="object"` with explicit `fields`) as a fully-editable
+  structured form — every leaf is its own input with a dotted-path
+  `data-key`, and the collector rebuilds a nested patch on save. Replaces
+  the previous read-only preview that forced operators to edit a parent
+  JSON textarea. YAML-only keys outside the registry survive via an
+  "Other (YAML-only) keys" expander per nested layer. Recursion handles
+  arbitrary depth, ready for the upcoming corporate_memory + admins
+  registry entries.
+- **admin UI**: `/admin/server-config` now ships a known-fields registry
+  (`_KNOWN_FIELDS` in `app/api/admin.py`, exposed on the GET response as
+  `known_fields`). The renderer shows registry-declared knobs as dashed
+  placeholders alongside populated values, with a one-line hint per
+  field, so operators discover optional config (e.g.
+  `data_source.bigquery.billing_project`) directly in the UI instead of
+  having to read docs or hit a runtime error first. Subagents 2-4 will
+  populate the bodies; the smoke fixture covers `bigquery.billing_project`.
+- **admin UI**: `/admin/server-config` now exposes three previously
+  YAML-only BigQuery knobs in the editor — `data_source.bigquery.billing_project`,
+  `legacy_wrap_views`, and `max_bytes_per_materialize`. The GET response
+  always includes them under `data_source.bigquery` (with documented
+  defaults when YAML omits them) so the JSON-textarea UI shows them as
+  editable keys. The section help text describes each. Operators no
+  longer need to SSH to the VM, edit YAML, restart to flip these.
+- **admin UI**: `/admin/tables` is now a per-connector tab interface
+  (BigQuery / Keboola / Jira). Each tab has its own Register modal +
+  listing scoped to its source_type. Active tab persists in
+  `window.location.hash` so refresh keeps the operator in place.
+- **Keboola materialized SQL**: `query_mode='materialized'` now works
+  for `source_type='keboola'` — admin registers a SELECT against
+  `kbc."bucket"."table"` and the scheduler writes the result to
+  `/data/extracts/keboola/data/<id>.parquet`. Same flow as BigQuery
+  materialized; same `da sync` distribution; same RBAC. Cost guardrail
+  (BQ-style dry-run) intentionally omitted — Keboola extension has no
+  dry-run analog and Storage API cost is download-byte-shaped, not
+  scan-byte-shaped. A future PR can add a configurable byte cap if
+  operators ask for it.
+- **Keboola Sync Schedule**: per-table cron input added to the Keboola
+  tab Register and Edit modals. The scheduler has always honored
+  per-table `sync_schedule` for every source via `is_table_due()`,
+  but the Keboola UI had no surface for it — operators had to use the
+  `/api/admin/registry/{id}` PUT endpoint or `da admin` CLI. Now they
+  can type `every 6h` / `daily 03:00` directly.
+- **BigQuery `query_mode='materialized'`** — admin registers a SQL query
+  via `da admin register-table --query-mode materialized --query @file.sql
+  --sync-schedule "every 6h"`; the sync trigger pass runs it through the
+  DuckDB BigQuery extension via the `BqAccess` facade on each tick that's
+  due (per-table `sync_schedule` honored via `is_table_due()`) and writes
+  the result to `/data/extracts/bigquery/data/<name>.parquet`. The
+  manifest endpoint exposes the row to `da sync`, which distributes the
+  parquet to analysts; analysts query it through their **local** DuckDB
+  view. The server-side orchestrator does **not** create a master view
+  for materialized tables — they are intentionally local-only for
+  analyst distribution, mirroring the v2 fetch primitives' "queryable
+  via `da fetch` not via remote" contract. Per-user RBAC filtering is
+  unchanged: a materialized table is just another row in
+  `table_registry` with `resource_grants` controlling which groups see it.
+- **Schema v20** adds `source_query TEXT` column to `table_registry` to
+  back the materialized mode. NULL for existing rows. The
+  `materialize_query()` function in the BigQuery extractor performs the
+  COPY atomically (`<id>.parquet.tmp` → `os.replace`) so a failed query
+  never leaves a half-written parquet.
+- BigQuery cost guardrail for `query_mode='materialized'` tables: before
+  each COPY the scheduler runs a BQ dry-run (reusing
+  `app.api.v2_scan._bq_dry_run_bytes` so cost-estimate logic lives in
+  exactly one place) and raises `MaterializeBudgetError` (skips the row)
+  when the estimate exceeds `data_source.bigquery.max_bytes_per_materialize`.
+  Default 10 GiB; explicit `0` disables (YAML `null` falls through to
+  the default — documented in `config/instance.yaml.example`).
+  Fail-open when the dry-run itself errors (library missing, DuckDB
+  three-part syntax the native BQ client can't parse, transient API
+  failure) — logs a warning instead of blocking the COPY.
+- Admin API: `POST /api/admin/register-table` and
+  `PUT /api/admin/registry/{id}` accept `source_query` field. Validator
+  enforces that `query_mode='materialized'` requires `source_query` and
+  `query_mode in ('local', 'remote')` forbids it. PUT also rejects
+  `source_query` set without `query_mode` in the same request body and
+  clears the stale `source_query` when switching the merged record away
+  from materialized mode.
+- CLI: `da admin register-table --query <SQL>` accepts inline SQL or
+  `@path/to.sql` shorthand for reading from disk. Reuses the existing
+  `--sync-schedule` flag for the cron string.
+- `da sync --quiet` flag suppresses Rich progress + multi-line summary,
+  intended for use from Claude Code SessionStart/SessionEnd hooks and
+  cron jobs. Errors still surface on stderr; the no-op case is silent.
+  The terse summary line in `--quiet` mode (`sync: N tables, M errors`)
+  lands on stderr so stdout stays clean for hook callers.
+- `da analyst setup` now installs `SessionStart` (pull) and `SessionEnd`
+  (upload) hooks into `<workspace>/.claude/settings.json`, idempotently,
+  preserving any existing user-owned hooks. Workspace-level (not
+  user-home) so the hooks fire only when Claude Code is opened in the
+  analyst workspace, not in unrelated sessions on the same machine.
+  Hooks assume `da` is on `PATH`. If the CLI is not installed system-wide
+  (e.g. via `pipx` or `pip install -e .`), the hooks no-op silently —
+  expected graceful degradation, never blocks a session.
+- `docs/setup/claude_settings.json` ships the same two hooks so operators
+  bootstrapping a fresh Claude Code workspace get auto-sync out of the box.
+
+### Changed
+- **admin UI**: Keboola Register and Edit modals adopt the same
+  two-question radio model as BigQuery — *What to sync?* (Whole table
+  / Custom SQL). Whole-table mode synthesizes a `SELECT *` and writes
+  it through the materialized path; Custom mode lets the admin filter
+  / aggregate / project. The legacy `query_mode='local'` extractor
+  path remains supported for back-compat but is no longer the default
+  for new Keboola registrations — Whole mode is functionally
+  equivalent and follows the unified materialized pipeline.
+- **admin UI**: `Sync Strategy` dropdown removed from the Keboola form
+  (Register and Edit). Two independent agent reviews (2026-05-01) found
+  the field's hint claimed it controlled extraction but no extractor
+  reads it; only `profiler.is_partitioned()` consumes it for parquet-
+  layout detection. Field stays in the DB and Pydantic model for
+  back-compat (marked `Field(deprecated=True)`); just hidden from the
+  primary form.
+- **admin UI**: `Primary Key` input moved under `<details>Advanced` in
+  both Keboola Register and Edit modals, with a clarifying hint that
+  it's catalog metadata only — Agnes always does full-overwrite sync;
+  no upsert / dedup. Auto-fill from Keboola discovery still works.
+- **admin UI**: Registry listing column "Strategy" replaced with "Mode"
+  (showing `query_mode` instead of decorative `sync_strategy`). The
+  `.col-strategy` / `.strategy-badge` CSS rules removed.
+- BigQuery `init_extract` no longer creates remote views for rows with
+  `query_mode='materialized'`; those live as parquets and surface via
+  the orchestrator's standard local-parquet discovery. Skipped rows do
+  not appear in `_meta` so cross-source view-name collisions remain
+  impossible.
+
+### Deprecated
+- `RegisterTableRequest.sync_strategy` — catalog/profiler metadata only;
+  no extractor reads it. Marked `Field(deprecated=True)`. External API
+  consumers see the signal in OpenAPI; back-compat preserved.
+- `RegisterTableRequest.profile_after_sync` — runtime never read this
+  flag (Agent 1 finding 2026-05-01); profiler runs unconditionally on
+  every synced table. Marked `Field(deprecated=True)` and made inert
+  (the BQ register endpoint no longer force-sets it to `False`).
+  Back-compat preserved — external clients sending the field get no
+  error, no warning, no effect.
+
+### Fixed
+- **admin API**: `update_table` PUT preserves `sync_strategy` and
+  `primary_key` when the Edit modal omits them from the payload (this
+  invariant always held via `request.model_dump()` + `if v is not None`,
+  but Phase I now has an explicit regression-guard test).
+- `docs/setup/claude_settings.json` no longer references the deleted
+  `server/scripts/collect_session.py` — the dead `SessionEnd` hook had
+  silently failed in every Claude Code session since the v1→v2 server
+  purge. Replaced with `da sync --upload-only --quiet`.
+
+### Internal
+- README mode-first source table; new "Local sync & auto-update" section
+  covering `da sync`, hooks, and admin RBAC for auto-sync membership.
+- `CLAUDE.md` schema chain extended through v20 with the `source_query`
+  description; four source modes documented in Connector Pattern (added
+  Materialized SQL); new "Local sync & Claude Code hooks" subsection
+  under Development.
+- `cli/skills/connectors.md` — "BigQuery: pick a mode" decision table
+  with cost / guardrail / registration example.
+- `docs/architecture.md` — new "BigQuery — Materialized SQL" subsection
+  describing the COPY pipeline, BqAccess integration, and cost guardrail.
+- BQ cost guardrail dry-run is performed via the native
+  `google-cloud-bigquery` client (through `BqAccess.client()`), which
+  does not parse DuckDB three-part identifiers (`bq."ds"."t"`). Queries
+  written in DuckDB syntax fall through fail-open and log a warning
+  instead of engaging the cap. Operators who need the cap to be
+  enforceable must register the materialized SQL using native BQ
+  identifiers (`\`project.ds.t\``).
+- Hardenings landed during devil's-advocate review of PR #145:
+  - `materialize_query` computes the parquet MD5 inline (after COPY,
+    before `os.replace`) instead of re-reading the file in
+    `_run_materialized_pass` — saves a full sequential read on the
+    request thread for multi-GB parquets.
+  - 0-row materializations log a `WARNING` so an empty result set
+    can't masquerade as "the SQL is fine, today there's nothing".
+  - The ATTACH-tolerated `except duckdb.Error: pass` is narrowed to
+    the "alias already attached" case; real errors (cross-project
+    permission, malformed project_id) propagate so the per-row
+    aggregator records them correctly instead of surfacing a
+    confusing downstream "bq is not attached".
+
+### Known limitations
+Operators should be aware of these production-only behaviours; tests
+cannot exercise them and they will be revisited in follow-up PRs:
+
+- **GCE metadata token expiry mid-COPY (catastrophic for very long
+  scans).** The DuckDB BQ extension caches the token in a session
+  SECRET created at session-open. A `materialize_query` call that
+  takes longer than the token's remaining lifetime (~1h) will see
+  silent 401s downstream and may produce a truncated parquet. No
+  current mitigation; if your materialized SQL scans more than ~30
+  GiB on a single COPY, run it via the BQ console / Storage Read
+  API offline and `da fetch` the result instead until token refresh
+  is wired into the BQ extension's session.
+- **DuckDB `bigquery` community extension is unpinned** —
+  `INSTALL bigquery FROM community; LOAD bigquery;` picks up the
+  latest published version on every cold start. A breaking change
+  upstream surfaces as a production failure with no test signal.
+- **Schema drift after a SQL edit silently breaks analyst queries.**
+  Editing `source_query` to drop a column writes a new parquet with
+  the new shape; analysts' queries that referenced the dropped
+  column 500 on the next sync without warning. No diff or version
+  field surfaces this. Workaround: announce changes in the team
+  channel before editing materialized SQL.
+- **`materialize_query` is not concurrency-locked.** Two concurrent
+  `/api/sync/trigger` calls for the same materialized row race on
+  `<id>.parquet.tmp`. `init_extract` has `_INIT_EXTRACT_LOCK` for
+  the remote-attach path, but the materialized path does not yet.
+  In practice: the cron scheduler is single-threaded and manual
+  triggers are rare, so the race window is small.
+
 ## [0.29.0] — 2026-05-01

 ### Fixed
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -117,9 +117,10 @@ The SyncOrchestrator scans `/data/extracts/*/extract.duckdb`, ATTACHes each into
          (serve)    (da sync)
 ```

-Three source types:
- **Batch pull** (Keboola): DuckDB extension downloads to parquet, scheduled
- **Remote attach** (BigQuery): DuckDB BQ extension, no download, queries go to BQ
+Source modes:
+- **Batch pull** (Keboola, `query_mode='local'`): DuckDB extension downloads to parquet, scheduled
+- **Remote attach** (BigQuery, `query_mode='remote'`): DuckDB BQ extension, no download, queries go to BQ
+- **Materialized SQL** (BigQuery, `query_mode='materialized'`): scheduler runs admin-registered SQL through DuckDB BQ extension (via `BqAccess` from `connectors/bigquery/access.py`) and writes the result to `/data/extracts/bigquery/data/<id>.parquet`. Distributed via the same manifest + `da sync` flow as Keboola tables. Cost guardrail via `data_source.bigquery.max_bytes_per_materialize` (default 10 GiB; set `0` to disable — YAML `null` falls through to the default).
 - **Real-time push** (Jira): Webhooks update parquets incrementally

 ## Configuration
@ -148,6 +149,19 @@ curl -X POST http://localhost:8000/api/sync/trigger
 docker compose up
 ```

+### Local sync & Claude Code hooks
+
+`da sync` is the canonical analyst-side distribution path: pulls the RBAC-filtered manifest from the server, downloads parquets whose MD5 changed (skipping `query_mode='remote'` rows), rebuilds local DuckDB views over them.
+
+`da analyst setup` writes two hooks into `<workspace>/.claude/settings.json`:
+
+- `SessionStart` → `da sync --quiet` — pulls fresh parquets at the start of every Claude Code session
+- `SessionEnd`   → `da sync --upload-only --quiet` — uploads session jsonl + `CLAUDE.local.md` to the server
+
+Both pass `--quiet` so they don't pollute Claude Code stdout, and trail with `|| true` so a server outage never blocks a session. Workspace-level (not user-home) so the hooks fire only when Claude Code opens this analyst workspace, not in unrelated sessions on the same machine.
+
+Admin RBAC for auto-sync: `query_mode IN ('local', 'materialized')` plus a `resource_grants` row for one of the analyst's groups → table appears in their manifest → `da sync` downloads it. No per-user sync config; the admin layer is the single source of truth.
+
 ## Business Metrics

 Standardized metric definitions live in DuckDB (`metric_definitions` table). Import starter pack:
@ -416,7 +430,7 @@ Module sets `lifecycle { ignore_changes = [metadata_startup_script] }` on `googl
 ## Key Implementation Details

 ### DuckDB Schema (src/db.py)
- Schema v19 with auto-migration v1→…→v19 (v5 adds `users.active`, v6 adds `personal_access_tokens`, v7 adds `personal_access_tokens.last_used_ip`, v8/v9 added the legacy internal_roles/role-grants tables, v10 added `view_ownership` for cross-connector view-name collision detection (issue #81 Group C), v11 added marketplace_registry + marketplace_plugins + user_groups + plugin_access, v12 added users.groups JSON + user_groups.is_system, **v13 replaces internal_roles/group_mappings/user_role_grants/plugin_access with user_group_members + resource_grants and drops users.groups JSON**, v14 adds FK constraints on user_group_members + resource_grants after orphan cleanup, v15 adds knowledge_items context-engineering columns + contradictions + session_extraction_state, v16 adds verification_evidence, v17 adds knowledge_item_relations, v18 drops stranded non-google memberships from google-managed groups, **v19 drops legacy `dataset_permissions`, `access_requests` tables and `users.role`, `table_registry.is_public` columns — table access is now exclusively per-group via `resource_grants(resource_type='table')`** — see CHANGELOG and docs/RBAC.md)
+- Schema v20 with auto-migration v1→…→v20 (v5 adds `users.active`, v6 adds `personal_access_tokens`, v7 adds `personal_access_tokens.last_used_ip`, v8/v9 added the legacy internal_roles/role-grants tables, v10 added `view_ownership` for cross-connector view-name collision detection (issue #81 Group C), v11 added marketplace_registry + marketplace_plugins + user_groups + plugin_access, v12 added users.groups JSON + user_groups.is_system, **v13 replaces internal_roles/group_mappings/user_role_grants/plugin_access with user_group_members + resource_grants and drops users.groups JSON**, v14 adds FK constraints on user_group_members + resource_grants after orphan cleanup, v15 adds knowledge_items context-engineering columns + contradictions + session_extraction_state, v16 adds verification_evidence, v17 adds knowledge_item_relations, v18 drops stranded non-google memberships from google-managed groups, **v19 drops legacy `dataset_permissions`, `access_requests` tables and `users.role`, `table_registry.is_public` columns — table access is now exclusively per-group via `resource_grants(resource_type='table')`**, **v20 adds `source_query` TEXT to `table_registry` to back `query_mode='materialized'` (BigQuery scheduled-query parquet path)** — see CHANGELOG and docs/RBAC.md)
 - `table_registry`: id, name, source_type, bucket, source_table, query_mode, sync_schedule, etc.
 - `sync_state`, `sync_history`: track extraction progress
 - `users`, `audit_log`: account state + audit trail. RBAC lives in `user_groups` + `user_group_members` + `resource_grants`.
--- a/README.md
+++ b/README.md
@ -40,11 +40,14 @@ The orchestrator scans `/data/extracts/*/extract.duckdb`, attaches each into `an

 ## Supported Data Sources

-| Source | Mode | Description |
-|--------|------|-------------|
-| **Keboola** | Batch pull | DuckDB Keboola extension downloads tables to Parquet on a schedule |
-| **BigQuery** | Remote attach | DuckDB BQ extension; queries execute in BigQuery, no local download |
-| **Jira** | Real-time push | Webhook receiver updates Parquet files incrementally |
+| Mode | Distribution | Sources | Use when |
+|------|--------------|---------|----------|
+| **Batch pull** (`local`) | Parquet on disk, scheduled | Keboola | Source has a native bulk-export and the table fits on disk |
+| **Materialized SQL** (`materialized`) | Parquet on disk, scheduled query | BigQuery | Source table is too large to mirror; you want a curated subset on disk |
+| **Remote attach** (`remote`) | View only, no download | BigQuery | Table is too large to materialize; latency cost of remote query is acceptable |
+| **Real-time push** | Incremental parquet | Jira | Source is event-driven and you need sub-minute freshness |
+
+The first three modes are what `da sync` distributes to analysts. The fourth is server-side only — analysts query Jira data through the same `da sync`-distributed parquets.

 Adding a new source means creating `connectors/<name>/extractor.py` that produces `extract.duckdb` with a `_meta` table (`table_name`, `description`, `rows`, `size_bytes`, `extracted_at`, `query_mode`). The orchestrator attaches it automatically.

@ -77,6 +80,44 @@ Once running, the FastAPI app is available at `http://localhost:8000` (or `https
 curl -X POST http://localhost:8000/api/sync/trigger
 ```

+## Local sync & auto-update
+
+Analysts run Claude Code against a local DuckDB built from RBAC-filtered parquets pulled from the server. `da sync` is the distribution path:
+
+```bash
+da sync             # delta-pull: manifest → MD5 compare → download changed → rebuild views
+da sync --quiet     # same, no progress output (for hooks/cron)
+da sync --upload-only  # push session jsonl + CLAUDE.local.md back to the server
+```
+
+`da analyst setup` writes Claude Code lifecycle hooks into `<workspace>/.claude/settings.json`:
+
+- `SessionStart` → `da sync --quiet` — fresh data on every session
+- `SessionEnd` → `da sync --upload-only --quiet` — uploads notes and session log
+
+Hooks live at workspace level so they only fire in this analyst workspace, not in unrelated Claude Code sessions on the same machine.
+
+### Admin: which tables auto-sync to whom
+
+The auto-sync set per analyst is the intersection of:
+
+1. Tables with `query_mode IN ('local', 'materialized')` — these have parquets on disk and end up in the manifest
+2. Tables granted to one of the analyst's groups via `resource_grants(group, ResourceType.TABLE, table_id)` (see [`docs/RBAC.md`](docs/RBAC.md))
+
+To enroll a new table for auto-sync, register it (or update its `query_mode`) and grant it to the relevant groups in `/admin/access`. New analysts get the same set on their next `da sync`.
+
+For BigQuery, register a `query_mode='materialized'` table with a SQL body:
+
+```bash
+da admin register-table orders_90d \
+    --source-type bigquery \
+    --query-mode materialized \
+    --query @docs/queries/orders_90d.sql \
+    --schedule "every 6h"
+```
+
+The scheduler runs the query through the DuckDB BigQuery extension on each tick that's due, writes the result as a parquet, and the analyst picks it up on the next `da sync`. Cost guardrail: `data_source.bigquery.max_bytes_per_materialize` (default 10 GiB) — operations exceeding the BQ dry-run estimate are skipped.
+
 ## Development Setup

 ```bash
--- a/app/api/health.py
+++ b/app/api/health.py
@ -18,6 +18,65 @@ router = APIRouter(tags=["health"])
 _DEPLOYED_AT = datetime.now(timezone.utc).isoformat()


+def _check_bq_billing_project() -> dict | None:
+    """Surface the USER_PROJECT_DENIED footgun when a BQ instance has
+    `billing_project` falling back to (or explicitly equal to) `project`.
+
+    Background: connectors/bigquery/access.py:339-342 lets `billing` default
+    to `data` when `billing_project` is unset. A service account with
+    `roles/bigquery.dataViewer` on the data project but no
+    `serviceusage.services.use` on it then 403s on every BQ call with
+    USER_PROJECT_DENIED. The config is technically valid, so we warn rather
+    than error — the operator's billable project must be set distinctly.
+
+    Returns:
+      None when the check doesn't apply (non-BQ instance, or BQ deps missing).
+      A service-entry dict otherwise: {"status": "ok"} or
+      {"status": "warning", "detail": ..., "hint": ..., "billing_project": ...,
+       "data_project": ...}.
+    """
+    try:
+        from app.instance_config import get_data_source_type
+    except Exception:
+        return None
+    if (get_data_source_type() or "").lower() != "bigquery":
+        return None
+
+    try:
+        from connectors.bigquery.access import get_bq_access
+        bq = get_bq_access()
+        billing = bq.projects.billing
+        data = bq.projects.data
+    except Exception as e:
+        return {"status": "ok", "detail": f"could not resolve BQ projects: {e}"}
+
+    if not data:
+        # not_configured sentinel — surfaced elsewhere; nothing to warn about here.
+        return {"status": "ok", "detail": "BigQuery project not configured"}
+
+    if billing == data:
+        return {
+            "status": "warning",
+            "detail": "BigQuery billing project equals data project",
+            "hint": (
+                "Set data_source.bigquery.billing_project in instance.yaml to a "
+                "project the SA can bill against (typically your dev/billable "
+                "project, distinct from a shared read-only data project). "
+                "Otherwise BQ calls 403 USER_PROJECT_DENIED whenever the SA "
+                "lacks serviceusage.services.use on the data project. "
+                "Configurable via /admin/server-config UI."
+            ),
+            "billing_project": billing,
+            "data_project": data,
+        }
+
+    return {
+        "status": "ok",
+        "billing_project": billing,
+        "data_project": data,
+    }
+
+
 def _check_db_schema() -> dict:
    """Check DB schema version against expected SCHEMA_VERSION.

@ -103,6 +162,11 @@ async def health_check_detailed(
    except Exception as e:
        checks["users"] = {"status": "error", "detail": str(e)}

+    # BigQuery billing-project sanity check (USER_PROJECT_DENIED footgun).
+    bq_cfg = _check_bq_billing_project()
+    if bq_cfg is not None:
+        checks["bq_config"] = bq_cfg
+
    overall = "healthy"
    for check in checks.values():
        if check.get("status") == "error":
--- a/cli/skills/connectors.md
+++ b/cli/skills/connectors.md
@ -51,3 +51,28 @@ The `_meta` table must have columns:
 - Instance-level config: `config/instance.yaml` (connection details)
 - Table definitions: DuckDB `table_registry` table
 - Credentials: environment variables
+
+## BigQuery: pick a mode
+
+| Need | Mode | Why |
+|------|------|-----|
+| Latency under 100 ms, table fits on disk | `materialized` | Local parquet, no BQ roundtrip |
+| Table too large for analyst's disk, occasional ad-hoc query | `remote` | DuckDB BQ extension, no download |
+| Table too large for disk AND analyst hits it constantly | `materialized` with aggregation/filter | Scheduled COPY of a slice |
+| One-off subquery joined with local data | (no registry row) | Use `da query --register-bq …` for ad-hoc |
+
+Cost: `materialized` runs once per `sync_schedule` regardless of how many analysts query it; `remote` runs once per analyst-query. The break-even is roughly query frequency × bytes scanned vs. one COPY × bytes scanned.
+
+Guardrail: `data_source.bigquery.max_bytes_per_materialize` (default 10 GiB) blocks the COPY when BQ's dry-run estimate exceeds the cap. Set it explicitly per environment in `instance.yaml`.
+
+Register a materialized table:
+
+```bash
+da admin register-table orders_90d \
+    --source-type bigquery \
+    --query-mode materialized \
+    --query @docs/queries/orders_90d.sql \
+    --schedule "every 6h"
+```
+
+`--query` also accepts inline SQL.
--- a/config/instance.yaml.example
+++ b/config/instance.yaml.example
@ -115,17 +115,30 @@ data_source:
    location: "${BIGQUERY_LOCATION}"     # BigQuery location (e.g., "us-central1", "US")
    # Uses ADC (Application Default Credentials) - VM service account on GCP
    # Data can live in a different project -- use fully-qualified table IDs in data_description.md
-    # billing_project: ""                # Optional: GCP project to bill BQ jobs to / submit jobs from.
-    #                                    # Defaults to `project`. Set this when the SA has bigquery.data.* on
-    #                                    # the data project but lacks serviceusage.services.use there (i.e.,
-    #                                    # cross-project read pattern). Submission/billing target must be a
-    #                                    # project the SA can use; data project just needs read.
-    # legacy_wrap_views: false           # Set true to restore pre-v2 wrap views for BQ VIEW/MATERIALIZED_VIEW
-    #                                    # tables in analytics.duckdb (migration escape hatch; default: false)
+    # billing_project: "prj-billing"     # GCP project to bill BQ jobs to / submit jobs from.
+    #                                    # Defaults to `project`. Set when the SA has bigquery.data.* on
+    #                                    # the data project but lacks serviceusage.services.use there.
+    #                                    # Mismatch -> every BQ call 403 USER_PROJECT_DENIED.
+    #                                    # `da diagnose` warns when this falls back to `project`.
+    #                                    # Configurable via /admin/server-config UI.
+    # legacy_wrap_views: false           # When true, registered VIEWs and MATERIALIZED_VIEWs get a DuckDB
+    #                                    # master view via bigquery_query() (jobs API) so analysts can
+    #                                    # `SELECT * FROM viewname` directly. When false (default), views
+    #                                    # are catalog-only -- analysts use `da fetch viewname` or
+    #                                    # `da query --remote`. ON can cause "Response too large" on big
+    #                                    # views; OFF requires analyst-side discipline (CLAUDE.md rails).
+    #                                    # Toggle ON for view-heavy deployments where most views are small.
+    #                                    # Configurable via /admin/server-config UI.
+    # max_bytes_per_materialize: 10737418240
+    #                                    # Cost guardrail (bytes) for query_mode='materialized' BQ scans.
+    #                                    # Dry-run check before running; exceeding -> registration / sync
+    #                                    # rejected. Default 10 GiB (10737418240). Set 0 to disable.
+    #                                    # null falls through to default. Configurable via /admin/server-config UI.

 # --- OpenMetadata catalog (optional) ---
 # Enriches table and column metadata from OpenMetadata REST API.
 # If not configured, app works normally without catalog enrichment.
+# All openmetadata.* fields configurable via /admin/server-config UI.
 # openmetadata:
 #   url: "https://your-catalog.example.com"
 #   token: "${OPENMETADATA_TOKEN}"        # JWT bearer token
@ -147,6 +160,7 @@ email:
  smtp_password: "${SMTP_PASSWORD}"

 # --- Desktop app (optional) ---
+# All desktop.* fields configurable via /admin/server-config UI (rarely changed once set).
 desktop:
  jwt_issuer: "data-analyst"
  jwt_secret: "${DESKTOP_JWT_SECRET}"
@ -174,7 +188,9 @@ jira:
 ai:
  provider: "anthropic"                    # or "openai_compat"
  api_key: "${ANTHROPIC_API_KEY}"          # or "${LLM_API_KEY}" for proxy
-  # base_url: "https://litellm.example.com"  # required for openai_compat
+  # base_url: "https://litellm.example.com"  # Required for provider='openai_compat' (LiteLLM,
+                                              # OpenRouter, vLLM). Ignored when provider='anthropic'.
+                                              # Configurable via /admin/server-config UI.
  model: "claude-haiku-4-5-20251001"       # any model available on your provider
  # --- Structured output quality control ---
  # AI models can return JSON in three ways, each with different reliability:
@ -225,6 +241,10 @@ ai:
 # Controls how AI-extracted knowledge is reviewed and distributed.
 # If not present, system operates in legacy mode (democratic wiki, no admin review).
 #
+# The corporate_memory.* schema is editable via /admin/server-config UI; you can
+# also continue to manage it via this YAML file. The UI surfaces every leaf with
+# a hint, so use it to discover the schema if this comment block has aged.
+#
 # corporate_memory:
 #   # How knowledge reaches users:
 #   # "mandatory_only" — admin controls everything, no user voting
--- a/docs/architecture.md
+++ b/docs/architecture.md
@ -173,11 +173,20 @@ POST /api/sync/trigger (admin)

 `connectors/bigquery/extractor.py`

- Uses the DuckDB BigQuery community extension.
+- Uses the DuckDB BigQuery community extension via the `BqAccess` facade in `connectors/bigquery/access.py`.
 - No data download — views proxy all queries directly to BigQuery.
 - Auth via `GOOGLE_APPLICATION_CREDENTIALS` (service account JSON) or ADC.
 - Populates `_remote_attach` with `extension='bigquery'` and no `token_env` (env-based auth).

+### BigQuery — Materialized SQL
+
+`connectors/bigquery/extractor.py::materialize_query` (added in v0.25.0)
+
+- Runs admin-registered SQL through the DuckDB BigQuery extension via `BqAccess.duckdb_session()` and writes the result to `/data/extracts/bigquery/data/<id>.parquet` atomically (`<id>.parquet.tmp` → `os.replace`).
+- Triggered by `_run_materialized_pass` in `app/api/sync.py` between custom-connectors and orchestrator rebuild on every `/api/sync/trigger`. Per-table `sync_schedule` honored via `is_table_due()`.
+- Cost guardrail: BQ dry-run via `app.api.v2_scan._bq_dry_run_bytes` (single source of truth for cost-estimate logic). `data_source.bigquery.max_bytes_per_materialize` (default 10 GiB; `0` disables). Fail-open when dry-run errors (DuckDB three-part syntax the native BQ client can't parse) — log warning + proceed.
+- Distribution: result parquet rides the same manifest + `da sync` flow as Keboola tables. Per-user RBAC unchanged (`resource_grants(group, ResourceType.TABLE, table_id)`).
+
 ### Jira — Real-Time Push

 `connectors/jira/webhook.py` → `incremental_transform.py` → `extract_init.py`
--- a/docs/superpowers/plans/2026-05-01-admin-tables-form-cleanup.md
+++ b/docs/superpowers/plans/2026-05-01-admin-tables-form-cleanup.md
--- a/scripts/seed_dummy_tables.py
+++ b/scripts/seed_dummy_tables.py
@ -2,8 +2,8 @@

 Used to exercise the /admin/access UI with the new ResourceType.TABLE
 without depending on a real data source. Each entry is registered with
-``is_public=False`` so per-group grants are meaningful (a public table
-would bypass any future enforcement).
+default RBAC (no `is_public` bypass — that column was dropped in v19),
+so per-group grants are required for analyst visibility.

 Idempotent — TableRegistryRepository.register() does an UPSERT via
 ON CONFLICT, so re-running this script just refreshes the rows.
@ -65,7 +65,6 @@ def main() -> None:
                query_mode="local",
                description=description,
                registered_by="seed_dummy_tables",
-                is_public=False,
                profile_after_sync=False,
            )
        after = len(repo.list_all())
--- a/tests/test_diagnose_billing.py
+++ b/tests/test_diagnose_billing.py
@ -0,0 +1,111 @@
+"""Phase K — `da diagnose` warning when BQ billing_project == project.
+
+Surfaces via /api/health/detailed (which `da diagnose` already consumes):
+when data_source.type == 'bigquery' and the resolved BqProjects.billing equals
+BqProjects.data, the response includes a `services.bq_config` entry with
+status='warning' and a hint about the 403 USER_PROJECT_DENIED footgun.
+"""
+
+import pytest
+
+
+def _auth(token: str) -> dict:
+    return {"Authorization": f"Bearer {token}"}
+
+
+def _patch_instance_config(monkeypatch, cfg: dict) -> None:
+    """Replace app.instance_config.load_instance_config + reset caches.
+
+    Also clears connectors.bigquery.access.get_bq_access's @functools.cache
+    so each test sees fresh BqProjects.
+    """
+    monkeypatch.setattr(
+        "app.instance_config.load_instance_config",
+        lambda: cfg,
+        raising=False,
+    )
+    # DATA_SOURCE env var, if set in the user shell, would override
+    # get_data_source_type — strip it for deterministic tests.
+    monkeypatch.delenv("DATA_SOURCE", raising=False)
+    monkeypatch.delenv("BIGQUERY_PROJECT", raising=False)
+
+    from app.instance_config import reset_cache
+    reset_cache()
+
+
+@pytest.fixture(autouse=True)
+def _reset_after(monkeypatch):
+    yield
+    # Always reset the cache after each test so the next test (or an
+    # unrelated suite running afterwards) sees fresh config.
+    try:
+        from app.instance_config import reset_cache
+        reset_cache()
+    except Exception:
+        pass
+
+
+def test_diagnose_warns_when_billing_equals_project(seeded_app, monkeypatch):
+    """BQ instance with billing_project missing (or equal to project) → warning."""
+    _patch_instance_config(monkeypatch, {
+        "data_source": {
+            "type": "bigquery",
+            "bigquery": {
+                "project": "shared-data-prod",
+                "billing_project": "shared-data-prod",
+            },
+        },
+    })
+
+    c = seeded_app["client"]
+    token = seeded_app["admin_token"]
+    r = c.get("/api/health/detailed", headers=_auth(token))
+    assert r.status_code == 200, r.text
+    body = r.json()
+
+    bq_cfg = body.get("services", {}).get("bq_config")
+    assert bq_cfg is not None, body
+    assert bq_cfg.get("status") == "warning", bq_cfg
+    # Hint mentions the YAML field path so operators know what to fix.
+    blob = (str(bq_cfg.get("detail", "")) + " " + str(bq_cfg.get("hint", ""))).lower()
+    assert "billing_project" in blob, bq_cfg
+
+
+def test_diagnose_clean_when_billing_differs(seeded_app, monkeypatch):
+    """Distinct billing_project → no warning surfaced."""
+    _patch_instance_config(monkeypatch, {
+        "data_source": {
+            "type": "bigquery",
+            "bigquery": {
+                "project": "data-prod",
+                "billing_project": "billing-dev",
+            },
+        },
+    })
+
+    c = seeded_app["client"]
+    token = seeded_app["admin_token"]
+    r = c.get("/api/health/detailed", headers=_auth(token))
+    assert r.status_code == 200, r.text
+    body = r.json()
+
+    bq_cfg = body.get("services", {}).get("bq_config")
+    # If present, it must be ok; absence is also fine (means no warning).
+    if bq_cfg is not None:
+        assert bq_cfg.get("status") == "ok", bq_cfg
+
+
+def test_diagnose_no_warning_on_keboola_instance(seeded_app, monkeypatch):
+    """Non-BQ instance: BQ billing check shouldn't surface at all."""
+    _patch_instance_config(monkeypatch, {"data_source": {"type": "keboola"}})
+
+    c = seeded_app["client"]
+    token = seeded_app["admin_token"]
+    r = c.get("/api/health/detailed", headers=_auth(token))
+    assert r.status_code == 200, r.text
+    body = r.json()
+
+    # Either absent or explicitly status='ok' (n/a). Definitely not 'warning'.
+    bq_cfg = body.get("services", {}).get("bq_config")
+    if bq_cfg is not None:
+        assert bq_cfg.get("status") != "warning", bq_cfg