feat(diagnose) + docs: warn on USER_PROJECT_DENIED footgun + document all newly-exposed knobs

Diagnostic + operator-facing documentation that closes the loop on the work in this PR.

`da diagnose` (via /api/health/detailed):
  - New _check_bq_billing_project() helper. When data_source.type='bigquery' and BqProjects.billing == .data, surface a yellow warning: 'BigQuery billing project equals data project'. Hint includes the YAML field path + the /admin/server-config UI shortcut. Diagnose's overall status promotes warning → degraded so the CLI echoes it.
  - Non-BQ instances (Keboola-only, etc.) skip the check.
  - Implementation hooks into the existing /api/health/detailed surface — no new endpoint, no CLI changes.

config/instance.yaml.example documentation:
  - data_source.bigquery.billing_project: USER_PROJECT_DENIED hint, /admin/server-config UI reference
  - data_source.bigquery.legacy_wrap_views: analyst-side discipline note (use `da fetch` / `da query --remote`), issue #101 history, view-heavy deployment guidance
  - data_source.bigquery.max_bytes_per_materialize: cost guardrail block (NEW — wasn't documented in .example before)
  - ai.base_url: provider list + UI hint
  - openmetadata + desktop: 'configurable via /admin/server-config UI' headers
  - corporate_memory: leading note that the schema is editable via UI

Other docs:
  - CHANGELOG.md: comprehensive Unreleased section
  - CLAUDE.md: schema chain → v20 + Materialized SQL connector mode + per-connector tab UI mention
  - README.md: mode-first source table summary
  - docs/architecture.md: per-connector tab UI mention
  - cli/skills/connectors.md: bootstrap rails (parallel to #154)
  - docs/superpowers/plans/2026-05-01-admin-tables-form-cleanup.md: implementation plan archive (2515 lines)
  - scripts/seed_dummy_tables.py: drop is_public after #150 RBAC migration (column gone)

Tests:
  - test_diagnose_billing.py — 3 cases (BQ with billing==data warns, BQ with billing!=data clean, non-BQ skips)
This commit is contained in:
ZdenekSrotyr 2026-05-01 20:27:24 +02:00
parent df7f5b1d9a
commit b627de8344
10 changed files with 3071 additions and 21 deletions

View file

@ -10,6 +10,258 @@ CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every C
## [Unreleased]
### Added
- **admin UI**: each row in `/admin/tables` listings now has a per-row
**Manage access** icon button (between Edit and Delete) that deep-links
to `/admin/access#table:<table_id>`. The grant editor reads the hash on
load and pre-fills the resource filter so the operator lands on the
picked table once they select a group — shortcut for the common
"I just registered table X, who should see it?" workflow without
manual navigation through the resource tree.
- **docs**: `config/instance.yaml.example` documents every field newly
exposed by `/admin/server-config``data_source.bigquery.billing_project`
(with the USER_PROJECT_DENIED hint), `data_source.bigquery.legacy_wrap_views`,
`data_source.bigquery.max_bytes_per_materialize`, `ai.base_url`,
`openmetadata.*`, `desktop.*`, and the full `corporate_memory.*` block.
Each cross-references the admin UI so operators discover the editor exists.
- **diagnostics**: `/api/health/detailed` (and therefore `da diagnose`) now
surfaces a `bq_config` service entry on BigQuery instances. Reports
`status="warning"` when `data_source.bigquery.billing_project` resolves
equal to `data_source.bigquery.project` — the configuration where a
service account with `roles/bigquery.dataViewer` on the data project but
no `serviceusage.services.use` 403s every BQ call with
USER_PROJECT_DENIED. The warning includes a hint pointing at the
`instance.yaml` field and the `/admin/server-config` UI.
- **admin UI**: `/admin/server-config` exposes the full **corporate_memory
governance schema** in the editor — `distribution_mode`, `approval_mode`,
`review_period_months`, `notify_on_new_items`, the `sources` /
`extraction` / `confidence` / `contradiction_detection` /
`entity_resolution` nested objects, plus the `domain_owners` /
`domains` lists. The whole section is optional (omitted = legacy
democratic-wiki mode); admins can opt in via the UI without hand-editing
YAML. Schema mirrors `config/instance.yaml.example` lines 224-317.
`confidence.modifiers` (map<string, map<string, float>>) currently
renders as a JSON-textarea fallback with the schema explained inline —
full structured editor is a TODO.
- **admin UI**: server-config renderer learned three new shapes —
`kind="array"` with a scalar `item_kind` renders as a vertical stack
of typed inputs with +/- row controls; `kind="map"` with scalar
`value_kind` renders as key:value rows with +/- controls;
`value_kind="array"` inside a map renders the value column as a
comma-separated list (pragmatic compromise over a full nested-array
UI inside each map row). Leaf inputs now carry `data-path` (JSON-encoded
segment array) so map keys with embedded dots —
e.g. `confidence.base["user_verification.correction"]` — survive
round-trip without being mistaken for nested-path separators.
- **admin UI**: `/admin/server-config` renders registry-declared nested
fields (`kind="object"` with explicit `fields`) as a fully-editable
structured form — every leaf is its own input with a dotted-path
`data-key`, and the collector rebuilds a nested patch on save. Replaces
the previous read-only preview that forced operators to edit a parent
JSON textarea. YAML-only keys outside the registry survive via an
"Other (YAML-only) keys" expander per nested layer. Recursion handles
arbitrary depth, ready for the upcoming corporate_memory + admins
registry entries.
- **admin UI**: `/admin/server-config` now ships a known-fields registry
(`_KNOWN_FIELDS` in `app/api/admin.py`, exposed on the GET response as
`known_fields`). The renderer shows registry-declared knobs as dashed
placeholders alongside populated values, with a one-line hint per
field, so operators discover optional config (e.g.
`data_source.bigquery.billing_project`) directly in the UI instead of
having to read docs or hit a runtime error first. Subagents 2-4 will
populate the bodies; the smoke fixture covers `bigquery.billing_project`.
- **admin UI**: `/admin/server-config` now exposes three previously
YAML-only BigQuery knobs in the editor — `data_source.bigquery.billing_project`,
`legacy_wrap_views`, and `max_bytes_per_materialize`. The GET response
always includes them under `data_source.bigquery` (with documented
defaults when YAML omits them) so the JSON-textarea UI shows them as
editable keys. The section help text describes each. Operators no
longer need to SSH to the VM, edit YAML, restart to flip these.
- **admin UI**: `/admin/tables` is now a per-connector tab interface
(BigQuery / Keboola / Jira). Each tab has its own Register modal +
listing scoped to its source_type. Active tab persists in
`window.location.hash` so refresh keeps the operator in place.
- **Keboola materialized SQL**: `query_mode='materialized'` now works
for `source_type='keboola'` — admin registers a SELECT against
`kbc."bucket"."table"` and the scheduler writes the result to
`/data/extracts/keboola/data/<id>.parquet`. Same flow as BigQuery
materialized; same `da sync` distribution; same RBAC. Cost guardrail
(BQ-style dry-run) intentionally omitted — Keboola extension has no
dry-run analog and Storage API cost is download-byte-shaped, not
scan-byte-shaped. A future PR can add a configurable byte cap if
operators ask for it.
- **Keboola Sync Schedule**: per-table cron input added to the Keboola
tab Register and Edit modals. The scheduler has always honored
per-table `sync_schedule` for every source via `is_table_due()`,
but the Keboola UI had no surface for it — operators had to use the
`/api/admin/registry/{id}` PUT endpoint or `da admin` CLI. Now they
can type `every 6h` / `daily 03:00` directly.
- **BigQuery `query_mode='materialized'`** — admin registers a SQL query
via `da admin register-table --query-mode materialized --query @file.sql
--sync-schedule "every 6h"`; the sync trigger pass runs it through the
DuckDB BigQuery extension via the `BqAccess` facade on each tick that's
due (per-table `sync_schedule` honored via `is_table_due()`) and writes
the result to `/data/extracts/bigquery/data/<name>.parquet`. The
manifest endpoint exposes the row to `da sync`, which distributes the
parquet to analysts; analysts query it through their **local** DuckDB
view. The server-side orchestrator does **not** create a master view
for materialized tables — they are intentionally local-only for
analyst distribution, mirroring the v2 fetch primitives' "queryable
via `da fetch` not via remote" contract. Per-user RBAC filtering is
unchanged: a materialized table is just another row in
`table_registry` with `resource_grants` controlling which groups see it.
- **Schema v20** adds `source_query TEXT` column to `table_registry` to
back the materialized mode. NULL for existing rows. The
`materialize_query()` function in the BigQuery extractor performs the
COPY atomically (`<id>.parquet.tmp` → `os.replace`) so a failed query
never leaves a half-written parquet.
- BigQuery cost guardrail for `query_mode='materialized'` tables: before
each COPY the scheduler runs a BQ dry-run (reusing
`app.api.v2_scan._bq_dry_run_bytes` so cost-estimate logic lives in
exactly one place) and raises `MaterializeBudgetError` (skips the row)
when the estimate exceeds `data_source.bigquery.max_bytes_per_materialize`.
Default 10 GiB; explicit `0` disables (YAML `null` falls through to
the default — documented in `config/instance.yaml.example`).
Fail-open when the dry-run itself errors (library missing, DuckDB
three-part syntax the native BQ client can't parse, transient API
failure) — logs a warning instead of blocking the COPY.
- Admin API: `POST /api/admin/register-table` and
`PUT /api/admin/registry/{id}` accept `source_query` field. Validator
enforces that `query_mode='materialized'` requires `source_query` and
`query_mode in ('local', 'remote')` forbids it. PUT also rejects
`source_query` set without `query_mode` in the same request body and
clears the stale `source_query` when switching the merged record away
from materialized mode.
- CLI: `da admin register-table --query <SQL>` accepts inline SQL or
`@path/to.sql` shorthand for reading from disk. Reuses the existing
`--sync-schedule` flag for the cron string.
- `da sync --quiet` flag suppresses Rich progress + multi-line summary,
intended for use from Claude Code SessionStart/SessionEnd hooks and
cron jobs. Errors still surface on stderr; the no-op case is silent.
The terse summary line in `--quiet` mode (`sync: N tables, M errors`)
lands on stderr so stdout stays clean for hook callers.
- `da analyst setup` now installs `SessionStart` (pull) and `SessionEnd`
(upload) hooks into `<workspace>/.claude/settings.json`, idempotently,
preserving any existing user-owned hooks. Workspace-level (not
user-home) so the hooks fire only when Claude Code is opened in the
analyst workspace, not in unrelated sessions on the same machine.
Hooks assume `da` is on `PATH`. If the CLI is not installed system-wide
(e.g. via `pipx` or `pip install -e .`), the hooks no-op silently —
expected graceful degradation, never blocks a session.
- `docs/setup/claude_settings.json` ships the same two hooks so operators
bootstrapping a fresh Claude Code workspace get auto-sync out of the box.
### Changed
- **admin UI**: Keboola Register and Edit modals adopt the same
two-question radio model as BigQuery — *What to sync?* (Whole table
/ Custom SQL). Whole-table mode synthesizes a `SELECT *` and writes
it through the materialized path; Custom mode lets the admin filter
/ aggregate / project. The legacy `query_mode='local'` extractor
path remains supported for back-compat but is no longer the default
for new Keboola registrations — Whole mode is functionally
equivalent and follows the unified materialized pipeline.
- **admin UI**: `Sync Strategy` dropdown removed from the Keboola form
(Register and Edit). Two independent agent reviews (2026-05-01) found
the field's hint claimed it controlled extraction but no extractor
reads it; only `profiler.is_partitioned()` consumes it for parquet-
layout detection. Field stays in the DB and Pydantic model for
back-compat (marked `Field(deprecated=True)`); just hidden from the
primary form.
- **admin UI**: `Primary Key` input moved under `<details>Advanced` in
both Keboola Register and Edit modals, with a clarifying hint that
it's catalog metadata only — Agnes always does full-overwrite sync;
no upsert / dedup. Auto-fill from Keboola discovery still works.
- **admin UI**: Registry listing column "Strategy" replaced with "Mode"
(showing `query_mode` instead of decorative `sync_strategy`). The
`.col-strategy` / `.strategy-badge` CSS rules removed.
- BigQuery `init_extract` no longer creates remote views for rows with
`query_mode='materialized'`; those live as parquets and surface via
the orchestrator's standard local-parquet discovery. Skipped rows do
not appear in `_meta` so cross-source view-name collisions remain
impossible.
### Deprecated
- `RegisterTableRequest.sync_strategy` — catalog/profiler metadata only;
no extractor reads it. Marked `Field(deprecated=True)`. External API
consumers see the signal in OpenAPI; back-compat preserved.
- `RegisterTableRequest.profile_after_sync` — runtime never read this
flag (Agent 1 finding 2026-05-01); profiler runs unconditionally on
every synced table. Marked `Field(deprecated=True)` and made inert
(the BQ register endpoint no longer force-sets it to `False`).
Back-compat preserved — external clients sending the field get no
error, no warning, no effect.
### Fixed
- **admin API**: `update_table` PUT preserves `sync_strategy` and
`primary_key` when the Edit modal omits them from the payload (this
invariant always held via `request.model_dump()` + `if v is not None`,
but Phase I now has an explicit regression-guard test).
- `docs/setup/claude_settings.json` no longer references the deleted
`server/scripts/collect_session.py` — the dead `SessionEnd` hook had
silently failed in every Claude Code session since the v1→v2 server
purge. Replaced with `da sync --upload-only --quiet`.
### Internal
- README mode-first source table; new "Local sync & auto-update" section
covering `da sync`, hooks, and admin RBAC for auto-sync membership.
- `CLAUDE.md` schema chain extended through v20 with the `source_query`
description; four source modes documented in Connector Pattern (added
Materialized SQL); new "Local sync & Claude Code hooks" subsection
under Development.
- `cli/skills/connectors.md` — "BigQuery: pick a mode" decision table
with cost / guardrail / registration example.
- `docs/architecture.md` — new "BigQuery — Materialized SQL" subsection
describing the COPY pipeline, BqAccess integration, and cost guardrail.
- BQ cost guardrail dry-run is performed via the native
`google-cloud-bigquery` client (through `BqAccess.client()`), which
does not parse DuckDB three-part identifiers (`bq."ds"."t"`). Queries
written in DuckDB syntax fall through fail-open and log a warning
instead of engaging the cap. Operators who need the cap to be
enforceable must register the materialized SQL using native BQ
identifiers (`\`project.ds.t\``).
- Hardenings landed during devil's-advocate review of PR #145:
- `materialize_query` computes the parquet MD5 inline (after COPY,
before `os.replace`) instead of re-reading the file in
`_run_materialized_pass` — saves a full sequential read on the
request thread for multi-GB parquets.
- 0-row materializations log a `WARNING` so an empty result set
can't masquerade as "the SQL is fine, today there's nothing".
- The ATTACH-tolerated `except duckdb.Error: pass` is narrowed to
the "alias already attached" case; real errors (cross-project
permission, malformed project_id) propagate so the per-row
aggregator records them correctly instead of surfacing a
confusing downstream "bq is not attached".
### Known limitations
Operators should be aware of these production-only behaviours; tests
cannot exercise them and they will be revisited in follow-up PRs:
- **GCE metadata token expiry mid-COPY (catastrophic for very long
scans).** The DuckDB BQ extension caches the token in a session
SECRET created at session-open. A `materialize_query` call that
takes longer than the token's remaining lifetime (~1h) will see
silent 401s downstream and may produce a truncated parquet. No
current mitigation; if your materialized SQL scans more than ~30
GiB on a single COPY, run it via the BQ console / Storage Read
API offline and `da fetch` the result instead until token refresh
is wired into the BQ extension's session.
- **DuckDB `bigquery` community extension is unpinned**
`INSTALL bigquery FROM community; LOAD bigquery;` picks up the
latest published version on every cold start. A breaking change
upstream surfaces as a production failure with no test signal.
- **Schema drift after a SQL edit silently breaks analyst queries.**
Editing `source_query` to drop a column writes a new parquet with
the new shape; analysts' queries that referenced the dropped
column 500 on the next sync without warning. No diff or version
field surfaces this. Workaround: announce changes in the team
channel before editing materialized SQL.
- **`materialize_query` is not concurrency-locked.** Two concurrent
`/api/sync/trigger` calls for the same materialized row race on
`<id>.parquet.tmp`. `init_extract` has `_INIT_EXTRACT_LOCK` for
the remote-attach path, but the materialized path does not yet.
In practice: the cron scheduler is single-threaded and manual
triggers are rare, so the race window is small.
## [0.29.0] — 2026-05-01
### Fixed

View file

@ -117,9 +117,10 @@ The SyncOrchestrator scans `/data/extracts/*/extract.duckdb`, ATTACHes each into
(serve) (da sync)
```
Three source types:
- **Batch pull** (Keboola): DuckDB extension downloads to parquet, scheduled
- **Remote attach** (BigQuery): DuckDB BQ extension, no download, queries go to BQ
Source modes:
- **Batch pull** (Keboola, `query_mode='local'`): DuckDB extension downloads to parquet, scheduled
- **Remote attach** (BigQuery, `query_mode='remote'`): DuckDB BQ extension, no download, queries go to BQ
- **Materialized SQL** (BigQuery, `query_mode='materialized'`): scheduler runs admin-registered SQL through DuckDB BQ extension (via `BqAccess` from `connectors/bigquery/access.py`) and writes the result to `/data/extracts/bigquery/data/<id>.parquet`. Distributed via the same manifest + `da sync` flow as Keboola tables. Cost guardrail via `data_source.bigquery.max_bytes_per_materialize` (default 10 GiB; set `0` to disable — YAML `null` falls through to the default).
- **Real-time push** (Jira): Webhooks update parquets incrementally
## Configuration
@ -148,6 +149,19 @@ curl -X POST http://localhost:8000/api/sync/trigger
docker compose up
```
### Local sync & Claude Code hooks
`da sync` is the canonical analyst-side distribution path: pulls the RBAC-filtered manifest from the server, downloads parquets whose MD5 changed (skipping `query_mode='remote'` rows), rebuilds local DuckDB views over them.
`da analyst setup` writes two hooks into `<workspace>/.claude/settings.json`:
- `SessionStart``da sync --quiet` — pulls fresh parquets at the start of every Claude Code session
- `SessionEnd``da sync --upload-only --quiet` — uploads session jsonl + `CLAUDE.local.md` to the server
Both pass `--quiet` so they don't pollute Claude Code stdout, and trail with `|| true` so a server outage never blocks a session. Workspace-level (not user-home) so the hooks fire only when Claude Code opens this analyst workspace, not in unrelated sessions on the same machine.
Admin RBAC for auto-sync: `query_mode IN ('local', 'materialized')` plus a `resource_grants` row for one of the analyst's groups → table appears in their manifest → `da sync` downloads it. No per-user sync config; the admin layer is the single source of truth.
## Business Metrics
Standardized metric definitions live in DuckDB (`metric_definitions` table). Import starter pack:
@ -416,7 +430,7 @@ Module sets `lifecycle { ignore_changes = [metadata_startup_script] }` on `googl
## Key Implementation Details
### DuckDB Schema (src/db.py)
- Schema v19 with auto-migration v1→…→v19 (v5 adds `users.active`, v6 adds `personal_access_tokens`, v7 adds `personal_access_tokens.last_used_ip`, v8/v9 added the legacy internal_roles/role-grants tables, v10 added `view_ownership` for cross-connector view-name collision detection (issue #81 Group C), v11 added marketplace_registry + marketplace_plugins + user_groups + plugin_access, v12 added users.groups JSON + user_groups.is_system, **v13 replaces internal_roles/group_mappings/user_role_grants/plugin_access with user_group_members + resource_grants and drops users.groups JSON**, v14 adds FK constraints on user_group_members + resource_grants after orphan cleanup, v15 adds knowledge_items context-engineering columns + contradictions + session_extraction_state, v16 adds verification_evidence, v17 adds knowledge_item_relations, v18 drops stranded non-google memberships from google-managed groups, **v19 drops legacy `dataset_permissions`, `access_requests` tables and `users.role`, `table_registry.is_public` columns — table access is now exclusively per-group via `resource_grants(resource_type='table')`** — see CHANGELOG and docs/RBAC.md)
- Schema v20 with auto-migration v1→…→v20 (v5 adds `users.active`, v6 adds `personal_access_tokens`, v7 adds `personal_access_tokens.last_used_ip`, v8/v9 added the legacy internal_roles/role-grants tables, v10 added `view_ownership` for cross-connector view-name collision detection (issue #81 Group C), v11 added marketplace_registry + marketplace_plugins + user_groups + plugin_access, v12 added users.groups JSON + user_groups.is_system, **v13 replaces internal_roles/group_mappings/user_role_grants/plugin_access with user_group_members + resource_grants and drops users.groups JSON**, v14 adds FK constraints on user_group_members + resource_grants after orphan cleanup, v15 adds knowledge_items context-engineering columns + contradictions + session_extraction_state, v16 adds verification_evidence, v17 adds knowledge_item_relations, v18 drops stranded non-google memberships from google-managed groups, **v19 drops legacy `dataset_permissions`, `access_requests` tables and `users.role`, `table_registry.is_public` columns — table access is now exclusively per-group via `resource_grants(resource_type='table')`**, **v20 adds `source_query` TEXT to `table_registry` to back `query_mode='materialized'` (BigQuery scheduled-query parquet path)** — see CHANGELOG and docs/RBAC.md)
- `table_registry`: id, name, source_type, bucket, source_table, query_mode, sync_schedule, etc.
- `sync_state`, `sync_history`: track extraction progress
- `users`, `audit_log`: account state + audit trail. RBAC lives in `user_groups` + `user_group_members` + `resource_grants`.

View file

@ -40,11 +40,14 @@ The orchestrator scans `/data/extracts/*/extract.duckdb`, attaches each into `an
## Supported Data Sources
| Source | Mode | Description |
|--------|------|-------------|
| **Keboola** | Batch pull | DuckDB Keboola extension downloads tables to Parquet on a schedule |
| **BigQuery** | Remote attach | DuckDB BQ extension; queries execute in BigQuery, no local download |
| **Jira** | Real-time push | Webhook receiver updates Parquet files incrementally |
| Mode | Distribution | Sources | Use when |
|------|--------------|---------|----------|
| **Batch pull** (`local`) | Parquet on disk, scheduled | Keboola | Source has a native bulk-export and the table fits on disk |
| **Materialized SQL** (`materialized`) | Parquet on disk, scheduled query | BigQuery | Source table is too large to mirror; you want a curated subset on disk |
| **Remote attach** (`remote`) | View only, no download | BigQuery | Table is too large to materialize; latency cost of remote query is acceptable |
| **Real-time push** | Incremental parquet | Jira | Source is event-driven and you need sub-minute freshness |
The first three modes are what `da sync` distributes to analysts. The fourth is server-side only — analysts query Jira data through the same `da sync`-distributed parquets.
Adding a new source means creating `connectors/<name>/extractor.py` that produces `extract.duckdb` with a `_meta` table (`table_name`, `description`, `rows`, `size_bytes`, `extracted_at`, `query_mode`). The orchestrator attaches it automatically.
@ -77,6 +80,44 @@ Once running, the FastAPI app is available at `http://localhost:8000` (or `https
curl -X POST http://localhost:8000/api/sync/trigger
```
## Local sync & auto-update
Analysts run Claude Code against a local DuckDB built from RBAC-filtered parquets pulled from the server. `da sync` is the distribution path:
```bash
da sync # delta-pull: manifest → MD5 compare → download changed → rebuild views
da sync --quiet # same, no progress output (for hooks/cron)
da sync --upload-only # push session jsonl + CLAUDE.local.md back to the server
```
`da analyst setup` writes Claude Code lifecycle hooks into `<workspace>/.claude/settings.json`:
- `SessionStart``da sync --quiet` — fresh data on every session
- `SessionEnd``da sync --upload-only --quiet` — uploads notes and session log
Hooks live at workspace level so they only fire in this analyst workspace, not in unrelated Claude Code sessions on the same machine.
### Admin: which tables auto-sync to whom
The auto-sync set per analyst is the intersection of:
1. Tables with `query_mode IN ('local', 'materialized')` — these have parquets on disk and end up in the manifest
2. Tables granted to one of the analyst's groups via `resource_grants(group, ResourceType.TABLE, table_id)` (see [`docs/RBAC.md`](docs/RBAC.md))
To enroll a new table for auto-sync, register it (or update its `query_mode`) and grant it to the relevant groups in `/admin/access`. New analysts get the same set on their next `da sync`.
For BigQuery, register a `query_mode='materialized'` table with a SQL body:
```bash
da admin register-table orders_90d \
--source-type bigquery \
--query-mode materialized \
--query @docs/queries/orders_90d.sql \
--schedule "every 6h"
```
The scheduler runs the query through the DuckDB BigQuery extension on each tick that's due, writes the result as a parquet, and the analyst picks it up on the next `da sync`. Cost guardrail: `data_source.bigquery.max_bytes_per_materialize` (default 10 GiB) — operations exceeding the BQ dry-run estimate are skipped.
## Development Setup
```bash

View file

@ -18,6 +18,65 @@ router = APIRouter(tags=["health"])
_DEPLOYED_AT = datetime.now(timezone.utc).isoformat()
def _check_bq_billing_project() -> dict | None:
"""Surface the USER_PROJECT_DENIED footgun when a BQ instance has
`billing_project` falling back to (or explicitly equal to) `project`.
Background: connectors/bigquery/access.py:339-342 lets `billing` default
to `data` when `billing_project` is unset. A service account with
`roles/bigquery.dataViewer` on the data project but no
`serviceusage.services.use` on it then 403s on every BQ call with
USER_PROJECT_DENIED. The config is technically valid, so we warn rather
than error the operator's billable project must be set distinctly.
Returns:
None when the check doesn't apply (non-BQ instance, or BQ deps missing).
A service-entry dict otherwise: {"status": "ok"} or
{"status": "warning", "detail": ..., "hint": ..., "billing_project": ...,
"data_project": ...}.
"""
try:
from app.instance_config import get_data_source_type
except Exception:
return None
if (get_data_source_type() or "").lower() != "bigquery":
return None
try:
from connectors.bigquery.access import get_bq_access
bq = get_bq_access()
billing = bq.projects.billing
data = bq.projects.data
except Exception as e:
return {"status": "ok", "detail": f"could not resolve BQ projects: {e}"}
if not data:
# not_configured sentinel — surfaced elsewhere; nothing to warn about here.
return {"status": "ok", "detail": "BigQuery project not configured"}
if billing == data:
return {
"status": "warning",
"detail": "BigQuery billing project equals data project",
"hint": (
"Set data_source.bigquery.billing_project in instance.yaml to a "
"project the SA can bill against (typically your dev/billable "
"project, distinct from a shared read-only data project). "
"Otherwise BQ calls 403 USER_PROJECT_DENIED whenever the SA "
"lacks serviceusage.services.use on the data project. "
"Configurable via /admin/server-config UI."
),
"billing_project": billing,
"data_project": data,
}
return {
"status": "ok",
"billing_project": billing,
"data_project": data,
}
def _check_db_schema() -> dict:
"""Check DB schema version against expected SCHEMA_VERSION.
@ -103,6 +162,11 @@ async def health_check_detailed(
except Exception as e:
checks["users"] = {"status": "error", "detail": str(e)}
# BigQuery billing-project sanity check (USER_PROJECT_DENIED footgun).
bq_cfg = _check_bq_billing_project()
if bq_cfg is not None:
checks["bq_config"] = bq_cfg
overall = "healthy"
for check in checks.values():
if check.get("status") == "error":

View file

@ -51,3 +51,28 @@ The `_meta` table must have columns:
- Instance-level config: `config/instance.yaml` (connection details)
- Table definitions: DuckDB `table_registry` table
- Credentials: environment variables
## BigQuery: pick a mode
| Need | Mode | Why |
|------|------|-----|
| Latency under 100 ms, table fits on disk | `materialized` | Local parquet, no BQ roundtrip |
| Table too large for analyst's disk, occasional ad-hoc query | `remote` | DuckDB BQ extension, no download |
| Table too large for disk AND analyst hits it constantly | `materialized` with aggregation/filter | Scheduled COPY of a slice |
| One-off subquery joined with local data | (no registry row) | Use `da query --register-bq …` for ad-hoc |
Cost: `materialized` runs once per `sync_schedule` regardless of how many analysts query it; `remote` runs once per analyst-query. The break-even is roughly query frequency × bytes scanned vs. one COPY × bytes scanned.
Guardrail: `data_source.bigquery.max_bytes_per_materialize` (default 10 GiB) blocks the COPY when BQ's dry-run estimate exceeds the cap. Set it explicitly per environment in `instance.yaml`.
Register a materialized table:
```bash
da admin register-table orders_90d \
--source-type bigquery \
--query-mode materialized \
--query @docs/queries/orders_90d.sql \
--schedule "every 6h"
```
`--query` also accepts inline SQL.

View file

@ -115,17 +115,30 @@ data_source:
location: "${BIGQUERY_LOCATION}" # BigQuery location (e.g., "us-central1", "US")
# Uses ADC (Application Default Credentials) - VM service account on GCP
# Data can live in a different project -- use fully-qualified table IDs in data_description.md
# billing_project: "" # Optional: GCP project to bill BQ jobs to / submit jobs from.
# # Defaults to `project`. Set this when the SA has bigquery.data.* on
# # the data project but lacks serviceusage.services.use there (i.e.,
# # cross-project read pattern). Submission/billing target must be a
# # project the SA can use; data project just needs read.
# legacy_wrap_views: false # Set true to restore pre-v2 wrap views for BQ VIEW/MATERIALIZED_VIEW
# # tables in analytics.duckdb (migration escape hatch; default: false)
# billing_project: "prj-billing" # GCP project to bill BQ jobs to / submit jobs from.
# # Defaults to `project`. Set when the SA has bigquery.data.* on
# # the data project but lacks serviceusage.services.use there.
# # Mismatch -> every BQ call 403 USER_PROJECT_DENIED.
# # `da diagnose` warns when this falls back to `project`.
# # Configurable via /admin/server-config UI.
# legacy_wrap_views: false # When true, registered VIEWs and MATERIALIZED_VIEWs get a DuckDB
# # master view via bigquery_query() (jobs API) so analysts can
# # `SELECT * FROM viewname` directly. When false (default), views
# # are catalog-only -- analysts use `da fetch viewname` or
# # `da query --remote`. ON can cause "Response too large" on big
# # views; OFF requires analyst-side discipline (CLAUDE.md rails).
# # Toggle ON for view-heavy deployments where most views are small.
# # Configurable via /admin/server-config UI.
# max_bytes_per_materialize: 10737418240
# # Cost guardrail (bytes) for query_mode='materialized' BQ scans.
# # Dry-run check before running; exceeding -> registration / sync
# # rejected. Default 10 GiB (10737418240). Set 0 to disable.
# # null falls through to default. Configurable via /admin/server-config UI.
# --- OpenMetadata catalog (optional) ---
# Enriches table and column metadata from OpenMetadata REST API.
# If not configured, app works normally without catalog enrichment.
# All openmetadata.* fields configurable via /admin/server-config UI.
# openmetadata:
# url: "https://your-catalog.example.com"
# token: "${OPENMETADATA_TOKEN}" # JWT bearer token
@ -147,6 +160,7 @@ email:
smtp_password: "${SMTP_PASSWORD}"
# --- Desktop app (optional) ---
# All desktop.* fields configurable via /admin/server-config UI (rarely changed once set).
desktop:
jwt_issuer: "data-analyst"
jwt_secret: "${DESKTOP_JWT_SECRET}"
@ -174,7 +188,9 @@ jira:
ai:
provider: "anthropic" # or "openai_compat"
api_key: "${ANTHROPIC_API_KEY}" # or "${LLM_API_KEY}" for proxy
# base_url: "https://litellm.example.com" # required for openai_compat
# base_url: "https://litellm.example.com" # Required for provider='openai_compat' (LiteLLM,
# OpenRouter, vLLM). Ignored when provider='anthropic'.
# Configurable via /admin/server-config UI.
model: "claude-haiku-4-5-20251001" # any model available on your provider
# --- Structured output quality control ---
# AI models can return JSON in three ways, each with different reliability:
@ -225,6 +241,10 @@ ai:
# Controls how AI-extracted knowledge is reviewed and distributed.
# If not present, system operates in legacy mode (democratic wiki, no admin review).
#
# The corporate_memory.* schema is editable via /admin/server-config UI; you can
# also continue to manage it via this YAML file. The UI surfaces every leaf with
# a hint, so use it to discover the schema if this comment block has aged.
#
# corporate_memory:
# # How knowledge reaches users:
# # "mandatory_only" — admin controls everything, no user voting

View file

@ -173,11 +173,20 @@ POST /api/sync/trigger (admin)
`connectors/bigquery/extractor.py`
- Uses the DuckDB BigQuery community extension.
- Uses the DuckDB BigQuery community extension via the `BqAccess` facade in `connectors/bigquery/access.py`.
- No data download — views proxy all queries directly to BigQuery.
- Auth via `GOOGLE_APPLICATION_CREDENTIALS` (service account JSON) or ADC.
- Populates `_remote_attach` with `extension='bigquery'` and no `token_env` (env-based auth).
### BigQuery — Materialized SQL
`connectors/bigquery/extractor.py::materialize_query` (added in v0.25.0)
- Runs admin-registered SQL through the DuckDB BigQuery extension via `BqAccess.duckdb_session()` and writes the result to `/data/extracts/bigquery/data/<id>.parquet` atomically (`<id>.parquet.tmp` → `os.replace`).
- Triggered by `_run_materialized_pass` in `app/api/sync.py` between custom-connectors and orchestrator rebuild on every `/api/sync/trigger`. Per-table `sync_schedule` honored via `is_table_due()`.
- Cost guardrail: BQ dry-run via `app.api.v2_scan._bq_dry_run_bytes` (single source of truth for cost-estimate logic). `data_source.bigquery.max_bytes_per_materialize` (default 10 GiB; `0` disables). Fail-open when dry-run errors (DuckDB three-part syntax the native BQ client can't parse) — log warning + proceed.
- Distribution: result parquet rides the same manifest + `da sync` flow as Keboola tables. Per-user RBAC unchanged (`resource_grants(group, ResourceType.TABLE, table_id)`).
### Jira — Real-Time Push
`connectors/jira/webhook.py``incremental_transform.py``extract_init.py`

File diff suppressed because it is too large Load diff

View file

@ -2,8 +2,8 @@
Used to exercise the /admin/access UI with the new ResourceType.TABLE
without depending on a real data source. Each entry is registered with
``is_public=False`` so per-group grants are meaningful (a public table
would bypass any future enforcement).
default RBAC (no `is_public` bypass that column was dropped in v19),
so per-group grants are required for analyst visibility.
Idempotent TableRegistryRepository.register() does an UPSERT via
ON CONFLICT, so re-running this script just refreshes the rows.
@ -65,7 +65,6 @@ def main() -> None:
query_mode="local",
description=description,
registered_by="seed_dummy_tables",
is_public=False,
profile_after_sync=False,
)
after = len(repo.list_all())

View file

@ -0,0 +1,111 @@
"""Phase K — `da diagnose` warning when BQ billing_project == project.
Surfaces via /api/health/detailed (which `da diagnose` already consumes):
when data_source.type == 'bigquery' and the resolved BqProjects.billing equals
BqProjects.data, the response includes a `services.bq_config` entry with
status='warning' and a hint about the 403 USER_PROJECT_DENIED footgun.
"""
import pytest
def _auth(token: str) -> dict:
return {"Authorization": f"Bearer {token}"}
def _patch_instance_config(monkeypatch, cfg: dict) -> None:
"""Replace app.instance_config.load_instance_config + reset caches.
Also clears connectors.bigquery.access.get_bq_access's @functools.cache
so each test sees fresh BqProjects.
"""
monkeypatch.setattr(
"app.instance_config.load_instance_config",
lambda: cfg,
raising=False,
)
# DATA_SOURCE env var, if set in the user shell, would override
# get_data_source_type — strip it for deterministic tests.
monkeypatch.delenv("DATA_SOURCE", raising=False)
monkeypatch.delenv("BIGQUERY_PROJECT", raising=False)
from app.instance_config import reset_cache
reset_cache()
@pytest.fixture(autouse=True)
def _reset_after(monkeypatch):
yield
# Always reset the cache after each test so the next test (or an
# unrelated suite running afterwards) sees fresh config.
try:
from app.instance_config import reset_cache
reset_cache()
except Exception:
pass
def test_diagnose_warns_when_billing_equals_project(seeded_app, monkeypatch):
"""BQ instance with billing_project missing (or equal to project) → warning."""
_patch_instance_config(monkeypatch, {
"data_source": {
"type": "bigquery",
"bigquery": {
"project": "shared-data-prod",
"billing_project": "shared-data-prod",
},
},
})
c = seeded_app["client"]
token = seeded_app["admin_token"]
r = c.get("/api/health/detailed", headers=_auth(token))
assert r.status_code == 200, r.text
body = r.json()
bq_cfg = body.get("services", {}).get("bq_config")
assert bq_cfg is not None, body
assert bq_cfg.get("status") == "warning", bq_cfg
# Hint mentions the YAML field path so operators know what to fix.
blob = (str(bq_cfg.get("detail", "")) + " " + str(bq_cfg.get("hint", ""))).lower()
assert "billing_project" in blob, bq_cfg
def test_diagnose_clean_when_billing_differs(seeded_app, monkeypatch):
"""Distinct billing_project → no warning surfaced."""
_patch_instance_config(monkeypatch, {
"data_source": {
"type": "bigquery",
"bigquery": {
"project": "data-prod",
"billing_project": "billing-dev",
},
},
})
c = seeded_app["client"]
token = seeded_app["admin_token"]
r = c.get("/api/health/detailed", headers=_auth(token))
assert r.status_code == 200, r.text
body = r.json()
bq_cfg = body.get("services", {}).get("bq_config")
# If present, it must be ok; absence is also fine (means no warning).
if bq_cfg is not None:
assert bq_cfg.get("status") == "ok", bq_cfg
def test_diagnose_no_warning_on_keboola_instance(seeded_app, monkeypatch):
"""Non-BQ instance: BQ billing check shouldn't surface at all."""
_patch_instance_config(monkeypatch, {"data_source": {"type": "keboola"}})
c = seeded_app["client"]
token = seeded_app["admin_token"]
r = c.get("/api/health/detailed", headers=_auth(token))
assert r.status_code == 200, r.text
body = r.json()
# Either absent or explicitly status='ok' (n/a). Definitely not 'warning'.
bq_cfg = body.get("services", {}).get("bq_config")
if bq_cfg is not None:
assert bq_cfg.get("status") != "warning", bq_cfg