feat(diagnose) + docs: warn on USER_PROJECT_DENIED footgun + document all newly-exposed knobs
Diagnostic + operator-facing documentation that closes the loop on the work in this PR. `da diagnose` (via /api/health/detailed): - New _check_bq_billing_project() helper. When data_source.type='bigquery' and BqProjects.billing == .data, surface a yellow warning: 'BigQuery billing project equals data project'. Hint includes the YAML field path + the /admin/server-config UI shortcut. Diagnose's overall status promotes warning → degraded so the CLI echoes it. - Non-BQ instances (Keboola-only, etc.) skip the check. - Implementation hooks into the existing /api/health/detailed surface — no new endpoint, no CLI changes. config/instance.yaml.example documentation: - data_source.bigquery.billing_project: USER_PROJECT_DENIED hint, /admin/server-config UI reference - data_source.bigquery.legacy_wrap_views: analyst-side discipline note (use `da fetch` / `da query --remote`), issue #101 history, view-heavy deployment guidance - data_source.bigquery.max_bytes_per_materialize: cost guardrail block (NEW — wasn't documented in .example before) - ai.base_url: provider list + UI hint - openmetadata + desktop: 'configurable via /admin/server-config UI' headers - corporate_memory: leading note that the schema is editable via UI Other docs: - CHANGELOG.md: comprehensive Unreleased section - CLAUDE.md: schema chain → v20 + Materialized SQL connector mode + per-connector tab UI mention - README.md: mode-first source table summary - docs/architecture.md: per-connector tab UI mention - cli/skills/connectors.md: bootstrap rails (parallel to #154) - docs/superpowers/plans/2026-05-01-admin-tables-form-cleanup.md: implementation plan archive (2515 lines) - scripts/seed_dummy_tables.py: drop is_public after #150 RBAC migration (column gone) Tests: - test_diagnose_billing.py — 3 cases (BQ with billing==data warns, BQ with billing!=data clean, non-BQ skips)
This commit is contained in:
parent
df7f5b1d9a
commit
b627de8344
10 changed files with 3071 additions and 21 deletions
252
CHANGELOG.md
252
CHANGELOG.md
|
|
@ -10,6 +10,258 @@ CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every C
|
||||||
|
|
||||||
## [Unreleased]
|
## [Unreleased]
|
||||||
|
|
||||||
|
### Added
|
||||||
|
- **admin UI**: each row in `/admin/tables` listings now has a per-row
|
||||||
|
**Manage access** icon button (between Edit and Delete) that deep-links
|
||||||
|
to `/admin/access#table:<table_id>`. The grant editor reads the hash on
|
||||||
|
load and pre-fills the resource filter so the operator lands on the
|
||||||
|
picked table once they select a group — shortcut for the common
|
||||||
|
"I just registered table X, who should see it?" workflow without
|
||||||
|
manual navigation through the resource tree.
|
||||||
|
- **docs**: `config/instance.yaml.example` documents every field newly
|
||||||
|
exposed by `/admin/server-config` — `data_source.bigquery.billing_project`
|
||||||
|
(with the USER_PROJECT_DENIED hint), `data_source.bigquery.legacy_wrap_views`,
|
||||||
|
`data_source.bigquery.max_bytes_per_materialize`, `ai.base_url`,
|
||||||
|
`openmetadata.*`, `desktop.*`, and the full `corporate_memory.*` block.
|
||||||
|
Each cross-references the admin UI so operators discover the editor exists.
|
||||||
|
- **diagnostics**: `/api/health/detailed` (and therefore `da diagnose`) now
|
||||||
|
surfaces a `bq_config` service entry on BigQuery instances. Reports
|
||||||
|
`status="warning"` when `data_source.bigquery.billing_project` resolves
|
||||||
|
equal to `data_source.bigquery.project` — the configuration where a
|
||||||
|
service account with `roles/bigquery.dataViewer` on the data project but
|
||||||
|
no `serviceusage.services.use` 403s every BQ call with
|
||||||
|
USER_PROJECT_DENIED. The warning includes a hint pointing at the
|
||||||
|
`instance.yaml` field and the `/admin/server-config` UI.
|
||||||
|
- **admin UI**: `/admin/server-config` exposes the full **corporate_memory
|
||||||
|
governance schema** in the editor — `distribution_mode`, `approval_mode`,
|
||||||
|
`review_period_months`, `notify_on_new_items`, the `sources` /
|
||||||
|
`extraction` / `confidence` / `contradiction_detection` /
|
||||||
|
`entity_resolution` nested objects, plus the `domain_owners` /
|
||||||
|
`domains` lists. The whole section is optional (omitted = legacy
|
||||||
|
democratic-wiki mode); admins can opt in via the UI without hand-editing
|
||||||
|
YAML. Schema mirrors `config/instance.yaml.example` lines 224-317.
|
||||||
|
`confidence.modifiers` (map<string, map<string, float>>) currently
|
||||||
|
renders as a JSON-textarea fallback with the schema explained inline —
|
||||||
|
full structured editor is a TODO.
|
||||||
|
- **admin UI**: server-config renderer learned three new shapes —
|
||||||
|
`kind="array"` with a scalar `item_kind` renders as a vertical stack
|
||||||
|
of typed inputs with +/- row controls; `kind="map"` with scalar
|
||||||
|
`value_kind` renders as key:value rows with +/- controls;
|
||||||
|
`value_kind="array"` inside a map renders the value column as a
|
||||||
|
comma-separated list (pragmatic compromise over a full nested-array
|
||||||
|
UI inside each map row). Leaf inputs now carry `data-path` (JSON-encoded
|
||||||
|
segment array) so map keys with embedded dots —
|
||||||
|
e.g. `confidence.base["user_verification.correction"]` — survive
|
||||||
|
round-trip without being mistaken for nested-path separators.
|
||||||
|
- **admin UI**: `/admin/server-config` renders registry-declared nested
|
||||||
|
fields (`kind="object"` with explicit `fields`) as a fully-editable
|
||||||
|
structured form — every leaf is its own input with a dotted-path
|
||||||
|
`data-key`, and the collector rebuilds a nested patch on save. Replaces
|
||||||
|
the previous read-only preview that forced operators to edit a parent
|
||||||
|
JSON textarea. YAML-only keys outside the registry survive via an
|
||||||
|
"Other (YAML-only) keys" expander per nested layer. Recursion handles
|
||||||
|
arbitrary depth, ready for the upcoming corporate_memory + admins
|
||||||
|
registry entries.
|
||||||
|
- **admin UI**: `/admin/server-config` now ships a known-fields registry
|
||||||
|
(`_KNOWN_FIELDS` in `app/api/admin.py`, exposed on the GET response as
|
||||||
|
`known_fields`). The renderer shows registry-declared knobs as dashed
|
||||||
|
placeholders alongside populated values, with a one-line hint per
|
||||||
|
field, so operators discover optional config (e.g.
|
||||||
|
`data_source.bigquery.billing_project`) directly in the UI instead of
|
||||||
|
having to read docs or hit a runtime error first. Subagents 2-4 will
|
||||||
|
populate the bodies; the smoke fixture covers `bigquery.billing_project`.
|
||||||
|
- **admin UI**: `/admin/server-config` now exposes three previously
|
||||||
|
YAML-only BigQuery knobs in the editor — `data_source.bigquery.billing_project`,
|
||||||
|
`legacy_wrap_views`, and `max_bytes_per_materialize`. The GET response
|
||||||
|
always includes them under `data_source.bigquery` (with documented
|
||||||
|
defaults when YAML omits them) so the JSON-textarea UI shows them as
|
||||||
|
editable keys. The section help text describes each. Operators no
|
||||||
|
longer need to SSH to the VM, edit YAML, restart to flip these.
|
||||||
|
- **admin UI**: `/admin/tables` is now a per-connector tab interface
|
||||||
|
(BigQuery / Keboola / Jira). Each tab has its own Register modal +
|
||||||
|
listing scoped to its source_type. Active tab persists in
|
||||||
|
`window.location.hash` so refresh keeps the operator in place.
|
||||||
|
- **Keboola materialized SQL**: `query_mode='materialized'` now works
|
||||||
|
for `source_type='keboola'` — admin registers a SELECT against
|
||||||
|
`kbc."bucket"."table"` and the scheduler writes the result to
|
||||||
|
`/data/extracts/keboola/data/<id>.parquet`. Same flow as BigQuery
|
||||||
|
materialized; same `da sync` distribution; same RBAC. Cost guardrail
|
||||||
|
(BQ-style dry-run) intentionally omitted — Keboola extension has no
|
||||||
|
dry-run analog and Storage API cost is download-byte-shaped, not
|
||||||
|
scan-byte-shaped. A future PR can add a configurable byte cap if
|
||||||
|
operators ask for it.
|
||||||
|
- **Keboola Sync Schedule**: per-table cron input added to the Keboola
|
||||||
|
tab Register and Edit modals. The scheduler has always honored
|
||||||
|
per-table `sync_schedule` for every source via `is_table_due()`,
|
||||||
|
but the Keboola UI had no surface for it — operators had to use the
|
||||||
|
`/api/admin/registry/{id}` PUT endpoint or `da admin` CLI. Now they
|
||||||
|
can type `every 6h` / `daily 03:00` directly.
|
||||||
|
- **BigQuery `query_mode='materialized'`** — admin registers a SQL query
|
||||||
|
via `da admin register-table --query-mode materialized --query @file.sql
|
||||||
|
--sync-schedule "every 6h"`; the sync trigger pass runs it through the
|
||||||
|
DuckDB BigQuery extension via the `BqAccess` facade on each tick that's
|
||||||
|
due (per-table `sync_schedule` honored via `is_table_due()`) and writes
|
||||||
|
the result to `/data/extracts/bigquery/data/<name>.parquet`. The
|
||||||
|
manifest endpoint exposes the row to `da sync`, which distributes the
|
||||||
|
parquet to analysts; analysts query it through their **local** DuckDB
|
||||||
|
view. The server-side orchestrator does **not** create a master view
|
||||||
|
for materialized tables — they are intentionally local-only for
|
||||||
|
analyst distribution, mirroring the v2 fetch primitives' "queryable
|
||||||
|
via `da fetch` not via remote" contract. Per-user RBAC filtering is
|
||||||
|
unchanged: a materialized table is just another row in
|
||||||
|
`table_registry` with `resource_grants` controlling which groups see it.
|
||||||
|
- **Schema v20** adds `source_query TEXT` column to `table_registry` to
|
||||||
|
back the materialized mode. NULL for existing rows. The
|
||||||
|
`materialize_query()` function in the BigQuery extractor performs the
|
||||||
|
COPY atomically (`<id>.parquet.tmp` → `os.replace`) so a failed query
|
||||||
|
never leaves a half-written parquet.
|
||||||
|
- BigQuery cost guardrail for `query_mode='materialized'` tables: before
|
||||||
|
each COPY the scheduler runs a BQ dry-run (reusing
|
||||||
|
`app.api.v2_scan._bq_dry_run_bytes` so cost-estimate logic lives in
|
||||||
|
exactly one place) and raises `MaterializeBudgetError` (skips the row)
|
||||||
|
when the estimate exceeds `data_source.bigquery.max_bytes_per_materialize`.
|
||||||
|
Default 10 GiB; explicit `0` disables (YAML `null` falls through to
|
||||||
|
the default — documented in `config/instance.yaml.example`).
|
||||||
|
Fail-open when the dry-run itself errors (library missing, DuckDB
|
||||||
|
three-part syntax the native BQ client can't parse, transient API
|
||||||
|
failure) — logs a warning instead of blocking the COPY.
|
||||||
|
- Admin API: `POST /api/admin/register-table` and
|
||||||
|
`PUT /api/admin/registry/{id}` accept `source_query` field. Validator
|
||||||
|
enforces that `query_mode='materialized'` requires `source_query` and
|
||||||
|
`query_mode in ('local', 'remote')` forbids it. PUT also rejects
|
||||||
|
`source_query` set without `query_mode` in the same request body and
|
||||||
|
clears the stale `source_query` when switching the merged record away
|
||||||
|
from materialized mode.
|
||||||
|
- CLI: `da admin register-table --query <SQL>` accepts inline SQL or
|
||||||
|
`@path/to.sql` shorthand for reading from disk. Reuses the existing
|
||||||
|
`--sync-schedule` flag for the cron string.
|
||||||
|
- `da sync --quiet` flag suppresses Rich progress + multi-line summary,
|
||||||
|
intended for use from Claude Code SessionStart/SessionEnd hooks and
|
||||||
|
cron jobs. Errors still surface on stderr; the no-op case is silent.
|
||||||
|
The terse summary line in `--quiet` mode (`sync: N tables, M errors`)
|
||||||
|
lands on stderr so stdout stays clean for hook callers.
|
||||||
|
- `da analyst setup` now installs `SessionStart` (pull) and `SessionEnd`
|
||||||
|
(upload) hooks into `<workspace>/.claude/settings.json`, idempotently,
|
||||||
|
preserving any existing user-owned hooks. Workspace-level (not
|
||||||
|
user-home) so the hooks fire only when Claude Code is opened in the
|
||||||
|
analyst workspace, not in unrelated sessions on the same machine.
|
||||||
|
Hooks assume `da` is on `PATH`. If the CLI is not installed system-wide
|
||||||
|
(e.g. via `pipx` or `pip install -e .`), the hooks no-op silently —
|
||||||
|
expected graceful degradation, never blocks a session.
|
||||||
|
- `docs/setup/claude_settings.json` ships the same two hooks so operators
|
||||||
|
bootstrapping a fresh Claude Code workspace get auto-sync out of the box.
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
- **admin UI**: Keboola Register and Edit modals adopt the same
|
||||||
|
two-question radio model as BigQuery — *What to sync?* (Whole table
|
||||||
|
/ Custom SQL). Whole-table mode synthesizes a `SELECT *` and writes
|
||||||
|
it through the materialized path; Custom mode lets the admin filter
|
||||||
|
/ aggregate / project. The legacy `query_mode='local'` extractor
|
||||||
|
path remains supported for back-compat but is no longer the default
|
||||||
|
for new Keboola registrations — Whole mode is functionally
|
||||||
|
equivalent and follows the unified materialized pipeline.
|
||||||
|
- **admin UI**: `Sync Strategy` dropdown removed from the Keboola form
|
||||||
|
(Register and Edit). Two independent agent reviews (2026-05-01) found
|
||||||
|
the field's hint claimed it controlled extraction but no extractor
|
||||||
|
reads it; only `profiler.is_partitioned()` consumes it for parquet-
|
||||||
|
layout detection. Field stays in the DB and Pydantic model for
|
||||||
|
back-compat (marked `Field(deprecated=True)`); just hidden from the
|
||||||
|
primary form.
|
||||||
|
- **admin UI**: `Primary Key` input moved under `<details>Advanced` in
|
||||||
|
both Keboola Register and Edit modals, with a clarifying hint that
|
||||||
|
it's catalog metadata only — Agnes always does full-overwrite sync;
|
||||||
|
no upsert / dedup. Auto-fill from Keboola discovery still works.
|
||||||
|
- **admin UI**: Registry listing column "Strategy" replaced with "Mode"
|
||||||
|
(showing `query_mode` instead of decorative `sync_strategy`). The
|
||||||
|
`.col-strategy` / `.strategy-badge` CSS rules removed.
|
||||||
|
- BigQuery `init_extract` no longer creates remote views for rows with
|
||||||
|
`query_mode='materialized'`; those live as parquets and surface via
|
||||||
|
the orchestrator's standard local-parquet discovery. Skipped rows do
|
||||||
|
not appear in `_meta` so cross-source view-name collisions remain
|
||||||
|
impossible.
|
||||||
|
|
||||||
|
### Deprecated
|
||||||
|
- `RegisterTableRequest.sync_strategy` — catalog/profiler metadata only;
|
||||||
|
no extractor reads it. Marked `Field(deprecated=True)`. External API
|
||||||
|
consumers see the signal in OpenAPI; back-compat preserved.
|
||||||
|
- `RegisterTableRequest.profile_after_sync` — runtime never read this
|
||||||
|
flag (Agent 1 finding 2026-05-01); profiler runs unconditionally on
|
||||||
|
every synced table. Marked `Field(deprecated=True)` and made inert
|
||||||
|
(the BQ register endpoint no longer force-sets it to `False`).
|
||||||
|
Back-compat preserved — external clients sending the field get no
|
||||||
|
error, no warning, no effect.
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
- **admin API**: `update_table` PUT preserves `sync_strategy` and
|
||||||
|
`primary_key` when the Edit modal omits them from the payload (this
|
||||||
|
invariant always held via `request.model_dump()` + `if v is not None`,
|
||||||
|
but Phase I now has an explicit regression-guard test).
|
||||||
|
- `docs/setup/claude_settings.json` no longer references the deleted
|
||||||
|
`server/scripts/collect_session.py` — the dead `SessionEnd` hook had
|
||||||
|
silently failed in every Claude Code session since the v1→v2 server
|
||||||
|
purge. Replaced with `da sync --upload-only --quiet`.
|
||||||
|
|
||||||
|
### Internal
|
||||||
|
- README mode-first source table; new "Local sync & auto-update" section
|
||||||
|
covering `da sync`, hooks, and admin RBAC for auto-sync membership.
|
||||||
|
- `CLAUDE.md` schema chain extended through v20 with the `source_query`
|
||||||
|
description; four source modes documented in Connector Pattern (added
|
||||||
|
Materialized SQL); new "Local sync & Claude Code hooks" subsection
|
||||||
|
under Development.
|
||||||
|
- `cli/skills/connectors.md` — "BigQuery: pick a mode" decision table
|
||||||
|
with cost / guardrail / registration example.
|
||||||
|
- `docs/architecture.md` — new "BigQuery — Materialized SQL" subsection
|
||||||
|
describing the COPY pipeline, BqAccess integration, and cost guardrail.
|
||||||
|
- BQ cost guardrail dry-run is performed via the native
|
||||||
|
`google-cloud-bigquery` client (through `BqAccess.client()`), which
|
||||||
|
does not parse DuckDB three-part identifiers (`bq."ds"."t"`). Queries
|
||||||
|
written in DuckDB syntax fall through fail-open and log a warning
|
||||||
|
instead of engaging the cap. Operators who need the cap to be
|
||||||
|
enforceable must register the materialized SQL using native BQ
|
||||||
|
identifiers (`\`project.ds.t\``).
|
||||||
|
- Hardenings landed during devil's-advocate review of PR #145:
|
||||||
|
- `materialize_query` computes the parquet MD5 inline (after COPY,
|
||||||
|
before `os.replace`) instead of re-reading the file in
|
||||||
|
`_run_materialized_pass` — saves a full sequential read on the
|
||||||
|
request thread for multi-GB parquets.
|
||||||
|
- 0-row materializations log a `WARNING` so an empty result set
|
||||||
|
can't masquerade as "the SQL is fine, today there's nothing".
|
||||||
|
- The ATTACH-tolerated `except duckdb.Error: pass` is narrowed to
|
||||||
|
the "alias already attached" case; real errors (cross-project
|
||||||
|
permission, malformed project_id) propagate so the per-row
|
||||||
|
aggregator records them correctly instead of surfacing a
|
||||||
|
confusing downstream "bq is not attached".
|
||||||
|
|
||||||
|
### Known limitations
|
||||||
|
Operators should be aware of these production-only behaviours; tests
|
||||||
|
cannot exercise them and they will be revisited in follow-up PRs:
|
||||||
|
|
||||||
|
- **GCE metadata token expiry mid-COPY (catastrophic for very long
|
||||||
|
scans).** The DuckDB BQ extension caches the token in a session
|
||||||
|
SECRET created at session-open. A `materialize_query` call that
|
||||||
|
takes longer than the token's remaining lifetime (~1h) will see
|
||||||
|
silent 401s downstream and may produce a truncated parquet. No
|
||||||
|
current mitigation; if your materialized SQL scans more than ~30
|
||||||
|
GiB on a single COPY, run it via the BQ console / Storage Read
|
||||||
|
API offline and `da fetch` the result instead until token refresh
|
||||||
|
is wired into the BQ extension's session.
|
||||||
|
- **DuckDB `bigquery` community extension is unpinned** —
|
||||||
|
`INSTALL bigquery FROM community; LOAD bigquery;` picks up the
|
||||||
|
latest published version on every cold start. A breaking change
|
||||||
|
upstream surfaces as a production failure with no test signal.
|
||||||
|
- **Schema drift after a SQL edit silently breaks analyst queries.**
|
||||||
|
Editing `source_query` to drop a column writes a new parquet with
|
||||||
|
the new shape; analysts' queries that referenced the dropped
|
||||||
|
column 500 on the next sync without warning. No diff or version
|
||||||
|
field surfaces this. Workaround: announce changes in the team
|
||||||
|
channel before editing materialized SQL.
|
||||||
|
- **`materialize_query` is not concurrency-locked.** Two concurrent
|
||||||
|
`/api/sync/trigger` calls for the same materialized row race on
|
||||||
|
`<id>.parquet.tmp`. `init_extract` has `_INIT_EXTRACT_LOCK` for
|
||||||
|
the remote-attach path, but the materialized path does not yet.
|
||||||
|
In practice: the cron scheduler is single-threaded and manual
|
||||||
|
triggers are rare, so the race window is small.
|
||||||
|
|
||||||
## [0.29.0] — 2026-05-01
|
## [0.29.0] — 2026-05-01
|
||||||
|
|
||||||
### Fixed
|
### Fixed
|
||||||
|
|
|
||||||
22
CLAUDE.md
22
CLAUDE.md
|
|
@ -117,9 +117,10 @@ The SyncOrchestrator scans `/data/extracts/*/extract.duckdb`, ATTACHes each into
|
||||||
(serve) (da sync)
|
(serve) (da sync)
|
||||||
```
|
```
|
||||||
|
|
||||||
Three source types:
|
Source modes:
|
||||||
- **Batch pull** (Keboola): DuckDB extension downloads to parquet, scheduled
|
- **Batch pull** (Keboola, `query_mode='local'`): DuckDB extension downloads to parquet, scheduled
|
||||||
- **Remote attach** (BigQuery): DuckDB BQ extension, no download, queries go to BQ
|
- **Remote attach** (BigQuery, `query_mode='remote'`): DuckDB BQ extension, no download, queries go to BQ
|
||||||
|
- **Materialized SQL** (BigQuery, `query_mode='materialized'`): scheduler runs admin-registered SQL through DuckDB BQ extension (via `BqAccess` from `connectors/bigquery/access.py`) and writes the result to `/data/extracts/bigquery/data/<id>.parquet`. Distributed via the same manifest + `da sync` flow as Keboola tables. Cost guardrail via `data_source.bigquery.max_bytes_per_materialize` (default 10 GiB; set `0` to disable — YAML `null` falls through to the default).
|
||||||
- **Real-time push** (Jira): Webhooks update parquets incrementally
|
- **Real-time push** (Jira): Webhooks update parquets incrementally
|
||||||
|
|
||||||
## Configuration
|
## Configuration
|
||||||
|
|
@ -148,6 +149,19 @@ curl -X POST http://localhost:8000/api/sync/trigger
|
||||||
docker compose up
|
docker compose up
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### Local sync & Claude Code hooks
|
||||||
|
|
||||||
|
`da sync` is the canonical analyst-side distribution path: pulls the RBAC-filtered manifest from the server, downloads parquets whose MD5 changed (skipping `query_mode='remote'` rows), rebuilds local DuckDB views over them.
|
||||||
|
|
||||||
|
`da analyst setup` writes two hooks into `<workspace>/.claude/settings.json`:
|
||||||
|
|
||||||
|
- `SessionStart` → `da sync --quiet` — pulls fresh parquets at the start of every Claude Code session
|
||||||
|
- `SessionEnd` → `da sync --upload-only --quiet` — uploads session jsonl + `CLAUDE.local.md` to the server
|
||||||
|
|
||||||
|
Both pass `--quiet` so they don't pollute Claude Code stdout, and trail with `|| true` so a server outage never blocks a session. Workspace-level (not user-home) so the hooks fire only when Claude Code opens this analyst workspace, not in unrelated sessions on the same machine.
|
||||||
|
|
||||||
|
Admin RBAC for auto-sync: `query_mode IN ('local', 'materialized')` plus a `resource_grants` row for one of the analyst's groups → table appears in their manifest → `da sync` downloads it. No per-user sync config; the admin layer is the single source of truth.
|
||||||
|
|
||||||
## Business Metrics
|
## Business Metrics
|
||||||
|
|
||||||
Standardized metric definitions live in DuckDB (`metric_definitions` table). Import starter pack:
|
Standardized metric definitions live in DuckDB (`metric_definitions` table). Import starter pack:
|
||||||
|
|
@ -416,7 +430,7 @@ Module sets `lifecycle { ignore_changes = [metadata_startup_script] }` on `googl
|
||||||
## Key Implementation Details
|
## Key Implementation Details
|
||||||
|
|
||||||
### DuckDB Schema (src/db.py)
|
### DuckDB Schema (src/db.py)
|
||||||
- Schema v19 with auto-migration v1→…→v19 (v5 adds `users.active`, v6 adds `personal_access_tokens`, v7 adds `personal_access_tokens.last_used_ip`, v8/v9 added the legacy internal_roles/role-grants tables, v10 added `view_ownership` for cross-connector view-name collision detection (issue #81 Group C), v11 added marketplace_registry + marketplace_plugins + user_groups + plugin_access, v12 added users.groups JSON + user_groups.is_system, **v13 replaces internal_roles/group_mappings/user_role_grants/plugin_access with user_group_members + resource_grants and drops users.groups JSON**, v14 adds FK constraints on user_group_members + resource_grants after orphan cleanup, v15 adds knowledge_items context-engineering columns + contradictions + session_extraction_state, v16 adds verification_evidence, v17 adds knowledge_item_relations, v18 drops stranded non-google memberships from google-managed groups, **v19 drops legacy `dataset_permissions`, `access_requests` tables and `users.role`, `table_registry.is_public` columns — table access is now exclusively per-group via `resource_grants(resource_type='table')`** — see CHANGELOG and docs/RBAC.md)
|
- Schema v20 with auto-migration v1→…→v20 (v5 adds `users.active`, v6 adds `personal_access_tokens`, v7 adds `personal_access_tokens.last_used_ip`, v8/v9 added the legacy internal_roles/role-grants tables, v10 added `view_ownership` for cross-connector view-name collision detection (issue #81 Group C), v11 added marketplace_registry + marketplace_plugins + user_groups + plugin_access, v12 added users.groups JSON + user_groups.is_system, **v13 replaces internal_roles/group_mappings/user_role_grants/plugin_access with user_group_members + resource_grants and drops users.groups JSON**, v14 adds FK constraints on user_group_members + resource_grants after orphan cleanup, v15 adds knowledge_items context-engineering columns + contradictions + session_extraction_state, v16 adds verification_evidence, v17 adds knowledge_item_relations, v18 drops stranded non-google memberships from google-managed groups, **v19 drops legacy `dataset_permissions`, `access_requests` tables and `users.role`, `table_registry.is_public` columns — table access is now exclusively per-group via `resource_grants(resource_type='table')`**, **v20 adds `source_query` TEXT to `table_registry` to back `query_mode='materialized'` (BigQuery scheduled-query parquet path)** — see CHANGELOG and docs/RBAC.md)
|
||||||
- `table_registry`: id, name, source_type, bucket, source_table, query_mode, sync_schedule, etc.
|
- `table_registry`: id, name, source_type, bucket, source_table, query_mode, sync_schedule, etc.
|
||||||
- `sync_state`, `sync_history`: track extraction progress
|
- `sync_state`, `sync_history`: track extraction progress
|
||||||
- `users`, `audit_log`: account state + audit trail. RBAC lives in `user_groups` + `user_group_members` + `resource_grants`.
|
- `users`, `audit_log`: account state + audit trail. RBAC lives in `user_groups` + `user_group_members` + `resource_grants`.
|
||||||
|
|
|
||||||
51
README.md
51
README.md
|
|
@ -40,11 +40,14 @@ The orchestrator scans `/data/extracts/*/extract.duckdb`, attaches each into `an
|
||||||
|
|
||||||
## Supported Data Sources
|
## Supported Data Sources
|
||||||
|
|
||||||
| Source | Mode | Description |
|
| Mode | Distribution | Sources | Use when |
|
||||||
|--------|------|-------------|
|
|------|--------------|---------|----------|
|
||||||
| **Keboola** | Batch pull | DuckDB Keboola extension downloads tables to Parquet on a schedule |
|
| **Batch pull** (`local`) | Parquet on disk, scheduled | Keboola | Source has a native bulk-export and the table fits on disk |
|
||||||
| **BigQuery** | Remote attach | DuckDB BQ extension; queries execute in BigQuery, no local download |
|
| **Materialized SQL** (`materialized`) | Parquet on disk, scheduled query | BigQuery | Source table is too large to mirror; you want a curated subset on disk |
|
||||||
| **Jira** | Real-time push | Webhook receiver updates Parquet files incrementally |
|
| **Remote attach** (`remote`) | View only, no download | BigQuery | Table is too large to materialize; latency cost of remote query is acceptable |
|
||||||
|
| **Real-time push** | Incremental parquet | Jira | Source is event-driven and you need sub-minute freshness |
|
||||||
|
|
||||||
|
The first three modes are what `da sync` distributes to analysts. The fourth is server-side only — analysts query Jira data through the same `da sync`-distributed parquets.
|
||||||
|
|
||||||
Adding a new source means creating `connectors/<name>/extractor.py` that produces `extract.duckdb` with a `_meta` table (`table_name`, `description`, `rows`, `size_bytes`, `extracted_at`, `query_mode`). The orchestrator attaches it automatically.
|
Adding a new source means creating `connectors/<name>/extractor.py` that produces `extract.duckdb` with a `_meta` table (`table_name`, `description`, `rows`, `size_bytes`, `extracted_at`, `query_mode`). The orchestrator attaches it automatically.
|
||||||
|
|
||||||
|
|
@ -77,6 +80,44 @@ Once running, the FastAPI app is available at `http://localhost:8000` (or `https
|
||||||
curl -X POST http://localhost:8000/api/sync/trigger
|
curl -X POST http://localhost:8000/api/sync/trigger
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## Local sync & auto-update
|
||||||
|
|
||||||
|
Analysts run Claude Code against a local DuckDB built from RBAC-filtered parquets pulled from the server. `da sync` is the distribution path:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
da sync # delta-pull: manifest → MD5 compare → download changed → rebuild views
|
||||||
|
da sync --quiet # same, no progress output (for hooks/cron)
|
||||||
|
da sync --upload-only # push session jsonl + CLAUDE.local.md back to the server
|
||||||
|
```
|
||||||
|
|
||||||
|
`da analyst setup` writes Claude Code lifecycle hooks into `<workspace>/.claude/settings.json`:
|
||||||
|
|
||||||
|
- `SessionStart` → `da sync --quiet` — fresh data on every session
|
||||||
|
- `SessionEnd` → `da sync --upload-only --quiet` — uploads notes and session log
|
||||||
|
|
||||||
|
Hooks live at workspace level so they only fire in this analyst workspace, not in unrelated Claude Code sessions on the same machine.
|
||||||
|
|
||||||
|
### Admin: which tables auto-sync to whom
|
||||||
|
|
||||||
|
The auto-sync set per analyst is the intersection of:
|
||||||
|
|
||||||
|
1. Tables with `query_mode IN ('local', 'materialized')` — these have parquets on disk and end up in the manifest
|
||||||
|
2. Tables granted to one of the analyst's groups via `resource_grants(group, ResourceType.TABLE, table_id)` (see [`docs/RBAC.md`](docs/RBAC.md))
|
||||||
|
|
||||||
|
To enroll a new table for auto-sync, register it (or update its `query_mode`) and grant it to the relevant groups in `/admin/access`. New analysts get the same set on their next `da sync`.
|
||||||
|
|
||||||
|
For BigQuery, register a `query_mode='materialized'` table with a SQL body:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
da admin register-table orders_90d \
|
||||||
|
--source-type bigquery \
|
||||||
|
--query-mode materialized \
|
||||||
|
--query @docs/queries/orders_90d.sql \
|
||||||
|
--schedule "every 6h"
|
||||||
|
```
|
||||||
|
|
||||||
|
The scheduler runs the query through the DuckDB BigQuery extension on each tick that's due, writes the result as a parquet, and the analyst picks it up on the next `da sync`. Cost guardrail: `data_source.bigquery.max_bytes_per_materialize` (default 10 GiB) — operations exceeding the BQ dry-run estimate are skipped.
|
||||||
|
|
||||||
## Development Setup
|
## Development Setup
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
|
|
|
||||||
|
|
@ -18,6 +18,65 @@ router = APIRouter(tags=["health"])
|
||||||
_DEPLOYED_AT = datetime.now(timezone.utc).isoformat()
|
_DEPLOYED_AT = datetime.now(timezone.utc).isoformat()
|
||||||
|
|
||||||
|
|
||||||
|
def _check_bq_billing_project() -> dict | None:
|
||||||
|
"""Surface the USER_PROJECT_DENIED footgun when a BQ instance has
|
||||||
|
`billing_project` falling back to (or explicitly equal to) `project`.
|
||||||
|
|
||||||
|
Background: connectors/bigquery/access.py:339-342 lets `billing` default
|
||||||
|
to `data` when `billing_project` is unset. A service account with
|
||||||
|
`roles/bigquery.dataViewer` on the data project but no
|
||||||
|
`serviceusage.services.use` on it then 403s on every BQ call with
|
||||||
|
USER_PROJECT_DENIED. The config is technically valid, so we warn rather
|
||||||
|
than error — the operator's billable project must be set distinctly.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
None when the check doesn't apply (non-BQ instance, or BQ deps missing).
|
||||||
|
A service-entry dict otherwise: {"status": "ok"} or
|
||||||
|
{"status": "warning", "detail": ..., "hint": ..., "billing_project": ...,
|
||||||
|
"data_project": ...}.
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
from app.instance_config import get_data_source_type
|
||||||
|
except Exception:
|
||||||
|
return None
|
||||||
|
if (get_data_source_type() or "").lower() != "bigquery":
|
||||||
|
return None
|
||||||
|
|
||||||
|
try:
|
||||||
|
from connectors.bigquery.access import get_bq_access
|
||||||
|
bq = get_bq_access()
|
||||||
|
billing = bq.projects.billing
|
||||||
|
data = bq.projects.data
|
||||||
|
except Exception as e:
|
||||||
|
return {"status": "ok", "detail": f"could not resolve BQ projects: {e}"}
|
||||||
|
|
||||||
|
if not data:
|
||||||
|
# not_configured sentinel — surfaced elsewhere; nothing to warn about here.
|
||||||
|
return {"status": "ok", "detail": "BigQuery project not configured"}
|
||||||
|
|
||||||
|
if billing == data:
|
||||||
|
return {
|
||||||
|
"status": "warning",
|
||||||
|
"detail": "BigQuery billing project equals data project",
|
||||||
|
"hint": (
|
||||||
|
"Set data_source.bigquery.billing_project in instance.yaml to a "
|
||||||
|
"project the SA can bill against (typically your dev/billable "
|
||||||
|
"project, distinct from a shared read-only data project). "
|
||||||
|
"Otherwise BQ calls 403 USER_PROJECT_DENIED whenever the SA "
|
||||||
|
"lacks serviceusage.services.use on the data project. "
|
||||||
|
"Configurable via /admin/server-config UI."
|
||||||
|
),
|
||||||
|
"billing_project": billing,
|
||||||
|
"data_project": data,
|
||||||
|
}
|
||||||
|
|
||||||
|
return {
|
||||||
|
"status": "ok",
|
||||||
|
"billing_project": billing,
|
||||||
|
"data_project": data,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
def _check_db_schema() -> dict:
|
def _check_db_schema() -> dict:
|
||||||
"""Check DB schema version against expected SCHEMA_VERSION.
|
"""Check DB schema version against expected SCHEMA_VERSION.
|
||||||
|
|
||||||
|
|
@ -103,6 +162,11 @@ async def health_check_detailed(
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
checks["users"] = {"status": "error", "detail": str(e)}
|
checks["users"] = {"status": "error", "detail": str(e)}
|
||||||
|
|
||||||
|
# BigQuery billing-project sanity check (USER_PROJECT_DENIED footgun).
|
||||||
|
bq_cfg = _check_bq_billing_project()
|
||||||
|
if bq_cfg is not None:
|
||||||
|
checks["bq_config"] = bq_cfg
|
||||||
|
|
||||||
overall = "healthy"
|
overall = "healthy"
|
||||||
for check in checks.values():
|
for check in checks.values():
|
||||||
if check.get("status") == "error":
|
if check.get("status") == "error":
|
||||||
|
|
|
||||||
|
|
@ -51,3 +51,28 @@ The `_meta` table must have columns:
|
||||||
- Instance-level config: `config/instance.yaml` (connection details)
|
- Instance-level config: `config/instance.yaml` (connection details)
|
||||||
- Table definitions: DuckDB `table_registry` table
|
- Table definitions: DuckDB `table_registry` table
|
||||||
- Credentials: environment variables
|
- Credentials: environment variables
|
||||||
|
|
||||||
|
## BigQuery: pick a mode
|
||||||
|
|
||||||
|
| Need | Mode | Why |
|
||||||
|
|------|------|-----|
|
||||||
|
| Latency under 100 ms, table fits on disk | `materialized` | Local parquet, no BQ roundtrip |
|
||||||
|
| Table too large for analyst's disk, occasional ad-hoc query | `remote` | DuckDB BQ extension, no download |
|
||||||
|
| Table too large for disk AND analyst hits it constantly | `materialized` with aggregation/filter | Scheduled COPY of a slice |
|
||||||
|
| One-off subquery joined with local data | (no registry row) | Use `da query --register-bq …` for ad-hoc |
|
||||||
|
|
||||||
|
Cost: `materialized` runs once per `sync_schedule` regardless of how many analysts query it; `remote` runs once per analyst-query. The break-even is roughly query frequency × bytes scanned vs. one COPY × bytes scanned.
|
||||||
|
|
||||||
|
Guardrail: `data_source.bigquery.max_bytes_per_materialize` (default 10 GiB) blocks the COPY when BQ's dry-run estimate exceeds the cap. Set it explicitly per environment in `instance.yaml`.
|
||||||
|
|
||||||
|
Register a materialized table:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
da admin register-table orders_90d \
|
||||||
|
--source-type bigquery \
|
||||||
|
--query-mode materialized \
|
||||||
|
--query @docs/queries/orders_90d.sql \
|
||||||
|
--schedule "every 6h"
|
||||||
|
```
|
||||||
|
|
||||||
|
`--query` also accepts inline SQL.
|
||||||
|
|
|
||||||
|
|
@ -115,17 +115,30 @@ data_source:
|
||||||
location: "${BIGQUERY_LOCATION}" # BigQuery location (e.g., "us-central1", "US")
|
location: "${BIGQUERY_LOCATION}" # BigQuery location (e.g., "us-central1", "US")
|
||||||
# Uses ADC (Application Default Credentials) - VM service account on GCP
|
# Uses ADC (Application Default Credentials) - VM service account on GCP
|
||||||
# Data can live in a different project -- use fully-qualified table IDs in data_description.md
|
# Data can live in a different project -- use fully-qualified table IDs in data_description.md
|
||||||
# billing_project: "" # Optional: GCP project to bill BQ jobs to / submit jobs from.
|
# billing_project: "prj-billing" # GCP project to bill BQ jobs to / submit jobs from.
|
||||||
# # Defaults to `project`. Set this when the SA has bigquery.data.* on
|
# # Defaults to `project`. Set when the SA has bigquery.data.* on
|
||||||
# # the data project but lacks serviceusage.services.use there (i.e.,
|
# # the data project but lacks serviceusage.services.use there.
|
||||||
# # cross-project read pattern). Submission/billing target must be a
|
# # Mismatch -> every BQ call 403 USER_PROJECT_DENIED.
|
||||||
# # project the SA can use; data project just needs read.
|
# # `da diagnose` warns when this falls back to `project`.
|
||||||
# legacy_wrap_views: false # Set true to restore pre-v2 wrap views for BQ VIEW/MATERIALIZED_VIEW
|
# # Configurable via /admin/server-config UI.
|
||||||
# # tables in analytics.duckdb (migration escape hatch; default: false)
|
# legacy_wrap_views: false # When true, registered VIEWs and MATERIALIZED_VIEWs get a DuckDB
|
||||||
|
# # master view via bigquery_query() (jobs API) so analysts can
|
||||||
|
# # `SELECT * FROM viewname` directly. When false (default), views
|
||||||
|
# # are catalog-only -- analysts use `da fetch viewname` or
|
||||||
|
# # `da query --remote`. ON can cause "Response too large" on big
|
||||||
|
# # views; OFF requires analyst-side discipline (CLAUDE.md rails).
|
||||||
|
# # Toggle ON for view-heavy deployments where most views are small.
|
||||||
|
# # Configurable via /admin/server-config UI.
|
||||||
|
# max_bytes_per_materialize: 10737418240
|
||||||
|
# # Cost guardrail (bytes) for query_mode='materialized' BQ scans.
|
||||||
|
# # Dry-run check before running; exceeding -> registration / sync
|
||||||
|
# # rejected. Default 10 GiB (10737418240). Set 0 to disable.
|
||||||
|
# # null falls through to default. Configurable via /admin/server-config UI.
|
||||||
|
|
||||||
# --- OpenMetadata catalog (optional) ---
|
# --- OpenMetadata catalog (optional) ---
|
||||||
# Enriches table and column metadata from OpenMetadata REST API.
|
# Enriches table and column metadata from OpenMetadata REST API.
|
||||||
# If not configured, app works normally without catalog enrichment.
|
# If not configured, app works normally without catalog enrichment.
|
||||||
|
# All openmetadata.* fields configurable via /admin/server-config UI.
|
||||||
# openmetadata:
|
# openmetadata:
|
||||||
# url: "https://your-catalog.example.com"
|
# url: "https://your-catalog.example.com"
|
||||||
# token: "${OPENMETADATA_TOKEN}" # JWT bearer token
|
# token: "${OPENMETADATA_TOKEN}" # JWT bearer token
|
||||||
|
|
@ -147,6 +160,7 @@ email:
|
||||||
smtp_password: "${SMTP_PASSWORD}"
|
smtp_password: "${SMTP_PASSWORD}"
|
||||||
|
|
||||||
# --- Desktop app (optional) ---
|
# --- Desktop app (optional) ---
|
||||||
|
# All desktop.* fields configurable via /admin/server-config UI (rarely changed once set).
|
||||||
desktop:
|
desktop:
|
||||||
jwt_issuer: "data-analyst"
|
jwt_issuer: "data-analyst"
|
||||||
jwt_secret: "${DESKTOP_JWT_SECRET}"
|
jwt_secret: "${DESKTOP_JWT_SECRET}"
|
||||||
|
|
@ -174,7 +188,9 @@ jira:
|
||||||
ai:
|
ai:
|
||||||
provider: "anthropic" # or "openai_compat"
|
provider: "anthropic" # or "openai_compat"
|
||||||
api_key: "${ANTHROPIC_API_KEY}" # or "${LLM_API_KEY}" for proxy
|
api_key: "${ANTHROPIC_API_KEY}" # or "${LLM_API_KEY}" for proxy
|
||||||
# base_url: "https://litellm.example.com" # required for openai_compat
|
# base_url: "https://litellm.example.com" # Required for provider='openai_compat' (LiteLLM,
|
||||||
|
# OpenRouter, vLLM). Ignored when provider='anthropic'.
|
||||||
|
# Configurable via /admin/server-config UI.
|
||||||
model: "claude-haiku-4-5-20251001" # any model available on your provider
|
model: "claude-haiku-4-5-20251001" # any model available on your provider
|
||||||
# --- Structured output quality control ---
|
# --- Structured output quality control ---
|
||||||
# AI models can return JSON in three ways, each with different reliability:
|
# AI models can return JSON in three ways, each with different reliability:
|
||||||
|
|
@ -225,6 +241,10 @@ ai:
|
||||||
# Controls how AI-extracted knowledge is reviewed and distributed.
|
# Controls how AI-extracted knowledge is reviewed and distributed.
|
||||||
# If not present, system operates in legacy mode (democratic wiki, no admin review).
|
# If not present, system operates in legacy mode (democratic wiki, no admin review).
|
||||||
#
|
#
|
||||||
|
# The corporate_memory.* schema is editable via /admin/server-config UI; you can
|
||||||
|
# also continue to manage it via this YAML file. The UI surfaces every leaf with
|
||||||
|
# a hint, so use it to discover the schema if this comment block has aged.
|
||||||
|
#
|
||||||
# corporate_memory:
|
# corporate_memory:
|
||||||
# # How knowledge reaches users:
|
# # How knowledge reaches users:
|
||||||
# # "mandatory_only" — admin controls everything, no user voting
|
# # "mandatory_only" — admin controls everything, no user voting
|
||||||
|
|
|
||||||
|
|
@ -173,11 +173,20 @@ POST /api/sync/trigger (admin)
|
||||||
|
|
||||||
`connectors/bigquery/extractor.py`
|
`connectors/bigquery/extractor.py`
|
||||||
|
|
||||||
- Uses the DuckDB BigQuery community extension.
|
- Uses the DuckDB BigQuery community extension via the `BqAccess` facade in `connectors/bigquery/access.py`.
|
||||||
- No data download — views proxy all queries directly to BigQuery.
|
- No data download — views proxy all queries directly to BigQuery.
|
||||||
- Auth via `GOOGLE_APPLICATION_CREDENTIALS` (service account JSON) or ADC.
|
- Auth via `GOOGLE_APPLICATION_CREDENTIALS` (service account JSON) or ADC.
|
||||||
- Populates `_remote_attach` with `extension='bigquery'` and no `token_env` (env-based auth).
|
- Populates `_remote_attach` with `extension='bigquery'` and no `token_env` (env-based auth).
|
||||||
|
|
||||||
|
### BigQuery — Materialized SQL
|
||||||
|
|
||||||
|
`connectors/bigquery/extractor.py::materialize_query` (added in v0.25.0)
|
||||||
|
|
||||||
|
- Runs admin-registered SQL through the DuckDB BigQuery extension via `BqAccess.duckdb_session()` and writes the result to `/data/extracts/bigquery/data/<id>.parquet` atomically (`<id>.parquet.tmp` → `os.replace`).
|
||||||
|
- Triggered by `_run_materialized_pass` in `app/api/sync.py` between custom-connectors and orchestrator rebuild on every `/api/sync/trigger`. Per-table `sync_schedule` honored via `is_table_due()`.
|
||||||
|
- Cost guardrail: BQ dry-run via `app.api.v2_scan._bq_dry_run_bytes` (single source of truth for cost-estimate logic). `data_source.bigquery.max_bytes_per_materialize` (default 10 GiB; `0` disables). Fail-open when dry-run errors (DuckDB three-part syntax the native BQ client can't parse) — log warning + proceed.
|
||||||
|
- Distribution: result parquet rides the same manifest + `da sync` flow as Keboola tables. Per-user RBAC unchanged (`resource_grants(group, ResourceType.TABLE, table_id)`).
|
||||||
|
|
||||||
### Jira — Real-Time Push
|
### Jira — Real-Time Push
|
||||||
|
|
||||||
`connectors/jira/webhook.py` → `incremental_transform.py` → `extract_init.py`
|
`connectors/jira/webhook.py` → `incremental_transform.py` → `extract_init.py`
|
||||||
|
|
|
||||||
2515
docs/superpowers/plans/2026-05-01-admin-tables-form-cleanup.md
Normal file
2515
docs/superpowers/plans/2026-05-01-admin-tables-form-cleanup.md
Normal file
File diff suppressed because it is too large
Load diff
|
|
@ -2,8 +2,8 @@
|
||||||
|
|
||||||
Used to exercise the /admin/access UI with the new ResourceType.TABLE
|
Used to exercise the /admin/access UI with the new ResourceType.TABLE
|
||||||
without depending on a real data source. Each entry is registered with
|
without depending on a real data source. Each entry is registered with
|
||||||
``is_public=False`` so per-group grants are meaningful (a public table
|
default RBAC (no `is_public` bypass — that column was dropped in v19),
|
||||||
would bypass any future enforcement).
|
so per-group grants are required for analyst visibility.
|
||||||
|
|
||||||
Idempotent — TableRegistryRepository.register() does an UPSERT via
|
Idempotent — TableRegistryRepository.register() does an UPSERT via
|
||||||
ON CONFLICT, so re-running this script just refreshes the rows.
|
ON CONFLICT, so re-running this script just refreshes the rows.
|
||||||
|
|
@ -65,7 +65,6 @@ def main() -> None:
|
||||||
query_mode="local",
|
query_mode="local",
|
||||||
description=description,
|
description=description,
|
||||||
registered_by="seed_dummy_tables",
|
registered_by="seed_dummy_tables",
|
||||||
is_public=False,
|
|
||||||
profile_after_sync=False,
|
profile_after_sync=False,
|
||||||
)
|
)
|
||||||
after = len(repo.list_all())
|
after = len(repo.list_all())
|
||||||
|
|
|
||||||
111
tests/test_diagnose_billing.py
Normal file
111
tests/test_diagnose_billing.py
Normal file
|
|
@ -0,0 +1,111 @@
|
||||||
|
"""Phase K — `da diagnose` warning when BQ billing_project == project.
|
||||||
|
|
||||||
|
Surfaces via /api/health/detailed (which `da diagnose` already consumes):
|
||||||
|
when data_source.type == 'bigquery' and the resolved BqProjects.billing equals
|
||||||
|
BqProjects.data, the response includes a `services.bq_config` entry with
|
||||||
|
status='warning' and a hint about the 403 USER_PROJECT_DENIED footgun.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
|
def _auth(token: str) -> dict:
|
||||||
|
return {"Authorization": f"Bearer {token}"}
|
||||||
|
|
||||||
|
|
||||||
|
def _patch_instance_config(monkeypatch, cfg: dict) -> None:
|
||||||
|
"""Replace app.instance_config.load_instance_config + reset caches.
|
||||||
|
|
||||||
|
Also clears connectors.bigquery.access.get_bq_access's @functools.cache
|
||||||
|
so each test sees fresh BqProjects.
|
||||||
|
"""
|
||||||
|
monkeypatch.setattr(
|
||||||
|
"app.instance_config.load_instance_config",
|
||||||
|
lambda: cfg,
|
||||||
|
raising=False,
|
||||||
|
)
|
||||||
|
# DATA_SOURCE env var, if set in the user shell, would override
|
||||||
|
# get_data_source_type — strip it for deterministic tests.
|
||||||
|
monkeypatch.delenv("DATA_SOURCE", raising=False)
|
||||||
|
monkeypatch.delenv("BIGQUERY_PROJECT", raising=False)
|
||||||
|
|
||||||
|
from app.instance_config import reset_cache
|
||||||
|
reset_cache()
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture(autouse=True)
|
||||||
|
def _reset_after(monkeypatch):
|
||||||
|
yield
|
||||||
|
# Always reset the cache after each test so the next test (or an
|
||||||
|
# unrelated suite running afterwards) sees fresh config.
|
||||||
|
try:
|
||||||
|
from app.instance_config import reset_cache
|
||||||
|
reset_cache()
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
def test_diagnose_warns_when_billing_equals_project(seeded_app, monkeypatch):
|
||||||
|
"""BQ instance with billing_project missing (or equal to project) → warning."""
|
||||||
|
_patch_instance_config(monkeypatch, {
|
||||||
|
"data_source": {
|
||||||
|
"type": "bigquery",
|
||||||
|
"bigquery": {
|
||||||
|
"project": "shared-data-prod",
|
||||||
|
"billing_project": "shared-data-prod",
|
||||||
|
},
|
||||||
|
},
|
||||||
|
})
|
||||||
|
|
||||||
|
c = seeded_app["client"]
|
||||||
|
token = seeded_app["admin_token"]
|
||||||
|
r = c.get("/api/health/detailed", headers=_auth(token))
|
||||||
|
assert r.status_code == 200, r.text
|
||||||
|
body = r.json()
|
||||||
|
|
||||||
|
bq_cfg = body.get("services", {}).get("bq_config")
|
||||||
|
assert bq_cfg is not None, body
|
||||||
|
assert bq_cfg.get("status") == "warning", bq_cfg
|
||||||
|
# Hint mentions the YAML field path so operators know what to fix.
|
||||||
|
blob = (str(bq_cfg.get("detail", "")) + " " + str(bq_cfg.get("hint", ""))).lower()
|
||||||
|
assert "billing_project" in blob, bq_cfg
|
||||||
|
|
||||||
|
|
||||||
|
def test_diagnose_clean_when_billing_differs(seeded_app, monkeypatch):
|
||||||
|
"""Distinct billing_project → no warning surfaced."""
|
||||||
|
_patch_instance_config(monkeypatch, {
|
||||||
|
"data_source": {
|
||||||
|
"type": "bigquery",
|
||||||
|
"bigquery": {
|
||||||
|
"project": "data-prod",
|
||||||
|
"billing_project": "billing-dev",
|
||||||
|
},
|
||||||
|
},
|
||||||
|
})
|
||||||
|
|
||||||
|
c = seeded_app["client"]
|
||||||
|
token = seeded_app["admin_token"]
|
||||||
|
r = c.get("/api/health/detailed", headers=_auth(token))
|
||||||
|
assert r.status_code == 200, r.text
|
||||||
|
body = r.json()
|
||||||
|
|
||||||
|
bq_cfg = body.get("services", {}).get("bq_config")
|
||||||
|
# If present, it must be ok; absence is also fine (means no warning).
|
||||||
|
if bq_cfg is not None:
|
||||||
|
assert bq_cfg.get("status") == "ok", bq_cfg
|
||||||
|
|
||||||
|
|
||||||
|
def test_diagnose_no_warning_on_keboola_instance(seeded_app, monkeypatch):
|
||||||
|
"""Non-BQ instance: BQ billing check shouldn't surface at all."""
|
||||||
|
_patch_instance_config(monkeypatch, {"data_source": {"type": "keboola"}})
|
||||||
|
|
||||||
|
c = seeded_app["client"]
|
||||||
|
token = seeded_app["admin_token"]
|
||||||
|
r = c.get("/api/health/detailed", headers=_auth(token))
|
||||||
|
assert r.status_code == 200, r.text
|
||||||
|
body = r.json()
|
||||||
|
|
||||||
|
# Either absent or explicitly status='ok' (n/a). Definitely not 'warning'.
|
||||||
|
bq_cfg = body.get("services", {}).get("bq_config")
|
||||||
|
if bq_cfg is not None:
|
||||||
|
assert bq_cfg.get("status") != "warning", bq_cfg
|
||||||
Loading…
Reference in a new issue