merge: pull #174 (BQ materialize view fix + concurrency, 0.33.0) into bootstrap branch
Brings in zs/materialize-sync-fix (PR #174): - BigQuery view materialize works (wrap admin SQL in bigquery_query()) - Per-table mutex + fcntl.flock for concurrent COPY corruption - Cost guardrail dry-run engages on materialized rows - Schema v23 -> v24 migration: rewrite source_query to BQ-native - Server-generated trivial source_query from bucket+source_table - Validator backtick relaxation for materialized rows - 0.33.0 release cut Conflict resolution: - CHANGELOG.md: keep our [Unreleased] (bootstrap rewrite content) ABOVE the new [0.33.0] section from #174. The bootstrap rewrite remains unreleased; it'll cut 0.34.0 (or later) when this PR merges to main. - tests/conftest.py: union — keep our analyst-bootstrap fixture re-export AND #174's bq_instance / stub_bq_extractor fixtures. - pyproject.toml auto-merged to 0.33.0 (matches the cut), correct. - src/db.py auto-merged: SCHEMA_VERSION = 24, _v23_to_v24_finalize added — no overlap with our work which left schema at v23. - CLAUDE.md auto-merged: schema-history paragraph extended with v24. Verified: 79/79 across CLI bootstrap suite + materialize suite + schema v24 migration tests pass locally on Python 3.13/macOS.
This commit is contained in:
commit
e438170ade
23 changed files with 1607 additions and 259 deletions
83
CHANGELOG.md
83
CHANGELOG.md
|
|
@ -54,6 +54,89 @@ End-to-end clean-analyst-bootstrap rewrite. The web `/setup?role=analyst` page n
|
||||||
- `tests/test_clean_install_integration.py` — end-to-end happy-path tests (minimal grants, zero grants, force preserves CLAUDE.local.md, readers in pre-init dir).
|
- `tests/test_clean_install_integration.py` — end-to-end happy-path tests (minimal grants, zero grants, force preserves CLAUDE.local.md, readers in pre-init dir).
|
||||||
- `docs/RELEASE_CHECKLIST.md` — manual clean-install protocol mandated for any PR touching the bootstrap path.
|
- `docs/RELEASE_CHECKLIST.md` — manual clean-install protocol mandated for any PR touching the bootstrap path.
|
||||||
|
|
||||||
|
## [0.33.0] — 2026-05-04
|
||||||
|
|
||||||
|
Closes #162. Headline fix: `query_mode='materialized'` BigQuery rows now
|
||||||
|
materialize correctly for views and materialized views, with per-table
|
||||||
|
concurrency control preventing parquet corruption on overlapping scheduler
|
||||||
|
ticks. Plus a source_query server-generation convenience, a
|
||||||
|
`materialize.lock_ttl_seconds` config knob, and a schema v24 migration that
|
||||||
|
converts existing DuckDB-flavor source_query values to BQ-native SQL.
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
|
||||||
|
- BigQuery materialize now works for views and materialized views. Pre-fix,
|
||||||
|
`materialize_query` ran admin's `source_query` as `COPY (sql) TO parquet`
|
||||||
|
through the DuckDB BigQuery extension session, which routed through the BQ
|
||||||
|
Storage Read API for `bq."<ds>"."<tbl>"` references. Storage Read API
|
||||||
|
rejects non-base entities (`Binder Error: Error while creating read session:
|
||||||
|
... non-table entities cannot be read with the storage API`). Fixed by
|
||||||
|
always wrapping admin SQL into `bigquery_query('<billing-project>',
|
||||||
|
'<inner-sql>')` so COPY uses the BQ jobs API uniformly for tables, views,
|
||||||
|
and materialized views.
|
||||||
|
- `materialize_query` no longer corrupts its parquet under concurrent
|
||||||
|
invocations for the same `table_id`. Pre-fix, two overlapping
|
||||||
|
`_run_materialized_pass` calls (e.g. a long-running COPY + the next
|
||||||
|
scheduler tick) both hit the unconditional `if tmp_path.exists():
|
||||||
|
tmp_path.unlink()` at function entry and started parallel COPYs against the
|
||||||
|
same path, interleaving bytes and producing a parquet file with no valid
|
||||||
|
footer. Now each call acquires a per-table_id `threading.Lock` plus an
|
||||||
|
advisory `fcntl.flock` on `<id>.parquet.lock`; the second caller raises
|
||||||
|
`MaterializeInFlightError` and the scheduler treats it as
|
||||||
|
`skipped, in_flight` — never as an error.
|
||||||
|
- Cost guardrail dry-run now engages for materialized rows. Pre-fix, the
|
||||||
|
BigQuery Python client returned 400 (`Table-valued function not found:
|
||||||
|
bigquery_query`) on the wrapped SQL and the dry-run silently fail-opened.
|
||||||
|
The dry-run now operates on the inner BQ-native SQL (admin's `source_query`
|
||||||
|
directly), which the client parses cleanly.
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
|
||||||
|
- **BREAKING** `query_mode='materialized'` rows MUST register `source_query`
|
||||||
|
as BigQuery-native SQL (backticks for dashed identifiers, native
|
||||||
|
joins/CTEs). DuckDB-flavor (`bq."<ds>"."<tbl>"`) is no longer accepted on
|
||||||
|
register/PUT. The schema v24 migration converts existing rows automatically;
|
||||||
|
operators with custom-written `source_query` should review the migrated form
|
||||||
|
on first deploy. The validator's prior backtick-rejection rule is now scoped
|
||||||
|
to `query_mode IN ('remote', 'local')` only.
|
||||||
|
- `_run_materialized_pass` summary `skipped` field changes from `list[str]`
|
||||||
|
to `list[dict]` with shape
|
||||||
|
`{"table": str, "reason": Literal["due_check", "in_flight"]}`. Downstream
|
||||||
|
consumers that asserted the old string form must update.
|
||||||
|
|
||||||
|
### Added
|
||||||
|
|
||||||
|
- `POST /api/admin/register-table` for `query_mode='materialized'` rows with
|
||||||
|
`bucket`+`source_table` but no `source_query` now server-generates
|
||||||
|
`` SELECT * FROM `<project>.<bucket>.<source_table>` `` from the configured
|
||||||
|
BigQuery project. The same fallback fires on `PUT /api/admin/registry/{id}`
|
||||||
|
when flipping to materialized. Operators only need to know
|
||||||
|
`bigquery_query()` semantics for non-trivial queries.
|
||||||
|
- New top-level `materialize` config section in `instance.yaml`. Single field
|
||||||
|
— `materialize.lock_ttl_seconds` (default `86400`, 24 h) — controls how
|
||||||
|
long a stale `<id>.parquet.lock` file lives before a sibling materialize
|
||||||
|
attempt reclaims it. Editable via `/admin/server-config` API and UI.
|
||||||
|
|
||||||
|
### Internal
|
||||||
|
|
||||||
|
- Schema v24 migration: rewrites `table_registry.source_query` for
|
||||||
|
materialized BigQuery rows from DuckDB-flavor (`bq."<ds>"."<tbl>"`) to
|
||||||
|
BQ-native (`` `<project>.<ds>.<tbl>` ``) using the configured BQ project.
|
||||||
|
Idempotent on already-converted rows; logs a warning and skips when the
|
||||||
|
project isn't configured (operator can configure + restart for retry).
|
||||||
|
Wrapped in `BEGIN TRANSACTION` / `COMMIT` to match the project's
|
||||||
|
transactional-finalizer pattern.
|
||||||
|
- `connectors/bigquery/extractor.py` exports `MaterializeInFlightError` and
|
||||||
|
the `_get_table_lock` / `_get_lock_ttl_seconds` /
|
||||||
|
`_wrap_admin_sql_for_jobs_api` / `_escape_sql_string_literal` helpers as
|
||||||
|
test seams. Underscore-prefixed; not part of the public API.
|
||||||
|
- `tests/conftest.py` lifts `bq_instance` and `stub_bq_extractor` fixtures
|
||||||
|
from `tests/test_api_admin_materialized.py` so subsequent test modules in
|
||||||
|
this PR can resolve them via pytest's auto-discovery.
|
||||||
|
- `app/api/sync.py:is_table_due` hoisted to module-level import (was deferred
|
||||||
|
inside `_run_materialized_pass`) so monkeypatching `app.api.sync.is_table_due`
|
||||||
|
actually intercepts the call — the deferred form made test patches a no-op.
|
||||||
|
|
||||||
## [0.32.0] — 2026-05-04
|
## [0.32.0] — 2026-05-04
|
||||||
|
|
||||||
Closes #160. Headline fix: `da query --remote` now resolves
|
Closes #160. Headline fix: `da query --remote` now resolves
|
||||||
|
|
|
||||||
|
|
@ -443,7 +443,7 @@ Module sets `lifecycle { ignore_changes = [metadata_startup_script] }` on `googl
|
||||||
## Key Implementation Details
|
## Key Implementation Details
|
||||||
|
|
||||||
### DuckDB Schema (src/db.py)
|
### DuckDB Schema (src/db.py)
|
||||||
- Schema v23 with auto-migration v1→…→v23 (v5 adds `users.active`, v6 adds `personal_access_tokens`, v7 adds `personal_access_tokens.last_used_ip`, v8/v9 added the legacy internal_roles/role-grants tables, v10 added `view_ownership` for cross-connector view-name collision detection (issue #81 Group C), v11 added marketplace_registry + marketplace_plugins + user_groups + plugin_access, v12 added users.groups JSON + user_groups.is_system, **v13 replaces internal_roles/group_mappings/user_role_grants/plugin_access with user_group_members + resource_grants and drops users.groups JSON**, v14 adds FK constraints on user_group_members + resource_grants after orphan cleanup, v15 adds knowledge_items context-engineering columns + contradictions + session_extraction_state, v16 adds verification_evidence, v17 adds knowledge_item_relations, v18 drops stranded non-google memberships from google-managed groups, **v19 drops legacy `dataset_permissions`, `access_requests` tables and `users.role`, `table_registry.is_public` columns — table access is now exclusively per-group via `resource_grants(resource_type='table')`**, **v20 adds `source_query` TEXT to `table_registry` to back `query_mode='materialized'` (BigQuery scheduled-query parquet path)**, **v21 adds `welcome_template` singleton table backing the Agent Setup Prompt admin override (`/admin/agent-prompt`)**, **v22 reserves the `setup_banner` table — feature dropped mid-development; table retained for forward compatibility with already-migrated instances**, **v23 adds `claude_md_template` singleton table backing the Agent Workspace Prompt admin override (`/admin/workspace-prompt`)** — see CHANGELOG and docs/RBAC.md)
|
- Schema v24 with auto-migration v1→…→v24 (v5 adds `users.active`, v6 adds `personal_access_tokens`, v7 adds `personal_access_tokens.last_used_ip`, v8/v9 added the legacy internal_roles/role-grants tables, v10 added `view_ownership` for cross-connector view-name collision detection (issue #81 Group C), v11 added marketplace_registry + marketplace_plugins + user_groups + plugin_access, v12 added users.groups JSON + user_groups.is_system, **v13 replaces internal_roles/group_mappings/user_role_grants/plugin_access with user_group_members + resource_grants and drops users.groups JSON**, v14 adds FK constraints on user_group_members + resource_grants after orphan cleanup, v15 adds knowledge_items context-engineering columns + contradictions + session_extraction_state, v16 adds verification_evidence, v17 adds knowledge_item_relations, v18 drops stranded non-google memberships from google-managed groups, **v19 drops legacy `dataset_permissions`, `access_requests` tables and `users.role`, `table_registry.is_public` columns — table access is now exclusively per-group via `resource_grants(resource_type='table')`**, **v20 adds `source_query` TEXT to `table_registry` to back `query_mode='materialized'` (BigQuery scheduled-query parquet path)**, **v21 adds `welcome_template` singleton table backing the Agent Setup Prompt admin override (`/admin/agent-prompt`)**, **v22 reserves the `setup_banner` table — feature dropped mid-development; table retained for forward compatibility with already-migrated instances**, **v23 adds `claude_md_template` singleton table backing the Agent Workspace Prompt admin override (`/admin/workspace-prompt`)**, **v24 rewrites materialized BQ `source_query` from DuckDB-flavor `bq."ds"."t"` to BQ-native `` `<project>.ds.t` `` so the new wrapping path accepts them; idempotent + warns when project unconfigured** — see CHANGELOG and docs/RBAC.md)
|
||||||
- `table_registry`: id, name, source_type, bucket, source_table, query_mode, sync_schedule, etc.
|
- `table_registry`: id, name, source_type, bucket, source_table, query_mode, sync_schedule, etc.
|
||||||
- `sync_state`, `sync_history`: track extraction progress
|
- `sync_state`, `sync_history`: track extraction progress
|
||||||
- `users`, `audit_log`: account state + audit trail. RBAC lives in `user_groups` + `user_group_members` + `resource_grants`.
|
- `users`, `audit_log`: account state + audit trail. RBAC lives in `user_groups` + `user_group_members` + `resource_grants`.
|
||||||
|
|
|
||||||
189
app/api/admin.py
189
app/api/admin.py
|
|
@ -146,6 +146,38 @@ def _validate_urls_in_patch(sections: Dict[str, Dict[str, Any]]) -> None:
|
||||||
_validate_url_not_private(value, field_name=".".join(path))
|
_validate_url_not_private(value, field_name=".".join(path))
|
||||||
|
|
||||||
|
|
||||||
|
_LOCK_TTL_MIN = 60
|
||||||
|
_LOCK_TTL_MAX = 7 * 24 * 3600 # 604800 — one week
|
||||||
|
|
||||||
|
|
||||||
|
def _validate_materialize_section(sections: Dict[str, Dict[str, Any]]) -> None:
|
||||||
|
"""Validate the materialize section patch when present.
|
||||||
|
|
||||||
|
Checks field-level constraints that the Pydantic envelope can't enforce
|
||||||
|
(it only validates the outer shape, not nested leaf values).
|
||||||
|
"""
|
||||||
|
mat = sections.get("materialize")
|
||||||
|
if not isinstance(mat, dict):
|
||||||
|
return
|
||||||
|
ttl = mat.get("lock_ttl_seconds")
|
||||||
|
if ttl is None:
|
||||||
|
return
|
||||||
|
if not isinstance(ttl, int) or isinstance(ttl, bool):
|
||||||
|
raise HTTPException(
|
||||||
|
status_code=422,
|
||||||
|
detail="materialize.lock_ttl_seconds must be an integer",
|
||||||
|
)
|
||||||
|
if ttl < _LOCK_TTL_MIN or ttl > _LOCK_TTL_MAX:
|
||||||
|
raise HTTPException(
|
||||||
|
status_code=422,
|
||||||
|
detail=(
|
||||||
|
f"materialize.lock_ttl_seconds must be between "
|
||||||
|
f"{_LOCK_TTL_MIN} and {_LOCK_TTL_MAX} "
|
||||||
|
f"(got {ttl})"
|
||||||
|
),
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
# --- Server-config (instance.yaml) editor -----------------------------------
|
# --- Server-config (instance.yaml) editor -----------------------------------
|
||||||
#
|
#
|
||||||
# The /admin/server-config UI POSTs a partial dict here keyed by section
|
# The /admin/server-config UI POSTs a partial dict here keyed by section
|
||||||
|
|
@ -175,6 +207,7 @@ _EDITABLE_SECTIONS: tuple[str, ...] = (
|
||||||
"openmetadata",
|
"openmetadata",
|
||||||
"desktop",
|
"desktop",
|
||||||
"corporate_memory",
|
"corporate_memory",
|
||||||
|
"materialize",
|
||||||
)
|
)
|
||||||
|
|
||||||
# "Danger-zone" sections — flipping these can lock operators out (auth.*) or
|
# "Danger-zone" sections — flipping these can lock operators out (auth.*) or
|
||||||
|
|
@ -585,6 +618,23 @@ _KNOWN_FIELDS: dict[str, dict[str, dict]] = {
|
||||||
),
|
),
|
||||||
},
|
},
|
||||||
},
|
},
|
||||||
|
# materialize — file-lock TTL for the concurrent-materialize safety net.
|
||||||
|
# A single field; more knobs may follow as the feature matures.
|
||||||
|
"materialize": {
|
||||||
|
"lock_ttl_seconds": {
|
||||||
|
"kind": "int",
|
||||||
|
"default": 86400,
|
||||||
|
"hint": (
|
||||||
|
"How long (seconds) before a stale materialize lock file is "
|
||||||
|
"reclaimed. The lock is a .parquet.lock sibling file; if the "
|
||||||
|
"holder process is hard-killed, the next attempt reclaims the "
|
||||||
|
"lock once the file's mtime is older than this TTL. "
|
||||||
|
"Default 86400 (24 h). Min 60, max 604800 (7 days). "
|
||||||
|
"Lower only if you know materializes never exceed the new value "
|
||||||
|
"and your host regularly hard-kills processes."
|
||||||
|
),
|
||||||
|
},
|
||||||
|
},
|
||||||
}
|
}
|
||||||
|
|
||||||
# Keys whose values must be redacted from the audit diff. We match
|
# Keys whose values must be redacted from the audit diff. We match
|
||||||
|
|
@ -913,6 +963,9 @@ async def update_server_config(
|
||||||
# the per-section patch (e.g. data_source.keboola.stack_url).
|
# the per-section patch (e.g. data_source.keboola.stack_url).
|
||||||
_validate_urls_in_patch(request.sections)
|
_validate_urls_in_patch(request.sections)
|
||||||
|
|
||||||
|
# Field-level constraints for sections whose values have documented ranges.
|
||||||
|
_validate_materialize_section(request.sections)
|
||||||
|
|
||||||
# Defense-in-depth: scrub redaction sentinels (`***` / `<empty>`) out of
|
# Defense-in-depth: scrub redaction sentinels (`***` / `<empty>`) out of
|
||||||
# secret-keyed leaves in the patch before they reach the deep-merge.
|
# secret-keyed leaves in the patch before they reach the deep-merge.
|
||||||
# The client form does the same scrub, but an API caller round-tripping
|
# The client form does the same scrub, but an API caller round-tripping
|
||||||
|
|
@ -1169,27 +1222,28 @@ class RegisterTableRequest(BaseModel):
|
||||||
@model_validator(mode="after")
|
@model_validator(mode="after")
|
||||||
def _check_mode_query_coherence(self):
|
def _check_mode_query_coherence(self):
|
||||||
"""Enforce query_mode ↔ source_query invariants up front so an admin
|
"""Enforce query_mode ↔ source_query invariants up front so an admin
|
||||||
can't persist a remote/local row carrying an orphan source_query, and
|
can't persist a remote/local row carrying an orphan source_query.
|
||||||
materialized rows can't be registered without a SQL body."""
|
|
||||||
|
For BigQuery materialized rows, an empty source_query is allowed here
|
||||||
|
because _validate_bigquery_register_payload generates it from
|
||||||
|
bucket+source_table after this validator runs. For all other source
|
||||||
|
types (e.g. Keboola), source_query is still required for materialized.
|
||||||
|
"""
|
||||||
sq = (self.source_query or "").strip() or None
|
sq = (self.source_query or "").strip() or None
|
||||||
if self.query_mode == "materialized" and not sq:
|
|
||||||
raise ValueError(
|
|
||||||
"query_mode='materialized' requires a non-empty source_query"
|
|
||||||
)
|
|
||||||
if self.query_mode != "materialized" and sq:
|
if self.query_mode != "materialized" and sq:
|
||||||
raise ValueError(
|
raise ValueError(
|
||||||
"source_query is only valid when query_mode='materialized'"
|
"source_query is only valid when query_mode='materialized'"
|
||||||
)
|
)
|
||||||
# The materialize path runs the SQL through DuckDB's parser (BigQuery
|
# Non-BQ materialized rows must supply source_query explicitly — there
|
||||||
# extension's COPY pushes it through DuckDB first, and the Keboola
|
# is no server-generate fallback for Keboola materialized.
|
||||||
# path COPYs the raw SQL through a DuckDB session too). DuckDB does
|
if self.query_mode == "materialized" and not sq and self.source_type != "bigquery":
|
||||||
# NOT understand BigQuery-native backtick identifiers — those parse-
|
raise ValueError(
|
||||||
# error or silently match no rows, leaving no parquet at the
|
"query_mode='materialized' requires a non-empty source_query"
|
||||||
# canonical path and no operator-visible failure. Reject at register
|
)
|
||||||
# time with an actionable message so the bad SQL never lands in
|
# Backtick guard stays for non-materialized rows (DuckDB-flavor SQL
|
||||||
# `table_registry.source_query`. See `_run_materialized_pass` for
|
# contract); materialized SQL is BigQuery-native and MUST allow
|
||||||
# the runtime path that would otherwise eat the error.
|
# backticks for dashed identifiers (e.g. `prj-org.dataset.table`).
|
||||||
if sq and "`" in sq:
|
if self.query_mode != "materialized" and sq and "`" in sq:
|
||||||
raise ValueError(_BACKTICK_REJECTION_MESSAGE)
|
raise ValueError(_BACKTICK_REJECTION_MESSAGE)
|
||||||
# Normalise: stash the trimmed-or-None form so the persisted column
|
# Normalise: stash the trimmed-or-None form so the persisted column
|
||||||
# never carries surrounding whitespace or empty-string sentinels.
|
# never carries surrounding whitespace or empty-string sentinels.
|
||||||
|
|
@ -1232,6 +1286,31 @@ class RegisterTableRequest(BaseModel):
|
||||||
return v
|
return v
|
||||||
|
|
||||||
|
|
||||||
|
def _generate_materialized_source_query(
|
||||||
|
bucket: str, source_table: str, project_id: str,
|
||||||
|
) -> str:
|
||||||
|
"""Build the canonical full-table-dump source_query for a materialized
|
||||||
|
BQ row when admin only supplies dataset + table. The result is
|
||||||
|
BigQuery-native SQL — wrapped at materialize time into
|
||||||
|
bigquery_query(...) by connectors.bigquery.extractor.materialize_query."""
|
||||||
|
if not _is_safe_quoted_identifier(bucket):
|
||||||
|
raise HTTPException(
|
||||||
|
status_code=400,
|
||||||
|
detail=f"bigquery: dataset {bucket!r} is unsafe",
|
||||||
|
)
|
||||||
|
if not _is_safe_quoted_identifier(source_table):
|
||||||
|
raise HTTPException(
|
||||||
|
status_code=400,
|
||||||
|
detail=f"bigquery: source_table {source_table!r} is unsafe",
|
||||||
|
)
|
||||||
|
if not _is_safe_project_id(project_id):
|
||||||
|
raise HTTPException(
|
||||||
|
status_code=400,
|
||||||
|
detail=f"bigquery: data_source.bigquery.project {project_id!r} is malformed",
|
||||||
|
)
|
||||||
|
return f"SELECT * FROM `{project_id}.{bucket}.{source_table}`"
|
||||||
|
|
||||||
|
|
||||||
def _validate_bigquery_register_payload(req: "RegisterTableRequest") -> None:
|
def _validate_bigquery_register_payload(req: "RegisterTableRequest") -> None:
|
||||||
"""Enforce BQ-specific shape on a register/precheck request.
|
"""Enforce BQ-specific shape on a register/precheck request.
|
||||||
|
|
||||||
|
|
@ -1253,13 +1332,8 @@ def _validate_bigquery_register_payload(req: "RegisterTableRequest") -> None:
|
||||||
"""
|
"""
|
||||||
if req.query_mode == "materialized":
|
if req.query_mode == "materialized":
|
||||||
# Materialized BQ rows: the SQL body replaces dataset+table refs.
|
# Materialized BQ rows: the SQL body replaces dataset+table refs.
|
||||||
# Pydantic model_validator already verified source_query is non-empty;
|
# source_query may be empty if admin supplied bucket+source_table —
|
||||||
# all we still need is a valid project_id and a safe view name.
|
# in that case the server generates a full-table-dump SQL below.
|
||||||
if not req.source_query or not req.source_query.strip():
|
|
||||||
raise HTTPException(
|
|
||||||
status_code=422,
|
|
||||||
detail="bigquery materialized: 'source_query' is required",
|
|
||||||
)
|
|
||||||
raw_name = req.name or ""
|
raw_name = req.name or ""
|
||||||
if raw_name.strip() != raw_name or not _is_safe_identifier(raw_name):
|
if raw_name.strip() != raw_name or not _is_safe_identifier(raw_name):
|
||||||
raise HTTPException(
|
raise HTTPException(
|
||||||
|
|
@ -1271,7 +1345,7 @@ def _validate_bigquery_register_payload(req: "RegisterTableRequest") -> None:
|
||||||
),
|
),
|
||||||
)
|
)
|
||||||
from app.instance_config import get_value
|
from app.instance_config import get_value
|
||||||
project_id = get_value("data_source", "bigquery", "project", default="")
|
project_id = get_value("data_source", "bigquery", "project", default="") or ""
|
||||||
if not project_id:
|
if not project_id:
|
||||||
raise HTTPException(
|
raise HTTPException(
|
||||||
status_code=400,
|
status_code=400,
|
||||||
|
|
@ -1290,6 +1364,24 @@ def _validate_bigquery_register_payload(req: "RegisterTableRequest") -> None:
|
||||||
"^[a-z][a-z0-9-]{4,28}[a-z0-9]$"
|
"^[a-z][a-z0-9-]{4,28}[a-z0-9]$"
|
||||||
),
|
),
|
||||||
)
|
)
|
||||||
|
|
||||||
|
if not (req.source_query and req.source_query.strip()):
|
||||||
|
# Server-generate from bucket+source_table. Trivial full-table
|
||||||
|
# dump path; admin only sets dataset+table and the server
|
||||||
|
# builds BQ-native SQL from instance.yaml's configured project.
|
||||||
|
if not (req.bucket and req.source_table):
|
||||||
|
raise HTTPException(
|
||||||
|
status_code=422,
|
||||||
|
detail=(
|
||||||
|
"bigquery materialized requires either source_query "
|
||||||
|
"(custom SQL) or bucket+source_table (server-generates "
|
||||||
|
"the full-table-dump SQL)"
|
||||||
|
),
|
||||||
|
)
|
||||||
|
req.source_query = _generate_materialized_source_query(
|
||||||
|
req.bucket, req.source_table, project_id,
|
||||||
|
)
|
||||||
|
|
||||||
# Phase C: profile_after_sync is now inert (Pydantic field marked
|
# Phase C: profile_after_sync is now inert (Pydantic field marked
|
||||||
# deprecated; not read by app/api/sync.py:410-438). The runtime
|
# deprecated; not read by app/api/sync.py:410-438). The runtime
|
||||||
# profiles every synced table unconditionally, so we no longer
|
# profiles every synced table unconditionally, so we no longer
|
||||||
|
|
@ -2283,35 +2375,32 @@ async def update_table(
|
||||||
|
|
||||||
# Cross-source coherence: query_mode='materialized' requires a
|
# Cross-source coherence: query_mode='materialized' requires a
|
||||||
# non-empty source_query for ALL source types, not just BigQuery.
|
# non-empty source_query for ALL source types, not just BigQuery.
|
||||||
# Pre-fix, only the BQ-specific synthetic-RegisterTableRequest below
|
# BQ rows without source_query can be server-generated from
|
||||||
# caught this — Keboola materialized rows could be PUT without
|
# bucket+source_table (handled by _validate_bigquery_register_payload
|
||||||
# source_query and persisted with source_query=None, then crash at
|
# via the synthetic RegisterTableRequest below). Non-BQ rows (e.g.
|
||||||
# the next sync tick when kb_materialize_query received `sql=None`
|
# Keboola) still require an explicit source_query at PUT time.
|
||||||
# and DuckDB rejected `COPY (None) TO ...`. Devin finding 2026-05-01:
|
|
||||||
# BUG_pr-review-job-58ae3148_0001.
|
|
||||||
if merged.get("query_mode") == "materialized":
|
if merged.get("query_mode") == "materialized":
|
||||||
sq = merged.get("source_query")
|
sq = merged.get("source_query")
|
||||||
if not sq or not str(sq).strip():
|
if not sq or not str(sq).strip():
|
||||||
raise HTTPException(
|
# BQ rows: let _validate_bigquery_register_payload generate
|
||||||
status_code=422,
|
# source_query from bucket+source_table (falls through below).
|
||||||
detail=(
|
# Non-BQ rows: no server-generate fallback; raise 422.
|
||||||
"query_mode='materialized' requires a non-empty "
|
if merged.get("source_type") != "bigquery":
|
||||||
"source_query. To revert to a non-materialized mode, "
|
raise HTTPException(
|
||||||
"PATCH query_mode='local' (Keboola) or 'remote' "
|
status_code=422,
|
||||||
"(BigQuery) and the stale source_query is cleared "
|
detail=(
|
||||||
"automatically."
|
"query_mode='materialized' requires a non-empty "
|
||||||
),
|
"source_query. To revert to a non-materialized mode, "
|
||||||
)
|
"PATCH query_mode='local' (Keboola) or 'remote' "
|
||||||
# Backtick rejection on the merged record — see
|
"(BigQuery) and the stale source_query is cleared "
|
||||||
# `_BACKTICK_REJECTION_MESSAGE` for the rationale. Catches PATCHes
|
"automatically."
|
||||||
# that flip `source_query` to a backtick form on an already-
|
),
|
||||||
# materialized row, which the synthetic-RegisterTableRequest below
|
)
|
||||||
# only re-validates for BQ rows. Apply uniformly so Keboola
|
# Backtick guard removed for materialized rows: the Task 2 wrapping
|
||||||
# materialized rows can't carry one either.
|
# path (connectors.bigquery.extractor.materialize_query) now runs
|
||||||
if "`" in str(sq):
|
# admin SQL through the BQ jobs API using BQ-native syntax, which
|
||||||
raise HTTPException(
|
# requires backticks for dashed project/dataset identifiers.
|
||||||
status_code=422, detail=_BACKTICK_REJECTION_MESSAGE,
|
# Non-materialized rows still reject backticks in the model validator.
|
||||||
)
|
|
||||||
|
|
||||||
if merged.get("source_type") == "bigquery":
|
if merged.get("source_type") == "bigquery":
|
||||||
# Reuse the register-time validator. It mutates the request to
|
# Reuse the register-time validator. It mutates the request to
|
||||||
|
|
|
||||||
|
|
@ -20,7 +20,7 @@ from src.repositories.sync_state import SyncStateRepository
|
||||||
from src.repositories.sync_settings import SyncSettingsRepository
|
from src.repositories.sync_settings import SyncSettingsRepository
|
||||||
from src.repositories.table_registry import TableRegistryRepository
|
from src.repositories.table_registry import TableRegistryRepository
|
||||||
from src.rbac import can_access_table
|
from src.rbac import can_access_table
|
||||||
from src.scheduler import filter_due_tables
|
from src.scheduler import filter_due_tables, is_table_due
|
||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
logger = logging.getLogger(__name__)
|
||||||
router = APIRouter(prefix="/api/sync", tags=["sync"])
|
router = APIRouter(prefix="/api/sync", tags=["sync"])
|
||||||
|
|
@ -74,9 +74,8 @@ def _run_materialized_pass(conn: duckdb.DuckDBPyConnection, bq) -> dict:
|
||||||
its structured fields so operator alerting can pick out the cap-vs-actual
|
its structured fields so operator alerting can pick out the cap-vs-actual
|
||||||
bytes from the log line.
|
bytes from the log line.
|
||||||
"""
|
"""
|
||||||
from src.scheduler import is_table_due
|
|
||||||
from app.instance_config import get_value
|
from app.instance_config import get_value
|
||||||
from connectors.bigquery.extractor import MaterializeBudgetError
|
from connectors.bigquery.extractor import MaterializeBudgetError, MaterializeInFlightError
|
||||||
|
|
||||||
bq_output_dir = str(Path(_get_data_dir()) / "extracts" / "bigquery")
|
bq_output_dir = str(Path(_get_data_dir()) / "extracts" / "bigquery")
|
||||||
kb_output_dir = Path(_get_data_dir()) / "extracts" / "keboola" / "data"
|
kb_output_dir = Path(_get_data_dir()) / "extracts" / "keboola" / "data"
|
||||||
|
|
@ -125,7 +124,7 @@ def _run_materialized_pass(conn: duckdb.DuckDBPyConnection, bq) -> dict:
|
||||||
last_iso = last.isoformat() if last else None
|
last_iso = last.isoformat() if last else None
|
||||||
schedule = row.get("sync_schedule") or "every 1h"
|
schedule = row.get("sync_schedule") or "every 1h"
|
||||||
if not is_table_due(schedule, last_iso):
|
if not is_table_due(schedule, last_iso):
|
||||||
summary["skipped"].append(ref_name)
|
summary["skipped"].append({"table": ref_name, "reason": "due_check"})
|
||||||
continue
|
continue
|
||||||
|
|
||||||
source_type = row.get("source_type") or "bigquery" # legacy default
|
source_type = row.get("source_type") or "bigquery" # legacy default
|
||||||
|
|
@ -195,6 +194,13 @@ def _run_materialized_pass(conn: duckdb.DuckDBPyConnection, bq) -> dict:
|
||||||
),
|
),
|
||||||
})
|
})
|
||||||
continue
|
continue
|
||||||
|
except MaterializeInFlightError:
|
||||||
|
# In-flight on a sibling worker / scheduler tick — treat as
|
||||||
|
# 'skipped, in-flight'. Do NOT call state.set_error: that
|
||||||
|
# would flip status='error' on a healthy concurrent run and
|
||||||
|
# the registry UI would surface a false-positive failure.
|
||||||
|
summary["skipped"].append({"table": ref_name, "reason": "in_flight"})
|
||||||
|
continue
|
||||||
except MaterializeBudgetError as e:
|
except MaterializeBudgetError as e:
|
||||||
logger.warning(
|
logger.warning(
|
||||||
"Materialize cap exceeded for %s: %s bytes > %s bytes",
|
"Materialize cap exceeded for %s: %s bytes > %s bytes",
|
||||||
|
|
@ -466,9 +472,13 @@ sys.exit(compute_exit_code(result, len(configs)))
|
||||||
mat_summary = _run_materialized_pass(mat_conn, bq_access)
|
mat_summary = _run_materialized_pass(mat_conn, bq_access)
|
||||||
finally:
|
finally:
|
||||||
mat_conn.close()
|
mat_conn.close()
|
||||||
|
skipped_count = len(mat_summary["skipped"])
|
||||||
|
in_flight_count = sum(
|
||||||
|
1 for s in mat_summary["skipped"] if s.get("reason") == "in_flight"
|
||||||
|
)
|
||||||
print(
|
print(
|
||||||
f"[SYNC] Materialized SQL: {len(mat_summary['materialized'])} ok, "
|
f"[SYNC] Materialized SQL: {len(mat_summary['materialized'])} ok, "
|
||||||
f"{len(mat_summary['skipped'])} skipped, "
|
f"{skipped_count} skipped (in_flight={in_flight_count}), "
|
||||||
f"{len(mat_summary['errors'])} errors",
|
f"{len(mat_summary['errors'])} errors",
|
||||||
file=_sys.stderr, flush=True,
|
file=_sys.stderr, flush=True,
|
||||||
)
|
)
|
||||||
|
|
|
||||||
|
|
@ -218,6 +218,10 @@ const SECTION_META = {
|
||||||
title: "Corporate Memory",
|
title: "Corporate Memory",
|
||||||
help: "Optional governance for AI-extracted knowledge. When the section is unset, the system runs in legacy democratic-wiki mode with no admin review.",
|
help: "Optional governance for AI-extracted knowledge. When the section is unset, the system runs in legacy democratic-wiki mode with no admin review.",
|
||||||
},
|
},
|
||||||
|
materialize: {
|
||||||
|
title: "Materialize",
|
||||||
|
help: "Concurrency safety net for the materialize path. Controls the file-lock TTL used to detect and reclaim stale locks from hard-killed processes.",
|
||||||
|
},
|
||||||
};
|
};
|
||||||
const DANGER_SECTIONS = new Set(["auth", "server"]);
|
const DANGER_SECTIONS = new Set(["auth", "server"]);
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -403,3 +403,19 @@ catalog:
|
||||||
# schema_cache_ttl_seconds: 3600 # /api/v2/schema/{table_id} cache lifetime (default: 1 h)
|
# schema_cache_ttl_seconds: 3600 # /api/v2/schema/{table_id} cache lifetime (default: 1 h)
|
||||||
# sample_cache_ttl_seconds: 3600 # /api/v2/sample/{table_id} cache lifetime (default: 1 h)
|
# sample_cache_ttl_seconds: 3600 # /api/v2/sample/{table_id} cache lifetime (default: 1 h)
|
||||||
# # Admins can force-refresh via POST /api/v2/sample/{id}?refresh=true
|
# # Admins can force-refresh via POST /api/v2/sample/{id}?refresh=true
|
||||||
|
|
||||||
|
# --- Materialize concurrency safety (optional) ---
|
||||||
|
# Concurrency safety net for the materialize path (BQ + Keboola). When
|
||||||
|
# two materialize attempts race for the same table_id, the second one
|
||||||
|
# raises MaterializeInFlightError and skips. The lock is held in a
|
||||||
|
# .parquet.lock sibling file; if a holder process is hard-killed before
|
||||||
|
# kernel-level flock release, the next attempt reclaims the lock once
|
||||||
|
# the file's mtime is older than this TTL.
|
||||||
|
#
|
||||||
|
# Default 86400 (24h) is generous on purpose — anything shorter risks
|
||||||
|
# a long-running COPY being interrupted by its own scheduler successor.
|
||||||
|
# Lower it only if you know your materialize never exceeds the new
|
||||||
|
# value AND your host has a habit of hard-killing processes.
|
||||||
|
# Min 60 (1 minute), max 604800 (7 days). Configurable via /admin/server-config UI.
|
||||||
|
materialize:
|
||||||
|
lock_ttl_seconds: 86400
|
||||||
|
|
|
||||||
|
|
@ -3,16 +3,29 @@
|
||||||
No data is downloaded. All queries go directly to BigQuery via DuckDB extension ATTACH.
|
No data is downloaded. All queries go directly to BigQuery via DuckDB extension ATTACH.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
|
import fcntl
|
||||||
|
import hashlib
|
||||||
import logging
|
import logging
|
||||||
import os
|
import os
|
||||||
|
import re
|
||||||
import shutil
|
import shutil
|
||||||
import threading
|
import threading
|
||||||
|
import time
|
||||||
from datetime import datetime, timezone
|
from datetime import datetime, timezone
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from typing import List, Dict, Any, Optional
|
from typing import List, Dict, Any, Optional
|
||||||
|
|
||||||
import duckdb
|
import duckdb
|
||||||
|
|
||||||
|
from connectors.bigquery.auth import get_metadata_token, BQMetadataAuthError
|
||||||
|
from src.sql_safe import (
|
||||||
|
validate_identifier as _validate_identifier,
|
||||||
|
validate_project_id as _validate_project_id,
|
||||||
|
)
|
||||||
|
from src.identifier_validation import validate_identifier, validate_quoted_identifier
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
# Serializes the body of `init_extract` across threads so two concurrent
|
# Serializes the body of `init_extract` across threads so two concurrent
|
||||||
# materialize calls (e.g. the synchronous timeout-fallback BackgroundTask
|
# materialize calls (e.g. the synchronous timeout-fallback BackgroundTask
|
||||||
# kicking in while the original daemon thread is still running) can't both
|
# kicking in while the original daemon thread is still running) can't both
|
||||||
|
|
@ -21,15 +34,127 @@ import duckdb
|
||||||
# not the per-source extract-file write, so we need a dedicated lock here.
|
# not the per-source extract-file write, so we need a dedicated lock here.
|
||||||
_INIT_EXTRACT_LOCK = threading.Lock()
|
_INIT_EXTRACT_LOCK = threading.Lock()
|
||||||
|
|
||||||
from connectors.bigquery.auth import get_metadata_token, BQMetadataAuthError
|
_LOCK_TTL_DEFAULT_SECONDS: int = 86400 # 24h — overridable via materialize.lock_ttl_seconds
|
||||||
from app.instance_config import get_value
|
|
||||||
from src.sql_safe import (
|
|
||||||
validate_identifier as _validate_identifier,
|
|
||||||
validate_project_id as _validate_project_id,
|
|
||||||
)
|
|
||||||
from src.identifier_validation import validate_identifier, validate_quoted_identifier
|
|
||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
|
||||||
|
class MaterializeInFlightError(Exception):
|
||||||
|
"""Raised when a per-table_id materialize is already running.
|
||||||
|
|
||||||
|
Caller (`_run_materialized_pass`) should treat this as a 'skipped,
|
||||||
|
in-flight' outcome — the in-flight worker will finish and write
|
||||||
|
sync_state on its own. Critically, this is NOT an error condition;
|
||||||
|
`state.set_error` MUST NOT be called for this exception or the
|
||||||
|
registry would surface a false-positive failure to the operator
|
||||||
|
every overlap."""
|
||||||
|
|
||||||
|
def __init__(self, table_id: str, layer: str = "process"):
|
||||||
|
self.table_id = table_id
|
||||||
|
self.layer = layer
|
||||||
|
super().__init__(
|
||||||
|
f"materialize for {table_id!r} already in flight ({layer} lock held)"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
# Unbounded by design — each registered table_id gets one Lock for the
|
||||||
|
# process lifetime. Per-Lock cost is ~56 bytes; a deployment with even
|
||||||
|
# 10k registered tables holds <1 MB. No cleanup logic — clean would
|
||||||
|
# need ref-counting and risks freeing a Lock currently held by a worker.
|
||||||
|
_table_locks: dict[str, threading.Lock] = {}
|
||||||
|
_table_locks_registry: threading.Lock = threading.Lock()
|
||||||
|
|
||||||
|
|
||||||
|
def _get_table_lock(table_id: str) -> threading.Lock:
|
||||||
|
"""Return the process-wide mutex for a given table_id, creating it
|
||||||
|
on first reference. The registry mutex serializes the dict mutation
|
||||||
|
only — once the per-id Lock is returned, contention between callers
|
||||||
|
happens on that lock alone."""
|
||||||
|
with _table_locks_registry:
|
||||||
|
lock = _table_locks.get(table_id)
|
||||||
|
if lock is None:
|
||||||
|
lock = threading.Lock()
|
||||||
|
_table_locks[table_id] = lock
|
||||||
|
return lock
|
||||||
|
|
||||||
|
|
||||||
|
def _get_lock_ttl_seconds() -> int:
|
||||||
|
"""Read the configured stale-lock TTL with fallback to the default.
|
||||||
|
|
||||||
|
Operator override lives at instance.yaml `materialize.lock_ttl_seconds`
|
||||||
|
(also editable via /admin/server-config). Default 86400 s = 24 h
|
||||||
|
matches the upper bound of any healthy BQ COPY in practice — anything
|
||||||
|
longer is a stuck process or a hung BQ session, both of which warrant
|
||||||
|
reclaim on next attempt."""
|
||||||
|
try:
|
||||||
|
# Deferred import: keeps the connectors module importable in
|
||||||
|
# contexts where the app layer isn't bootstrapped (e.g. unit tests
|
||||||
|
# that exercise extractor helpers without the FastAPI app).
|
||||||
|
from app.instance_config import get_value
|
||||||
|
v = get_value(
|
||||||
|
"materialize", "lock_ttl_seconds",
|
||||||
|
default=_LOCK_TTL_DEFAULT_SECONDS,
|
||||||
|
)
|
||||||
|
n = int(v) if v is not None else _LOCK_TTL_DEFAULT_SECONDS
|
||||||
|
return n if n > 0 else _LOCK_TTL_DEFAULT_SECONDS
|
||||||
|
except Exception:
|
||||||
|
return _LOCK_TTL_DEFAULT_SECONDS
|
||||||
|
|
||||||
|
|
||||||
|
def _try_acquire_file_lock(lock_path: Path):
|
||||||
|
"""Try to acquire an advisory exclusive flock on `lock_path`. Returns
|
||||||
|
the open file object on success (caller must close to release); None
|
||||||
|
on conflict.
|
||||||
|
|
||||||
|
Stale-lock reclaim: if the lock_path exists and its mtime is older
|
||||||
|
than the configured TTL, log a warning and unlink before retrying.
|
||||||
|
A live holder still wins the second flock attempt (kernel-level
|
||||||
|
flock isn't tied to mtime), so the reclaim doesn't break correctness
|
||||||
|
— it just unblocks the case where a holder process was hard-killed
|
||||||
|
before the kernel released the lock."""
|
||||||
|
lock_path.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
def _try_open_and_flock():
|
||||||
|
# Open in 'w' mode so the file's mtime updates on every successful
|
||||||
|
# acquisition — the mtime is the TTL signal for the next caller.
|
||||||
|
# Content is intentionally empty; the fd exists only to anchor flock.
|
||||||
|
f = open(lock_path, "w")
|
||||||
|
try:
|
||||||
|
fcntl.flock(f.fileno(), fcntl.LOCK_EX | fcntl.LOCK_NB)
|
||||||
|
return f
|
||||||
|
except BlockingIOError:
|
||||||
|
# Another holder owns the lock — return None so the caller can
|
||||||
|
# decide between TTL-reclaim and propagating MaterializeInFlightError.
|
||||||
|
f.close()
|
||||||
|
return None
|
||||||
|
except OSError:
|
||||||
|
# Anything else (read-only fs, unsupported, fd exhaustion) is a
|
||||||
|
# platform / config error, not a contention signal. Close the fd
|
||||||
|
# and re-raise so the caller (and operator) sees the real failure
|
||||||
|
# instead of a silent leak.
|
||||||
|
f.close()
|
||||||
|
raise
|
||||||
|
|
||||||
|
holder = _try_open_and_flock()
|
||||||
|
if holder is not None:
|
||||||
|
return holder
|
||||||
|
|
||||||
|
# Conflict. If the file is older than TTL, reclaim and retry once.
|
||||||
|
try:
|
||||||
|
age = time.time() - lock_path.stat().st_mtime
|
||||||
|
except FileNotFoundError:
|
||||||
|
return _try_open_and_flock()
|
||||||
|
|
||||||
|
if age > _get_lock_ttl_seconds():
|
||||||
|
logger.warning(
|
||||||
|
"Reclaiming stale materialize lock at %s (age %.1fs > TTL)",
|
||||||
|
lock_path, age,
|
||||||
|
)
|
||||||
|
try:
|
||||||
|
lock_path.unlink()
|
||||||
|
except FileNotFoundError:
|
||||||
|
pass
|
||||||
|
return _try_open_and_flock()
|
||||||
|
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
def _detect_table_type(
|
def _detect_table_type(
|
||||||
|
|
@ -59,6 +184,56 @@ def _detect_table_type(
|
||||||
return row[0] if row else None
|
return row[0] if row else None
|
||||||
|
|
||||||
|
|
||||||
|
_BILLING_PROJECT_RE = re.compile(r"^[a-z][a-z0-9-]{4,28}[a-z0-9]$")
|
||||||
|
|
||||||
|
|
||||||
|
def _escape_sql_string_literal(s: str) -> str:
|
||||||
|
"""Double every single quote so the result is safe to embed inside a
|
||||||
|
single-quoted SQL string literal. DuckDB and BigQuery both honor the
|
||||||
|
SQL standard `''` escape inside `'...'`. Used to wrap admin
|
||||||
|
source_query into bigquery_query()'s second arg without breaking
|
||||||
|
the literal envelope."""
|
||||||
|
return s.replace("'", "''")
|
||||||
|
|
||||||
|
|
||||||
|
def _wrap_admin_sql_for_jobs_api(billing_project: str, inner_sql: str) -> str:
|
||||||
|
"""Build the COPY-source SQL that runs admin's `inner_sql` through
|
||||||
|
the BigQuery jobs API via the DuckDB BQ extension's
|
||||||
|
``bigquery_query()`` table function.
|
||||||
|
|
||||||
|
Why: the default `bq."ds"."t"` reference path uses the BQ Storage
|
||||||
|
Read API which rejects non-base entities (views, materialized views).
|
||||||
|
Routing through `bigquery_query()` uses the jobs API which accepts
|
||||||
|
every entity type uniformly.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
billing_project: GCP project ID that bills the BQ job. Must
|
||||||
|
match the GCP project_id grammar — anything else is rejected
|
||||||
|
as a defense-in-depth check (admin is trusted, but a typo
|
||||||
|
should fail closed not silently lose budget to the wrong
|
||||||
|
project).
|
||||||
|
inner_sql: BigQuery-flavor SQL the admin registered as
|
||||||
|
``source_query``. Should be BigQuery-native; DuckDB-flavor
|
||||||
|
`bq."ds"."t"` references are not enforced here but will fail at
|
||||||
|
COPY time inside the BQ jobs API. Existing rows are converted by
|
||||||
|
the v24 schema migration; new rows are validated upstream at
|
||||||
|
register/PUT.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
A DuckDB-parseable SQL fragment suitable as the operand of
|
||||||
|
``COPY (...) TO 'path' (FORMAT PARQUET)``.
|
||||||
|
"""
|
||||||
|
if not _BILLING_PROJECT_RE.match(billing_project):
|
||||||
|
raise ValueError(
|
||||||
|
f"billing_project {billing_project!r} is not a valid GCP project_id "
|
||||||
|
"(grammar: ^[a-z][a-z0-9-]{4,28}[a-z0-9]$)"
|
||||||
|
)
|
||||||
|
return (
|
||||||
|
f"SELECT * FROM bigquery_query('{billing_project}', "
|
||||||
|
f"'{_escape_sql_string_literal(inner_sql)}')"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
def _create_meta_table(conn: duckdb.DuckDBPyConnection) -> None:
|
def _create_meta_table(conn: duckdb.DuckDBPyConnection) -> None:
|
||||||
"""Create the _meta table required by the extract.duckdb contract."""
|
"""Create the _meta table required by the extract.duckdb contract."""
|
||||||
conn.execute("DROP TABLE IF EXISTS _meta")
|
conn.execute("DROP TABLE IF EXISTS _meta")
|
||||||
|
|
@ -321,33 +496,42 @@ def materialize_query(
|
||||||
to `<output_dir>/data/<table_id>.parquet` atomically.
|
to `<output_dir>/data/<table_id>.parquet` atomically.
|
||||||
|
|
||||||
Designed for `query_mode='materialized'` table_registry rows. The SQL
|
Designed for `query_mode='materialized'` table_registry rows. The SQL
|
||||||
is admin-registered (validated upstream) and may reference DuckDB
|
is admin-registered BQ-native SQL (DuckDB-flavor `bq."ds"."t"` refs are
|
||||||
three-part identifiers (`bq."dataset"."table"`) resolved by the
|
validated upstream). The SQL is wrapped in `bigquery_query('<billing>',
|
||||||
in-session ATTACH, OR native BQ identifiers via the `bigquery_query()`
|
'<inner>')` before the COPY so the BQ extension routes through the BQ
|
||||||
table function — both work because the session has the bigquery
|
jobs API — the default Storage Read API path rejects non-base entities
|
||||||
extension loaded with a SECRET token.
|
(views, materialized views) with "non-table entities cannot be read with
|
||||||
|
the storage API". Routing through `bigquery_query()` works uniformly for
|
||||||
|
base tables and views alike.
|
||||||
|
|
||||||
Cost guardrail: when `max_bytes` is a positive int, run a BQ dry-run
|
Cost guardrail: when `max_bytes` is a positive int, run a BQ dry-run
|
||||||
via `bq.client()` first; raise `MaterializeBudgetError` if the
|
via `bq.client()` first; raise `MaterializeBudgetError` if the
|
||||||
estimate exceeds the cap. `max_bytes=None` or `max_bytes <= 0`
|
estimate exceeds the cap. `max_bytes=None` or `max_bytes <= 0`
|
||||||
disables the guardrail (config sentinel, see
|
disables the guardrail (config sentinel, see
|
||||||
`data_source.bigquery.max_bytes_per_materialize`).
|
`data_source.bigquery.max_bytes_per_materialize`). The dry-run operates
|
||||||
|
on the inner `sql` (BQ-native), not the wrapped form.
|
||||||
|
|
||||||
Dry-run is best-effort and fail-open: if the SQL uses DuckDB syntax
|
Dry-run is best-effort and fail-open: if the dry-run errors (transient
|
||||||
that the native BQ client can't parse (e.g. `bq."ds"."t"`), the
|
upstream failure, missing google lib), we log a warning and proceed
|
||||||
dry-run raises and we log a warning; the COPY still runs. This
|
with the wrapped COPY.
|
||||||
matches the BqAccess facade's "client is for native BQ SQL only"
|
|
||||||
contract — operators who need the cap to engage write the registered
|
|
||||||
SQL using native BQ identifiers (`\\`project.ds.t\\``).
|
|
||||||
|
|
||||||
Atomic write: result lands in `<id>.parquet.tmp` first, then
|
Atomic write: result lands in `<id>.parquet.tmp` first, then
|
||||||
`os.replace` swaps it in. A failed COPY leaves no partial file behind.
|
`os.replace` swaps it in. A failed COPY leaves no partial file behind.
|
||||||
|
|
||||||
|
Concurrency: per-``table_id`` in-process mutex + advisory file lock
|
||||||
|
on ``<table_id>.parquet.lock``. Overlapping calls for the same id
|
||||||
|
raise ``MaterializeInFlightError`` immediately so the caller can
|
||||||
|
skip cleanly without consuming the COPY budget twice. Stale file
|
||||||
|
locks (mtime > ``materialize.lock_ttl_seconds``, default 24 h) are
|
||||||
|
reclaimed automatically.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
table_id: Logical id from table_registry; becomes the parquet
|
table_id: Logical id from table_registry; becomes the parquet
|
||||||
filename. Must pass `validate_identifier()` so it can't
|
filename. Must pass `validate_identifier()` so it can't
|
||||||
inject path traversal.
|
inject path traversal.
|
||||||
sql: SELECT statement, no trailing semicolon.
|
sql: BQ-native SELECT statement, no trailing semicolon. Wrapped
|
||||||
|
in `bigquery_query()` before the COPY — must not itself
|
||||||
|
contain a `bigquery_query()` call.
|
||||||
bq: A `BqAccess` instance — provides `duckdb_session()` for the
|
bq: A `BqAccess` instance — provides `duckdb_session()` for the
|
||||||
COPY and `client()` for the dry-run.
|
COPY and `client()` for the dry-run.
|
||||||
output_dir: Connector root, e.g. `/data/extracts/bigquery`.
|
output_dir: Connector root, e.g. `/data/extracts/bigquery`.
|
||||||
|
|
@ -358,7 +542,10 @@ def materialize_query(
|
||||||
{"rows": int, "size_bytes": int, "query_mode": "materialized"}
|
{"rows": int, "size_bytes": int, "query_mode": "materialized"}
|
||||||
|
|
||||||
Raises:
|
Raises:
|
||||||
ValueError: if `table_id` is unsafe.
|
ValueError: if `table_id` is unsafe or `bq.projects.billing` fails
|
||||||
|
the GCP project_id grammar check.
|
||||||
|
MaterializeInFlightError: if a concurrent call for the same table_id
|
||||||
|
is already in progress (in-process or cross-process).
|
||||||
MaterializeBudgetError: if `max_bytes > 0` and dry-run estimate exceeds it.
|
MaterializeBudgetError: if `max_bytes > 0` and dry-run estimate exceeds it.
|
||||||
BqAccessError: from `bq.duckdb_session()` (auth_failed / bq_lib_missing /
|
BqAccessError: from `bq.duckdb_session()` (auth_failed / bq_lib_missing /
|
||||||
not_configured) — caller catches and aggregates into the trigger
|
not_configured) — caller catches and aggregates into the trigger
|
||||||
|
|
@ -374,99 +561,114 @@ def materialize_query(
|
||||||
|
|
||||||
parquet_path = data_dir / f"{table_id}.parquet"
|
parquet_path = data_dir / f"{table_id}.parquet"
|
||||||
tmp_path = data_dir / f"{table_id}.parquet.tmp"
|
tmp_path = data_dir / f"{table_id}.parquet.tmp"
|
||||||
if tmp_path.exists():
|
lock_path = data_dir / f"{table_id}.parquet.lock"
|
||||||
tmp_path.unlink()
|
|
||||||
|
|
||||||
# Cost guardrail (best-effort — fail-open if dry-run can't parse the SQL).
|
proc_lock = _get_table_lock(table_id)
|
||||||
if max_bytes is not None and max_bytes > 0:
|
if not proc_lock.acquire(blocking=False):
|
||||||
|
raise MaterializeInFlightError(table_id, layer="process")
|
||||||
|
try:
|
||||||
|
file_lock = _try_acquire_file_lock(lock_path)
|
||||||
|
if file_lock is None:
|
||||||
|
raise MaterializeInFlightError(table_id, layer="file")
|
||||||
try:
|
try:
|
||||||
from app.api.v2_scan import _bq_dry_run_bytes # reuse main's impl
|
|
||||||
estimated = _bq_dry_run_bytes(bq, sql)
|
|
||||||
except Exception as e:
|
|
||||||
logger.warning(
|
|
||||||
"BQ dry-run failed for materialize cost guardrail (fail-open): %s. "
|
|
||||||
"If the SQL uses DuckDB three-part names like bq.\"ds\".\"t\", "
|
|
||||||
"rewrite to native BQ identifiers (`project.ds.t`) for the "
|
|
||||||
"guardrail to engage. Proceeding with COPY.",
|
|
||||||
e,
|
|
||||||
)
|
|
||||||
estimated = 0
|
|
||||||
if estimated > max_bytes:
|
|
||||||
raise MaterializeBudgetError(
|
|
||||||
f"dry-run estimate {estimated:,} bytes exceeds cap "
|
|
||||||
f"{max_bytes:,} for table {table_id!r}",
|
|
||||||
table_id=table_id,
|
|
||||||
current=estimated,
|
|
||||||
limit=max_bytes,
|
|
||||||
)
|
|
||||||
|
|
||||||
# COPY through a BqAccess-managed session.
|
|
||||||
with bq.duckdb_session() as conn:
|
|
||||||
# ATTACH the data project — but only when no `bq` catalog is
|
|
||||||
# already attached. Production sessions (real BqAccess) come with
|
|
||||||
# only `:memory:` and need the ATTACH; test sessions pre-populate
|
|
||||||
# `bq` as a fixture catalog and would error on a redundant ATTACH
|
|
||||||
# (alias already in use) AND on the bigquery extension load when
|
|
||||||
# the test runner has no cached extension. Detecting via
|
|
||||||
# `duckdb_databases()` keeps the ATTACH path idempotent without
|
|
||||||
# swallowing real errors (auth, cross-project permission,
|
|
||||||
# malformed project_id) — those still propagate from the actual
|
|
||||||
# ATTACH call.
|
|
||||||
attached = {
|
|
||||||
r[0] for r in conn.execute(
|
|
||||||
"SELECT database_name FROM duckdb_databases()"
|
|
||||||
).fetchall()
|
|
||||||
}
|
|
||||||
if "bq" not in attached:
|
|
||||||
conn.execute(
|
|
||||||
f"ATTACH 'project={bq.projects.data}' AS bq (TYPE bigquery, READ_ONLY)"
|
|
||||||
)
|
|
||||||
|
|
||||||
try:
|
|
||||||
safe_path = str(tmp_path).replace("'", "''")
|
|
||||||
conn.execute(f"COPY ({sql}) TO '{safe_path}' (FORMAT PARQUET)")
|
|
||||||
rows = conn.execute(
|
|
||||||
f"SELECT count(*) FROM read_parquet('{safe_path}')"
|
|
||||||
).fetchone()[0]
|
|
||||||
except Exception:
|
|
||||||
if tmp_path.exists():
|
if tmp_path.exists():
|
||||||
tmp_path.unlink()
|
tmp_path.unlink()
|
||||||
raise
|
|
||||||
|
|
||||||
# Compute the parquet hash inline before the atomic swap. The caller used
|
# Build the wrapped SQL once — both the cost guardrail dry-run and
|
||||||
# to re-read the file in `_run_materialized_pass` to hash it via
|
# the COPY operate on `sql` (the inner BQ SQL); only the COPY needs
|
||||||
# `_file_hash`, but that's a synchronous full-read on the FastAPI worker
|
# the DuckDB-side bigquery_query() envelope.
|
||||||
# thread — a 10 GiB parquet means 50+ seconds of disk I/O blocking other
|
billing_project = bq.projects.billing
|
||||||
# requests. Hashing here keeps the open-file handle hot from the COPY
|
wrapped_sql = _wrap_admin_sql_for_jobs_api(billing_project, sql)
|
||||||
# round and removes the second read. Devil's-advocate review item.
|
|
||||||
import hashlib
|
|
||||||
h = hashlib.md5()
|
|
||||||
with open(tmp_path, "rb") as f:
|
|
||||||
for chunk in iter(lambda: f.read(8192), b""):
|
|
||||||
h.update(chunk)
|
|
||||||
parquet_hash = h.hexdigest()
|
|
||||||
|
|
||||||
size_bytes = tmp_path.stat().st_size
|
if max_bytes is not None and max_bytes > 0:
|
||||||
os.replace(tmp_path, parquet_path)
|
try:
|
||||||
|
from app.api.v2_scan import _bq_dry_run_bytes # reuse main's impl
|
||||||
|
estimated = _bq_dry_run_bytes(bq, sql) # NB: pass inner SQL (BQ-native)
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning(
|
||||||
|
"BQ dry-run failed for materialize cost guardrail (fail-open): %s. "
|
||||||
|
"Proceeding with COPY against `bigquery_query()` wrapping.",
|
||||||
|
e,
|
||||||
|
)
|
||||||
|
estimated = 0
|
||||||
|
if estimated > max_bytes:
|
||||||
|
raise MaterializeBudgetError(
|
||||||
|
f"dry-run estimate {estimated:,} bytes exceeds cap "
|
||||||
|
f"{max_bytes:,} for table {table_id!r}",
|
||||||
|
table_id=table_id,
|
||||||
|
current=estimated,
|
||||||
|
limit=max_bytes,
|
||||||
|
)
|
||||||
|
|
||||||
rows = int(rows)
|
# COPY through a BqAccess-managed session. The session has the BQ
|
||||||
if rows == 0:
|
# extension loaded with a SECRET token; bigquery_query() reuses that
|
||||||
# 0 rows is indistinguishable from "the SQL is wrong and nobody
|
# auth path against the billing_project for the jobs API call.
|
||||||
# noticed" — surface it loudly so operators see it in the scheduler
|
with bq.duckdb_session() as conn:
|
||||||
# log line and the per-row error aggregation. Caller decides whether
|
attached = {
|
||||||
# to alert.
|
r[0] for r in conn.execute(
|
||||||
logger.warning(
|
"SELECT database_name FROM duckdb_databases()"
|
||||||
"Materialized %s produced 0 rows — verify the SQL filter is "
|
).fetchall()
|
||||||
"intentional. Parquet written: %s",
|
}
|
||||||
table_id, parquet_path,
|
if "bq" not in attached:
|
||||||
)
|
conn.execute(
|
||||||
|
f"ATTACH 'project={bq.projects.data}' AS bq (TYPE bigquery, READ_ONLY)"
|
||||||
|
)
|
||||||
|
|
||||||
return {
|
try:
|
||||||
"rows": rows,
|
safe_path = _escape_sql_string_literal(str(tmp_path))
|
||||||
"size_bytes": size_bytes,
|
conn.execute(
|
||||||
"query_mode": "materialized",
|
f"COPY ({wrapped_sql}) TO '{safe_path}' (FORMAT PARQUET)"
|
||||||
"hash": parquet_hash,
|
)
|
||||||
}
|
rows = conn.execute(
|
||||||
|
f"SELECT count(*) FROM read_parquet('{safe_path}')"
|
||||||
|
).fetchone()[0]
|
||||||
|
except Exception:
|
||||||
|
if tmp_path.exists():
|
||||||
|
tmp_path.unlink()
|
||||||
|
raise
|
||||||
|
|
||||||
|
# Compute the parquet hash inline before the atomic swap. The caller used
|
||||||
|
# to re-read the file in `_run_materialized_pass` to hash it via
|
||||||
|
# `_file_hash`, but that's a synchronous full-read on the FastAPI worker
|
||||||
|
# thread — a 10 GiB parquet means 50+ seconds of disk I/O blocking other
|
||||||
|
# requests. Hashing here keeps the open-file handle hot from the COPY
|
||||||
|
# round and removes the second read. Devil's-advocate review item.
|
||||||
|
h = hashlib.md5()
|
||||||
|
with open(tmp_path, "rb") as f:
|
||||||
|
for chunk in iter(lambda: f.read(8192), b""):
|
||||||
|
h.update(chunk)
|
||||||
|
parquet_hash = h.hexdigest()
|
||||||
|
|
||||||
|
size_bytes = tmp_path.stat().st_size
|
||||||
|
os.replace(tmp_path, parquet_path)
|
||||||
|
|
||||||
|
rows = int(rows)
|
||||||
|
if rows == 0:
|
||||||
|
# 0 rows is indistinguishable from "the SQL is wrong and nobody
|
||||||
|
# noticed" — surface it loudly so operators see it in the scheduler
|
||||||
|
# log line and the per-row error aggregation. Caller decides whether
|
||||||
|
# to alert.
|
||||||
|
logger.warning(
|
||||||
|
"Materialized %s produced 0 rows — verify the SQL filter is "
|
||||||
|
"intentional. Parquet written: %s",
|
||||||
|
table_id, parquet_path,
|
||||||
|
)
|
||||||
|
|
||||||
|
return {
|
||||||
|
"rows": rows,
|
||||||
|
"size_bytes": size_bytes,
|
||||||
|
"query_mode": "materialized",
|
||||||
|
"hash": parquet_hash,
|
||||||
|
}
|
||||||
|
finally:
|
||||||
|
try:
|
||||||
|
file_lock.close() # releases flock
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
# Don't unlink lock_path — its mtime is the TTL signal for
|
||||||
|
# the next reclaim. Leaving it in place is intentional.
|
||||||
|
finally:
|
||||||
|
proc_lock.release()
|
||||||
|
|
||||||
|
|
||||||
def _resolve_bq_project_id() -> str:
|
def _resolve_bq_project_id() -> str:
|
||||||
|
|
|
||||||
|
|
@ -1,6 +1,6 @@
|
||||||
[project]
|
[project]
|
||||||
name = "agnes-the-ai-analyst"
|
name = "agnes-the-ai-analyst"
|
||||||
version = "0.32.0"
|
version = "0.33.0"
|
||||||
description = "Agnes — AI Data Analyst platform for AI analytical systems"
|
description = "Agnes — AI Data Analyst platform for AI analytical systems"
|
||||||
requires-python = ">=3.11,<3.14"
|
requires-python = ">=3.11,<3.14"
|
||||||
license = "MIT"
|
license = "MIT"
|
||||||
|
|
|
||||||
79
src/db.py
79
src/db.py
|
|
@ -39,7 +39,7 @@ def _maybe_instrument(con, db_tag: str):
|
||||||
|
|
||||||
_SAFE_IDENTIFIER = re.compile(r"^[a-zA-Z_][a-zA-Z0-9_]{0,63}$")
|
_SAFE_IDENTIFIER = re.compile(r"^[a-zA-Z_][a-zA-Z0-9_]{0,63}$")
|
||||||
|
|
||||||
SCHEMA_VERSION = 23
|
SCHEMA_VERSION = 24
|
||||||
|
|
||||||
_SYSTEM_SCHEMA = """
|
_SYSTEM_SCHEMA = """
|
||||||
CREATE TABLE IF NOT EXISTS schema_version (
|
CREATE TABLE IF NOT EXISTS schema_version (
|
||||||
|
|
@ -1682,6 +1682,81 @@ _V22_TO_V23_MIGRATIONS = [
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|
||||||
|
# v24: rewrite materialized BQ source_query from DuckDB-flavor
|
||||||
|
# (bq."<dataset>"."<table>") to BigQuery-native (`<project>.<dataset>.<table>`)
|
||||||
|
# so the new connectors.bigquery.extractor.materialize_query wrapping
|
||||||
|
# path (which routes through bigquery_query() / BQ jobs API) accepts
|
||||||
|
# them. Pre-v24, materialize used Storage Read API for the bq.<ds>.<tbl>
|
||||||
|
# form, which fails for views — see PR for full motivation.
|
||||||
|
#
|
||||||
|
# This migration is implemented in Python (not pure SQL) because the
|
||||||
|
# rewrite is a regex-and-replace per row: the project_id comes from
|
||||||
|
# instance_config (file/env), not the DB. SQL alone can't pull the
|
||||||
|
# project_id and substitute it. If the project isn't configured at
|
||||||
|
# migration time, log a warning per affected row and leave them — the
|
||||||
|
# operator must configure data_source.bigquery.project, restart, and
|
||||||
|
# the migration will fire on next start (idempotent).
|
||||||
|
def _replace_for_v24(project_id: str):
|
||||||
|
"""Build a re.sub replacement function (not a string) so backslash
|
||||||
|
sequences in `project_id` aren't interpreted as group references.
|
||||||
|
GCP project IDs can't actually contain backslashes, but using a
|
||||||
|
function-form replacement is the defensive idiom — it makes the
|
||||||
|
intent explicit and removes the dependency on re.sub's replacement-
|
||||||
|
string escaping rules."""
|
||||||
|
def _repl(m):
|
||||||
|
return f"`{project_id}.{m.group(1)}.{m.group(2)}`"
|
||||||
|
return _repl
|
||||||
|
|
||||||
|
|
||||||
|
def _v23_to_v24_finalize(conn: duckdb.DuckDBPyConnection) -> None:
|
||||||
|
import re as _re
|
||||||
|
|
||||||
|
try:
|
||||||
|
from app.instance_config import get_value
|
||||||
|
project_id = get_value("data_source", "bigquery", "project", default="") or ""
|
||||||
|
except Exception:
|
||||||
|
project_id = ""
|
||||||
|
|
||||||
|
pattern = _re.compile(r'bq\."([^"]+)"\."([^"]+)"')
|
||||||
|
|
||||||
|
rows = conn.execute(
|
||||||
|
"SELECT id, source_query FROM table_registry "
|
||||||
|
"WHERE query_mode = 'materialized' "
|
||||||
|
"AND source_query LIKE '%bq.\"%' "
|
||||||
|
"AND source_type = 'bigquery'"
|
||||||
|
).fetchall()
|
||||||
|
|
||||||
|
if not rows:
|
||||||
|
return # Nothing to migrate; skip the transaction.
|
||||||
|
|
||||||
|
conn.execute("BEGIN TRANSACTION")
|
||||||
|
try:
|
||||||
|
for row_id, sq in rows:
|
||||||
|
if sq is None:
|
||||||
|
continue
|
||||||
|
if not project_id:
|
||||||
|
logger.warning(
|
||||||
|
"v24 migration: skipping rewrite of source_query for row %r — "
|
||||||
|
"data_source.bigquery.project is not configured. Set it via "
|
||||||
|
"/admin/server-config and restart the app to retry the "
|
||||||
|
"migration.", row_id,
|
||||||
|
)
|
||||||
|
continue
|
||||||
|
new_sq = pattern.sub(_replace_for_v24(project_id), sq)
|
||||||
|
if new_sq != sq:
|
||||||
|
conn.execute(
|
||||||
|
"UPDATE table_registry SET source_query = ? WHERE id = ?",
|
||||||
|
[new_sq, row_id],
|
||||||
|
)
|
||||||
|
logger.info(
|
||||||
|
"v24 migration: rewrote source_query for row %r", row_id,
|
||||||
|
)
|
||||||
|
conn.execute("COMMIT")
|
||||||
|
except Exception:
|
||||||
|
conn.execute("ROLLBACK")
|
||||||
|
raise
|
||||||
|
|
||||||
|
|
||||||
def _ensure_schema(conn: duckdb.DuckDBPyConnection) -> None:
|
def _ensure_schema(conn: duckdb.DuckDBPyConnection) -> None:
|
||||||
"""Create tables if they don't exist. Apply migrations if schema version changed.
|
"""Create tables if they don't exist. Apply migrations if schema version changed.
|
||||||
|
|
||||||
|
|
@ -1837,6 +1912,8 @@ def _ensure_schema(conn: duckdb.DuckDBPyConnection) -> None:
|
||||||
if current < 23:
|
if current < 23:
|
||||||
for sql in _V22_TO_V23_MIGRATIONS:
|
for sql in _V22_TO_V23_MIGRATIONS:
|
||||||
conn.execute(sql)
|
conn.execute(sql)
|
||||||
|
if current < 24:
|
||||||
|
_v23_to_v24_finalize(conn)
|
||||||
conn.execute(
|
conn.execute(
|
||||||
"UPDATE schema_version SET version = ?, applied_at = current_timestamp",
|
"UPDATE schema_version SET version = ?, applied_at = current_timestamp",
|
||||||
[SCHEMA_VERSION],
|
[SCHEMA_VERSION],
|
||||||
|
|
|
||||||
|
|
@ -2,6 +2,7 @@
|
||||||
|
|
||||||
import os
|
import os
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
from unittest.mock import MagicMock
|
||||||
|
|
||||||
import duckdb
|
import duckdb
|
||||||
import pytest
|
import pytest
|
||||||
|
|
@ -319,3 +320,68 @@ from tests.fixtures.analyst_bootstrap import ( # noqa: E402,F401
|
||||||
web_session,
|
web_session,
|
||||||
zero_grants_workspace,
|
zero_grants_workspace,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def bq_instance(monkeypatch):
|
||||||
|
"""Force instance.yaml to look like a BigQuery deployment for the
|
||||||
|
duration of one test. Patches the cached load_instance_config so
|
||||||
|
/admin/server-config reads / get_value('data_source.bigquery.project')
|
||||||
|
return what we want, without touching the on-disk instance.yaml.
|
||||||
|
|
||||||
|
Tests that need BigQuery-specific admin API behaviour (project_id
|
||||||
|
validation, materialized source_query checks, etc.) depend on this
|
||||||
|
fixture. Yields the fake config dict so callers can inspect it.
|
||||||
|
|
||||||
|
Note: several test files (test_admin_bq_register.py,
|
||||||
|
test_admin_tables_ui_materialized.py, …) define their own local
|
||||||
|
``bq_instance`` fixture. Those local definitions shadow this one
|
||||||
|
inside those files — the conftest copy is the canonical provider for
|
||||||
|
any new test file that imports from this module."""
|
||||||
|
fake_cfg = {
|
||||||
|
"data_source": {
|
||||||
|
"type": "bigquery",
|
||||||
|
"bigquery": {"project": "my-test-project", "location": "us"},
|
||||||
|
},
|
||||||
|
}
|
||||||
|
monkeypatch.setattr(
|
||||||
|
"app.instance_config.load_instance_config",
|
||||||
|
lambda: fake_cfg,
|
||||||
|
raising=False,
|
||||||
|
)
|
||||||
|
from app.instance_config import reset_cache
|
||||||
|
reset_cache()
|
||||||
|
yield fake_cfg
|
||||||
|
reset_cache()
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def stub_bq_extractor(monkeypatch):
|
||||||
|
"""Mirror tests/test_admin_bq_register.py — bypasses real-BQ traffic
|
||||||
|
in the post-register rebuild path so the test stays offline. Required
|
||||||
|
whenever the test seeds a remote-mode BQ row via the HTTP API.
|
||||||
|
|
||||||
|
Patches:
|
||||||
|
- ``connectors.bigquery.extractor.rebuild_from_registry`` — returns a
|
||||||
|
minimal success dict so the admin register endpoint's 200/201 path
|
||||||
|
completes without touching a real BQ project.
|
||||||
|
- ``src.orchestrator.SyncOrchestrator`` — replaced with a no-op mock so
|
||||||
|
the post-register orchestrator.rebuild() call doesn't scan the
|
||||||
|
(empty) extracts directory during tests.
|
||||||
|
|
||||||
|
Returns the ``rebuild_from_registry`` MagicMock directly so callers
|
||||||
|
that only need the side-effect patcher can ignore the return value,
|
||||||
|
and callers that want to assert call args can inspect it."""
|
||||||
|
rebuild_mock = MagicMock(return_value={
|
||||||
|
"project_id": "my-test-project",
|
||||||
|
"tables_registered": 1, "errors": [], "skipped": False,
|
||||||
|
})
|
||||||
|
monkeypatch.setattr(
|
||||||
|
"connectors.bigquery.extractor.rebuild_from_registry",
|
||||||
|
rebuild_mock,
|
||||||
|
)
|
||||||
|
monkeypatch.setattr(
|
||||||
|
"src.orchestrator.SyncOrchestrator",
|
||||||
|
lambda *a, **kw: MagicMock(),
|
||||||
|
)
|
||||||
|
return rebuild_mock
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,90 @@
|
||||||
|
"""When admin registers a materialized BQ row with bucket+source_table
|
||||||
|
but NO source_query, the server generates the source_query from the
|
||||||
|
configured BQ project + the supplied bucket/source_table. Admin never
|
||||||
|
has to know about bigquery_query() syntax for the trivial full-table
|
||||||
|
dump case.
|
||||||
|
|
||||||
|
Fixtures `seeded_app`, `bq_instance`, `stub_bq_extractor` are auto-
|
||||||
|
discovered from `tests/conftest.py` — DO NOT import. `seeded_app`
|
||||||
|
is a dict: `{"client": TestClient, "admin_token": str, ...}`.
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
|
def _auth(token: str) -> dict:
|
||||||
|
"""Mirror the project's local _auth helper used in every materialized
|
||||||
|
test file (e.g. test_api_admin_materialized.py)."""
|
||||||
|
return {"Authorization": f"Bearer {token}"}
|
||||||
|
|
||||||
|
|
||||||
|
def test_register_materialized_with_bucket_only_generates_source_query(
|
||||||
|
seeded_app, bq_instance, stub_bq_extractor,
|
||||||
|
):
|
||||||
|
client = seeded_app["client"]
|
||||||
|
headers = _auth(seeded_app["admin_token"])
|
||||||
|
payload = {
|
||||||
|
"name": "trivial_full_dump",
|
||||||
|
"source_type": "bigquery",
|
||||||
|
"query_mode": "materialized",
|
||||||
|
"bucket": "analytics",
|
||||||
|
"source_table": "orders_v2",
|
||||||
|
}
|
||||||
|
resp = client.post("/api/admin/register-table", json=payload, headers=headers)
|
||||||
|
assert resp.status_code in (200, 201, 202), resp.text
|
||||||
|
|
||||||
|
reg = client.get("/api/admin/registry", headers=headers).json()
|
||||||
|
row = next(t for t in reg["tables"] if t["id"] == "trivial_full_dump")
|
||||||
|
expected_project = bq_instance["data_source"]["bigquery"]["project"]
|
||||||
|
assert row["source_query"] == (
|
||||||
|
f"SELECT * FROM `{expected_project}.analytics.orders_v2`"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def test_register_materialized_with_explicit_source_query_persists_verbatim(
|
||||||
|
seeded_app, bq_instance, stub_bq_extractor,
|
||||||
|
):
|
||||||
|
client = seeded_app["client"]
|
||||||
|
headers = _auth(seeded_app["admin_token"])
|
||||||
|
custom = "SELECT col1, col2 FROM `analytics.orders_v2` WHERE col3 = 'x'"
|
||||||
|
payload = {
|
||||||
|
"name": "explicit_sql",
|
||||||
|
"source_type": "bigquery",
|
||||||
|
"query_mode": "materialized",
|
||||||
|
"source_query": custom,
|
||||||
|
}
|
||||||
|
resp = client.post("/api/admin/register-table", json=payload, headers=headers)
|
||||||
|
assert resp.status_code in (200, 201, 202), resp.text
|
||||||
|
|
||||||
|
reg = client.get("/api/admin/registry", headers=headers).json()
|
||||||
|
row = next(t for t in reg["tables"] if t["id"] == "explicit_sql")
|
||||||
|
assert row["source_query"] == custom
|
||||||
|
|
||||||
|
|
||||||
|
def test_put_flip_to_materialized_with_bucket_generates_source_query(
|
||||||
|
seeded_app, bq_instance, stub_bq_extractor,
|
||||||
|
):
|
||||||
|
client = seeded_app["client"]
|
||||||
|
headers = _auth(seeded_app["admin_token"])
|
||||||
|
# First register as remote.
|
||||||
|
client.post(
|
||||||
|
"/api/admin/register-table",
|
||||||
|
json={"name": "flip_t", "source_type": "bigquery",
|
||||||
|
"bucket": "analytics", "source_table": "orders_v2"},
|
||||||
|
headers=headers,
|
||||||
|
)
|
||||||
|
# PUT to flip to materialized without supplying source_query.
|
||||||
|
resp = client.put(
|
||||||
|
"/api/admin/registry/flip_t",
|
||||||
|
json={"query_mode": "materialized"},
|
||||||
|
headers=headers,
|
||||||
|
)
|
||||||
|
assert resp.status_code == 200, resp.text
|
||||||
|
|
||||||
|
reg = client.get("/api/admin/registry", headers=headers).json()
|
||||||
|
row = next(t for t in reg["tables"] if t["id"] == "flip_t")
|
||||||
|
expected_project = bq_instance["data_source"]["bigquery"]["project"]
|
||||||
|
assert row["source_query"] == (
|
||||||
|
f"SELECT * FROM `{expected_project}.analytics.orders_v2`"
|
||||||
|
)
|
||||||
179
tests/test_admin_server_config_materialize_section.py
Normal file
179
tests/test_admin_server_config_materialize_section.py
Normal file
|
|
@ -0,0 +1,179 @@
|
||||||
|
"""/api/admin/server-config exposes materialize.lock_ttl_seconds and
|
||||||
|
accepts updates. Default is 86400 (24h).
|
||||||
|
|
||||||
|
Fixture `seeded_app` is auto-discovered from `tests/conftest.py` —
|
||||||
|
DO NOT import. It returns a dict: `{"client": TestClient,
|
||||||
|
"admin_token": str, ...}`. Auth helper `_auth(token)` mirrors the
|
||||||
|
project's local pattern (also used in test_api_admin_materialized.py).
|
||||||
|
|
||||||
|
Behaviour contract:
|
||||||
|
- GET returns `materialize` section in `sections` (empty dict when no
|
||||||
|
override is set, since the endpoint surfaces every editable section).
|
||||||
|
- GET also exposes the known_fields registry entry for `materialize`
|
||||||
|
with `lock_ttl_seconds` spec (kind=int, default=86400).
|
||||||
|
- POST with a valid value persists it and GET returns the new value.
|
||||||
|
- POST with lock_ttl_seconds < 60 or > 604800 is rejected with 422.
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
import yaml
|
||||||
|
|
||||||
|
|
||||||
|
def _auth(token: str) -> dict:
|
||||||
|
return {"Authorization": f"Bearer {token}"}
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# GET — default state
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def test_get_returns_materialize_in_editable_sections(seeded_app):
|
||||||
|
"""materialize must appear in editable_sections."""
|
||||||
|
client = seeded_app["client"]
|
||||||
|
headers = _auth(seeded_app["admin_token"])
|
||||||
|
resp = client.get("/api/admin/server-config", headers=headers)
|
||||||
|
assert resp.status_code == 200
|
||||||
|
body = resp.json()
|
||||||
|
assert "materialize" in body["editable_sections"]
|
||||||
|
|
||||||
|
|
||||||
|
def test_get_returns_materialize_section_key(seeded_app):
|
||||||
|
"""materialize key appears in sections (empty dict when no override set)."""
|
||||||
|
client = seeded_app["client"]
|
||||||
|
headers = _auth(seeded_app["admin_token"])
|
||||||
|
resp = client.get("/api/admin/server-config", headers=headers)
|
||||||
|
assert resp.status_code == 200
|
||||||
|
body = resp.json()
|
||||||
|
# The endpoint surfaces every editable section so the UI can render it.
|
||||||
|
assert "materialize" in body["sections"]
|
||||||
|
|
||||||
|
|
||||||
|
def test_get_returns_materialize_known_fields(seeded_app):
|
||||||
|
"""known_fields must have a materialize.lock_ttl_seconds entry."""
|
||||||
|
client = seeded_app["client"]
|
||||||
|
headers = _auth(seeded_app["admin_token"])
|
||||||
|
resp = client.get("/api/admin/server-config", headers=headers)
|
||||||
|
assert resp.status_code == 200
|
||||||
|
body = resp.json()
|
||||||
|
mat_fields = body.get("known_fields", {}).get("materialize", {})
|
||||||
|
assert "lock_ttl_seconds" in mat_fields, body.get("known_fields", {})
|
||||||
|
spec = mat_fields["lock_ttl_seconds"]
|
||||||
|
assert spec["kind"] == "int"
|
||||||
|
assert spec["default"] == 86400
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# POST — update and read back
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def test_put_updates_materialize_lock_ttl(seeded_app, tmp_path, monkeypatch):
|
||||||
|
"""POST with a valid value persists; GET reflects the new value."""
|
||||||
|
monkeypatch.setenv("DATA_DIR", str(tmp_path))
|
||||||
|
state = tmp_path / "state"
|
||||||
|
state.mkdir(parents=True, exist_ok=True)
|
||||||
|
import app.instance_config as ic
|
||||||
|
ic._instance_config = None
|
||||||
|
try:
|
||||||
|
client = seeded_app["client"]
|
||||||
|
headers = _auth(seeded_app["admin_token"])
|
||||||
|
resp = client.post(
|
||||||
|
"/api/admin/server-config",
|
||||||
|
json={"sections": {"materialize": {"lock_ttl_seconds": 3600}}},
|
||||||
|
headers=headers,
|
||||||
|
)
|
||||||
|
assert resp.status_code == 200, resp.text
|
||||||
|
|
||||||
|
# Verify on disk.
|
||||||
|
loaded = yaml.safe_load((state / "instance.yaml").read_text())
|
||||||
|
assert loaded["materialize"]["lock_ttl_seconds"] == 3600
|
||||||
|
|
||||||
|
# Verify GET reflects the new value.
|
||||||
|
ic._instance_config = None
|
||||||
|
resp2 = client.get("/api/admin/server-config", headers=headers)
|
||||||
|
assert resp2.json()["sections"]["materialize"]["lock_ttl_seconds"] == 3600
|
||||||
|
finally:
|
||||||
|
ic._instance_config = None
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# POST — validation
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def test_invalid_lock_ttl_below_min_rejected(seeded_app):
|
||||||
|
"""lock_ttl_seconds < 60 is rejected with 422."""
|
||||||
|
client = seeded_app["client"]
|
||||||
|
headers = _auth(seeded_app["admin_token"])
|
||||||
|
resp = client.post(
|
||||||
|
"/api/admin/server-config",
|
||||||
|
json={"sections": {"materialize": {"lock_ttl_seconds": -5}}},
|
||||||
|
headers=headers,
|
||||||
|
)
|
||||||
|
assert resp.status_code == 422, resp.text
|
||||||
|
|
||||||
|
|
||||||
|
def test_invalid_lock_ttl_zero_rejected(seeded_app):
|
||||||
|
"""lock_ttl_seconds=0 is rejected with 422 (below the 60s floor)."""
|
||||||
|
client = seeded_app["client"]
|
||||||
|
headers = _auth(seeded_app["admin_token"])
|
||||||
|
resp = client.post(
|
||||||
|
"/api/admin/server-config",
|
||||||
|
json={"sections": {"materialize": {"lock_ttl_seconds": 0}}},
|
||||||
|
headers=headers,
|
||||||
|
)
|
||||||
|
assert resp.status_code == 422, resp.text
|
||||||
|
|
||||||
|
|
||||||
|
def test_invalid_lock_ttl_above_max_rejected(seeded_app):
|
||||||
|
"""lock_ttl_seconds > 604800 (1 week) is rejected with 422."""
|
||||||
|
client = seeded_app["client"]
|
||||||
|
headers = _auth(seeded_app["admin_token"])
|
||||||
|
resp = client.post(
|
||||||
|
"/api/admin/server-config",
|
||||||
|
json={"sections": {"materialize": {"lock_ttl_seconds": 604801}}},
|
||||||
|
headers=headers,
|
||||||
|
)
|
||||||
|
assert resp.status_code == 422, resp.text
|
||||||
|
|
||||||
|
|
||||||
|
def test_valid_lock_ttl_boundary_min_accepted(seeded_app, tmp_path, monkeypatch):
|
||||||
|
"""lock_ttl_seconds=60 (minimum) is accepted."""
|
||||||
|
monkeypatch.setenv("DATA_DIR", str(tmp_path))
|
||||||
|
state = tmp_path / "state"
|
||||||
|
state.mkdir(parents=True, exist_ok=True)
|
||||||
|
import app.instance_config as ic
|
||||||
|
ic._instance_config = None
|
||||||
|
try:
|
||||||
|
client = seeded_app["client"]
|
||||||
|
headers = _auth(seeded_app["admin_token"])
|
||||||
|
resp = client.post(
|
||||||
|
"/api/admin/server-config",
|
||||||
|
json={"sections": {"materialize": {"lock_ttl_seconds": 60}}},
|
||||||
|
headers=headers,
|
||||||
|
)
|
||||||
|
assert resp.status_code == 200, resp.text
|
||||||
|
finally:
|
||||||
|
ic._instance_config = None
|
||||||
|
|
||||||
|
|
||||||
|
def test_valid_lock_ttl_boundary_max_accepted(seeded_app, tmp_path, monkeypatch):
|
||||||
|
"""lock_ttl_seconds=604800 (maximum) is accepted."""
|
||||||
|
monkeypatch.setenv("DATA_DIR", str(tmp_path))
|
||||||
|
state = tmp_path / "state"
|
||||||
|
state.mkdir(parents=True, exist_ok=True)
|
||||||
|
import app.instance_config as ic
|
||||||
|
ic._instance_config = None
|
||||||
|
try:
|
||||||
|
client = seeded_app["client"]
|
||||||
|
headers = _auth(seeded_app["admin_token"])
|
||||||
|
resp = client.post(
|
||||||
|
"/api/admin/server-config",
|
||||||
|
json={"sections": {"materialize": {"lock_ttl_seconds": 604800}}},
|
||||||
|
headers=headers,
|
||||||
|
)
|
||||||
|
assert resp.status_code == 200, resp.text
|
||||||
|
finally:
|
||||||
|
ic._instance_config = None
|
||||||
|
|
@ -0,0 +1,39 @@
|
||||||
|
"""Backtick-quoted identifiers are required for materialized BQ source_query
|
||||||
|
(when the dataset/table/project name contains a dash). The validator must
|
||||||
|
allow them on materialized rows but still reject on remote/local."""
|
||||||
|
from __future__ import annotations
|
||||||
|
import pytest
|
||||||
|
from pydantic import ValidationError
|
||||||
|
|
||||||
|
from app.api.admin import RegisterTableRequest
|
||||||
|
|
||||||
|
|
||||||
|
def test_materialized_accepts_backticks():
|
||||||
|
req = RegisterTableRequest(
|
||||||
|
name="b1",
|
||||||
|
source_type="bigquery",
|
||||||
|
query_mode="materialized",
|
||||||
|
source_query="SELECT * FROM `my-project.ds.tbl`",
|
||||||
|
)
|
||||||
|
assert req.source_query == "SELECT * FROM `my-project.ds.tbl`"
|
||||||
|
|
||||||
|
|
||||||
|
def test_remote_rejects_backticks():
|
||||||
|
with pytest.raises(ValidationError):
|
||||||
|
RegisterTableRequest(
|
||||||
|
name="r1",
|
||||||
|
source_type="bigquery",
|
||||||
|
query_mode="remote",
|
||||||
|
bucket="ds", source_table="tbl",
|
||||||
|
source_query="SELECT * FROM `prj.ds.tbl`",
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def test_local_rejects_backticks():
|
||||||
|
with pytest.raises(ValidationError):
|
||||||
|
RegisterTableRequest(
|
||||||
|
name="l1",
|
||||||
|
source_type="keboola",
|
||||||
|
query_mode="local",
|
||||||
|
source_query="SELECT * FROM `kbc.ds.tbl`",
|
||||||
|
)
|
||||||
|
|
@ -18,8 +18,6 @@ Covers PR #145 (re-implementation against 0.24.0 base):
|
||||||
Shares the seeded_app + bq_instance fixtures from conftest /
|
Shares the seeded_app + bq_instance fixtures from conftest /
|
||||||
test_admin_bq_register.py for parity with the existing BQ test surface.
|
test_admin_bq_register.py for parity with the existing BQ test surface.
|
||||||
"""
|
"""
|
||||||
from unittest.mock import MagicMock
|
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
|
|
@ -27,59 +25,15 @@ def _auth(token):
|
||||||
return {"Authorization": f"Bearer {token}"}
|
return {"Authorization": f"Bearer {token}"}
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
|
||||||
def stub_bq_extractor(monkeypatch):
|
|
||||||
"""Mirror tests/test_admin_bq_register.py — bypasses real-BQ traffic
|
|
||||||
in the post-register rebuild path so the test stays offline. Required
|
|
||||||
whenever the test seeds a remote-mode BQ row via the HTTP API."""
|
|
||||||
rebuild_mock = MagicMock(return_value={
|
|
||||||
"project_id": "my-test-project",
|
|
||||||
"tables_registered": 1, "errors": [], "skipped": False,
|
|
||||||
})
|
|
||||||
monkeypatch.setattr(
|
|
||||||
"connectors.bigquery.extractor.rebuild_from_registry",
|
|
||||||
rebuild_mock,
|
|
||||||
)
|
|
||||||
monkeypatch.setattr(
|
|
||||||
"src.orchestrator.SyncOrchestrator",
|
|
||||||
lambda *a, **kw: MagicMock(),
|
|
||||||
)
|
|
||||||
return rebuild_mock
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
|
||||||
def bq_instance(monkeypatch):
|
|
||||||
"""Force instance.yaml to look like a BigQuery deployment.
|
|
||||||
|
|
||||||
Mirrors tests/test_admin_bq_register.py::bq_instance so the
|
|
||||||
project_id read inside _validate_bigquery_register_payload succeeds.
|
|
||||||
"""
|
|
||||||
fake_cfg = {
|
|
||||||
"data_source": {
|
|
||||||
"type": "bigquery",
|
|
||||||
"bigquery": {"project": "my-test-project", "location": "us"},
|
|
||||||
},
|
|
||||||
}
|
|
||||||
monkeypatch.setattr(
|
|
||||||
"app.instance_config.load_instance_config",
|
|
||||||
lambda: fake_cfg,
|
|
||||||
raising=False,
|
|
||||||
)
|
|
||||||
from app.instance_config import reset_cache
|
|
||||||
reset_cache()
|
|
||||||
yield fake_cfg
|
|
||||||
reset_cache()
|
|
||||||
|
|
||||||
|
|
||||||
def _materialized_payload(**overrides):
|
def _materialized_payload(**overrides):
|
||||||
p = {
|
p = {
|
||||||
"name": "orders_90d",
|
"name": "orders_90d",
|
||||||
"source_type": "bigquery",
|
"source_type": "bigquery",
|
||||||
"query_mode": "materialized",
|
"query_mode": "materialized",
|
||||||
# DuckDB-flavor SQL (not BQ-native backticks) — the materialize path
|
# BQ-native or DuckDB-flavor SQL — both accepted since Task 2 wraps
|
||||||
# runs the SQL through the DuckDB BQ extension's COPY which uses
|
# materialized SQL in bigquery_query() (BQ jobs API path). Backtick
|
||||||
# double-quoted identifiers. Backticks are now rejected at register
|
# identifiers are now allowed for materialized rows; remote/local rows
|
||||||
# time. See `_BACKTICK_REJECTION_MESSAGE` in app/api/admin.py.
|
# still require DuckDB-flavor (double-quoted) identifiers.
|
||||||
"source_query": 'SELECT date FROM bq."ds"."orders"',
|
"source_query": 'SELECT date FROM bq."ds"."orders"',
|
||||||
"sync_schedule": "every 6h",
|
"sync_schedule": "every 6h",
|
||||||
}
|
}
|
||||||
|
|
@ -326,36 +280,44 @@ def test_register_materialized_persists_source_query_in_registry(seeded_app, bq_
|
||||||
assert "WHERE x = 1" in row["source_query"]
|
assert "WHERE x = 1" in row["source_query"]
|
||||||
|
|
||||||
|
|
||||||
# --- Backtick (BigQuery-native) source_query rejection -----------------------
|
# --- Backtick (BigQuery-native) source_query handling ------------------------
|
||||||
#
|
#
|
||||||
# DuckDB BQ extension's COPY path interprets the SQL through DuckDB's parser,
|
# Task 2 (materialize-sync-fix) changed the BQ materialization path to run
|
||||||
# which does NOT understand backtick-quoted identifiers (it uses double quotes
|
# admin SQL through the BQ jobs API (bigquery_query() wrapper) rather than
|
||||||
# for quoted identifiers). A registered backtick-style source_query like
|
# through DuckDB's BQ extension COPY path. BQ-native SQL requires backticks
|
||||||
# `SELECT * FROM \`prj.ds.t\`` either parse-errors or returns 0 rows at next
|
# for dashed project/dataset/table identifiers. The backtick guard has been
|
||||||
# materialize tick — silently — and no parquet ends up at the canonical path.
|
# relaxed for ALL materialized rows: the validator now only rejects backticks
|
||||||
# Reject at registration time with an actionable message.
|
# for remote/local rows (DuckDB-flavor SQL contract). Materialized rows must
|
||||||
|
# be allowed to carry backticks so operators can reference dashed identifiers.
|
||||||
|
# See test_admin_validator_backtick_relaxed_for_materialized.py for the
|
||||||
|
# model-layer unit tests.
|
||||||
|
|
||||||
|
|
||||||
def test_register_materialized_rejects_backtick_source_query(seeded_app, bq_instance):
|
def test_register_materialized_accepts_backtick_source_query(seeded_app, bq_instance, stub_bq_extractor):
|
||||||
|
"""BQ materialized rows now accept BQ-native backtick syntax; the
|
||||||
|
materialize path (Task 2) wraps them in bigquery_query() which uses
|
||||||
|
the BQ jobs API — not DuckDB's COPY — so backticks are valid."""
|
||||||
c = seeded_app["client"]
|
c = seeded_app["client"]
|
||||||
token = seeded_app["admin_token"]
|
token = seeded_app["admin_token"]
|
||||||
r = c.post(
|
r = c.post(
|
||||||
"/api/admin/register-table",
|
"/api/admin/register-table",
|
||||||
json=_materialized_payload(
|
json=_materialized_payload(
|
||||||
name="bt_native",
|
name="bt_native",
|
||||||
source_query="SELECT * FROM `prj-grp.ds.product_inventory`",
|
source_query="SELECT * FROM `my-project.ds.product_inventory`",
|
||||||
),
|
),
|
||||||
headers=_auth(token),
|
headers=_auth(token),
|
||||||
)
|
)
|
||||||
assert r.status_code == 422, r.json()
|
assert r.status_code in (200, 201, 202), r.json()
|
||||||
detail = str(r.json().get("detail", "")).lower()
|
reg = c.get("/api/admin/registry", headers=_auth(token)).json()
|
||||||
assert "backtick" in detail
|
row = next(t for t in reg["tables"] if t["id"] == "bt_native")
|
||||||
assert 'bq."' in detail or "duckdb" in detail
|
assert row["source_query"] == "SELECT * FROM `my-project.ds.product_inventory`"
|
||||||
|
|
||||||
|
|
||||||
def test_update_materialized_rejects_backtick_source_query(
|
def test_update_materialized_accepts_backtick_source_query(
|
||||||
seeded_app, bq_instance, stub_bq_extractor,
|
seeded_app, bq_instance, stub_bq_extractor,
|
||||||
):
|
):
|
||||||
|
"""PUT to a materialized BQ row may switch source_query to BQ-native
|
||||||
|
backtick form — accepted now that Task 2 wraps via jobs API."""
|
||||||
c = seeded_app["client"]
|
c = seeded_app["client"]
|
||||||
token = seeded_app["admin_token"]
|
token = seeded_app["admin_token"]
|
||||||
|
|
||||||
|
|
@ -370,7 +332,7 @@ def test_update_materialized_rejects_backtick_source_query(
|
||||||
assert r.status_code == 201, r.json()
|
assert r.status_code == 201, r.json()
|
||||||
table_id = r.json()["id"]
|
table_id = r.json()["id"]
|
||||||
|
|
||||||
# PATCH the source_query to a backtick form — must be rejected.
|
# PATCH the source_query to a BQ-native backtick form — now accepted.
|
||||||
r2 = c.put(
|
r2 = c.put(
|
||||||
f"/api/admin/registry/{table_id}",
|
f"/api/admin/registry/{table_id}",
|
||||||
json={
|
json={
|
||||||
|
|
@ -379,14 +341,17 @@ def test_update_materialized_rejects_backtick_source_query(
|
||||||
},
|
},
|
||||||
headers=_auth(token),
|
headers=_auth(token),
|
||||||
)
|
)
|
||||||
assert r2.status_code == 422, r2.json()
|
assert r2.status_code == 200, r2.json()
|
||||||
detail = str(r2.json().get("detail", "")).lower()
|
reg = c.get("/api/admin/registry", headers=_auth(token)).json()
|
||||||
assert "backtick" in detail
|
row = next(t for t in reg["tables"] if t["id"] == table_id)
|
||||||
|
assert row["source_query"] == "SELECT * FROM `prj.ds.t`"
|
||||||
|
|
||||||
|
|
||||||
def test_register_materialized_keboola_rejects_backtick_source_query(seeded_app):
|
def test_register_materialized_keboola_accepts_backtick_source_query(seeded_app):
|
||||||
"""The check is generic, not BQ-only — Keboola materialized rows that
|
"""Keboola materialized rows also accept backtick source_query at register
|
||||||
include backticks would also be silently skipped at materialize time."""
|
time — the backtick guard now only applies to remote/local rows. If the
|
||||||
|
SQL is invalid at runtime (DuckDB parse error), that surfaces as a sync
|
||||||
|
error, not a registration error."""
|
||||||
c = seeded_app["client"]
|
c = seeded_app["client"]
|
||||||
token = seeded_app["admin_token"]
|
token = seeded_app["admin_token"]
|
||||||
r = c.post(
|
r = c.post(
|
||||||
|
|
@ -399,9 +364,7 @@ def test_register_materialized_keboola_rejects_backtick_source_query(seeded_app)
|
||||||
},
|
},
|
||||||
headers=_auth(token),
|
headers=_auth(token),
|
||||||
)
|
)
|
||||||
assert r.status_code == 422, r.json()
|
assert r.status_code == 201, r.json()
|
||||||
detail = str(r.json().get("detail", "")).lower()
|
|
||||||
assert "backtick" in detail
|
|
||||||
|
|
||||||
|
|
||||||
# --- Surface materialize errors per-row ---------------------------------------
|
# --- Surface materialize errors per-row ---------------------------------------
|
||||||
|
|
|
||||||
|
|
@ -18,7 +18,13 @@ from connectors.bigquery.extractor import materialize_query, MaterializeBudgetEr
|
||||||
|
|
||||||
def _bq_with_seed(tables: dict[str, str] | None = None) -> BqAccess:
|
def _bq_with_seed(tables: dict[str, str] | None = None) -> BqAccess:
|
||||||
"""Stub BqAccess seeded with in-memory tables (same recipe as
|
"""Stub BqAccess seeded with in-memory tables (same recipe as
|
||||||
test_bq_materialize)."""
|
test_bq_materialize).
|
||||||
|
|
||||||
|
A `bigquery_query(project, sql_text)` table macro is registered so the
|
||||||
|
wrapping added by `_wrap_admin_sql_for_jobs_api` (Task 2 — routes COPY
|
||||||
|
through the BQ jobs API for views) resolves against the in-memory tables
|
||||||
|
without needing the real BQ extension.
|
||||||
|
"""
|
||||||
tables = tables or {}
|
tables = tables or {}
|
||||||
|
|
||||||
@contextmanager
|
@contextmanager
|
||||||
|
|
@ -30,6 +36,12 @@ def _bq_with_seed(tables: dict[str, str] | None = None) -> BqAccess:
|
||||||
conn.execute(f"CREATE SCHEMA IF NOT EXISTS {s}")
|
conn.execute(f"CREATE SCHEMA IF NOT EXISTS {s}")
|
||||||
for ref, body in tables.items():
|
for ref, body in tables.items():
|
||||||
conn.execute(f"CREATE OR REPLACE TABLE {ref} AS {body}")
|
conn.execute(f"CREATE OR REPLACE TABLE {ref} AS {body}")
|
||||||
|
# Stub bigquery_query() so materialize_query's wrapped COPY works
|
||||||
|
# against the in-memory bq catalog without the real BQ extension.
|
||||||
|
conn.execute(
|
||||||
|
"CREATE OR REPLACE MACRO bigquery_query(project, sql_text) "
|
||||||
|
"AS TABLE SELECT * FROM query(sql_text)"
|
||||||
|
)
|
||||||
yield conn
|
yield conn
|
||||||
finally:
|
finally:
|
||||||
conn.close()
|
conn.close()
|
||||||
|
|
@ -116,22 +128,26 @@ def test_zero_max_bytes_skips_dry_run(tmp_path):
|
||||||
assert stats["rows"] == 1
|
assert stats["rows"] == 1
|
||||||
|
|
||||||
|
|
||||||
def test_dry_run_failure_is_fail_open(tmp_path):
|
def test_dry_run_failure_is_fail_open(tmp_path, caplog):
|
||||||
"""If the dry-run errors (DuckDB syntax, missing google lib, transient
|
"""If the dry-run errors (DuckDB syntax, missing google lib, transient
|
||||||
upstream failure) we don't block — log + proceed with COPY. Operators
|
upstream failure) we don't block — log + proceed with COPY. Operators
|
||||||
who need hard-fail watch logs for the warning."""
|
who need hard-fail watch logs for the warning."""
|
||||||
|
import logging
|
||||||
|
|
||||||
out = tmp_path / "extracts" / "bigquery"
|
out = tmp_path / "extracts" / "bigquery"
|
||||||
out.mkdir(parents=True)
|
out.mkdir(parents=True)
|
||||||
bq = _bq_with_seed({"bq.test.tiny": "SELECT 1 AS n"})
|
bq = _bq_with_seed({"bq.test.tiny": "SELECT 1 AS n"})
|
||||||
|
|
||||||
with patch(
|
with caplog.at_level(logging.WARNING, logger="connectors.bigquery.extractor"):
|
||||||
"app.api.v2_scan._bq_dry_run_bytes", side_effect=RuntimeError("boom")
|
with patch(
|
||||||
):
|
"app.api.v2_scan._bq_dry_run_bytes", side_effect=RuntimeError("boom")
|
||||||
stats = materialize_query(
|
):
|
||||||
table_id="t1",
|
stats = materialize_query(
|
||||||
sql="SELECT * FROM bq.test.tiny",
|
table_id="t1",
|
||||||
bq=bq,
|
sql="SELECT * FROM bq.test.tiny",
|
||||||
output_dir=str(out),
|
bq=bq,
|
||||||
max_bytes=10 * 2**30,
|
output_dir=str(out),
|
||||||
)
|
max_bytes=10 * 2**30,
|
||||||
|
)
|
||||||
assert stats["rows"] == 1
|
assert stats["rows"] == 1
|
||||||
|
assert "fail-open" in caplog.text
|
||||||
|
|
|
||||||
|
|
@ -21,6 +21,11 @@ def _make_stub_bq(tables: dict[str, str] | None = None) -> BqAccess:
|
||||||
with a pretend `bq` catalog containing test tables. `tables` maps
|
with a pretend `bq` catalog containing test tables. `tables` maps
|
||||||
DuckDB-three-part references like `'bq.test.orders'` to a SELECT
|
DuckDB-three-part references like `'bq.test.orders'` to a SELECT
|
||||||
expression to seed them with.
|
expression to seed them with.
|
||||||
|
|
||||||
|
A `bigquery_query(project, sql_text)` table macro is registered so the
|
||||||
|
wrapping added by `_wrap_admin_sql_for_jobs_api` (Task 2 — routes COPY
|
||||||
|
through the BQ jobs API for views) resolves against the in-memory tables
|
||||||
|
without needing the real BQ extension.
|
||||||
"""
|
"""
|
||||||
tables = tables or {}
|
tables = tables or {}
|
||||||
|
|
||||||
|
|
@ -34,6 +39,12 @@ def _make_stub_bq(tables: dict[str, str] | None = None) -> BqAccess:
|
||||||
conn.execute(f"CREATE SCHEMA IF NOT EXISTS {s}")
|
conn.execute(f"CREATE SCHEMA IF NOT EXISTS {s}")
|
||||||
for ref, body in tables.items():
|
for ref, body in tables.items():
|
||||||
conn.execute(f"CREATE OR REPLACE TABLE {ref} AS {body}")
|
conn.execute(f"CREATE OR REPLACE TABLE {ref} AS {body}")
|
||||||
|
# Stub bigquery_query() so materialize_query's wrapped COPY works
|
||||||
|
# against the in-memory bq catalog without the real BQ extension.
|
||||||
|
conn.execute(
|
||||||
|
"CREATE OR REPLACE MACRO bigquery_query(project, sql_text) "
|
||||||
|
"AS TABLE SELECT * FROM query(sql_text)"
|
||||||
|
)
|
||||||
yield conn
|
yield conn
|
||||||
finally:
|
finally:
|
||||||
conn.close()
|
conn.close()
|
||||||
|
|
|
||||||
204
tests/test_bq_materialize_concurrency.py
Normal file
204
tests/test_bq_materialize_concurrency.py
Normal file
|
|
@ -0,0 +1,204 @@
|
||||||
|
"""Per-table_id concurrency: in-process mutex + advisory file lock with
|
||||||
|
TTL reclaim. Two overlapping materialize_query calls for the same id
|
||||||
|
must NOT corrupt each other's parquet."""
|
||||||
|
from __future__ import annotations
|
||||||
|
import os
|
||||||
|
import threading
|
||||||
|
import time
|
||||||
|
from pathlib import Path
|
||||||
|
from unittest.mock import MagicMock, patch
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from connectors.bigquery.extractor import (
|
||||||
|
materialize_query,
|
||||||
|
MaterializeInFlightError,
|
||||||
|
_get_table_lock,
|
||||||
|
_LOCK_TTL_DEFAULT_SECONDS,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture(autouse=True)
|
||||||
|
def reset_locks(monkeypatch):
|
||||||
|
# Tests must not share lock state across runs.
|
||||||
|
import connectors.bigquery.extractor as mod
|
||||||
|
monkeypatch.setattr(mod, "_table_locks", {})
|
||||||
|
yield
|
||||||
|
|
||||||
|
|
||||||
|
def _slow_bq(stall_seconds: float = 1.0):
|
||||||
|
"""Build a fake BqAccess whose duckdb_session COPY blocks for
|
||||||
|
`stall_seconds` so we can race a second call against it."""
|
||||||
|
bq = MagicMock()
|
||||||
|
bq.projects.billing = "prj-billing"
|
||||||
|
bq.projects.data = "prj-data"
|
||||||
|
|
||||||
|
class _Session:
|
||||||
|
def __enter__(self):
|
||||||
|
return self
|
||||||
|
def __exit__(self, *a):
|
||||||
|
return False
|
||||||
|
def execute(self, sql):
|
||||||
|
if sql.startswith("SELECT database_name"):
|
||||||
|
class _R:
|
||||||
|
def fetchall(self):
|
||||||
|
return [("memory",)]
|
||||||
|
return _R()
|
||||||
|
if sql.startswith("ATTACH"):
|
||||||
|
return MagicMock()
|
||||||
|
if sql.startswith("COPY"):
|
||||||
|
# Simulate a long-running COPY by writing a stub parquet
|
||||||
|
# then sleeping so a second call can race us.
|
||||||
|
# Extract the path from the COPY statement.
|
||||||
|
import re
|
||||||
|
m = re.search(r"TO '([^']+)'", sql)
|
||||||
|
assert m
|
||||||
|
Path(m.group(1)).write_bytes(b"PARQUET_STUB_HEADER" + b"\x00" * 200)
|
||||||
|
time.sleep(stall_seconds)
|
||||||
|
return MagicMock()
|
||||||
|
if sql.startswith("SELECT count"):
|
||||||
|
class _R:
|
||||||
|
def fetchone(self):
|
||||||
|
return (42,)
|
||||||
|
return _R()
|
||||||
|
return MagicMock()
|
||||||
|
|
||||||
|
bq.duckdb_session.return_value = _Session()
|
||||||
|
return bq
|
||||||
|
|
||||||
|
|
||||||
|
def test_concurrent_calls_for_same_id_raise_in_flight(tmp_path):
|
||||||
|
bq = _slow_bq(stall_seconds=2.0)
|
||||||
|
|
||||||
|
out_dir = str(tmp_path)
|
||||||
|
captured: list = []
|
||||||
|
|
||||||
|
def runner(tag):
|
||||||
|
try:
|
||||||
|
r = materialize_query(
|
||||||
|
table_id="t1", sql="SELECT 1",
|
||||||
|
bq=bq, output_dir=out_dir, max_bytes=None,
|
||||||
|
)
|
||||||
|
captured.append(("ok", tag, r))
|
||||||
|
except MaterializeInFlightError as e:
|
||||||
|
captured.append(("in_flight", tag, str(e)))
|
||||||
|
except Exception as e:
|
||||||
|
captured.append(("err", tag, str(e)))
|
||||||
|
|
||||||
|
t1 = threading.Thread(target=runner, args=("first",))
|
||||||
|
t2 = threading.Thread(target=runner, args=("second",))
|
||||||
|
t1.start()
|
||||||
|
time.sleep(0.2) # let t1 acquire the lock
|
||||||
|
t2.start()
|
||||||
|
t1.join()
|
||||||
|
t2.join()
|
||||||
|
|
||||||
|
outcomes = [c[0] for c in captured]
|
||||||
|
assert outcomes.count("ok") == 1, f"expected exactly one success, got {captured}"
|
||||||
|
assert outcomes.count("in_flight") == 1
|
||||||
|
|
||||||
|
|
||||||
|
def test_sequential_calls_for_same_id_both_succeed(tmp_path):
|
||||||
|
bq = _slow_bq(stall_seconds=0.05)
|
||||||
|
|
||||||
|
out_dir = str(tmp_path)
|
||||||
|
r1 = materialize_query(
|
||||||
|
table_id="t1", sql="SELECT 1",
|
||||||
|
bq=bq, output_dir=out_dir, max_bytes=None,
|
||||||
|
)
|
||||||
|
r2 = materialize_query(
|
||||||
|
table_id="t1", sql="SELECT 1",
|
||||||
|
bq=bq, output_dir=out_dir, max_bytes=None,
|
||||||
|
)
|
||||||
|
assert r1["rows"] == 42
|
||||||
|
assert r2["rows"] == 42
|
||||||
|
|
||||||
|
|
||||||
|
def test_different_ids_run_in_parallel(tmp_path):
|
||||||
|
bq = _slow_bq(stall_seconds=1.0)
|
||||||
|
out_dir = str(tmp_path)
|
||||||
|
captured: list = []
|
||||||
|
|
||||||
|
def runner(tid):
|
||||||
|
try:
|
||||||
|
r = materialize_query(
|
||||||
|
table_id=tid, sql="SELECT 1",
|
||||||
|
bq=bq, output_dir=out_dir, max_bytes=None,
|
||||||
|
)
|
||||||
|
captured.append((tid, r["rows"]))
|
||||||
|
except Exception as e:
|
||||||
|
captured.append((tid, "ERROR"))
|
||||||
|
|
||||||
|
threads = [threading.Thread(target=runner, args=(f"tab_{i}",)) for i in range(3)]
|
||||||
|
start = time.time()
|
||||||
|
for t in threads: t.start()
|
||||||
|
for t in threads: t.join()
|
||||||
|
elapsed = time.time() - start
|
||||||
|
# If they were serialized, would take >= 3s. Parallel: ~1s.
|
||||||
|
assert elapsed < 2.0, f"expected parallel, elapsed={elapsed:.2f}s"
|
||||||
|
assert len(captured) == 3
|
||||||
|
assert all(c[1] == 42 for c in captured)
|
||||||
|
|
||||||
|
|
||||||
|
def test_stale_file_lock_is_reclaimed_after_ttl(tmp_path, monkeypatch):
|
||||||
|
"""Verify a stale, unheld .lock file (old mtime, no live flock holder) does NOT
|
||||||
|
cause `MaterializeInFlightError`. The reclaim branch in `_try_acquire_file_lock`
|
||||||
|
is technically not reached here (the first `_try_open_and_flock` succeeds because
|
||||||
|
nobody holds the lock), but exercising the in-flight-by-mtime-only mistake is what
|
||||||
|
this test guards against."""
|
||||||
|
bq = _slow_bq(stall_seconds=0.05)
|
||||||
|
lock_path = Path(tmp_path) / "data" / "t1.parquet.lock"
|
||||||
|
lock_path.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
lock_path.write_text("")
|
||||||
|
|
||||||
|
# Set mtime to 25h ago (> default 24h TTL).
|
||||||
|
old_ts = time.time() - 25 * 3600
|
||||||
|
os.utime(lock_path, (old_ts, old_ts))
|
||||||
|
|
||||||
|
r = materialize_query(
|
||||||
|
table_id="t1", sql="SELECT 1",
|
||||||
|
bq=bq, output_dir=str(tmp_path), max_bytes=None,
|
||||||
|
)
|
||||||
|
assert r["rows"] == 42
|
||||||
|
|
||||||
|
|
||||||
|
def test_fresh_file_lock_blocks_with_in_flight_error(tmp_path, monkeypatch):
|
||||||
|
"""Force a fresh .lock file (mtime within TTL) and verify a new
|
||||||
|
call raises rather than reclaims."""
|
||||||
|
bq = _slow_bq(stall_seconds=0.05)
|
||||||
|
lock_path = Path(tmp_path) / "data" / "t1.parquet.lock"
|
||||||
|
lock_path.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
# Open the lock file and HOLD a fcntl exclusive lock so the materialize
|
||||||
|
# call's flock(LOCK_NB) sees a real conflicting lock — relying on
|
||||||
|
# mtime-only would let the test pass even if flock acquisition was
|
||||||
|
# broken.
|
||||||
|
import fcntl
|
||||||
|
holder = open(lock_path, "w")
|
||||||
|
fcntl.flock(holder.fileno(), fcntl.LOCK_EX | fcntl.LOCK_NB)
|
||||||
|
try:
|
||||||
|
with pytest.raises(MaterializeInFlightError):
|
||||||
|
materialize_query(
|
||||||
|
table_id="t1", sql="SELECT 1",
|
||||||
|
bq=bq, output_dir=str(tmp_path), max_bytes=None,
|
||||||
|
)
|
||||||
|
finally:
|
||||||
|
fcntl.flock(holder.fileno(), fcntl.LOCK_UN)
|
||||||
|
holder.close()
|
||||||
|
|
||||||
|
|
||||||
|
def test_lock_ttl_reads_from_instance_config(tmp_path, monkeypatch):
|
||||||
|
"""When `materialize.lock_ttl_seconds` is set in instance.yaml, that
|
||||||
|
value overrides the default."""
|
||||||
|
# Patches `app.instance_config.get_value` directly. This works because
|
||||||
|
# `_get_lock_ttl_seconds` re-imports `get_value` on every call (see
|
||||||
|
# extractor.py for the deferred-import rationale). If a future change
|
||||||
|
# hoists the import to module-level, this patch must change to target
|
||||||
|
# `connectors.bigquery.extractor.get_value` instead.
|
||||||
|
monkeypatch.setattr(
|
||||||
|
"app.instance_config.get_value",
|
||||||
|
lambda *args, **kw: 60 if args == ("materialize", "lock_ttl_seconds") else kw.get("default"),
|
||||||
|
)
|
||||||
|
|
||||||
|
from connectors.bigquery.extractor import _get_lock_ttl_seconds
|
||||||
|
assert _get_lock_ttl_seconds() == 60
|
||||||
54
tests/test_bq_materialize_query_wrapping.py
Normal file
54
tests/test_bq_materialize_query_wrapping.py
Normal file
|
|
@ -0,0 +1,54 @@
|
||||||
|
"""materialize_query must always wrap admin source_query in
|
||||||
|
bigquery_query('<billing>', '<admin>') so the COPY uses BQ jobs API,
|
||||||
|
which works for base tables AND views — Storage Read API does not."""
|
||||||
|
from __future__ import annotations
|
||||||
|
from pathlib import Path
|
||||||
|
from unittest.mock import MagicMock, patch
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from connectors.bigquery.extractor import (
|
||||||
|
_wrap_admin_sql_for_jobs_api,
|
||||||
|
_escape_sql_string_literal,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def test_wrap_simple_select():
|
||||||
|
out = _wrap_admin_sql_for_jobs_api(
|
||||||
|
billing_project="prj-billing",
|
||||||
|
inner_sql="SELECT * FROM `ds.tbl`",
|
||||||
|
)
|
||||||
|
assert out == (
|
||||||
|
"SELECT * FROM bigquery_query('prj-billing', "
|
||||||
|
"'SELECT * FROM `ds.tbl`')"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def test_escape_single_quotes_in_inner_sql():
|
||||||
|
inner = "SELECT name FROM `ds.tbl` WHERE country = 'CZ'"
|
||||||
|
escaped = _escape_sql_string_literal(inner)
|
||||||
|
assert escaped == "SELECT name FROM `ds.tbl` WHERE country = ''CZ''"
|
||||||
|
|
||||||
|
|
||||||
|
def test_wrap_with_inner_quotes_round_trips():
|
||||||
|
inner = "SELECT * FROM `ds.tbl` WHERE col = 'foo''bar'"
|
||||||
|
out = _wrap_admin_sql_for_jobs_api("myproject", inner)
|
||||||
|
# Outer string-literal envelope must double the inner single quotes
|
||||||
|
# so DuckDB's parser sees a balanced literal.
|
||||||
|
assert out.count("'") % 2 == 0
|
||||||
|
# Round-trip: stripping the wrapper gives back the original inner exactly.
|
||||||
|
prefix = "SELECT * FROM bigquery_query('myproject', '"
|
||||||
|
assert out.startswith(prefix)
|
||||||
|
assert out.endswith("')")
|
||||||
|
middle = out[len(prefix):-2]
|
||||||
|
# DuckDB string literal escape: '' → '. Reverse it.
|
||||||
|
decoded = middle.replace("''", "'")
|
||||||
|
assert decoded == inner
|
||||||
|
|
||||||
|
|
||||||
|
def test_billing_project_validates_format():
|
||||||
|
with pytest.raises(ValueError, match="billing_project"):
|
||||||
|
_wrap_admin_sql_for_jobs_api(
|
||||||
|
billing_project="bad project'; DROP",
|
||||||
|
inner_sql="SELECT 1",
|
||||||
|
)
|
||||||
|
|
@ -13,8 +13,9 @@ import duckdb
|
||||||
from src.db import SCHEMA_VERSION, _ensure_schema, get_schema_version
|
from src.db import SCHEMA_VERSION, _ensure_schema, get_schema_version
|
||||||
|
|
||||||
|
|
||||||
def test_schema_version_is_23():
|
def test_schema_version_is_24():
|
||||||
assert SCHEMA_VERSION == 23
|
# bumped from 23→24 for the materialized BQ source_query rewrite migration
|
||||||
|
assert SCHEMA_VERSION == 24
|
||||||
|
|
||||||
|
|
||||||
def test_v20_adds_source_query(tmp_path):
|
def test_v20_adds_source_query(tmp_path):
|
||||||
|
|
@ -29,7 +30,7 @@ def test_v20_adds_source_query(tmp_path):
|
||||||
).fetchall()
|
).fetchall()
|
||||||
}
|
}
|
||||||
assert "source_query" in cols, f"source_query missing from {cols}"
|
assert "source_query" in cols, f"source_query missing from {cols}"
|
||||||
assert get_schema_version(conn) == 23
|
assert get_schema_version(conn) == SCHEMA_VERSION
|
||||||
conn.close()
|
conn.close()
|
||||||
|
|
||||||
|
|
||||||
|
|
@ -83,7 +84,7 @@ def test_v19_db_migrates_to_v20(tmp_path):
|
||||||
|
|
||||||
_ensure_schema(conn)
|
_ensure_schema(conn)
|
||||||
|
|
||||||
assert get_schema_version(conn) == 23
|
assert get_schema_version(conn) == SCHEMA_VERSION # bumped 23→24
|
||||||
cols = {
|
cols = {
|
||||||
r[0] for r in conn.execute(
|
r[0] for r in conn.execute(
|
||||||
"SELECT column_name FROM information_schema.columns "
|
"SELECT column_name FROM information_schema.columns "
|
||||||
|
|
|
||||||
|
|
@ -75,7 +75,13 @@ def stub_bq_extractor(monkeypatch):
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def stub_bq():
|
def stub_bq():
|
||||||
"""Real-shape BqAccess wired to in-memory DuckDB factories so the
|
"""Real-shape BqAccess wired to in-memory DuckDB factories so the
|
||||||
materialize_query path can run end-to-end without GCP."""
|
materialize_query path can run end-to-end without GCP.
|
||||||
|
|
||||||
|
A `bigquery_query(project, sql_text)` table macro is registered so the
|
||||||
|
wrapping added by `_wrap_admin_sql_for_jobs_api` (Task 2 — routes COPY
|
||||||
|
through the BQ jobs API for views) resolves against the in-memory tables
|
||||||
|
without needing the real BQ extension.
|
||||||
|
"""
|
||||||
@contextmanager
|
@contextmanager
|
||||||
def _session(_p):
|
def _session(_p):
|
||||||
conn = duckdb.connect(":memory:")
|
conn = duckdb.connect(":memory:")
|
||||||
|
|
@ -87,6 +93,12 @@ def stub_bq():
|
||||||
"SELECT 'EU' AS region, 100 AS revenue UNION ALL "
|
"SELECT 'EU' AS region, 100 AS revenue UNION ALL "
|
||||||
"SELECT 'US' AS region, 250 AS revenue"
|
"SELECT 'US' AS region, 250 AS revenue"
|
||||||
)
|
)
|
||||||
|
# Stub bigquery_query() so materialize_query's wrapped COPY works
|
||||||
|
# against the in-memory bq catalog without the real BQ extension.
|
||||||
|
conn.execute(
|
||||||
|
"CREATE OR REPLACE MACRO bigquery_query(project, sql_text) "
|
||||||
|
"AS TABLE SELECT * FROM query(sql_text)"
|
||||||
|
)
|
||||||
yield conn
|
yield conn
|
||||||
finally:
|
finally:
|
||||||
conn.close()
|
conn.close()
|
||||||
|
|
@ -265,12 +277,18 @@ def test_materialized_zero_rows_logs_warning(stub_bq, tmp_path, caplog):
|
||||||
conn.execute("CREATE SCHEMA bq.test")
|
conn.execute("CREATE SCHEMA bq.test")
|
||||||
conn.execute("CREATE OR REPLACE TABLE bq.test.empty AS "
|
conn.execute("CREATE OR REPLACE TABLE bq.test.empty AS "
|
||||||
"SELECT 1 AS n WHERE FALSE")
|
"SELECT 1 AS n WHERE FALSE")
|
||||||
|
# Stub bigquery_query() so materialize_query's wrapped COPY works
|
||||||
|
# against the in-memory bq catalog without the real BQ extension.
|
||||||
|
conn.execute(
|
||||||
|
"CREATE OR REPLACE MACRO bigquery_query(project, sql_text) "
|
||||||
|
"AS TABLE SELECT * FROM query(sql_text)"
|
||||||
|
)
|
||||||
yield conn
|
yield conn
|
||||||
finally:
|
finally:
|
||||||
conn.close()
|
conn.close()
|
||||||
|
|
||||||
bq_empty = BqAccess(
|
bq_empty = BqAccess(
|
||||||
BqProjects(billing="t", data="t"),
|
BqProjects(billing="test-project", data="test-project"),
|
||||||
client_factory=lambda _p: MagicMock(),
|
client_factory=lambda _p: MagicMock(),
|
||||||
duckdb_session_factory=_session_empty,
|
duckdb_session_factory=_session_empty,
|
||||||
)
|
)
|
||||||
|
|
@ -323,7 +341,7 @@ def test_attach_real_error_propagates(stub_bq, tmp_path):
|
||||||
conn.close()
|
conn.close()
|
||||||
|
|
||||||
bq_bad = BqAccess(
|
bq_bad = BqAccess(
|
||||||
BqProjects(billing="t", data="t"),
|
BqProjects(billing="test-project", data="test-project"),
|
||||||
client_factory=lambda _p: MagicMock(),
|
client_factory=lambda _p: MagicMock(),
|
||||||
duckdb_session_factory=_session_attach_fails,
|
duckdb_session_factory=_session_attach_fails,
|
||||||
)
|
)
|
||||||
|
|
|
||||||
66
tests/test_run_materialized_pass_in_flight_skip.py
Normal file
66
tests/test_run_materialized_pass_in_flight_skip.py
Normal file
|
|
@ -0,0 +1,66 @@
|
||||||
|
"""When materialize_query raises MaterializeInFlightError, _run_materialized_pass
|
||||||
|
must record it as a 'skipped, in_flight' outcome and NOT call state.set_error
|
||||||
|
(otherwise sync_state surfaces a false-positive 'failure' for a healthy
|
||||||
|
in-progress run)."""
|
||||||
|
from __future__ import annotations
|
||||||
|
from unittest.mock import MagicMock, patch
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from app.api.sync import _run_materialized_pass
|
||||||
|
from connectors.bigquery.extractor import MaterializeInFlightError
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def fake_registry_with_one_materialized(monkeypatch, tmp_path):
|
||||||
|
monkeypatch.setenv("DATA_DIR", str(tmp_path))
|
||||||
|
rows = [{
|
||||||
|
"id": "in_flight_t",
|
||||||
|
"name": "in_flight_t",
|
||||||
|
"query_mode": "materialized",
|
||||||
|
"source_type": "bigquery",
|
||||||
|
"source_query": "SELECT * FROM `ds.t`",
|
||||||
|
"sync_schedule": None,
|
||||||
|
}]
|
||||||
|
|
||||||
|
class _Repo:
|
||||||
|
def __init__(self, conn): pass
|
||||||
|
def list_all(self): return rows
|
||||||
|
|
||||||
|
class _State:
|
||||||
|
def __init__(self, conn):
|
||||||
|
self.set_error_calls = []
|
||||||
|
self.update_sync_calls = []
|
||||||
|
def get_last_sync(self, _id): return None
|
||||||
|
def set_error(self, table_id, msg): self.set_error_calls.append((table_id, msg))
|
||||||
|
def update_sync(self, **kw): self.update_sync_calls.append(kw)
|
||||||
|
|
||||||
|
state = _State(None)
|
||||||
|
monkeypatch.setattr("app.api.sync.TableRegistryRepository", _Repo)
|
||||||
|
monkeypatch.setattr("app.api.sync.SyncStateRepository", lambda c: state)
|
||||||
|
return state
|
||||||
|
|
||||||
|
|
||||||
|
def test_in_flight_recorded_as_skipped_not_error(fake_registry_with_one_materialized):
|
||||||
|
state = fake_registry_with_one_materialized
|
||||||
|
|
||||||
|
with patch(
|
||||||
|
"app.api.sync._materialize_table",
|
||||||
|
side_effect=MaterializeInFlightError("in_flight_t", layer="process"),
|
||||||
|
):
|
||||||
|
summary = _run_materialized_pass(MagicMock(), MagicMock())
|
||||||
|
|
||||||
|
assert summary["materialized"] == []
|
||||||
|
assert summary["errors"] == []
|
||||||
|
assert len(summary["skipped"]) == 1
|
||||||
|
skipped = summary["skipped"][0]
|
||||||
|
assert skipped == {"table": "in_flight_t", "reason": "in_flight"}
|
||||||
|
assert state.set_error_calls == []
|
||||||
|
assert state.update_sync_calls == []
|
||||||
|
|
||||||
|
|
||||||
|
def test_due_check_skipped_uses_due_check_reason(fake_registry_with_one_materialized, monkeypatch):
|
||||||
|
monkeypatch.setattr("app.api.sync.is_table_due", lambda *a, **k: False)
|
||||||
|
|
||||||
|
summary = _run_materialized_pass(MagicMock(), MagicMock())
|
||||||
|
assert summary["skipped"] == [{"table": "in_flight_t", "reason": "due_check"}]
|
||||||
159
tests/test_schema_v24_source_query_rewrite.py
Normal file
159
tests/test_schema_v24_source_query_rewrite.py
Normal file
|
|
@ -0,0 +1,159 @@
|
||||||
|
"""v24: rewrites table_registry.source_query for materialized BQ rows
|
||||||
|
from DuckDB-flavor (bq.\"ds\".\"tbl\") to BQ-native (`<project>.ds.tbl`).
|
||||||
|
The wrapping path (connectors.bigquery.extractor.materialize_query) only
|
||||||
|
accepts BQ-native; pre-v24 rows would fail at materialize time without
|
||||||
|
this conversion."""
|
||||||
|
from __future__ import annotations
|
||||||
|
import os
|
||||||
|
import tempfile
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import duckdb
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from src.db import _ensure_schema, get_schema_version, SCHEMA_VERSION
|
||||||
|
|
||||||
|
|
||||||
|
def _seed_v23(conn, project_id: str = "prj-data"):
|
||||||
|
conn.execute(
|
||||||
|
"CREATE TABLE schema_version (version INTEGER, applied_at TIMESTAMP DEFAULT current_timestamp)"
|
||||||
|
)
|
||||||
|
conn.execute("INSERT INTO schema_version (version) VALUES (23)")
|
||||||
|
conn.execute(
|
||||||
|
"CREATE TABLE table_registry ("
|
||||||
|
"id VARCHAR PRIMARY KEY, name VARCHAR, source_type VARCHAR, "
|
||||||
|
"query_mode VARCHAR, bucket VARCHAR, source_table VARCHAR, source_query VARCHAR)"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def test_v24_rewrites_duckdb_flavor_to_bq_native(monkeypatch):
|
||||||
|
with tempfile.TemporaryDirectory() as tmp:
|
||||||
|
monkeypatch.setenv("DATA_DIR", tmp)
|
||||||
|
monkeypatch.setattr(
|
||||||
|
"app.instance_config.get_value",
|
||||||
|
lambda *args, **kw: "prj-data" if args == ("data_source", "bigquery", "project") else kw.get("default"),
|
||||||
|
)
|
||||||
|
Path(tmp, "state").mkdir(parents=True, exist_ok=True)
|
||||||
|
db_path = Path(tmp, "state", "system.duckdb")
|
||||||
|
conn = duckdb.connect(str(db_path))
|
||||||
|
try:
|
||||||
|
_seed_v23(conn)
|
||||||
|
conn.execute(
|
||||||
|
'INSERT INTO table_registry VALUES (?, ?, ?, ?, ?, ?, ?)',
|
||||||
|
["t1", "t1", "bigquery", "materialized", "ds", "tbl",
|
||||||
|
'SELECT * FROM bq."ds"."tbl"'],
|
||||||
|
)
|
||||||
|
conn.execute(
|
||||||
|
'INSERT INTO table_registry VALUES (?, ?, ?, ?, ?, ?, ?)',
|
||||||
|
["t2", "t2", "bigquery", "materialized", "analytics", "orders",
|
||||||
|
'SELECT col1 FROM bq."analytics"."orders" WHERE col2 > 10'],
|
||||||
|
)
|
||||||
|
conn.execute(
|
||||||
|
'INSERT INTO table_registry VALUES (?, ?, ?, ?, ?, ?, ?)',
|
||||||
|
["r1", "r1", "bigquery", "remote", "ds", "tbl", None],
|
||||||
|
)
|
||||||
|
|
||||||
|
_ensure_schema(conn)
|
||||||
|
assert get_schema_version(conn) == SCHEMA_VERSION
|
||||||
|
assert SCHEMA_VERSION >= 24
|
||||||
|
|
||||||
|
rows = {r[0]: r[1] for r in conn.execute(
|
||||||
|
"SELECT id, source_query FROM table_registry"
|
||||||
|
).fetchall()}
|
||||||
|
assert rows["t1"] == "SELECT * FROM `prj-data.ds.tbl`"
|
||||||
|
assert rows["t2"] == (
|
||||||
|
"SELECT col1 FROM `prj-data.analytics.orders` WHERE col2 > 10"
|
||||||
|
)
|
||||||
|
assert rows["r1"] is None # remote row untouched
|
||||||
|
finally:
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
|
||||||
|
def test_v24_idempotent_when_already_bq_native(monkeypatch):
|
||||||
|
with tempfile.TemporaryDirectory() as tmp:
|
||||||
|
monkeypatch.setenv("DATA_DIR", tmp)
|
||||||
|
monkeypatch.setattr(
|
||||||
|
"app.instance_config.get_value",
|
||||||
|
lambda *args, **kw: "prj-data" if args == ("data_source", "bigquery", "project") else kw.get("default"),
|
||||||
|
)
|
||||||
|
Path(tmp, "state").mkdir(parents=True, exist_ok=True)
|
||||||
|
db_path = Path(tmp, "state", "system.duckdb")
|
||||||
|
conn = duckdb.connect(str(db_path))
|
||||||
|
try:
|
||||||
|
_seed_v23(conn)
|
||||||
|
conn.execute(
|
||||||
|
'INSERT INTO table_registry VALUES (?, ?, ?, ?, ?, ?, ?)',
|
||||||
|
["t1", "t1", "bigquery", "materialized", "ds", "tbl",
|
||||||
|
"SELECT * FROM `prj-data.ds.tbl`"],
|
||||||
|
)
|
||||||
|
_ensure_schema(conn)
|
||||||
|
row = conn.execute(
|
||||||
|
"SELECT source_query FROM table_registry WHERE id='t1'"
|
||||||
|
).fetchone()
|
||||||
|
assert row[0] == "SELECT * FROM `prj-data.ds.tbl`"
|
||||||
|
finally:
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
|
||||||
|
def test_v24_logs_warning_when_project_not_configured(monkeypatch, caplog):
|
||||||
|
with tempfile.TemporaryDirectory() as tmp:
|
||||||
|
monkeypatch.setenv("DATA_DIR", tmp)
|
||||||
|
monkeypatch.setattr(
|
||||||
|
"app.instance_config.get_value",
|
||||||
|
lambda *args, **kw: kw.get("default", ""), # no project configured
|
||||||
|
)
|
||||||
|
Path(tmp, "state").mkdir(parents=True, exist_ok=True)
|
||||||
|
db_path = Path(tmp, "state", "system.duckdb")
|
||||||
|
conn = duckdb.connect(str(db_path))
|
||||||
|
try:
|
||||||
|
_seed_v23(conn)
|
||||||
|
conn.execute(
|
||||||
|
'INSERT INTO table_registry VALUES (?, ?, ?, ?, ?, ?, ?)',
|
||||||
|
["t1", "t1", "bigquery", "materialized", "ds", "tbl",
|
||||||
|
'SELECT * FROM bq."ds"."tbl"'],
|
||||||
|
)
|
||||||
|
with caplog.at_level("WARNING"):
|
||||||
|
_ensure_schema(conn)
|
||||||
|
row = conn.execute(
|
||||||
|
"SELECT source_query FROM table_registry WHERE id='t1'"
|
||||||
|
).fetchone()
|
||||||
|
assert row[0] == 'SELECT * FROM bq."ds"."tbl"'
|
||||||
|
assert any(
|
||||||
|
"v24" in r.message.lower() or "project" in r.message.lower()
|
||||||
|
for r in caplog.records
|
||||||
|
)
|
||||||
|
finally:
|
||||||
|
conn.close()
|
||||||
|
|
||||||
|
|
||||||
|
def test_v24_keboola_materialized_row_not_rewritten(monkeypatch):
|
||||||
|
"""Materialized rows with source_type != 'bigquery' must not be touched
|
||||||
|
by v24. Keboola materialized has no notion of bq."ds"."tbl" syntax;
|
||||||
|
the SELECT's source_type filter pins this contract.
|
||||||
|
"""
|
||||||
|
with tempfile.TemporaryDirectory() as tmp:
|
||||||
|
monkeypatch.setenv("DATA_DIR", tmp)
|
||||||
|
monkeypatch.setattr(
|
||||||
|
"app.instance_config.get_value",
|
||||||
|
lambda *args, **kw: "prj-data" if args == ("data_source", "bigquery", "project") else kw.get("default"),
|
||||||
|
)
|
||||||
|
Path(tmp, "state").mkdir(parents=True, exist_ok=True)
|
||||||
|
db_path = Path(tmp, "state", "system.duckdb")
|
||||||
|
conn = duckdb.connect(str(db_path))
|
||||||
|
try:
|
||||||
|
_seed_v23(conn)
|
||||||
|
# Keboola row that happens to contain `bq."..."` in its SQL
|
||||||
|
# (admin error or copy-paste from a BQ row). Migration must
|
||||||
|
# leave it alone — this is not the v24 contract.
|
||||||
|
conn.execute(
|
||||||
|
'INSERT INTO table_registry VALUES (?, ?, ?, ?, ?, ?, ?)',
|
||||||
|
["kb1", "kb1", "keboola", "materialized", "ds", "tbl",
|
||||||
|
'SELECT * FROM bq."ds"."tbl"'],
|
||||||
|
)
|
||||||
|
_ensure_schema(conn)
|
||||||
|
row = conn.execute(
|
||||||
|
"SELECT source_query FROM table_registry WHERE id='kb1'"
|
||||||
|
).fetchone()
|
||||||
|
assert row[0] == 'SELECT * FROM bq."ds"."tbl"'
|
||||||
|
finally:
|
||||||
|
conn.close()
|
||||||
|
|
@ -102,7 +102,8 @@ def test_materialized_pass_skips_undue_rows(system_db, stub_bq):
|
||||||
summary = sync_mod._run_materialized_pass(system_db, stub_bq)
|
summary = sync_mod._run_materialized_pass(system_db, stub_bq)
|
||||||
|
|
||||||
mock_mat.assert_not_called()
|
mock_mat.assert_not_called()
|
||||||
assert "orders_daily" in summary["skipped"]
|
# summary["skipped"] is now list[dict] — see PR zs/materialize-sync-fix
|
||||||
|
assert {"table": "orders_daily", "reason": "due_check"} in summary["skipped"]
|
||||||
|
|
||||||
|
|
||||||
def test_materialized_pass_skips_non_materialized_rows(system_db, stub_bq):
|
def test_materialized_pass_skips_non_materialized_rows(system_db, stub_bq):
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue