fix(admin): register-table real-world UX gaps for materialized BQ

Three items from operator feedback after running the actual flow:

(1) Help docstring lied: "--bucket / --source-table ignored" for
materialized rows. Reality: --bucket is load-bearing because
`agnes schema <name>` builds the BQ identifier as
`bq.<bucket>.<source_table>`. An empty bucket registered the row but
broke schema/describe with HTTP 400 "unsafe BQ identifier in
registry". Fix: docstring rewritten to reflect reality, plus
client-side validation rejects materialized + empty bucket with a
clear error pointing at the right knob.

(2) Post-register UX cliff: `agnes pull` after register-table reports
"Updated 0 tables (1 total)" because registration adds a registry
row but does NOT trigger a parquet build. Operators routinely
assume something's broken when they need to run
`agnes setup first-sync` to kick off the materialization. Hint
emitted on success now points at first-sync.

(3) RBAC gotcha: `agnes catalog` is RBAC-filtered via
`resource_grants`, so non-admin users don't see freshly-registered
rows until a grant is created. Hint emitted on success now points at
`agnes admin grant create <group> table <name>`.

Tests: 8/8 in test_cli_admin_materialized.py, including two new
regression tests for the validation + the hint output.
This commit is contained in:
ZdenekSrotyr 2026-05-04 23:06:17 +02:00
parent 5915f92eaa
commit 36012e0833
3 changed files with 126 additions and 3 deletions

View file

@ -39,6 +39,10 @@ End-to-end clean-analyst-bootstrap rewrite. The web `/setup` page now produces a
- `agnes snapshot create` (formerly `da fetch`) no longer materializes an empty `user/duckdb/analytics.duckdb` when run before any `agnes pull`. Friendly hint redirects to `agnes pull`. - `agnes snapshot create` (formerly `da fetch`) no longer materializes an empty `user/duckdb/analytics.duckdb` when run before any `agnes pull`. Friendly hint redirects to `agnes pull`.
- Workspace `agnes status` reads from the canonical `server/parquet/` and `user/duckdb/analytics.duckdb` paths (was reading legacy `data/parquet/`, `data/metadata/last_sync.json`). - Workspace `agnes status` reads from the canonical `server/parquet/` and `user/duckdb/analytics.duckdb` paths (was reading legacy `data/parquet/`, `data/metadata/last_sync.json`).
- `agnes init` and `agnes pull` errors now use the `cli/error_render.py` typed-error renderer (added in 0.32.0), so analyst-facing error UX matches the structured shape `agnes query --remote` already produces. - `agnes init` and `agnes pull` errors now use the `cli/error_render.py` typed-error renderer (added in 0.32.0), so analyst-facing error UX matches the structured shape `agnes query --remote` already produces.
- **`agnes admin register-table` UX**: three real-world feedback items addressed.
- **`--query-mode materialized` now requires `--bucket`** (client-side validation; exits with a clear error before hitting the server). The previous help docstring claimed `--bucket` was *ignored* for materialized rows, but the value is actually load-bearing — `agnes schema <name>` builds the BQ identifier as `bq.<bucket>.<source_table>`, so an empty bucket registered the row but broke subsequent schema/describe with HTTP 400 "unsafe BQ identifier in registry". Docstring rewritten to reflect reality.
- **Post-success hints**: after a successful registration the CLI now points operators at the two follow-ups they routinely miss: (a) `agnes setup first-sync` to materialize the parquet (registration alone doesn't trigger a build; `agnes pull` reports "Updated 0 tables" until the scheduler tick), and (b) `agnes admin grant create <group> table <name>` to make the row visible in `agnes catalog` for non-admin users (catalog is RBAC-filtered).
- Test coverage: `tests/test_cli_admin_materialized.py::test_register_materialized_without_bucket_fails_with_clear_error` and `test_register_table_emits_first_sync_and_grant_hints`.
- **`agnes query --remote` SQL rewriter no longer corrupts output when the GCP project ID contains a registered table name as a hyphen-delimited word** (Devin Review on `query.py:464`). The previous iterative rewrite (one `re.sub(\b<name>\b, ...)` per registered name) was vulnerable to cross-contamination: e.g. project `my-ue-project` + registered `orders` + registered `ue` → iter 1 rewrites `orders` to `\`my-ue-project.fin.orders\``, iter 2's `\bue\b` then matches the `ue` INSIDE `my-ue-project` and corrupts the iter-1 path. Fix: replaced the iteration with a SINGLE `re.sub` whose alternation regex (sorted longest-first) handles every name in one pass, so freshly-inserted backticked text isn't re-scanned. The fallback at `query.py:576` (per-table SELECT * on BQ parse error) caught the corrupted output as `bq_bad_request` so impact was over-estimation rather than fail-open, but the partition-pruning benefit of #171 is now preserved for projects whose IDs share a hyphen-segment with a registered table name. Regression test in `tests/test_api_query_guardrail.py::test_rewrite_helper_does_not_corrupt_when_project_id_contains_registered_name`. - **`agnes query --remote` SQL rewriter no longer corrupts output when the GCP project ID contains a registered table name as a hyphen-delimited word** (Devin Review on `query.py:464`). The previous iterative rewrite (one `re.sub(\b<name>\b, ...)` per registered name) was vulnerable to cross-contamination: e.g. project `my-ue-project` + registered `orders` + registered `ue` → iter 1 rewrites `orders` to `\`my-ue-project.fin.orders\``, iter 2's `\bue\b` then matches the `ue` INSIDE `my-ue-project` and corrupts the iter-1 path. Fix: replaced the iteration with a SINGLE `re.sub` whose alternation regex (sorted longest-first) handles every name in one pass, so freshly-inserted backticked text isn't re-scanned. The fallback at `query.py:576` (per-table SELECT * on BQ parse error) caught the corrupted output as `bq_bad_request` so impact was over-estimation rather than fail-open, but the partition-pruning benefit of #171 is now preserved for projects whose IDs share a hyphen-segment with a registered table name. Regression test in `tests/test_api_query_guardrail.py::test_rewrite_helper_does_not_corrupt_when_project_id_contains_registered_name`.
- **BigQuery materialize TTL reclaim is no longer dead code** (Devin Review on `extractor.py:166`). `_try_acquire_file_lock` used to call `open(lock_path, mode="w")` BEFORE checking the lock-file mtime, which truncated the file and refreshed mtime to *now* on every invocation. The subsequent `time.time() - lock_path.stat().st_mtime` always saw age ~0, so `age > TTL` never fired, and `materialize.lock_ttl_seconds` was a silently no-op config knob. Fix: stat the lock path BEFORE any `open()` to read the real pre-probe mtime; if older than TTL, unlink (forcing a fresh inode for the next `open + flock`); only then probe. Two regression tests added: `test_stale_held_lock_is_reclaimed_despite_live_holder` exercises the full reclaim path with a still-living fcntl holder, `test_failed_probe_does_not_self_refresh_lock_mtime` pins that a failed acquisition doesn't pathologically loop. Residual cross-process risk (a genuinely overrunning materialize past TTL races a fresh attempt) is documented in the helper docstring; in-process `threading.Lock` keyed on `table_id` blocks the single-process race. - **BigQuery materialize TTL reclaim is no longer dead code** (Devin Review on `extractor.py:166`). `_try_acquire_file_lock` used to call `open(lock_path, mode="w")` BEFORE checking the lock-file mtime, which truncated the file and refreshed mtime to *now* on every invocation. The subsequent `time.time() - lock_path.stat().st_mtime` always saw age ~0, so `age > TTL` never fired, and `materialize.lock_ttl_seconds` was a silently no-op config knob. Fix: stat the lock path BEFORE any `open()` to read the real pre-probe mtime; if older than TTL, unlink (forcing a fresh inode for the next `open + flock`); only then probe. Two regression tests added: `test_stale_held_lock_is_reclaimed_despite_live_holder` exercises the full reclaim path with a still-living fcntl holder, `test_failed_probe_does_not_self_refresh_lock_mtime` pins that a failed acquisition doesn't pathologically loop. Residual cross-process risk (a genuinely overrunning materialize past TTL races a fresh attempt) is documented in the helper docstring; in-process `threading.Lock` keyed on `table_id` blocks the single-process race.
- **`agnes init --token X` now correctly uses the explicit token in the verify call**, even when `~/.config/agnes/token.json` already holds a stale token from a prior install. Pre-fix `cli.config.get_token()` read the on-disk file first and only fell back to env vars, so step 2 (PAT-verify) ran with the stale token and failed with a confusing 401 — even though the `--token` arg was valid (Devin Review on `init.py:99`). Fix: a `ContextVar`-based override in `cli.config` short-circuits `get_token()` before the file read; `_override_server_env` (used by both `agnes init` and `agnes pull`'s `run_pull`) sets it for the duration of the call. Async-safe (each task sees its own override) and leak-proof (resets on context exit). - **`agnes init --token X` now correctly uses the explicit token in the verify call**, even when `~/.config/agnes/token.json` already holds a stale token from a prior install. Pre-fix `cli.config.get_token()` read the on-disk file first and only fell back to env vars, so step 2 (PAT-verify) ran with the stale token and failed with a confusing 401 — even though the `--token` arg was valid (Devin Review on `init.py:99`). Fix: a `ContextVar`-based override in `cli.config` short-circuits `get_token()` before the file read; `_override_server_env` (used by both `agnes init` and `agnes pull`'s `run_pull`) sets it for the duration of the call. Async-safe (each task sees its own override) and leak-proof (resets on context exit).

View file

@ -92,12 +92,18 @@ def register_table(
"""Register a single table. """Register a single table.
Modes: Modes:
- **local** (Keboola): batch pull, parquet on disk. - **local** (Keboola): batch pull, parquet on disk. Requires
`--bucket` + `--source-table`.
- **remote** (BigQuery): view only, queries go to BQ. Requires - **remote** (BigQuery): view only, queries go to BQ. Requires
`--bucket` + `--source-table`. `--bucket` + `--source-table`.
- **materialized** (BigQuery): server-side scheduled SQL parquet. - **materialized** (BigQuery): server-side scheduled SQL parquet.
Requires `--query` (inline or `@file.sql`); `--bucket` / Requires `--query` (inline or `@file.sql`) AND `--bucket` (BQ
`--source-table` ignored. dataset of the destination identifier). `--source-table` defaults
to the registered `name` when omitted; explicit override is rare.
Note: `agnes schema <name>` builds the BQ identifier as
`bq.<bucket>.<source_table>` even for materialized rows, so an
empty `--bucket` here registers the row but breaks subsequent
schema/describe calls.
`--dry-run` goes through /precheck (BQ remote only for materialized `--dry-run` goes through /precheck (BQ remote only for materialized
rows, dry-run is a no-op since the SQL itself is the contract). rows, dry-run is a no-op since the SQL itself is the contract).
@ -123,6 +129,22 @@ def register_table(
) )
raise typer.Exit(2) raise typer.Exit(2)
# Bucket is load-bearing for the BQ destination identifier on
# materialized rows. Without it, registration succeeds but
# subsequent `agnes schema <name>` builds `bq."".."<src>"` from
# the empty bucket and the server rejects with HTTP 400 "unsafe
# BQ identifier in registry". Catch this at register time so the
# operator gets a clear error pointing at the right knob.
if query_mode == "materialized" and not bucket:
typer.echo(
"Error: --query-mode materialized requires --bucket (the BQ "
"dataset for the destination identifier). Without it the row "
"registers but `agnes schema <name>` later fails with "
"'unsafe BQ identifier in registry'.",
err=True,
)
raise typer.Exit(2)
payload = { payload = {
"name": name, "name": name,
"source_type": source_type, "source_type": source_type,
@ -176,6 +198,33 @@ def register_table(
typer.echo(f"Registered (materializing in background): {name}") typer.echo(f"Registered (materializing in background): {name}")
else: else:
typer.echo(f"Registered: {name}") typer.echo(f"Registered: {name}")
# Post-success hints. Two operator gotchas this catches:
#
# 1. `agnes pull` does not auto-materialize newly-registered
# rows — registration adds a registry row, but the parquet
# is built only when the scheduler tick runs (or first-sync
# is triggered manually). Without this hint operators see
# "Updated 0 tables" on `agnes pull` and assume something
# is broken.
# 2. `register-table` does NOT auto-grant. `agnes catalog`
# filters per-user via `resource_grants`, so operators
# other than the registering admin won't see the new row
# until a grant is created.
#
# Hint #1 only fires for `local` and `materialized` (the modes
# that actually produce a parquet); 202-async path covers a
# different signal, so don't double-message there.
if query_mode in ("local", "materialized") and resp.status_code != 202:
typer.echo(
" Next: run `agnes setup first-sync` to materialize "
"the parquet (or wait for the scheduler tick)."
)
typer.echo(
f" Note: register-table does not auto-grant. Run "
f"`agnes admin grant create <group> table {name}` to "
f"make this visible in `agnes catalog` for non-admin users."
)
elif resp.status_code == 409: elif resp.status_code == 409:
typer.echo(f"Already exists: {name}") typer.echo(f"Already exists: {name}")
else: else:

View file

@ -29,6 +29,7 @@ def test_register_materialized_with_inline_query(monkeypatch):
"admin", "register-table", "orders_90d", "admin", "register-table", "orders_90d",
"--source-type", "bigquery", "--source-type", "bigquery",
"--query-mode", "materialized", "--query-mode", "materialized",
"--bucket", "fin",
"--query", "SELECT date FROM `prj.ds.orders`", "--query", "SELECT date FROM `prj.ds.orders`",
"--sync-schedule", "every 6h", "--sync-schedule", "every 6h",
]) ])
@ -59,6 +60,7 @@ def test_register_materialized_reads_query_from_file(tmp_path, monkeypatch):
"admin", "register-table", "orders_90d", "admin", "register-table", "orders_90d",
"--source-type", "bigquery", "--source-type", "bigquery",
"--query-mode", "materialized", "--query-mode", "materialized",
"--bucket", "fin",
"--query", f"@{sql_file}", "--query", f"@{sql_file}",
"--sync-schedule", "daily 03:00", "--sync-schedule", "daily 03:00",
]) ])
@ -155,3 +157,71 @@ def test_register_remote_path_unchanged(monkeypatch):
assert "source_query" not in captured["json"] assert "source_query" not in captured["json"]
assert captured["json"]["bucket"] == "analytics" assert captured["json"]["bucket"] == "analytics"
assert captured["json"]["source_table"] == "orders" assert captured["json"]["source_table"] == "orders"
def test_register_materialized_without_bucket_fails_with_clear_error(monkeypatch):
"""`--query-mode materialized` without `--bucket` is a client-side
error. Pre-fix the CLI sent `bucket=""` to the server; registration
succeeded but `agnes schema <name>` later 400ed with "unsafe BQ
identifier in registry" because the schema endpoint built
`bq.\"\".\"<src>\"` from the empty bucket. Catching this at register
time gives operators a clear pointer at the right knob instead of
accept-then-fail-later UX."""
called = {"count": 0}
def fake_post(*args, **kwargs):
called["count"] += 1
return _fake_resp(201)
monkeypatch.setattr("cli.commands.admin.api_post", fake_post)
runner = CliRunner()
result = runner.invoke(app, [
"admin", "register-table", "category_summary",
"--source-type", "bigquery",
"--query-mode", "materialized",
"--query", "SELECT 1",
# No --bucket on purpose.
])
assert result.exit_code != 0
# API never called — fail fast on client side.
assert called["count"] == 0
combined = result.stdout + (result.stderr or "")
assert "--bucket" in combined
# The error must explain WHY it's required, not just say "missing".
assert "schema" in combined.lower() or "identifier" in combined.lower()
def test_register_table_emits_first_sync_and_grant_hints(monkeypatch):
"""After a successful register-table for a materialized row, the CLI
output must point operators at:
(a) `agnes setup first-sync` registration adds a registry row but
does NOT trigger a parquet build, and `agnes pull` then reports
"Updated 0 tables (1 total)" until the next scheduler tick.
(b) `agnes admin grant create <group> table <id>` `agnes catalog`
is RBAC-filtered, so non-admin users won't see the new row
until a grant is created.
Without these hints operators bounce between symptoms and assume
something's broken when it's just unstated post-register UX."""
monkeypatch.setattr(
"cli.commands.admin.api_post",
lambda *a, **kw: _fake_resp(201),
)
runner = CliRunner()
result = runner.invoke(app, [
"admin", "register-table", "category_summary",
"--source-type", "bigquery",
"--query-mode", "materialized",
"--bucket", "analytics",
"--query", "SELECT category, SUM(rev) FROM `prj.ds.tx` GROUP BY 1",
])
assert result.exit_code == 0, result.stdout
out = result.stdout
assert "agnes setup first-sync" in out
assert "agnes admin grant create" in out
# The grant hint should mention the registered name so operators can
# copy-paste the next command verbatim.
assert "category_summary" in out