fix(admin): register-table real-world UX gaps for materialized BQ

Three items from operator feedback after running the actual flow: (1) Help docstring lied: "--bucket / --source-table ignored" for materialized rows. Reality: --bucket is load-bearing because `agnes schema <name>` builds the BQ identifier as `bq.<bucket>.<source_table>`. An empty bucket registered the row but broke schema/describe with HTTP 400 "unsafe BQ identifier in registry". Fix: docstring rewritten to reflect reality, plus client-side validation rejects materialized + empty bucket with a clear error pointing at the right knob. (2) Post-register UX cliff: `agnes pull` after register-table reports "Updated 0 tables (1 total)" because registration adds a registry row but does NOT trigger a parquet build. Operators routinely assume something's broken when they need to run `agnes setup first-sync` to kick off the materialization. Hint emitted on success now points at first-sync. (3) RBAC gotcha: `agnes catalog` is RBAC-filtered via `resource_grants`, so non-admin users don't see freshly-registered rows until a grant is created. Hint emitted on success now points at `agnes admin grant create <group> table <name>`. Tests: 8/8 in test_cli_admin_materialized.py, including two new regression tests for the validation + the hint output.
2026-05-04 23:06:17 +02:00 · 2026-05-04 23:06:17 +02:00 · 36012e0833
commit 36012e0833
parent 5915f92eaa
3 changed files with 126 additions and 3 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -39,6 +39,10 @@ End-to-end clean-analyst-bootstrap rewrite. The web `/setup` page now produces a
 - `agnes snapshot create` (formerly `da fetch`) no longer materializes an empty `user/duckdb/analytics.duckdb` when run before any `agnes pull`. Friendly hint redirects to `agnes pull`.
 - Workspace `agnes status` reads from the canonical `server/parquet/` and `user/duckdb/analytics.duckdb` paths (was reading legacy `data/parquet/`, `data/metadata/last_sync.json`).
 - `agnes init` and `agnes pull` errors now use the `cli/error_render.py` typed-error renderer (added in 0.32.0), so analyst-facing error UX matches the structured shape `agnes query --remote` already produces.
+- **`agnes admin register-table` UX**: three real-world feedback items addressed.
+  - **`--query-mode materialized` now requires `--bucket`** (client-side validation; exits with a clear error before hitting the server). The previous help docstring claimed `--bucket` was *ignored* for materialized rows, but the value is actually load-bearing — `agnes schema <name>` builds the BQ identifier as `bq.<bucket>.<source_table>`, so an empty bucket registered the row but broke subsequent schema/describe with HTTP 400 "unsafe BQ identifier in registry". Docstring rewritten to reflect reality.
+  - **Post-success hints**: after a successful registration the CLI now points operators at the two follow-ups they routinely miss: (a) `agnes setup first-sync` to materialize the parquet (registration alone doesn't trigger a build; `agnes pull` reports "Updated 0 tables" until the scheduler tick), and (b) `agnes admin grant create <group> table <name>` to make the row visible in `agnes catalog` for non-admin users (catalog is RBAC-filtered).
+  - Test coverage: `tests/test_cli_admin_materialized.py::test_register_materialized_without_bucket_fails_with_clear_error` and `test_register_table_emits_first_sync_and_grant_hints`.
 - **`agnes query --remote` SQL rewriter no longer corrupts output when the GCP project ID contains a registered table name as a hyphen-delimited word** (Devin Review on `query.py:464`). The previous iterative rewrite (one `re.sub(\b<name>\b, ...)` per registered name) was vulnerable to cross-contamination: e.g. project `my-ue-project` + registered `orders` + registered `ue` → iter 1 rewrites `orders` to `\`my-ue-project.fin.orders\``, iter 2's `\bue\b` then matches the `ue` INSIDE `my-ue-project` and corrupts the iter-1 path. Fix: replaced the iteration with a SINGLE `re.sub` whose alternation regex (sorted longest-first) handles every name in one pass, so freshly-inserted backticked text isn't re-scanned. The fallback at `query.py:576` (per-table SELECT * on BQ parse error) caught the corrupted output as `bq_bad_request` so impact was over-estimation rather than fail-open, but the partition-pruning benefit of #171 is now preserved for projects whose IDs share a hyphen-segment with a registered table name. Regression test in `tests/test_api_query_guardrail.py::test_rewrite_helper_does_not_corrupt_when_project_id_contains_registered_name`.
 - **BigQuery materialize TTL reclaim is no longer dead code** (Devin Review on `extractor.py:166`). `_try_acquire_file_lock` used to call `open(lock_path, mode="w")` BEFORE checking the lock-file mtime, which truncated the file and refreshed mtime to *now* on every invocation. The subsequent `time.time() - lock_path.stat().st_mtime` always saw age ~0, so `age > TTL` never fired, and `materialize.lock_ttl_seconds` was a silently no-op config knob. Fix: stat the lock path BEFORE any `open()` to read the real pre-probe mtime; if older than TTL, unlink (forcing a fresh inode for the next `open + flock`); only then probe. Two regression tests added: `test_stale_held_lock_is_reclaimed_despite_live_holder` exercises the full reclaim path with a still-living fcntl holder, `test_failed_probe_does_not_self_refresh_lock_mtime` pins that a failed acquisition doesn't pathologically loop. Residual cross-process risk (a genuinely overrunning materialize past TTL races a fresh attempt) is documented in the helper docstring; in-process `threading.Lock` keyed on `table_id` blocks the single-process race.
 - **`agnes init --token X` now correctly uses the explicit token in the verify call**, even when `~/.config/agnes/token.json` already holds a stale token from a prior install. Pre-fix `cli.config.get_token()` read the on-disk file first and only fell back to env vars, so step 2 (PAT-verify) ran with the stale token and failed with a confusing 401 — even though the `--token` arg was valid (Devin Review on `init.py:99`). Fix: a `ContextVar`-based override in `cli.config` short-circuits `get_token()` before the file read; `_override_server_env` (used by both `agnes init` and `agnes pull`'s `run_pull`) sets it for the duration of the call. Async-safe (each task sees its own override) and leak-proof (resets on context exit).
--- a/cli/commands/admin.py
+++ b/cli/commands/admin.py
@ -92,12 +92,18 @@ def register_table(
    """Register a single table.

    Modes:
-    - **local** (Keboola): batch pull, parquet on disk.
+    - **local** (Keboola): batch pull, parquet on disk. Requires
+      `--bucket` + `--source-table`.
    - **remote** (BigQuery): view only, queries go to BQ. Requires
      `--bucket` + `--source-table`.
    - **materialized** (BigQuery): server-side scheduled SQL → parquet.
-      Requires `--query` (inline or `@file.sql`); `--bucket` /
-      `--source-table` ignored.
+      Requires `--query` (inline or `@file.sql`) AND `--bucket` (BQ
+      dataset of the destination identifier). `--source-table` defaults
+      to the registered `name` when omitted; explicit override is rare.
+      Note: `agnes schema <name>` builds the BQ identifier as
+      `bq.<bucket>.<source_table>` even for materialized rows, so an
+      empty `--bucket` here registers the row but breaks subsequent
+      schema/describe calls.

    `--dry-run` goes through /precheck (BQ remote only — for materialized
    rows, dry-run is a no-op since the SQL itself is the contract).
@ -123,6 +129,22 @@ def register_table(
        )
        raise typer.Exit(2)

+    # Bucket is load-bearing for the BQ destination identifier on
+    # materialized rows. Without it, registration succeeds but
+    # subsequent `agnes schema <name>` builds `bq."".."<src>"` from
+    # the empty bucket and the server rejects with HTTP 400 "unsafe
+    # BQ identifier in registry". Catch this at register time so the
+    # operator gets a clear error pointing at the right knob.
+    if query_mode == "materialized" and not bucket:
+        typer.echo(
+            "Error: --query-mode materialized requires --bucket (the BQ "
+            "dataset for the destination identifier). Without it the row "
+            "registers but `agnes schema <name>` later fails with "
+            "'unsafe BQ identifier in registry'.",
+            err=True,
+        )
+        raise typer.Exit(2)
+
    payload = {
        "name": name,
        "source_type": source_type,
@ -176,6 +198,33 @@ def register_table(
            typer.echo(f"Registered (materializing in background): {name}")
        else:
            typer.echo(f"Registered: {name}")
+
+        # Post-success hints. Two operator gotchas this catches:
+        #
+        # 1. `agnes pull` does not auto-materialize newly-registered
+        #    rows — registration adds a registry row, but the parquet
+        #    is built only when the scheduler tick runs (or first-sync
+        #    is triggered manually). Without this hint operators see
+        #    "Updated 0 tables" on `agnes pull` and assume something
+        #    is broken.
+        # 2. `register-table` does NOT auto-grant. `agnes catalog`
+        #    filters per-user via `resource_grants`, so operators
+        #    other than the registering admin won't see the new row
+        #    until a grant is created.
+        #
+        # Hint #1 only fires for `local` and `materialized` (the modes
+        # that actually produce a parquet); 202-async path covers a
+        # different signal, so don't double-message there.
+        if query_mode in ("local", "materialized") and resp.status_code != 202:
+            typer.echo(
+                "  Next: run `agnes setup first-sync` to materialize "
+                "the parquet (or wait for the scheduler tick)."
+            )
+        typer.echo(
+            f"  Note: register-table does not auto-grant. Run "
+            f"`agnes admin grant create <group> table {name}` to "
+            f"make this visible in `agnes catalog` for non-admin users."
+        )
    elif resp.status_code == 409:
        typer.echo(f"Already exists: {name}")
    else:
--- a/tests/test_cli_admin_materialized.py
+++ b/tests/test_cli_admin_materialized.py
@ -29,6 +29,7 @@ def test_register_materialized_with_inline_query(monkeypatch):
        "admin", "register-table", "orders_90d",
        "--source-type", "bigquery",
        "--query-mode", "materialized",
+        "--bucket", "fin",
        "--query", "SELECT date FROM `prj.ds.orders`",
        "--sync-schedule", "every 6h",
    ])
@ -59,6 +60,7 @@ def test_register_materialized_reads_query_from_file(tmp_path, monkeypatch):
        "admin", "register-table", "orders_90d",
        "--source-type", "bigquery",
        "--query-mode", "materialized",
+        "--bucket", "fin",
        "--query", f"@{sql_file}",
        "--sync-schedule", "daily 03:00",
    ])
@ -155,3 +157,71 @@ def test_register_remote_path_unchanged(monkeypatch):
    assert "source_query" not in captured["json"]
    assert captured["json"]["bucket"] == "analytics"
    assert captured["json"]["source_table"] == "orders"
+
+
+def test_register_materialized_without_bucket_fails_with_clear_error(monkeypatch):
+    """`--query-mode materialized` without `--bucket` is a client-side
+    error. Pre-fix the CLI sent `bucket=""` to the server; registration
+    succeeded but `agnes schema <name>` later 400ed with "unsafe BQ
+    identifier in registry" because the schema endpoint built
+    `bq.\"\".\"<src>\"` from the empty bucket. Catching this at register
+    time gives operators a clear pointer at the right knob instead of
+    accept-then-fail-later UX."""
+    called = {"count": 0}
+
+    def fake_post(*args, **kwargs):
+        called["count"] += 1
+        return _fake_resp(201)
+
+    monkeypatch.setattr("cli.commands.admin.api_post", fake_post)
+
+    runner = CliRunner()
+    result = runner.invoke(app, [
+        "admin", "register-table", "category_summary",
+        "--source-type", "bigquery",
+        "--query-mode", "materialized",
+        "--query", "SELECT 1",
+        # No --bucket on purpose.
+    ])
+
+    assert result.exit_code != 0
+    # API never called — fail fast on client side.
+    assert called["count"] == 0
+    combined = result.stdout + (result.stderr or "")
+    assert "--bucket" in combined
+    # The error must explain WHY it's required, not just say "missing".
+    assert "schema" in combined.lower() or "identifier" in combined.lower()
+
+
+def test_register_table_emits_first_sync_and_grant_hints(monkeypatch):
+    """After a successful register-table for a materialized row, the CLI
+    output must point operators at:
+    (a) `agnes setup first-sync` — registration adds a registry row but
+        does NOT trigger a parquet build, and `agnes pull` then reports
+        "Updated 0 tables (1 total)" until the next scheduler tick.
+    (b) `agnes admin grant create <group> table <id>` — `agnes catalog`
+        is RBAC-filtered, so non-admin users won't see the new row
+        until a grant is created.
+
+    Without these hints operators bounce between symptoms and assume
+    something's broken when it's just unstated post-register UX."""
+    monkeypatch.setattr(
+        "cli.commands.admin.api_post",
+        lambda *a, **kw: _fake_resp(201),
+    )
+
+    runner = CliRunner()
+    result = runner.invoke(app, [
+        "admin", "register-table", "category_summary",
+        "--source-type", "bigquery",
+        "--query-mode", "materialized",
+        "--bucket", "analytics",
+        "--query", "SELECT category, SUM(rev) FROM `prj.ds.tx` GROUP BY 1",
+    ])
+    assert result.exit_code == 0, result.stdout
+    out = result.stdout
+    assert "agnes setup first-sync" in out
+    assert "agnes admin grant create" in out
+    # The grant hint should mention the registered name so operators can
+    # copy-paste the next command verbatim.
+    assert "category_summary" in out