release: 0.40.0 — materialize_query writes _meta + inner view so master views appear

Pre-fix flow: 1. extractor subprocess writes _meta with N remote rows + creates N inner views in extract.duckdb (rebuild_from_registry skips materialized rows per design — explicit `continue` at line 389) 2. _run_materialized_pass calls materialize_query, which writes parquet atomically + returns stats — but never updates _meta 3. orchestrator.rebuild scans _meta, finds only the N remote rows, creates master views only for them. Materialized parquet is on disk but invisible to /api/query → 400 'not yet materialized' Symptom appears after every container recreate (the previous run's _meta state is wiped because docker compose down nukes the named volume that backs extract.duckdb on some compose layouts; even on volumes that persist, the next extractor pass calls _create_meta_table which DROPs + CREATEs _meta cleanly). Fix: after os.replace(tmp_path, parquet_path) in materialize_query, open extract.duckdb (read-write), DELETE existing _meta row for table_id, INSERT new one with query_mode='materialized', and CREATE OR REPLACE VIEW <table_id> AS SELECT * FROM read_parquet(<path>). All inside a single transaction so concurrent reads see either old or new state, not torn rows. Fail-soft on lock contention or schema drift — parquet remains canonical, next sync pass recovers. Tests: 3 new in test_bq_materialize.py covering: - meta + inner view registered after materialize, alongside existing remote rows - re-run replaces (not duplicates) the meta row - skips inner-view registration when extract.duckdb doesn't exist yet (fresh BQ-only deployment edge case)
2026-05-06 16:04:58 +02:00 · 2026-05-06 16:04:58 +02:00 · b5b16e98a0
commit b5b16e98a0
parent 6de7084c9f
4 changed files with 280 additions and 2 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -10,7 +10,31 @@ CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every C

 ## [Unreleased]

-## [0.39.0] — 2026-05-06
+## [0.40.0] — 2026-05-06
+
+### Fixed
+- **Materialized BigQuery parquets now register themselves in
+  `extract.duckdb` so the master view actually appears**
+  (`connectors/bigquery/extractor.py:materialize_query`). Pre-fix the
+  function wrote the `<id>.parquet` to disk and returned the row count,
+  but **never** wrote a `_meta` row or an inner view in the connector's
+  `extract.duckdb`. The orchestrator's `rebuild()` scans `_meta` to
+  decide which master views to create, so materialized tables remained
+  invisible: `agnes query "SELECT … FROM <id>"` returned HTTP 400
+  *"registered as query_mode='materialized' but is not yet materialized
+  in this instance's analytics views"* even though the parquet was
+  sitting there. Symptom appeared after every container recreate (image
+  upgrade) and after every `_create_meta_table` cycle in the extractor
+  subprocess (which `DROP TABLE IF EXISTS _meta` + `CREATE TABLE`
+  cleanly each pass — wiping any prior materialized rows). Fix: after
+  the atomic `os.replace(tmp_path, parquet_path)`, open
+  `extract.duckdb` and `DELETE FROM _meta WHERE table_name = ? + INSERT
+  + CREATE OR REPLACE VIEW <id> AS SELECT * FROM read_parquet('<path>')`
+  inside a single transaction. Idempotent, fail-soft (parquet remains
+  canonical, the next sync pass recovers any registration drift).
+  When `extract.duckdb` doesn't exist yet (fresh BQ-only deployment),
+  the fix logs and continues — the next extractor pass creates the
+  file and the master view appears on the rebuild after that.

 ### Performance
 - **`/api/query` (and `agnes query --remote`) now rewrites user SQL referencing
--- a/connectors/bigquery/extractor.py
+++ b/connectors/bigquery/extractor.py
@ -269,6 +269,103 @@ def _create_meta_table(conn: duckdb.DuckDBPyConnection) -> None:
    )""")


+def _ensure_meta_table(conn: duckdb.DuckDBPyConnection) -> None:
+    """Idempotent variant of `_create_meta_table` — creates the table if
+    missing, leaves existing rows untouched. Used by `materialize_query`
+    to register the materialized parquet without wiping the
+    extractor-subprocess-written remote rows that share the same
+    extract.duckdb."""
+    conn.execute("""CREATE TABLE IF NOT EXISTS _meta (
+        table_name VARCHAR NOT NULL,
+        description VARCHAR,
+        rows BIGINT,
+        size_bytes BIGINT,
+        extracted_at TIMESTAMP,
+        query_mode VARCHAR DEFAULT 'remote'
+    )""")
+
+
+def _persist_materialized_inner_view(
+    extract_db_path: Path,
+    table_id: str,
+    parquet_path: Path,
+    rows: int,
+    size_bytes: int,
+) -> None:
+    """Write the materialized parquet's inner view + ``_meta`` row into
+    ``extract.duckdb`` so the orchestrator's master-view rebuild picks it
+    up uniformly with remote-mode rows. Without this, ``materialize_query``
+    leaves the parquet on disk but no record of it in ``_meta``, and the
+    orchestrator's ``rebuild()`` scan never creates the master view —
+    ``agnes query`` then 400s with "registered as query_mode='materialized'
+    but is not yet materialized" even though the parquet exists.
+
+    Idempotent: existing ``_meta`` row for the same ``table_name`` is
+    replaced, existing inner view is recreated. Fail-soft — the parquet
+    is the canonical artifact; if extract.duckdb registration fails (lock
+    contention, missing file, schema drift), log and continue. The
+    caller's ``rebuild_from_registry`` rebuild will get a chance to fix
+    it next pass.
+    """
+    if not extract_db_path.exists():
+        # Fresh BQ-only deployment hasn't run the extractor subprocess
+        # yet, so extract.duckdb doesn't exist. Nothing to update — the
+        # next extractor pass + materialize cycle will populate it.
+        logger.info(
+            "materialize: extract.duckdb at %s does not exist yet; "
+            "skipping inner-view registration. Next extractor pass will "
+            "create it and the master view will appear on the rebuild "
+            "after that.",
+            extract_db_path,
+        )
+        return
+
+    safe_table = _escape_sql_string_literal(table_id)
+    safe_path = _escape_sql_string_literal(str(parquet_path))
+    try:
+        with duckdb.connect(str(extract_db_path), read_only=False) as ext_conn:
+            _ensure_meta_table(ext_conn)
+            # `_meta` has no UNIQUE on table_name (legacy schema), so we
+            # do a manual delete-then-insert. Wrap in a transaction so
+            # concurrent reads of `_meta` either see the old row or the
+            # new one, never both / neither.
+            ext_conn.execute("BEGIN")
+            try:
+                ext_conn.execute(
+                    "DELETE FROM _meta WHERE table_name = ?", [table_id]
+                )
+                ext_conn.execute(
+                    "INSERT INTO _meta VALUES (?, '', ?, ?, CURRENT_TIMESTAMP, 'materialized')",
+                    [table_id, rows, size_bytes],
+                )
+                # Inner view backing the master view. Orchestrator scans
+                # information_schema.tables for the attached extract.duckdb
+                # and only creates a master view when an inner object
+                # exists by the same name. read_parquet() is hot per-call,
+                # so the master view path goes through the same disk.
+                ext_conn.execute(
+                    f"CREATE OR REPLACE VIEW \"{table_id}\" AS "
+                    f"SELECT * FROM read_parquet('{safe_path}')"
+                )
+                ext_conn.execute("COMMIT")
+            except Exception:
+                try:
+                    ext_conn.execute("ROLLBACK")
+                except Exception:
+                    pass
+                raise
+    except Exception as e:
+        # Fail-soft: parquet is on disk, registry stays consistent, the
+        # next extractor + orchestrator pass will recover. Loud log so
+        # operators can spot persistent breakage.
+        logger.warning(
+            "materialize: failed to register %s in extract.duckdb (%s) — "
+            "parquet at %s is fine, master view will appear after the "
+            "next sync cycle. Error: %s",
+            table_id, extract_db_path, parquet_path, e,
+        )
+
+
 def _create_remote_attach_table(
    conn: duckdb.DuckDBPyConnection, project_id: str
 ) -> None:
@ -667,6 +764,23 @@ def materialize_query(
            os.replace(tmp_path, parquet_path)

            rows = int(rows)
+
+            # Register the parquet in extract.duckdb so the orchestrator's
+            # master-view rebuild can pick it up uniformly with remote-mode
+            # rows. Without this, the parquet sits on disk but the master
+            # view is never created — `agnes query "SELECT … FROM <id>"`
+            # 400s with "not yet materialized in this instance's analytics
+            # views". Fail-soft — the parquet is canonical, the next
+            # extractor + orchestrator pass will recover any registration
+            # drift.
+            _persist_materialized_inner_view(
+                extract_db_path=out_path / "extract.duckdb",
+                table_id=table_id,
+                parquet_path=parquet_path,
+                rows=rows,
+                size_bytes=size_bytes,
+            )
+
            if rows == 0:
                # 0 rows is indistinguishable from "the SQL is wrong and nobody
                # noticed" — surface it loudly so operators see it in the scheduler
--- a/pyproject.toml
+++ b/pyproject.toml
@ -1,6 +1,6 @@
 [project]
 name = "agnes-the-ai-analyst"
-version = "0.39.0"
+version = "0.40.0"
 description = "Agnes — AI Data Analyst platform for AI analytical systems"
 requires-python = ">=3.11,<3.14"
 license = "MIT"
--- a/tests/test_bq_materialize.py
+++ b/tests/test_bq_materialize.py
@ -148,3 +148,143 @@ def test_materialize_overwrites_existing_parquet(tmp_path):
        f"SELECT n FROM read_parquet('{out}/data/t1.parquet')"
    ).fetchall()
    assert rows == [(2,)]
+
+
+def test_materialize_persists_meta_and_inner_view_in_extract_db(tmp_path):
+    """0.40.0 fix: after materialize_query writes the parquet, it must also
+    register the table in extract.duckdb (`_meta` row + inner view) so the
+    orchestrator's master-view rebuild picks it up uniformly with remote-mode
+    rows. Without this, the parquet sits on disk but the master view never
+    materializes — `agnes query` 400s with "not yet materialized".
+    """
+    out = tmp_path / "extracts" / "bigquery"
+    out.mkdir(parents=True)
+
+    # Pre-create extract.duckdb (as the extractor subprocess would have done
+    # on this connector's first pass) with the canonical _meta table + a
+    # remote-mode row. We must verify the materialize call adds its row
+    # without wiping the existing remote rows.
+    extract_db = out / "extract.duckdb"
+    with duckdb.connect(str(extract_db)) as ext:
+        ext.execute("""CREATE TABLE _meta (
+            table_name VARCHAR NOT NULL,
+            description VARCHAR,
+            rows BIGINT,
+            size_bytes BIGINT,
+            extracted_at TIMESTAMP,
+            query_mode VARCHAR DEFAULT 'remote'
+        )""")
+        ext.execute(
+            "INSERT INTO _meta VALUES ('s1_session_landings', '', 0, 0, "
+            "CURRENT_TIMESTAMP, 'remote')"
+        )
+
+    bq = _make_stub_bq({
+        "bq.test.orders": (
+            "SELECT 'EU' AS region, 100 AS revenue UNION ALL "
+            "SELECT 'US' AS region, 250 AS revenue"
+        )
+    })
+
+    materialize_query(
+        table_id="orders_summary",
+        sql="SELECT region, SUM(revenue) AS revenue FROM bq.test.orders GROUP BY 1",
+        bq=bq,
+        output_dir=str(out),
+    )
+
+    # Parquet exists.
+    parquet_path = out / "data" / "orders_summary.parquet"
+    assert parquet_path.exists()
+
+    # _meta has BOTH the legacy remote row AND the new materialized row.
+    with duckdb.connect(str(extract_db), read_only=True) as ext:
+        rows = ext.execute(
+            "SELECT table_name, query_mode, rows FROM _meta ORDER BY table_name"
+        ).fetchall()
+        assert ("orders_summary", "materialized", 2) in [
+            (r[0], r[1], r[2]) for r in rows
+        ]
+        assert ("s1_session_landings", "remote", 0) in [
+            (r[0], r[1], r[2]) for r in rows
+        ]
+        # Inner view backing the master view exists, points at the parquet.
+        view_rows = ext.execute(
+            "SELECT * FROM \"orders_summary\" ORDER BY region"
+        ).fetchall()
+        assert view_rows == [("EU", 100), ("US", 250)]
+
+
+def test_materialize_replaces_meta_row_on_re_run(tmp_path):
+    """A second materialize for the same table_id must REPLACE the existing
+    `_meta` row, not duplicate it. Otherwise the orchestrator scan sees two
+    rows for the same name and creates the master view twice (or worse,
+    against stale row stats)."""
+    out = tmp_path / "extracts" / "bigquery"
+    out.mkdir(parents=True)
+    # Pre-create extract.duckdb (the extractor subprocess would do this on
+    # the first sync pass; we shortcut so the test exercises the
+    # delete-then-insert branch on re-run, not the "no extract.duckdb yet"
+    # skip branch.
+    extract_db = out / "extract.duckdb"
+    with duckdb.connect(str(extract_db)) as ext:
+        ext.execute("""CREATE TABLE _meta (
+            table_name VARCHAR NOT NULL,
+            description VARCHAR,
+            rows BIGINT,
+            size_bytes BIGINT,
+            extracted_at TIMESTAMP,
+            query_mode VARCHAR DEFAULT 'remote'
+        )""")
+
+    bq = _make_stub_bq({
+        "bq.test.t1": "SELECT 'EU' AS region, 100 AS revenue",
+        "bq.test.t2": (
+            "SELECT 'EU' AS region, 100 AS revenue UNION ALL "
+            "SELECT 'US' AS region, 250 AS revenue"
+        ),
+    })
+
+    # First pass — 1 row.
+    materialize_query(
+        table_id="orders_summary",
+        sql="SELECT region, revenue FROM bq.test.t1",
+        bq=bq, output_dir=str(out),
+    )
+    # Second pass — different SQL, 2 rows. Must overwrite, not duplicate.
+    materialize_query(
+        table_id="orders_summary",
+        sql="SELECT region, revenue FROM bq.test.t2",
+        bq=bq, output_dir=str(out),
+    )
+
+    extract_db = out / "extract.duckdb"
+    with duckdb.connect(str(extract_db), read_only=True) as ext:
+        rows = ext.execute(
+            "SELECT COUNT(*), MAX(rows) FROM _meta WHERE table_name = 'orders_summary'"
+        ).fetchone()
+        assert rows[0] == 1, "must be exactly one _meta row, not duplicated"
+        assert rows[1] == 2, "row count reflects the latest run, not the first"
+
+
+def test_materialize_skips_inner_view_when_extract_db_missing(tmp_path):
+    """Fresh BQ-only deployment may not have run the extractor subprocess
+    yet, so extract.duckdb doesn't exist. materialize_query must not crash
+    on that path — it logs and continues, the next extractor pass +
+    rebuild will pick up the parquet via the registered registry row."""
+    out = tmp_path / "extracts" / "bigquery"
+    out.mkdir(parents=True)
+    # Deliberately do NOT create extract.duckdb.
+
+    bq = _make_stub_bq({"bq.test.t": "SELECT 1 AS n"})
+
+    # Should NOT raise — fail-soft.
+    stats = materialize_query(
+        table_id="solo_table",
+        sql="SELECT n FROM bq.test.t",
+        bq=bq, output_dir=str(out),
+    )
+    assert stats["rows"] == 1
+    # Parquet is on disk, extract.duckdb still doesn't exist (no force-create).
+    assert (out / "data" / "solo_table.parquet").exists()
+    assert not (out / "extract.duckdb").exists()