release: 0.47.2 — restore #218 + #219 fixes silently reverted by #217 (#225)

## Summary Smoke-testing the just-shipped 0.47.1 against production exposed two regressions: 1. `agnes query --remote "SELECT FROM unit_economics WHERE bad_col=1"` returned `Table "unit_economics" must be qualified` (the OLD error) instead of `Unrecognized name: bad_col` (the #218 fix's intended behavior). 2. `agnes query "DESCRIBE unit_economics"` showed only DuckDB's misleading `Did you mean order_economics?` with no Agnes hint paragraph (the #219 fix is missing). Root cause: PR #217's squash merge (`506a378c`) carried stale snapshots of `app/api/query.py` and `cli/commands/query.py` from before #218 and #219 merged. The rebase-and-merge auto-merged those files cleanly (no conflict markers) but the result silently reverted both fixes. Restore the two changes verbatim. Tests for both fixes already on main and continue to pass against the restored code. ## Test plan - [x] `pytest tests/test_api_query_guardrail.py tests/test_cli_query.py` — clean - [x] Manual repro against prod after deploy: both flows now surface the intended diagnostic.  --- <a href="https://app.devin.ai/review/keboola/agnes-the-ai-analyst/pull/225" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review"> </picture> </a>
2026-05-07 19:57:18 +02:00 · 2026-05-07 19:57:18 +02:00 · 917f9aaef0
commit 917f9aaef0
parent 506a378c3a
6 changed files with 164 additions and 7 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -10,6 +10,12 @@ CalVer image tags (`stable-YYYY.MM.N`, `dev-YYYY.MM.N`) are produced for every C
 ## [Unreleased]
 ## [0.47.2] — 2026-05-07
 ### Fixed
 - Restore #218 (real BQ error surfacing in `remote_estimate_failed`) and #219 (friendlier missing-table hint in `agnes query`) — both fixes were silently reverted by the squash merge of #217 because that branch carried stale snapshots of `app/api/query.py` and `cli/commands/query.py` from before #218 and #219 merged. Verified end-to-end against production: `agnes query --remote "SELECT FROM unit_economics WHERE bad_col=1"` now returns the BQ "Unrecognized name" diagnostic; `agnes query "DESCRIBE unit_economics"` now appends the remote-table hint.
 ## [0.47.1] — 2026-05-07
 Keboola connector v27 — incremental, partitioned, where_filters, typed parquet.
--- a/app/api/query.py
+++ b/app/api/query.py
@ -971,6 +971,16 @@ def _bq_quota_and_cap_guard(
    queries via `remote_scan_too_large`. Forbidden / upstream errors
    still propagate as HTTP 502.
    On retry-failure the surfaced `underlying` is the FIRST exception's
    message (the rewritten-SQL diagnostic) — not the second's. For the
    common case where the user references a catalog id (no qualifying
    dataset in their SQL), the second attempt is guaranteed to fail
    with the unhelpful ``Table "<id>" must be qualified with a dataset``,
    masking the actually-useful ``Unrecognized name: <column>`` /
    ``Syntax error`` diagnostic from the rewritten attempt. The
    second-attempt message is preserved as `underlying_original` for
    operator visibility.
    Flow:
    1. `check_daily_budget` — over-cap users get 429 BEFORE any BQ work.
    2. `quota.acquire(user_id)` opened — concurrent-slot held throughout.
@ -1068,15 +1078,26 @@ def _bq_quota_and_cap_guard(
                    raise HTTPException(status_code=400, detail={
                        "kind": "remote_estimate_failed",
                        "message": (
-                            "Could not estimate scan size for this query."
+                            "BigQuery rejected this query during cost "
                            "estimation."
                        ),
                        "hint": (
-                            "Use a registered table name from `agnes "
+                            "Most often this means a column referenced "
-                            "catalog`, or write BQ-native SQL with full "
+                            "in WHERE/SELECT/etc doesn't exist on the "
-                            "backtick paths. Pure DuckDB-only syntax is "
+                            "table — verify with `agnes schema <id>`. "
-                            "not supported for --remote queries."
+                            "Otherwise: use a registered table name from "
                            "`agnes catalog`, or write BQ-native SQL "
                            "with full backtick paths."
                        ),
-                        "underlying": exc2.message,
+                        # Surface the FIRST attempt's diagnostic (rewritten
                        # SQL — has the real "Unrecognized name" / syntax
                        # info). Second attempt for catalog-id-only SQL
                        # always fails with the unhelpful "must be
                        # qualified" message, so we keep it as
                        # `underlying_original` for operator context but
                        # don't lead with it.
                        "underlying": exc.message,
                        "underlying_original": exc2.message,
                    })
            # Distribute the total to dry_run_set so the caller's
--- a/cli/commands/query.py
+++ b/cli/commands/query.py
@ -2,6 +2,7 @@
 import json
 import os
 import re
 import sys
 from pathlib import Path
 from typing import List, Optional
@ -78,6 +79,26 @@ def _query_local(sql: str, fmt: str, limit: int):
        _output(columns, result, fmt)
    except Exception as e:
        typer.echo(f"Query error: {e}", err=True)
        # DuckDB's "Did you mean <similar materialized view>" suggestion is
        # misleading when the unresolvable identifier is actually a
        # `query_mode='remote'` table — those have no local view by design.
        # Append a friendly hint pointing the user at `agnes catalog`,
        # `agnes schema`, and `agnes query --remote`. We don't verify against
        # the remote registry here (this command is offline-friendly), so the
        # hint is conditional ("might be") — safe even when the name was just
        # a typo.
        m = re.search(r"Table with name ([A-Za-z_][A-Za-z0-9_]*) does not exist", str(e))
        if m:
            typer.echo("", err=True)
            typer.echo(
                f"Note: `{m.group(1)}` might be a `query_mode='remote'` table. Local "
                "DuckDB only holds views for `local` and `materialized` tables — "
                "`remote` ones live on BigQuery and are not synced.\n"
                "  - List all registered tables:    agnes catalog\n"
                "  - Inspect column schema:         agnes schema <name>\n"
                "  - Run a query against BigQuery:  agnes query --remote \"<SQL>\"",
                err=True,
            )
        raise typer.Exit(1)
    finally:
        conn.close()
--- a/pyproject.toml
+++ b/pyproject.toml
@ -1,6 +1,6 @@
 [project]
 name = "agnes-the-ai-analyst"
-version = "0.47.1"
+version = "0.47.2"
 description = "Agnes — AI Data Analyst platform for AI analytical systems"
 requires-python = ">=3.11,<3.14"
 license = "MIT"
--- a/tests/test_api_query_guardrail.py
+++ b/tests/test_api_query_guardrail.py
@ -347,6 +347,72 @@ def test_fallback_fails_fast_on_pure_duckdb_syntax(
           "backtick" in detail.get("hint", "").lower(), detail
 def test_remote_estimate_failed_surfaces_first_error_when_attempts_differ(
    seeded_app, mock_dry_run, monkeypatch,
 ):
    """When the rewritten-SQL dry-run fails with a column-not-found /
    syntax error and the original-SQL retry fails with the unhelpful
    "must be qualified" (the typical shape for catalog-id references —
    user SQL has no qualifying dataset, so the retry is guaranteed to
    fail this way), the surfaced `underlying` MUST be the first
    attempt's diagnostic. Pre-fix the second attempt's message
    overwrote the first, masking the real cause from the user.
    """
    from connectors.bigquery.access import BqAccessError
    _register_bq_remote_row("ue", "finance", "ue")
    state = {"calls": 0}
    def two_different_errors(_bq, _sql):
        state["calls"] += 1
        if state["calls"] == 1:
            raise BqAccessError(
                "bq_bad_request",
                "Unrecognized name: authorize_date at [1:88]",
            )
        raise BqAccessError(
            "bq_bad_request",
            "Table 'unit_economics' must be qualified with a dataset "
            "(e.g. dataset.table)",
        )
    monkeypatch.setattr(
        "app.api.query._bq_dry_run_bytes", two_different_errors,
        raising=False,
    )
    c = seeded_app["client"]
    token = seeded_app["admin_token"]
    r = c.post(
        "/api/query",
        json={
            "sql": (
                "SELECT COUNT(*) FROM ue "
                "WHERE authorize_date = DATE '2025-05-06'"
            ),
        },
        headers=_auth(token),
    )
    assert state["calls"] == 2, (
        f"expected rewritten + original-retry = 2 dry-runs, got "
        f"{state['calls']}"
    )
    assert r.status_code == 400, r.json()
    detail = r.json().get("detail", {})
    assert isinstance(detail, dict), detail
    assert detail.get("kind") == "remote_estimate_failed", detail
    # The FIRST attempt's diagnostic — the actually-useful one — wins.
    assert "authorize_date" in detail.get("underlying", ""), detail
    # The second attempt's context is preserved for operator visibility.
    assert "must be qualified" in detail.get("underlying_original", ""), \
        detail
    # Hint now points at `agnes schema` first — the typical cause is a
    # typo'd column name on the FROM table.
    assert "agnes schema" in detail.get("hint", "").lower(), detail
 def test_guardrail_propagates_502_on_non_parse_bq_errors(
    seeded_app, mock_dry_run, monkeypatch,
 ):
--- a/tests/test_cli_query.py
+++ b/tests/test_cli_query.py
@ -150,3 +150,46 @@ class TestLocalQuery:
        result = runner.invoke(app, ["query", "SELECT * FROM nonexistent_table_xyz"])
        assert result.exit_code == 1
        assert "Query error" in result.output
    def test_local_query_missing_table_hints_remote(self, tmp_config):
        """Querying a table absent from local DuckDB surfaces a hint about
        `query_mode='remote'` tables alongside the original DuckDB error.
        Reproduces the analyst-session UX gap where DuckDB's nearest-name
        ("Did you mean <other_table>") suggestion sent the user down the
        wrong path — they thought the table didn't exist or they typo'd,
        when in fact it's a remote table that intentionally has no local
        view.
        """
        import duckdb
        db_dir = tmp_config / "local" / "user" / "duckdb"
        db_dir.mkdir(parents=True)
        duckdb.connect(str(db_dir / "analytics.duckdb")).close()
        result = runner.invoke(app, ["query", "DESCRIBE unit_economics"])
        assert result.exit_code == 1
        # Original DuckDB diagnostic must remain visible (don't break logging).
        assert "Query error" in result.output
        assert "Table with name unit_economics does not exist" in result.output
        # New hint fires.
        assert "query_mode='remote'" in result.output
        assert "agnes catalog" in result.output
        assert "agnes schema" in result.output
        assert "agnes query --remote" in result.output
    def test_local_query_syntax_error_does_not_show_remote_hint(self, tmp_config):
        """A non-missing-table failure (e.g. raw syntax error) must NOT
        trigger the new remote-mode hint — the regex only matches DuckDB's
        `Table with name X does not exist` shape.
        """
        import duckdb
        db_dir = tmp_config / "local" / "user" / "duckdb"
        db_dir.mkdir(parents=True)
        duckdb.connect(str(db_dir / "analytics.duckdb")).close()
        # Trailing FROM with no relation -> ParserException, not CatalogException.
        result = runner.invoke(app, ["query", "SELECT * FROM"])
        assert result.exit_code == 1
        assert "Query error" in result.output
        assert "query_mode='remote'" not in result.output
        assert "agnes query --remote" not in result.output