agnes-the-ai-analyst/tests/test_cli_query.py
ZdenekSrotyr 506a378c3a
release: 0.47.1 — Keboola connector v27 (incremental, partitioned, where_filters, typed parquet) (#217)
## Summary

Brings the Keboola connector to feature parity with the legacy internal data-analyst's per-table sync strategies. Closes the four documented gaps from the spec branch (`zs/keboola-connector-specs`):

- **Typed parquet** in the legacy SDK extraction path — column types from Keboola Storage metadata (provider cascade `user > ai-metadata-enrichment > keboola.snowflake-transformation`) survive the CSV → parquet roundtrip; invalid date strings (`'0000-00-00'`) and invalid numeric strings (`'Non-Manager'`) become NULL while keeping the column's typed schema. Pre-fix everything was VARCHAR.
- **Incremental sync** via Storage API `changedSince` — opt-in per table; pulls only delta rows, merges into the existing parquet by `primary_key` (drop_duplicates with keep='last'). Cuts daily extraction from O(full table) to O(delta).
- **Partitioned sync** — flat per-partition layout `data/<table>/<key>.parquet` (e.g. `2026_05.parquet`), per-affected-partition merge for daily updates, chunked initial load with 1-day overlap and 2-empty-chunk stop heuristic.
- **`where_filters`** — server-side row filter with date placeholders (`{{today}}`, `{{last_3_months}}`, `{{start_of_3_months_ago}}`, etc.) resolved at sync time. Force the SDK path; reject `incremental + where_filters` combination at API layer (changedSince already filters temporally).

## Architecture

- **Schema migration v25 → v26**: 7 new columns on `table_registry`. Existing `sync_strategy` column reused (pre-v26 it was inert catalog metadata; post-v26 the extractor dispatches off it).
- **Per-table dispatcher** in `extractor.run()` routes to one of `_extract_via_extension` (full_refresh + extension), `_extract_via_legacy` (full_refresh + filters or extension fallback), `extract_incremental`, or `extract_partitioned`.
- **API conflict policy**: `incremental + where_filters` → 422; `partitioned + query_mode='remote'` → 422; `partitioned ⇒ partition_by required`.
- **Admin UI**: third "Direct extract (Storage API)" radio in the Keboola Register / Edit modals, alongside existing "Whole table (extension)" and "Custom SQL". When selected, exposes a v26 sync-strategy panel with conditional fields per strategy.

## Test plan

- [x] **Unit + module** — 134 v26 tests covering migration, repo, parquet_io, where_filters, incremental (compute_changed_since + merge_parquet + extract_incremental E2E), partitioned (key derivation + merge_partition + chunked windows + extract_partitioned E2E), extractor dispatcher, admin API validators, PUT field clearing, registry-shape → dispatcher bridge
- [x] **HTML form structure** — all v26 inputs + visibility classes + JS payload fields verified in rendered template
- [x] **Real Keboola roundtrip** — registered a small test table as `sync_strategy='incremental'` against a test Storage project, triggered two syncs:
  - Sync 1: `changedSince=None` → full pull → 9 rows typed parquet
  - Sync 2: `changedSince=last_sync - 1d window` → 9 delta rows merged with 9 existing → 9 after dedup on primary_key (PK merge confirmed)
- [x] **Browser UX** — agent-browser session against a local uvicorn: login → admin/tables → register modal → switch radios → verify field visibility per strategy → submit → edit existing row → switch to Direct/Incremental → save → confirm DB persistence
- [x] **Regression** — no regressions in the broader 3252-test suite (3 pre-v26 tests updated for the deprecation-marker removal + schema-version bump; 2 pre-existing environment-sensitive test failures unrelated to this change)

## Bugs caught + fixed during E2E

The browser + real-Keboola roundtrip exposed four bugs the unit tests missed:

1. **JS visibility race** — two competing `forEach` loops set `display=''` then `display='none'` on form elements sharing `kb-strategy-incremental kb-strategy-partitioned` classes (window_days + max_history_days are reused across strategies). Fix: single-pass selector with class-based visibility resolver.
2. **PUT cannot clear field** — pre-v26 `updates = {k: v ... if v is not None}` collapsed "omitted from body" and "sent as null" into the same case, so admin couldn't switch a partitioned row back to full_refresh and have stale `partition_by` clear. Fix: `model_dump(exclude_unset=True)`.
3. **Subprocess DB lock conflict** — `_read_last_sync` reopened `system.duckdb` while the parent server held the write lock (subprocess contract at `app/api/sync.py:_run_sync` line 260). Fix: parent injects `__last_sync__` into table_config before subprocess spawn.
4. **Wrong KBC table_id** — `extract_incremental` / `extract_partitioned` built the Storage API table_id from the registry row's slugified `id` (`circle_inc`) instead of `bucket.source_table` (`in.c-finance.circle`), producing 404s. Fix: prefer `bucket+source_table`; fall back to `id` only when bucket empty.

## Operator notes

- Existing tables stay on `full_refresh` after migration; admins opt individual tables in via `agnes admin register-table --sync-strategy ...`, the Keboola Edit modal, or `POST/PUT /api/admin/registry`.
- `merge_parquet` and `merge_partition` use `pd.concat + drop_duplicates`, loading both existing and delta into pandas RAM. For tables in the multi-million-row range this may OOM — switch to `partitioned` strategy for those (per-partition merge keeps memory bounded). Documented in `### Internal` of the changelog entry.
- Date placeholders are resolved at **sync time**, not register time — a typo'd `{{lasst_week}}` is accepted at register and surfaces only when the next sync runs. By design (rolling windows need late-binding).

## Spec source

The four corresponding plans on the `zs/keboola-connector-specs` branch under `docs/superpowers/plans/2026-05-07-0[1-4]-*.md` capture the design rationale and link back to internal repo references for each subsystem.
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/keboola/agnes-the-ai-analyst/pull/217" target="_blank">
  <picture>
    <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
    <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open in Devin Review">
  </picture>
</a>
<!-- devin-review-badge-end -->
2026-05-07 19:01:27 +02:00

152 lines
6.2 KiB
Python

"""Tests for agnes query command."""
import json
import pytest
from unittest.mock import patch, MagicMock
from typer.testing import CliRunner
from cli.main import app
runner = CliRunner()
@pytest.fixture(autouse=True)
def tmp_config(tmp_path, monkeypatch):
monkeypatch.setenv("AGNES_CONFIG_DIR", str(tmp_path / "config"))
monkeypatch.setenv("AGNES_LOCAL_DIR", str(tmp_path / "local"))
(tmp_path / "config").mkdir()
(tmp_path / "local").mkdir()
yield tmp_path
def _resp(status_code=200, json_data=None, text=""):
r = MagicMock()
r.status_code = status_code
r.json.return_value = json_data if json_data is not None else {}
r.text = text
return r
class TestRemoteQuery:
def test_remote_query_success(self):
"""--remote sends SQL to server and prints results."""
# api_post is imported inside _query_remote so mock the source module
payload = {"columns": ["id", "name"], "rows": [[1, "Alice"]], "truncated": False}
with patch("cli.client.api_post", return_value=_resp(200, payload)):
result = runner.invoke(app, ["query", "SELECT * FROM users", "--remote"])
assert result.exit_code == 0
def test_remote_query_failure(self):
"""--remote prints error message on API failure (#160 §4.7: shared
renderer surfaces the detail; the prior `Query failed: ...` prefix
was dropped in favor of HTTP-status + structured detail)."""
with patch("cli.client.api_post", return_value=_resp(400, {"detail": "bad SQL"})):
result = runner.invoke(app, ["query", "SELECT bad", "--remote"])
assert result.exit_code == 1
# Renderer formats string-detail as `HTTP 400: bad SQL`
assert "HTTP 400" in result.output
assert "bad SQL" in result.output
def test_remote_query_truncated(self):
"""Truncated result shows warning."""
payload = {"columns": ["id"], "rows": [[i] for i in range(5)], "truncated": True}
with patch("cli.client.api_post", return_value=_resp(200, payload)):
result = runner.invoke(app, ["query", "SELECT id FROM t", "--remote", "--limit", "5"])
assert result.exit_code == 0
assert "truncated" in result.output
def test_remote_query_uses_long_timeout(self):
"""--remote passes the long-running QUERY_TIMEOUT_S to api_post.
BigQuery SELECTs routinely take minutes; the default 30s httpx
timeout dies long before the query finishes. Regression guard for
the fix that introduced AGNES_QUERY_TIMEOUT (default 300s).
"""
from cli.client import QUERY_TIMEOUT_S
payload = {"columns": [], "rows": [], "truncated": False}
mock_post = MagicMock(return_value=_resp(200, payload))
with patch("cli.client.api_post", mock_post):
result = runner.invoke(app, ["query", "SELECT 1", "--remote"])
assert result.exit_code == 0
assert mock_post.call_args.kwargs["timeout"] == QUERY_TIMEOUT_S
assert QUERY_TIMEOUT_S >= 300.0
class TestLocalQuery:
def test_local_query_no_db(self, tmp_config):
"""Local query without DuckDB exits with guidance."""
result = runner.invoke(app, ["query", "SELECT 1"])
assert result.exit_code == 1
assert "not found" in result.output.lower()
def test_local_query_with_real_db(self, tmp_config):
"""Local query executes against real DuckDB."""
import duckdb
db_dir = tmp_config / "local" / "user" / "duckdb"
db_dir.mkdir(parents=True)
conn = duckdb.connect(str(db_dir / "analytics.duckdb"))
conn.execute("CREATE TABLE nums (n INTEGER)")
conn.execute("INSERT INTO nums VALUES (1), (2), (3)")
conn.close()
result = runner.invoke(app, ["query", "SELECT SUM(n) as total FROM nums", "--format", "json"])
assert result.exit_code == 0
data = json.loads(result.output)
assert data[0]["total"] == 6
def test_local_query_csv_format(self, tmp_config):
"""--format csv produces CSV output."""
import duckdb
db_dir = tmp_config / "local" / "user" / "duckdb"
db_dir.mkdir(parents=True)
conn = duckdb.connect(str(db_dir / "analytics.duckdb"))
conn.execute("CREATE TABLE t (a INTEGER, b VARCHAR)")
conn.execute("INSERT INTO t VALUES (1, 'x')")
conn.close()
result = runner.invoke(app, ["query", "SELECT a, b FROM t", "--format", "csv"])
assert result.exit_code == 0
lines = result.output.strip().splitlines()
assert lines[0] == "a,b"
assert "1,x" in lines[1]
def test_local_query_table_format(self, tmp_config):
"""Default table format renders without crash."""
import duckdb
db_dir = tmp_config / "local" / "user" / "duckdb"
db_dir.mkdir(parents=True)
conn = duckdb.connect(str(db_dir / "analytics.duckdb"))
conn.execute("CREATE TABLE t (id INTEGER)")
conn.execute("INSERT INTO t VALUES (42)")
conn.close()
result = runner.invoke(app, ["query", "SELECT id FROM t"])
assert result.exit_code == 0
assert "42" in result.output
def test_local_query_limit(self, tmp_config):
"""--limit restricts rows returned."""
import duckdb
db_dir = tmp_config / "local" / "user" / "duckdb"
db_dir.mkdir(parents=True)
conn = duckdb.connect(str(db_dir / "analytics.duckdb"))
conn.execute("CREATE TABLE big (n INTEGER)")
conn.executemany("INSERT INTO big VALUES (?)", [(i,) for i in range(100)])
conn.close()
result = runner.invoke(app, ["query", "SELECT n FROM big", "--format", "json", "--limit", "5"])
assert result.exit_code == 0
data = json.loads(result.output)
assert len(data) == 5
def test_local_query_sql_error(self, tmp_config):
"""SQL syntax error exits with error."""
import duckdb
db_dir = tmp_config / "local" / "user" / "duckdb"
db_dir.mkdir(parents=True)
duckdb.connect(str(db_dir / "analytics.duckdb")).close()
result = runner.invoke(app, ["query", "SELECT * FROM nonexistent_table_xyz"])
assert result.exit_code == 1
assert "Query error" in result.output