Commit graph

10 commits

Author SHA1 Message Date
ZdenekSrotyr
35df940e5c fix: BQ COUNT subquery alias, wrap ImportError in RemoteQueryError
- Add AS _cnt alias to COUNT(*) subquery (BQ Standard SQL requires it)
- Catch ImportError in _get_bq_client() and raise RemoteQueryError
  so API endpoint returns proper 400 instead of 500

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 20:29:03 +02:00
ZdenekSrotyr
2ad8828f8c fix: stdin register_bq parsing, separate BQ SQL validation
- cli/commands/query.py: --stdin mode now reads register_bq from the
  JSON payload and merges it into the register_bq option list, matching
  the documented {"register_bq": {...}, "sql": "..."} contract.
- src/remote_query.py: add _validate_bq_sql() with a narrower blocklist
  (writes only); register_bq() now calls _validate_bq_sql() so legitimate
  BQ operations like INFORMATION_SCHEMA, CALL, IMPORT are not blocked.
  The final DuckDB execute() path still uses the full _validate_sql().
- tests/test_remote_query.py: add TestValidateBqSql covering allowed
  INFORMATION_SCHEMA queries and blocked write operations.
2026-04-11 19:31:39 +02:00
ZdenekSrotyr
f4129dc87d fix: alias validation, url escaping, read-only CLI, blocklist comment
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-11 11:28:27 +02:00
ZdenekSrotyr
86bbb8fce4 feat: add RemoteQueryEngine with BQ registration and safety limits
Two-phase query engine: Phase 1 registers BQ query results as DuckDB
Arrow views (with COUNT pre-check, row/memory limits, Storage API
fallback); Phase 2 executes validated SQL against DuckDB with result
serialization and truncation. 25 tests covering all branches.
2026-04-11 11:07:08 +02:00
ZdenekSrotyr
b502bd8bdd refactor: delete old sync pipeline — 9,500 lines removed
Phase 5 cleanup: remove all code replaced by extract.duckdb architecture.

Deleted modules:
- src/config.py (653) — replaced by DuckDB table_registry
- src/parquet_manager.py (755) — replaced by DuckDB COPY TO
- src/data_sync.py (734) — replaced by SyncOrchestrator
- src/remote_query.py (636) — replaced by DuckDB BigQuery ATTACH
- src/table_registry.py (464) — replaced by DuckDB repository
- connectors/keboola/adapter.py (820) — replaced by extractor.py
- connectors/bigquery/adapter.py (665) — replaced by extractor.py
- connectors/bigquery/client.py (644) — replaced by DuckDB BQ extension

Updated all imports in webapp, catalog_export, enricher, router,
sync_settings_service, generate_sample_data. Kept keboola/client.py
as fallback (removed src.config dependency).

704 tests passing.
2026-03-31 07:50:37 +02:00
Petr
84d14da611 Fix remote query UX: file-based stdin, ssh permissions, deprecation
Session testing revealed 3 issues with remote queries:

1. CLAUDE.md template recommended `cat <<HEREDOC | ssh ...` but
   claude_settings.json had `cat` in deny list, causing 2-3 failed
   attempts per query. Replaced with file-based approach: Write tool
   creates JSON file, then `ssh ... < file` avoids the cat deny.

2. ssh/scp commands were not in the allow list, requiring manual
   approval for every remote query. Added both to allow list.

3. DuckDB fetch_arrow_table() emitted DeprecationWarning on every
   parquet export. Replaced with .arrow().read_all().

Also added instruction for proactive hybrid analysis when remote
tables are available (agent was only using local data until asked).
2026-03-21 18:41:43 +01:00
Petr
8c6c162417 Fix: --sql not required when --stdin is used
argparse was rejecting --stdin mode because --sql was required=True.
Changed to required=False with runtime validation in main().
2026-03-21 12:17:02 +01:00
Petr
67df4acd73 Add --stdin JSON mode to avoid shell escaping nightmare
Agent was failing 3x on SSH commands due to backticks (BQ table names)
and single quotes (SQL string literals) getting mangled by nested shell
interpretation (local -> SSH -> bash -> Python).

New --stdin mode reads query spec as JSON from stdin via heredoc:
  cat <<'QUERY' | ssh alias 'bash remote_query.sh --stdin'
  {"register_bq": {"alias": "SELECT ... FROM \`table\` ..."}, "sql": "..."}
  QUERY

Heredoc with <<'QUERY' (quoted) passes everything literally -- no
escaping needed for backticks, quotes, or parentheses.

Updated claude_md_template.txt to use --stdin as the primary method.
2026-03-21 12:15:50 +01:00
Petr
39763ea5a2 Fix: load instance.yaml without requiring webapp secrets
Analysts don't have WEBAPP_SECRET_KEY, so load_instance_config()
validation failed with noisy warnings. Now reads instance.yaml
directly with yaml.safe_load, skipping secret validation.
2026-03-21 12:01:41 +01:00
Petr
d180b2014e Step 28: Remote query architecture for local+remote table JOINs
Add src/remote_query.py CLI module enabling the AI agent to run SQL
queries spanning local Parquet tables and remote BigQuery tables in a
single DuckDB session on the server. Two-phase protocol: BQ sub-queries
(--register-bq) fetch filtered/aggregated data, then DuckDB SQL (--sql)
joins everything.

Safety: COUNT(*) pre-check, memory estimation (2GB cap), row limits
(500K per BQ sub-query, 100K final result).

Changes:
- New src/remote_query.py with CLI, BQ registration, output formatting
- Add bq_entity_type field to TableConfig (view vs table routing)
- Extract create_local_views() from duckdb_manager.py for reuse
- Update claude_md_template.txt with remote query agent instructions
- Update example configs with remote_query section and docs
- 52 new tests (42 remote_query + 10 bq_entity_type), all passing
2026-03-21 11:39:15 +01:00