Commit graph

14 commits

Author SHA1 Message Date
ZdenekSrotyr
5eaa449fcc fix(query): #168 review iter 2 — quota user_id parity + concurrent-slot 429
Devin Review iter #2 found 2 new issues (after iter #1's 5 fixes
landed). Both real, both addressed.

🔴 Quota user_id key mismatch defeated shared daily budget. /api/query
computed `user.get("id") or user.get("email")` while /api/v2/scan uses
`user.get("email") or "anon"` (app/api/v2_scan.py:327). Same user → two
different keys in the singleton QuotaTracker. BQ bytes consumed via
/api/query were tracked under UUID; via /api/v2/scan under email; the
`check_daily_budget` pre-flight on either endpoint never saw the
other's recorded bytes — per-user cap was effectively doubled. Match
v2/scan's email-first ordering.

🟡 QuotaExceededError(KIND_CONCURRENT) → 400 instead of 429.
`quota.acquire(user_id)` raises this from __enter__ when the per-user
concurrent-scan slot is at cap. The exception propagated through the
@contextlib.contextmanager generator, the caller's `with guard:`
block, and was caught by execute_query's generic `except Exception`
handler → mapped to 400 with a flattened "Query error: concurrent_scans:
N/M" string, dropping the typed retry_after_seconds field. Wrap the
`with quota.acquire(...)` in a try/except QuotaExceededError that maps
to 429 with the same typed-detail shape used for the daily-budget
rejection — consistent with /api/v2/scan:392-402.

Tests: test_api_query_quota.py user_id strings updated to
"admin@test.com" (the seeded_app admin's email) to match the new
email-first ordering. 40 affected tests pass.
2026-05-04 13:38:31 +02:00
ZdenekSrotyr
1263b80726 fix(query): #168 review — concurrent-slot wraps execute, doc/JS fixes
Devin Review on PR #168 found 5 issues — all real, all addressed.

🚩 ANALYSIS_001 (architectural): concurrent-slot guard didn't protect
actual BQ query execution. Earlier `_enforce_remote_bq_quota_and_cap`
ran dry-run + cap check inside `with quota.acquire(user_id):`, then
returned — releasing the slot BEFORE `analytics.execute(...)` ran. Spec
§4.3.3 explicitly designs the slot to wrap execute so the per-user
concurrent cap limits BQ scans, not just dry-runs.

Refactor to a context manager `_bq_quota_and_cap_guard`. Caller's `with`
block now holds the slot through dry-run, cap check, the actual
`analytics.execute(...)` (which is what triggers the BQ scan when DuckDB
resolves the master view), AND the post-flight record_bytes. Slot
released only when caller's `with` body exits.

🟡 BUG_001: placeholder JS walked `original` (full GET payload root)
instead of `original.sections`. `placeholder_from: ["data_source",
"bigquery", "project"]` is a section-relative path, so billing_project
placeholder NEVER rendered. Fix: walk `original.sections` (with fallback
to `original` for safety).

🟡 BUG_002 + BUG_003: admin_tables.html register and edit modals'
operator help text referenced `max_bytes_per_remote_query` (the old
name from the spec) but the actual config key is `bq_max_scan_bytes`
after the fix-up commit `6423888d` moved it. Replace both occurrences.

🟡 BUG_004: CHANGELOG entry said `api.query.bq_max_scan_bytes` (the
old path) but the read at app/api/query.py:53 is
`get_value("data_source", "bigquery", "bq_max_scan_bytes", ...)`. An
operator who set it under `api.query` in their yaml would have no
effect. Correct path in CHANGELOG.

All 95 #160-affected tests pass after the changes.
2026-05-04 13:28:03 +02:00
ZdenekSrotyr
6423888d02 fix(query): #160 move bq_max_scan_bytes to data_source.bigquery (UI editable)
E2E test on dev VM revealed: spec said "configurable via /admin/server-config"
for the cost guardrail cap, but the underlying read path was
`api.query.bq_max_scan_bytes` and `api` is NOT in `_EDITABLE_SECTIONS`. POST
to /admin/server-config rejected `{"sections":{"api":...}}` as "unknown
section(s): api" — the cap was only adjustable via direct YAML edit.

Move to `data_source.bigquery.bq_max_scan_bytes`:
- `_default_remote_query_cap_bytes()` reads from the new path.
- Add to `_OPTIONAL_FIELDS["data_source"]["bigquery"]["fields"]` with the
  same shape as `max_bytes_per_materialize` (kind=int, default 5 GiB, hint).
- Add to `_BQ_OPTIONAL_FIELD_DEFAULTS` so it surfaces in the GET payload
  even when YAML omits it.

Convention now mirrors `max_bytes_per_materialize` — both BQ cost
guardrails live under `data_source.bigquery`, both editable in the UI.
2026-05-04 12:46:38 +02:00
ZdenekSrotyr
77cdb65f76 sec(query): #160 BQ_PATH catches quoted "bq" catalog token (Phase 3 review)
Phase 3 review identified an RBAC + cost-cap bypass: `SELECT * FROM
"bq"."ds"."tbl"` (catalog token quoted as a DuckDB identifier) was NOT
matched by the BQ_PATH regex, so direct quoted-form references skipped
both the registry check and the cost-cap dry-run. DuckDB resolves
`"bq"` to the same ATTACHed BQ catalog, so the bypass is real.

Widen the catalog-token alternation: `(?:"bq"|bq)` matches both forms.
Negative lookbehind `(?<![\w.])` still rejects look-alike prefixes
(`other_bq`, `my_bq`); the new "my_bq".ds.tbl negative test locks that
in alongside `other_bq.ds.tbl`.

Tests:
- 2 new positive cases in tests/test_query_bq_regex.py for the quoted
  form (`"bq"."finance"."ue"` and uppercase `"BQ"."ds"."tbl"`).
- 1 new negative case rejecting `"my_bq".ds.tbl` so the quoted-form
  widening doesn't open a different evasion.
- 1 new RBAC test in tests/test_api_query_rbac_bq_path.py: admin
  hitting an unregistered quoted path returns the same
  bq_path_not_registered 403 as the unquoted form.

All 33 Phase 3 tests pass after the fix.
2026-05-04 10:31:35 +02:00
ZdenekSrotyr
896c43c7a2 feat(query): #160 cost guardrail + bq.* RBAC + quota integration on /api/query
The headline implementation for issue #160. POST /api/query now gates
direct `bq."<dataset>"."<source_table>"` references behind the registry
and bounds the BQ scan cost behind a configurable cap. Wired through
the same singleton QuotaTracker as /api/v2/scan so daily-byte budgets
are shared across both BQ-touching paths.

Changes in app/api/query.py:

- Add module-level `BQ_PATH` regex matching the 16 syntax variants
  verified empirically (fully-quoted, unquoted, mixed quoting,
  case-insensitive, inside CTE bodies, multi-path, …).
- Add `bigquery_query` to the SQL keyword blocklist. Closes the
  pre-existing function-call backdoor where a user could run an
  arbitrary BQ jobs API call against any reachable dataset, bypassing
  the registry and RBAC. Wrap views internal to the BQ extractor still
  use bigquery_query() — but those run via DuckDB view resolution at
  query time, not via user-submitted SQL, so the blocklist doesn't
  break them.
- Add `_bq_guardrail_inputs` helper: walks user SQL twice — once for
  bare-name matches against accessible registered remote-BQ names
  (contributes to dry_run_set), once for direct `bq.X.Y` matches
  (gated against `find_by_bq_path` lookups, returns 403 with
  structured detail on miss or grant violation).
- Add `_enforce_remote_bq_quota_and_cap` helper: pre-flight
  `check_daily_budget` (over-cap → 429), then `with quota.acquire(...)`
  wraps a per-path BQ dry-run, sums bytes, raises 400
  `remote_scan_too_large` when total > cap.
- Cap default 5 GiB; configurable via `api.query.bq_max_scan_bytes`
  in /admin/server-config (next phase wires the UI).
- Post-flight `record_bytes` against the user's daily counter.
- Module-level imports of `_bq_dry_run_bytes`, `_build_quota_tracker`,
  `get_bq_access` so tests can monkeypatch via `app.api.query.<name>`.

Tests:
- All 23 RED tests from the previous commit now pass (regex matrix,
  blocklist with detail-string assertion, RBAC unregistered/admin-bypass,
  guardrail dry-run-called/over-cap-rejected, quota pre-flight 429).
- mock_dry_run fixture stubs both `_bq_dry_run_bytes` and `get_bq_access`
  so guardrail tests don't require a live BQ project.
- Quota test uses `admin1` (the seeded_app fixture's actual user id, not
  `admin`).

Smoke: 887 passed across query/bq/admin/extractor/registry/quota
domains. No regressions.
2026-05-04 10:31:35 +02:00
ZdenekSrotyr
dc03837a7b feat(query-api): better error message when --remote query references a materialized-but-not-rebuilt id
E2E sub-agent finding: `da query --remote "SELECT * FROM <id>"` against a
materialized table that hasn't yet been rebuilt in the server's
analytics.duckdb returns a confusing DuckDB "Table does not exist"
message even though the table is in the registry. Materialized rows
produce parquets at `${DATA_DIR}/extracts/<source>/data/<id>.parquet`,
but the orchestrator's master-view creation is `_meta`-driven — fresh
instances or pre-tick states have the registry row without a
corresponding view, so analysts hit the bare "does not exist" with no
path forward.

Improve the error rendering in `app/api/query.py:execute_query`. When
DuckDB raises a "table does not exist" error, scan the registry for any
`query_mode='materialized'` row whose id or name appears in the failed
SQL. On a hit, return a 400 whose detail names the table, explains the
materialize state, and offers two concrete next steps:

1. Run `da sync` (or wait for the scheduler tick / hit
   POST /api/sync/trigger) to materialize the parquet, OR
2. Query the source directly via the catalog alias when the registry row
   carries bucket+source_table (e.g. `bq."dataset"."table"` for BigQuery,
   `kbc."bucket"."table"` for Keboola).

Detection is bounded — the registry round-trip only fires when DuckDB's
error mentions a missing table, so happy-path queries pay no cost.
Non-materialized unknowns fall through to DuckDB's raw error.

2 new tests: materialized id surfaces the hint with the bucket+source_table
payload; unknown table falls back to the generic error path with no false
positive on the new hint.
2026-05-01 23:09:52 +02:00
ZdenekSrotyr
2e1dfb7553
feat(v2): claude-driven fetch primitives + 0.14.0 (#102)
Replaces the BigQuery wrap-view pattern with a discovery + scoped-fetch toolkit driven by the analyst's Claude session. Adds /api/v2/{catalog,schema,sample,scan,scan/estimate}, da catalog/schema/describe/fetch/snapshot/disk-info CLI commands, sqlglot-backed WHERE validator, process-local quota tracker, agent rails skill (cli/skills/agnes-data-querying.md). BREAKING: BQ wrap views off by default — set data_source.bigquery.legacy_wrap_views=true for one cycle. Backward-compat field_validator on primary_key. Catalog cache now matches documented 300s TTL with RBAC fresh per request. Cuts release v0.14.0.
2026-04-29 01:07:19 +02:00
ZdenekSrotyr
55515266ea fix: block DuckDB metadata functions and relative paths in query endpoint
Add information_schema, duckdb_* introspection functions, pragma_* functions,
and relative path traversal patterns to the SQL blocklist so users cannot
enumerate schema metadata regardless of RBAC. Add six corresponding tests.
2026-04-09 16:29:11 +02:00
ZdenekSrotyr
1b3acce7e9 fix: replace substring table access check with word-boundary regex
Replace substring matching with word-boundary regex in query endpoint's
table access validation. Prevents false positives where short table names
like 'id' would block any query containing the word. Uses re.escape() to
safely handle special characters in table names.

- Import re module at top
- Use regex pattern with word boundaries (\b) for matching
- Add tests to verify no false positives and proper blocking
2026-04-09 07:00:48 +02:00
ZdenekSrotyr
23ae6a602c security: harden query endpoint SQL blocklist and disable external access
Expand blocked keywords to cover parquet_scan, read_csv_auto, query_table,
iceberg_scan, delta_scan, call, URL schemes (http/https/s3/gcs), and
additional file-scan functions. Set enable_external_access=false on the
non-read-only analytics connection path. Add three new tests covering
parquet_scan, read_csv_auto, and query_table blocking.
2026-04-09 06:54:58 +02:00
ZdenekSrotyr
05a1b452e9 security: harden query (read-only DB), uploads (path sanitization), scripts (AST validation) 2026-04-08 12:09:19 +02:00
ZdenekSrotyr
1074d5ec49 feat: implement data access control — table-level permissions
Schema v3: add is_public column to table_registry (default true).

src/rbac.py: can_access_table() checks admin bypass, public flag,
explicit permissions, wildcard bucket permissions.

API enforcement:
- manifest: filters tables by user access
- download: 403 if no access
- catalog: filters table list
- query: validates referenced tables against allowed list

New admin permissions API (/api/admin/permissions) for grant/revoke.

28 access control tests + 733 total tests passing.
2026-03-31 12:33:31 +02:00
ZdenekSrotyr
c5527ec153 fix: harden script sandbox and SQL query security
Fixes found by E2E QA agent:
- Script sandbox: block os, sys, socket, eval, exec, open, __import__,
  getattr, pathlib and 20+ other dangerous patterns
- SQL query: block COPY, ATTACH, read_csv, semicolons, non-SELECT
- Added 24 security tests covering all attack vectors
2026-03-27 16:11:05 +01:00
ZdenekSrotyr
a3918d3833 feat: add FastAPI server with auth, RBAC, and all API endpoints
- JWT auth with role-based access control (viewer/analyst/admin/km_admin)
- Endpoints: health, sync manifest, data download, query, users CRUD,
  corporate memory, session/artifact upload
- 18 API tests covering auth, RBAC, all endpoints
2026-03-27 15:19:18 +01:00