Commit graph

81 commits

Author SHA1 Message Date
ZdenekSrotyr
e5645fd280 fix: devil's advocate R1 — chunked probe, parse-error heuristic narrow, pool settings refresh, content-length sanity, multi-project skip
R1 adversarial review surfaced 5 issues, all addressed:

#1 chunked download silently disabled in non-Caddy deployments (HEAD on
GET-only FastAPI route returns 405). _probe_range_support now falls back
to GET with Range: bytes=0-0 when HEAD fails — works against both
Caddy file_server (HEAD-friendly) and dev FastAPI direct (GET-only).

#2 parse-error fallback heuristic too broad — matched on Unrecognized
name / Function not found / No matching signature / Invalid cast,
which BQ surfaces for ordinary user-column typos. That triggered slow
ATTACH-catalog retry on every typo (2× latency tax). Narrowed to just
'Syntax error' / 'syntax error' which are the genuine DuckDB-vs-BQ
dialect mismatch markers.

#3 apply_bq_session_settings was only run on fresh-built pool entries,
not on reuse. An operator's /admin/server-config change to bq_query
_timeout_ms wouldn't propagate to long-lived pooled sessions until
restart. Fixed: re-apply on every pool acquire (idempotent + fail-soft).

#4 content-length sanity bound — a misconfigured proxy returning a
wildly inflated Content-Length would cause overlapping chunked Range
requests against the actual file → corrupt assembled output (caught
by manifest hash check, but only after wasted bandwidth). Cap at 100
GiB; above that, drop to single-stream.

#5 rewriter assumed every BQ row resolves under the single
bq.projects.data project. Bucket containing '.' suggests a project-
qualified bucket (multi-project deployment); rewriter would silently
target the wrong project. Conservative skip with regression test.
2026-05-06 13:50:46 +02:00
ZdenekSrotyr
14db85f506 fix(bq): map 'Response too large' to its own error class instead of generic bad_request
translate_bq_error previously mapped BQ's responseTooLarge failure mode
to bq_bad_request (HTTP 400 with the raw upstream message). The user-
facing implication ('your SQL has a syntax error') is wrong -- the root
cause is query shape (BQ refused to return the result inline because
it exceeded the response size limit), and the actionable remediation is
'narrow the WHERE clause, aggregate further, or use a materialized
table'.

Add bq_response_too_large as a first-class BqAccessError kind (also 400)
with a canonical hint message; original BQ message preserved in details
for operator debugging. Detection is substring-based on 'response too
large' and fires before the generic BadRequest path so the dedicated
mapping always wins. Affects every BQ-touching path since they all
share translate_bq_error -- /api/query, /api/v2/{scan,sample,schema},
materialize.
2026-05-06 13:09:31 +02:00
ZdenekSrotyr
83209f32b0 perf(bq): pool DuckDB BQ extension sessions to amortize INSTALL/LOAD/ATTACH cost
Each BqAccess.duckdb_session() acquire previously created a fresh
in-memory DuckDB conn and ran INSTALL bigquery; LOAD bigquery;
CREATE SECRET; ATTACH on it -- costing ~0.5 s per request even before
any BQ work. Add a process-local pool (deque + lock) of pre-warmed
sessions; acquire reuses a warm entry when available, refreshing the
auth SECRET so a long-lived pool entry doesn't keep a stale GCE
metadata token past its TTL. Liveness probe (cheap SELECT 1) drops
broken entries before handing them to callers.

On exception inside the with-block the conn is closed instead of
returned to pool (session may carry dirty state). Pool size is
data_source.bigquery.session_pool_size (default 4; sentinel 0
disables pooling). Process-cached, not fork-safe (single uvicorn
worker is the supported deployment shape per CLAUDE.md).

All call sites get faster automatically: /api/query, /api/v2/{scan,
sample,schema}, materialize, the orchestrator's remote-attach, and
the BQ dry-run cap-guard.
2026-05-06 13:06:25 +02:00
Vojtech Rysanek
32c8ea601a fix(bigquery): apply bq_query_timeout_ms on every BQ-extension attach + surface silent failures
The DuckDB BigQuery extension defaults bq_query_timeout_ms to 90 s,
which is too tight for analyst-scale queries against view-backed BQ
datasets. Agnes already has apply_bq_session_settings() that bumps it
to 600 s (configurable via data_source.bigquery.query_timeout_ms), but
two regressions let the 90 s default leak through to live queries:

1. apply_bq_session_settings() swallowed every Exception silently. If
   the BigQuery extension wasn't loaded on the connection yet, or the
   installed extension version didn't recognise the setting, the SET
   would fail and the function would return without surfacing the
   problem. Operators saw 90 s timeouts on 'agnes query --remote' with
   no log line explaining why.

2. The call sites in src/db.py:_reattach_remote_extensions and
   src/orchestrator.py:_remote_attach only invoked
   apply_bq_session_settings on the metadata-token branch (token_env
   empty, the BqAccess contract). The token-based and no-auth branches
   ran ATTACH against the BigQuery extension without ever applying the
   timeout setting — so any BQ source registered with an explicit
   token_env, or with no auth env at all, fell back to the 90 s default.

Fix:

- apply_bq_session_settings now logs WARNING on each failure path
  (instance_config import error, non-numeric value, SET execution
  failure, readback error). It also verifies the setting actually
  landed via SELECT current_setting('bq_query_timeout_ms') and logs
  WARNING when the readback disagrees with the requested value, which
  catches the silent-ignore case some extension versions exhibit.

- Both _reattach_remote_extensions (src/db.py) and _remote_attach
  (src/orchestrator.py) now call apply_bq_session_settings on every
  branch that ATTACHes a BigQuery alias, not only the metadata-token
  branch. Idempotent: calling it twice on the metadata-token path is a
  no-op SET.

Tests:

- Extended the _RecordingConn fixture to support .fetchone() so the
  readback assertion path works. Updated existing call-shape
  assertions to expect the SELECT current_setting readback alongside
  the SET. Added two new tests covering the WARNING surfaces for SET
  failure and readback mismatch — regression guards for the silent-
  fallback bug this PR addresses.

- Full BQ-touching suite (398 tests) passes.
2026-05-06 11:24:14 +04:00
ZdenekSrotyr
4f04235502 feat(bigquery): bq_query_timeout_ms knob; default 600s (was 90s)
DuckDB BigQuery extension defaults `bq_query_timeout_ms` to 90 s, which
is too tight for analyst-scale queries against view-backed BQ datasets.
`agnes query --remote` HTTP 400'd with `Binder Error: Query execution
exceeded the timeout. Job ID: ...` whenever the underlying BQ job ran
longer than 90 s, even though the job itself was healthy.

Add `data_source.bigquery.query_timeout_ms` (default 600 000 ms = 10 min,
sentinel 0 falls through to the extension default). Applied via
`SET bq_query_timeout_ms` after every `LOAD bigquery` on every BQ-touching
DuckDB session: orchestrator's `_remote_attach` ATTACH path, BqAccess
session factory, and the standalone extractor. Configurable via
`/admin/server-config` UI.

Fail-soft: extension versions that don't recognise the setting silently
keep the default rather than poisoning the session.
2026-05-05 16:40:40 +02:00
ZdenekSrotyr
4751094e1c
fix(keboola): per-table fallback to legacy Storage-API client (#183)
* fix(keboola): per-table fallback to legacy Storage-API client

The DuckDB Keboola extension's per-table COPY fails with
`Schema '..."in.c-..."' does not exist or not authorized` on
projects whose Snowflake backend doesn't expose bucket schemas
to the storage-token-derived QueryService role
(keboola/duckdb-extension#17). ATTACH itself succeeds, so the
existing extension-level fallback in `_try_attach_extension`
never triggers — the table is just marked failed.

- Promote `kbcstorage>=0.9.0` from optional to core dep so the
  legacy client import in `_extract_via_legacy` doesn't crash
  default installs with `ModuleNotFoundError`.
- Wrap `_extract_via_extension` in a per-table try/except so a
  scan failure retries via `_extract_via_legacy` instead of
  recording `tables_failed` and moving on.

Slower than the extension path, but produces correct parquets
on affected projects while the upstream extension fix lands.

* test(keboola): cover per-table extension→legacy fallback

Two existing tests mocked _extract_via_extension to throw and asserted
the original message survived in result["errors"]. With per-table
fallback, the new flow retries via _extract_via_legacy — which on the
mock URLs would throw a different (404 / DNS-fail) error, replacing the
asserted message.

- Mock _extract_via_legacy alongside _extract_via_extension in
  test_network_timeout_during_extraction +
  test_partial_failure_continues +
  test_all_tables_fail_returns_full_failure_stats so the assertion
  observes the final propagated error from the fallback chain.
- Add test_extension_per_table_failure_falls_back_to_legacy that
  exercises the new behavior directly: extension scan fails with the
  QueryService schema-not-authorized message
  (keboola/duckdb-extension#17), legacy succeeds, parquet ends up
  queryable.
2026-05-05 15:47:44 +02:00
ZdenekSrotyr
e86dd5edc5 fix(anthropic): strict json_schema (additionalProperties=false) + add /admin/scheduler-runs UI
E2E test on a real BQ deploy showed every verification-extraction call
fails with HTTP 400 invalid_request_error: "output_config.format.schema:
For 'object' type, 'additionalProperties' must be explicitly set to false".
The Anthropic structured-output API now requires the field on every object
node in the json_schema. Fix: connectors/llm/anthropic_provider.py wraps
the caller-supplied schema through a recursive _strict_json_schema()
walker that adds the field where missing (preserving any explicit
override), then passes the strict variant to the API. Six unit tests in
TestStrictJsonSchema pin the recursion across nested objects, array items,
and the no-mutation invariant.

Adds /admin/scheduler-runs — a read-only admin page that surfaces the
last 200 audit-log entries from scheduler-driven actions. New
AuditRepository.query_actions(actions, limit) helper, new admin nav
entry. Failed scheduler ticks (HTTP 401, network errors) don't reach
the audit_log; the page calls that out with a hint to set
SCHEDULER_API_TOKEN if no rows show up.
2026-05-05 08:00:57 +02:00
ZdenekSrotyr
bbb04ac041 fix(setup): seed default ai: block + env-var fallback (#176)
POST /api/admin/configure now writes a default ai: block into the
instance.yaml overlay when the request leaves it untouched and either
ANTHROPIC_API_KEY or LLM_API_KEY is set in the environment. The block
references the env var via ${VAR} syntax — secrets never land in YAML.

connectors.llm.factory grows create_extractor_from_env_or_config which
falls back to ANTHROPIC_API_KEY / LLM_API_KEY when ai_config is empty
and raises a clear ValueError when neither is available. Both
services/corporate_memory and services/verification_detector switch to
the new helper, replacing the old 'silently skip when ai: missing'
path that was the silent-failure root cause.

Tests:
- tests/test_setup_ai_block.py — overlay seeding contract.
- tests/test_llm_provider_env_fallback.py — fallback + fail-fast.
2026-05-04 23:55:19 +02:00
ZdenekSrotyr
c432e90f62 fix(bq-materialize): TTL reclaim was dead code (Devin Review on extractor.py:166)
`_try_acquire_file_lock` opened the lock file with `open(mode='w')`
BEFORE the mtime check, which truncated the file and refreshed mtime
to now. The subsequent age check always saw ~0, so the TTL reclaim
branch was never reachable and `materialize.lock_ttl_seconds` was
a silently no-op config knob.

Repro:
  before open(w): mtime age = 100000s
  after  open(w): mtime age = 0s

Fix: stat the lock path BEFORE any open(). If pre-probe mtime is
older than TTL, unlink (forcing a fresh inode for the open + flock
that follows). Order is now stat-then-decide-then-probe, not
probe-then-stat-then-decide.

Two regression tests added in tests/test_bq_materialize_concurrency.py:
- test_stale_held_lock_is_reclaimed_despite_live_holder — exercises
  the full reclaim path with a still-living fcntl holder. Pre-fix
  this returned None (in_flight forever); post-fix returns a holder
  fd on a new inode.
- test_failed_probe_does_not_self_refresh_lock_mtime — sister test
  pins that a failed acquisition's mode='w' truncate doesn't
  pathologically loop.

Residual cross-process risk (genuinely overrunning materialize past
TTL races a fresh attempt — both write to the same parquet.tmp,
inode-level flock independence means new acquisition succeeds while
old holder is still alive) stays documented in the helper docstring.
In-process threading.Lock keyed on table_id blocks the single-process
race; cross-process protection relies on TTL being well above
longest plausible COPY (24h default).
2026-05-04 22:36:56 +02:00
ZdenekSrotyr
8784f10a6b fix(devin-review): stale-token override + status sessions counter + lock comment
Three Devin Review findings on PR #173 addressed in one commit since
they're in adjacent code paths:

1. cli/commands/init.py:99 (\u{1F534}): `agnes init --token NEW` ran
   step 2 verify against the OLD on-disk token because `get_token()`
   read `~/.config/agnes/token.json` before the env var, and
   `_override_server_env` only set the env var. So `agnes init --force`
   on a machine with a stale token.json failed 401 with a confusing
   'token expired' even though the --token arg was valid.

   Fix: ContextVar-based override in `cli.config._token_override`
   checked by `get_token()` BEFORE the on-disk read.
   `_with_token_override` context manager scopes the override.
   `_override_server_env` now also sets the contextvar via
   `_with_token_override(token)`, so both env var and contextvar
   carry the override (env for back-compat with anything bypassing
   get_token; contextvar is the authoritative source).
   Async-safe (each task sees its own override) and leak-proof
   (resets on context exit).
   2 new tests: regression on stale-disk-token + scope leak guard.

2. cli/commands/status.py:43 (\u{1F7E1}): sessions_pending_upload only
   checked legacy `<workspace>/user/sessions/` and always reported 0
   in workspaces bootstrapped with `agnes init` (Claude Code writes
   to `~/.claude/projects/`, not the legacy path). Same bug we fixed
   for `agnes push` in 08e49591.

   Fix: route through `cli.lib.claude_sessions.list_session_files()`
   so status and push agree on what counts as a pending session.

3. connectors/bigquery/extractor.py:111 (\u{1F7E1}): docstring claimed
   "a live holder still wins the second flock attempt" — incorrect on
   Linux. After `unlink()` + `open()`, the new file is a new inode;
   fcntl.flock keys per-inode, so the old holder's lock does NOT block
   the new acquisition. In a genuine TTL-overrun scenario two writers
   CAN race the parquet.tmp.

   Fix: documentation only. Comment now honestly describes the
   inode-recreation behavior, names the threading.Lock as the actual
   in-process guard, and flags pid-gating as the next-iteration fix
   if real corruption surfaces. The 24h default TTL is well above
   typical COPY durations so the practical risk is low.

Tests: 17/17 across test_cli_init.py + test_lib_pull.py + the broader
regression set.
2026-05-04 21:26:30 +02:00
ZdenekSrotyr
8233c3e3f9 chore(docs): replace stale da verbs and vendor-specific install paths
Sweep operator runbooks (docs/QUICKSTART, docs/HEADLESS_USAGE,
docs/architecture, docs/sample-data, docs/agent-workspace-prompt,
docs/metrics/metrics.yml, dev_docs/server, dev_docs/disaster-recovery),
the corporate-memory service README, the jira connector README + backfill
scripts, the deploy skill, and test docstrings. Replaces `da sync` →
`agnes pull`, `da analyst setup` → `agnes init`, `da metrics ...` →
`agnes catalog --metrics` / `agnes admin metrics ...`, `da fetch` →
`agnes snapshot create`, plus the matching docker-compose admin
invocations.

Vendor-specific `/opt/data-analyst/` install paths in jira backfill /
consistency scripts and operator docs are replaced with the
placeholder `<install-dir>` and a new `AGNES_ENV_FILE` env-var override
that lets a deployment inject its actual install path without a code
change. Aligns with the OSS vendor-agnostic policy in CLAUDE.md.

CHANGELOG `### Internal` entry summarizes the audit and reaffirms the
intentional stale-marker tuples (`_LEGACY_STRINGS`, `_OUR_COMMAND_MARKERS`)
that must keep referencing `da sync` / `da fetch` / etc. for hook upgrade
and override-detection logic.
2026-05-04 21:22:19 +02:00
ZdenekSrotyr
8890b6f09b fix(post-merge): clean up stale da verbs introduced via #174 merge
Four call sites where #174 (branched from main before the agnes rename
fully landed in some files) emitted or referenced `da fetch`. None are
operator-visible runtime crashes — but `extractor.py` logs a stale
verb to the operator log and `DATA_SOURCES.md` is current docs:

- connectors/bigquery/extractor.py:431,434 (operator-facing log line on
  unverified BQ entity_type — was suggesting `da fetch`).
- docs/DATA_SOURCES.md:77,85 (current public docs, two refs to
  `da fetch` in the workflow + the BQ scope description).
- tests/test_cli_query_render.py:7 (module docstring listed
  `da fetch / agnes schema / etc.` — now `agnes snapshot create / agnes
  schema / etc.`).
- tests/test_cli_snapshot_create.py:1 (docstring referenced `(folded
  from `da fetch`)` — historical, removed; no value once the rename
  landed).

Pre-existing stale `da` references elsewhere in the branch (templates,
operator runbooks, internal comments) are not touched by this commit —
they live outside the merge surface and are a separate cleanup task.

Verified: 10/10 across the affected test files pass.
2026-05-04 20:57:36 +02:00
ZdenekSrotyr
dc7e27082d fix(bq-materialize): code-review follow-ups for 16eaf7a3
- extractor._try_acquire_file_lock: close fd and re-raise on non-
  BlockingIOError from fcntl.flock (read-only fs, unsupported flock,
  fd exhaustion). Pre-fix the fd leaked silently and the underlying
  OSError still propagated past the caller.
- extractor: reorder module-level layout so logger is bound before
  the new lock-related helpers reference it. Deferred import of
  app.instance_config inside _get_lock_ttl_seconds documented inline.
- extractor: comment _table_locks unbounded-by-design rationale.
- tests: docstring + monkeypatch-target rationale for the two
  concurrency tests where the contract isn't obvious from the body.
2026-05-04 17:59:21 +02:00
ZdenekSrotyr
16eaf7a399 feat(bq-materialize): per-table mutex + file lock with TTL reclaim
Two layers of concurrency control. Layer 1 is a per-table_id
threading.Lock keyed on table_id; Layer 2 is fcntl.flock on a sibling
<id>.parquet.lock file. Overlapping calls for the same id raise
MaterializeInFlightError, which the caller treats as 'skipped,
in_flight' instead of a hard error. Stale file locks (mtime older
than materialize.lock_ttl_seconds, default 86400) are reclaimed on
the next attempt — covers the rare case where a holder was hard-killed
before kernel-level flock release.

Pre-fix, when a materialize ran longer than the scheduler tick interval
(15 min), the next tick called materialize_query for the same id, hit
the unconditional tmp_path.unlink() at function entry, and started a
second COPY against the same path. Both writers interleaved bytes;
the original COPY's read_parquet validation then failed with
'No magic bytes found at end of file'.
2026-05-04 17:40:21 +02:00
ZdenekSrotyr
a2afcfe59a fix(bq-materialize): code-review follow-ups for d8a22996
- tests/test_bq_cost_guardrail.py: assert fail-open warning is logged
  (test previously only proved fail-open doesn't crash; review note:
  warning is the only operator-visible signal of the silent failure).
- extractor._wrap_admin_sql_for_jobs_api: docstring no longer claims
  DuckDB-flavor SQL is rejected — the function performs no inner-SQL
  validation; the v24 migration + register-time validator are the
  real enforcement points.
- extractor.materialize_query: safe_path uses _escape_sql_string_literal
  instead of inline replace, for one-place-to-update consistency.
- extractor: import hashlib hoisted to module-level imports.
2026-05-04 16:52:18 +02:00
ZdenekSrotyr
d8a2299633 fix(bq-materialize): wrap admin SQL in bigquery_query() so views work
Pre-fix, materialize ran the admin source_query as 'COPY (sql) TO parquet'
through the DuckDB BQ extension session. The extension defaults to the
BQ Storage Read API for bq.<ds>.<tbl> references, which rejects views
('non-table entities cannot be read with the storage API'). The fix
always wraps admin SQL into bigquery_query('<billing>', '<inner>') so
COPY uses the BQ jobs API uniformly for tables and views.

Cost guardrail dry-run now operates on the inner SQL (BQ-native), so
the BQ Python client parses it and the cap engages — pre-fix the dry-run
hit 'Table-valued function not found: bigquery_query' and fail-opened.
2026-05-04 16:40:40 +02:00
ZdenekSrotyr
10d7bd62f8 fix(bq): #160 wrap views via bigquery_query() for VIEW/MATERIALIZED_VIEW
Issue #160: da query --remote against query_mode='remote' BQ rows whose
underlying entity is a VIEW or MATERIALIZED_VIEW returned a DuckDB catalog
error because the extractor (with legacy_wrap_views=False default since
the v2 fetch primitives release) skipped master-view creation for those
entity types — but kept inserting the _meta row, leaving operators with a
registered name that resolves to nothing.

Always create a master view for entity types we have proven runtime support
for in this codebase:

  BASE TABLE          → bq."<dataset>"."<source_table>"
                        (Storage Read API path; predicate pushdown)
  VIEW / MAT_VIEW     → bigquery_query('<project>', 'SELECT * FROM `proj.ds.tbl`')
                        (jobs API path; no pushdown — the upcoming /api/query
                        cost guardrail bounds the scan; was the legacy
                        legacy_wrap_views=True branch SQL form, just always-on)

For other entity types (EXTERNAL, SNAPSHOT, CLONE, future), log a warning
and SKIP both the master view AND the _meta row. The registry row remains
intact so /api/v2/scan still works for `da fetch`; we just don't expose a
stale _meta entry that the orchestrator would later strand.

The legacy_wrap_views config knob is still readable in this commit (read
returns the value, which is then ignored). Removal across the rest of
the codebase happens in the follow-up REFACTOR commit.

tests/test_bigquery_extractor.py:
- Add 3 RED tests covering the new always-wrap behavior:
  test_view_creates_wrap_view_with_default_config,
  test_materialized_view_creates_wrap_view_with_default_config,
  test_unsupported_entity_type_skips_meta_and_view.
- Fix pre-existing flakiness in test_main_exits_when_project_missing
  by resetting app.instance_config cache before the no-project mock —
  the prior test populates the cache with a project, and removing the
  legacy_wrap_views get_value() call surfaced this latent ordering bug.
2026-05-04 10:31:35 +02:00
ZdenekSrotyr
16938ae7cb fix(materialized): address 4 Devin Review findings on PR #152
Devin Review on commit 7052a235 flagged 4 real bugs in the Keboola
materialized path. All four are fixed; 3 new regression tests pin the
behavior so future refactors can't quietly regress.

BUG_pr-review-job-3fbd31c9_0001 — _run_materialized_pass gated behind 'if bq_project:'
  app/api/sync.py:444-466 wrapped the entire materialized pass (which
  dispatches BOTH BigQuery AND Keboola rows by source_type) in a check
  for data_source.bigquery.project being non-empty. On Keboola-only
  instances this short-circuited and Keboola materialized rows sat in
  table_registry forever without their SQL being evaluated — the feature
  CHANGELOG advertised was dead code on the most common deployment shape.
  Fix: always run the materialized pass; the BQ branch's per-row try/except
  catches the typed BqAccessError(not_configured) the sentinel raises
  when no BQ project is set, so non-BQ instances incur a per-row error
  for any (hypothetical) BQ-tagged row but the Keboola path runs cleanly.
  Log line renamed 'Materialized BQ' → 'Materialized SQL' to match.

BUG_pr-review-job-3fbd31c9_0004 — wrong config key 'url' instead of 'stack_url'
  app/api/sync.py:149 read get_value('data_source', 'keboola', 'url'),
  but the canonical config key documented in instance.yaml.example:111
  and used by app/api/admin.py:1503 + 2359 is 'stack_url'. Production
  Keboola instances would always see an empty URL and fail with the
  'not configured' error. The pre-existing test patched the wrong key
  too, so it passed without catching the mismatch. Fix: use stack_url
  in both sync.py and the test fixture.

BUG_pr-review-job-3fbd31c9_0003 — no atomic write in Keboola materialize_query
  connectors/keboola/extractor.py wrote COPY directly to the final
  '<id>.parquet' path. A mid-COPY failure (network, disk full, extension
  crash) left a partial parquet that the orchestrator rebuild would
  later pick up and serve to analysts. BQ's materialize_query already
  uses a '<id>.parquet.tmp' staging path + os.replace() atomic swap
  (connectors/bigquery/extractor.py:370-445); Keboola now mirrors that
  pattern with the same try/except cleanup on COPY failure.

BUG_pr-review-job-3fbd31c9_0002 — full file read into memory for MD5
  Same file:60-62 used parquet_path.read_bytes() for the MD5 hash.
  Multi-GB Keboola materialized results would OOM on memory-constrained
  containers. BQ's version uses streaming 8 KiB-chunk hashing
  (connectors/bigquery/extractor.py:438-442); Keboola now mirrors it.

Tests:
  - test_run_sync_runs_materialized_pass_on_keboola_only_instance —
    pins BUG_0001's fix; setting bigquery.project='' must NOT skip
    Keboola materialized dispatch
  - test_keboola_materialize_atomic_write_on_failure — pins BUG_0003;
    a mid-COPY RuntimeError leaves no .parquet AND no .parquet.tmp at
    the canonical path
  - test_keboola_materialize_uses_tmp_path_during_copy — documents the
    atomic-write contract: COPY targets .parquet.tmp, final swap to
    .parquet (no .tmp suffix on the result['path'])
  - existing test_run_materialized_pass_dispatches_keboola_to_keboola_extractor
    fixture updated: stack_url instead of url

Full sweep: 2505 passed, 25 skipped, 0 failed (modulo 8 pre-existing
internal_roles schema-migration failures called out in the task brief).
2026-05-01 20:58:17 +02:00
ZdenekSrotyr
85d3810535 feat(materialized): query_mode='materialized' for BigQuery + Keboola — admin SELECT → parquet → analyst
Closes the 'admin pre-stages a curated table/view for analysts' use case end-to-end across both supported source connectors.

Backend (BigQuery + Keboola, schema v20):
  - schema v20 adds source_query TEXT to table_registry (renumbered from v19 after main's #150 RBAC migration also bumped to v19)
  - connectors/bigquery/extractor.py adds materialize_query(table_id, sql, *, bq, output_dir, max_bytes=...) — BqAccess session, dry-run cost guardrail (default 10 GiB, configurable via data_source.bigquery.max_bytes_per_materialize), idempotent ATTACH, rows/bytes/md5 metadata for sync_state
  - connectors/keboola/access.py — new KeboolaAccess facade (parallel of BqAccess) wrapping ATTACH 'keboola://...' AS kbc
  - connectors/keboola/extractor.py adds materialize_query — same shape, no dry-run analog (Keboola Storage API has different cost model); legacy bucket-download path skips query_mode='materialized' rows
  - app/api/sync.py:_run_materialized_pass dispatches by source_type to the right materialize_query
  - app/api/admin.py: RegisterTableRequest accepts source_query; model_validator coheres mode↔source_query↔bucket; PUT preserves omitted fields; deprecation marks (Field(deprecated=True)) on sync_strategy + profile_after_sync (no extractor reads them; profile_after_sync becomes inert — bug from earlier work where /api/sync/trigger never honored the flag); _BQ_OPTIONAL_FIELD_DEFAULTS injects defaults into GET /server-config payload

Operator + CLI surface:
  - da admin register-table --query / --query-mode materialized
  - scripts/smoke-test-materialized-bq.sh — end-to-end smoke for operators

Tests (incl. spike + integration + regression):
  - test_db_migration_v20, test_table_registry_source_query
  - test_bq_materialize, test_bq_cost_guardrail, test_bq_init_extract_skips
  - test_keboola_access, test_keboola_extension_query_passthrough (lock-in for the DuckDB extension capability), test_keboola_materialize, test_keboola_init_extract_skips, test_keboola_materialized_e2e (skipped without KBC_TEST_* creds)
  - test_sync_trigger_materialized, test_sync_trigger_keboola_materialized
  - test_api_admin_materialized, test_cli_admin_materialized
  - test_admin_bq_register, test_admin_discover_bigquery, test_admin_keboola_materialized, test_admin_phase_c_deprecation, test_admin_put_preservation, test_materialized_e2e

Cost: BQ uses bigquery_query() (jobs API, view-aware) — works on tables, views, materialized views uniformly. Keboola uses ATTACH+COPY parquet through the DuckDB extension.
2026-05-01 20:25:56 +02:00
ZdenekSrotyr
83adf01bde
fix(v2): #134 BigQuery cross-project errors return structured 502/400 + BqAccess facade (#138)
* docs(spec): #134 unify BigQuery access behind BqAccess facade

Brainstorm output for issue #134. Captures:
- root cause (incl. correction of the issue's hypothesis about commit 33a9964)
- BqAccess facade API + project resolution rules
- error contract — typed BqAccessError mapped to HTTP 502 for upstream
  BQ failures, 500 for deployment/config bugs
- migration plan for v2_scan, v2_sample, RemoteQueryEngine
- test rewrite eliminating _bq_client_factory injection point
- E2E verification protocol on agnes-development as success criterion

* docs(spec): #134 revise after first review

Incorporates code-reviewer findings:

Must-fix:
- Add v2_schema (2 copies of INSTALL/LOAD/SECRET dance) to migration scope.
- Reframe v2_scan headline: missing try/except around BQ calls is the
  actual cause of bare 500s, not project resolution (which 33a9964 fixed).
- List two more deferred call sites (extractor.py, register_bq_table)
  with explicit rationale.

Important:
- Drop billing != data clause from cross_project_forbidden heuristic;
  rely only on 'serviceusage' substring. billing != data is normal
  for cross-project setup, was over-classifying.
- Split bq_bad_request into _user (400) and _server (502) variants;
  add sql_origin parameter to translate_bq_error so call sites declare
  whether SQL contains user input.
- Add @functools.cache to BqAccess.from_config; document tests bypass
  via dependency_overrides.
- Replace monkey-patched-classmethod test pattern with
  BqAccess(client_factory=...) injection at construction time. Cleaner
  than today's _bq_client_factory and 1:1 migration shape.
- Keep BqProjects.data (reviewer assumed registry has source_project;
  it doesn't). Multi-project explicitly listed as non-goal with note.

Nice-to-have:
- Add 'Implementation strategy' section: 2 staged commits (bug fix
  alone is revertable; refactor follows).
- Extend E2E protocol to cover all three endpoints, not just /sample.
- Note removal of stale docstring at src/remote_query.py:204.

* docs(spec): #134 revision 3 — incorporates second-round review

Must-fix from second review:
- v2_schema split into two migration cases: _fetch_bq_schema translates
  errors via translate_bq_error; _fetch_bq_table_options preserves its
  swallow-all 'except Exception → return {}' so /schema doesn't 502 on
  partition-info failures.
- RemoteQueryEngine.__init__ now resolves BqAccess lazily (in
  _get_bq_client, not in __init__). Without this, ~7 DuckDB-only tests
  in test_remote_query.py would suddenly fail with not_configured.
- translate_bq_error pass-through for BqAccessError is now load-bearing
  (clause 1, before any Google-API branch). bq.client() raises BqAccessError
  for bq_lib_missing/auth_failed; without explicit pass-through those
  fall to 'unknown' and re-raise as bare 500.
- Commit 1 now emits the SAME structured response shape as commit 2 to
  avoid contract churn between commits.
- BIGQUERY_PROJECT env-var precedence is BREAKING for env-only deployments
  — flagged in CHANGELOG ### Changed.

Editorial:
- sql_origin renamed to bad_request_status with values 'client_error' /
  'upstream_error' (clearer about what the parameter actually decides).
  bq_bad_request_user/_server kinds collapsed to bq_bad_request (400)
  and bq_upstream_error (502).
- CLI (cli/commands/query.py) noted as external RemoteQueryEngine caller;
  unaffected because new bq_access kwarg has default None.
- Added unit/integration tests for the new contracts:
  test_translate_passes_through_BqAccessError,
  test_v2_scan_returns_500_on_bq_lib_missing,
  test_v2_schema_returns_200_with_empty_partition_on_bq_failure,
  test_resolve_succeeds_after_config_set.
- E2E protocol now covers /schema as the fourth endpoint.
- Documented functools.cache-doesn't-cache-exceptions semantics and
  fixture nullcontext-doesn't-close caveat for nested sessions.

* docs(spec): #134 revision 4 — incorporates third-round review

Third reviewer verdict: 'implementation-ready with two trivial edits';
explicitly noted prior rounds did the heavy lifting.

Edits:
1. get_bq_access() module-level function instead of @classmethod
   @functools.cache from_config. Removes the classmethod-cache stacking
   footgun (different Python versions wrap differently) and gives FastAPI's
   dependency introspection a clean function signature. Drops the
   'Do not subclass BqAccess' caveat that no longer applies.

2. Commit 1 strategy explicitly: wrap _fetch_bq_sample (v2_sample),
   _bq_dry_run_bytes + _run_bq_scan (v2_scan), and _fetch_bq_schema
   (v2_schema strict block). Do NOT touch _fetch_bq_table_options swallow-all
   in commit 1 — preserved as-is, then migrated (still preserved) in commit 2.
   All three endpoints emit the same structured body shape so client parsers
   see one consistent contract throughout the staged rollout. No more
   half-rolled-out window where /sample is bare 500 while /scan is
   structured 502.

* docs(plan): #134 implementation plan — Phase 1 (atomic bug fix) + Phase 2 (BqAccess refactor) + Phase 3 (verification)

Bite-sized TDD tasks. 3 phases, 16 tasks total:

Phase 1 (Commit 1) — atomic bug fix across all four v2 endpoints:
  Tasks 1.1-1.5 wrap _fetch_bq_sample, _bq_dry_run_bytes, _run_bq_scan,
  _fetch_bq_schema with structured 502/400 try/except. _fetch_bq_table_options
  preserved untouched. CHANGELOG Fixed entries.

Phase 2 (Commit 2) — BqAccess facade extraction + migration:
  Tasks 2.1-2.5 build connectors/bigquery/access.py bottom-up
  (BqProjects, BqAccessError, translate_bq_error, default factories,
  BqAccess class, get_bq_access module-level cached). Task 2.6 adds
  conftest.py fixture. Tasks 2.7-2.9 migrate v2_scan, v2_sample, v2_schema
  to BqAccess. Tasks 2.10-2.11 migrate RemoteQueryEngine + tests
  (lazy bq_access, drop _bq_client_factory). Task 2.12 CHANGELOG
  Changed BREAKING + Internal.

Phase 3 — Verification:
  3.1 full pytest. 3.2 squash into two PR-shape commits. 3.3 manual
  E2E on agnes-development per spec protocol → close #134.

Self-review table maps spec sections to implementing tasks; no gaps.

* fix(v2): #134 structured 502/400 on BQ errors across /scan, /scan/estimate, /sample, /schema

Wraps the BigQuery call sites in v2_scan, v2_sample, and v2_schema (strict
block only) with try/except for google.api_core exceptions, translating to
HTTPException with a structured body shape: {error, message, details}.

Fixes Pavel's report (#134) where these endpoints returned bare HTTP 500
with no body when the SA on agnes-development hit cross-project Forbidden
on serviceusage.services.use.

Also fixes /sample's missing billing_project fallback (the bug 33a9964
fixed for /scan never landed here).

Status code split:
  - /scan, /scan/estimate: BadRequest -> 400 (bq_bad_request) since SQL is
    user-derived from req.select/where/order_by.
  - /sample, /schema: BadRequest -> 502 (bq_upstream_error) since SQL is
    server-constructed from validated identifiers.
  - All Forbidden -> 502 with cross_project_forbidden if 'serviceusage' in
    error message (with hint pointing at data_source.bigquery.billing_project),
    else bq_forbidden.

Body shape matches what the upcoming BqAccess refactor (next commit) will
produce, so client-side parsers see one consistent contract throughout
the staged rollout.

_fetch_bq_table_options preserved exactly as-is — its swallow-all-and-return-empty
contract is intentional and survives into the refactor; /schema continues to
return 200 with empty partition info when partition queries fail.

Outer wraps in scan_endpoint, scan_estimate_endpoint, sample, and schema
endpoints exist only to make the test pattern (monkeypatching whole
_fetch_* functions) work, and are tagged TODO(#134 Phase 2) for removal
once BqAccess centralizes translation.

* refactor(bq): #134 BqAccess facade — unify v2_scan, v2_sample, v2_schema, RemoteQueryEngine

Extracts the duplicated BigQuery-access pattern (project resolution +
client construction + DuckDB-extension session + Google-API error
translation) into connectors/bigquery/access.py. Migrates four
call sites to use it:

- app/api/v2_scan.py — _bq_dry_run_bytes, _run_bq_scan
- app/api/v2_sample.py — _fetch_bq_sample
- app/api/v2_schema.py — _fetch_bq_schema (strict translation),
  _fetch_bq_table_options (preserves swallow-all best-effort contract)
- src/remote_query.py — RemoteQueryEngine, lazy bq_access kwarg

The new module exposes:
- BqProjects (frozen dataclass: billing + data project IDs)
- BqAccessError (typed exception with HTTP_STATUS class mapping)
- BqAccess (facade with injectable client_factory/duckdb_session_factory
  for tests; defaults call the real google-cloud-bigquery + DuckDB extension)
- get_bq_access (module-level @functools.cache; FastAPI Depends target)
- translate_bq_error (Google API exception → BqAccessError mapper, with
  BqAccessError pass-through, 'serviceusage'-substring heuristic for
  cross_project_forbidden, and bad_request_status param distinguishing
  user-derived (400) from server-constructed (502) SQL)
- _default_client_factory, _default_duckdb_session_factory

RemoteQueryEngine.__init__ no longer accepts _bq_client_factory; tests
migrate to bq_access=BqAccess(projects, client_factory=...). DuckDB-only
RemoteQueryEngine tests need no changes — bq_access defaults to None and
get_bq_access() is only invoked on first BQ call (lazy resolution).
BqAccessError raised internally is translated to RemoteQueryError(
error_type="bq_error") in _get_bq_client to preserve the engine's
existing public contract — CLI and /api/query/hybrid callers see no change.

Endpoint tests (test_v2_scan, test_v2_scan_estimate, test_v2_sample,
test_v2_schema) migrate from monkey-patching whole _fetch_* functions
to using the new bq_access fixture in tests/conftest.py — which
exercises the REAL translation path through BqAccess + translate_bq_error,
closing the test gap flagged in Task 1.1's review.

Side-effect behavior change: v2_sample's FROM clause now uses the data
project (instance.yaml data_source.bigquery.project), not the conflated
billing_project from Phase 1. Documented in CHANGELOG ### Internal.

BREAKING for deployments combining BIGQUERY_PROJECT env var with
data_source.bigquery.project in instance.yaml — env var now overrides
data project too. See CHANGELOG ### Changed.

Two known-duplicate BQ-access sites (connectors/bigquery/extractor.py,
scripts/duckdb_manager.register_bq_table) explicitly out of scope;
tracked as follow-up.

Removed stale docstring at the previous src/remote_query.py:204
that referenced scripts.duckdb_manager._create_bq_client as the default
BQ client factory (RemoteQueryEngine never actually used that function).

Test counts: tests/test_bq_access.py +27 (new), tests/test_v2_*.py +
tests/test_remote_query.py migrated to bq_access fixture (counts unchanged
or +1-2 per file). Full suite: 2086 passed, 8 pre-existing failures
(DB migration tests with unrelated internal_roles DependencyException —
not introduced by this PR).

* fix(bq_access): translate DefaultCredentialsError to BqAccessError(auth_failed)

CI on PR #138 caught: bigquery.Client(...) resolves Application Default
Credentials at construction time; without ADC (CI without SA key, dev
laptop without 'gcloud auth application-default login') it raises
google.auth.exceptions.DefaultCredentialsError synchronously.

Pre-fix _default_client_factory only caught ImportError, so DefaultCredentialsError
propagated as raw exception — and from production endpoints would surface
as bare 500 (the exact failure mode #134 sets out to fix).

Now translates to BqAccessError(kind='auth_failed', details.hint='Run
gcloud auth application-default login...'). Endpoint catch chain returns
HTTP 502 with structured body. Adds unit test
test_raises_auth_failed_on_default_credentials_error.

Third-round spec review flagged this case in passing; the fix didn't land.
CI's auth-less environment surfaced it.

* fix(bq_access): get_bq_access() returns sentinel instead of raising when not configured

Devin BUG_0001 on PR #138 review: 'get_bq_access() as FastAPI Depends
breaks all v2 endpoints for non-BigQuery instances'.

Pre-fix: get_bq_access() raised BqAccessError(not_configured) when
neither BIGQUERY_PROJECT env nor data_source.bigquery.project was set.
Because FastAPI resolves Depends() BEFORE the endpoint body runs, this
exception fires during dep-injection — the endpoint's try/except
BqAccessError clause never gets a chance to catch it. Result: every
v2 request on Keboola-only or CSV-only instances returned bare HTTP
500, even for local-source tables that never touch BigQuery.

Fix: get_bq_access() now returns a sentinel BqAccess with empty
BqProjects and factories that raise BqAccessError(not_configured)
on actual use. Construction succeeds, FastAPI's dep-injection cleanly
yields the sentinel, the endpoint runs. The local-source code path
in build_sample / build_schema / etc. never calls bq.client() or
bq.duckdb_session() (it reads parquet directly), so non-BQ tables
return 200 as before. Only when an endpoint actually tries to query
BQ (source_type == 'bigquery') does the sentinel raise — and the
endpoint's existing except BqAccessError catches it normally,
returning structured 502 with hint.

Test get_bq_access::test_raises_not_configured_when_neither_set
renamed and rewritten to test_returns_sentinel_when_neither_set:
asserts BqAccess is returned, then asserts client() and
duckdb_session() each raise BqAccessError(not_configured) on call.

Test test_does_not_cache_exceptions removed (no longer applicable)
and replaced with test_sentinel_is_cached_per_process documenting
the operator-restart-on-config-change contract.

* docs(spec+plan): #134 genericize customer-specific tokens (CLAUDE.md OSS rule)

Devin BUG_0001/0002 round 3 on PR #138: spec and plan docs contained
customer-specific deployment hostnames, deployment names, and a GCP
project ID that violated CLAUDE.md's vendor-agnostic OSS rule
('Nothing customer-specific belongs in code, configuration defaults,
comments, docs, commit messages, PR titles, or PR bodies').

Replacements:
  agnes-development.groupondev.com -> <your-agnes-host>
  agnes-development                -> <your-dev-instance>
  prj-grp-dataview-prod-1ff9       -> <your-data-project>
  s1_session_landings              -> <bq_table_id>

E2E verification semantics unchanged — operators still run the same
four curls + config flip + retry, just substituting their own host /
deployment name / project / table.

* fix(bq_access): hook get_bq_access.cache_clear into instance_config.reset_cache

Devin ANALYSIS_0004 on PR #138: get_bq_access is @functools.cache'd at
process level, so it captures BigQuery project IDs at first call and
ignores subsequent instance.yaml changes. Pre-Phase-2 the v2 endpoints
re-read get_value() on every request, so admin /api/admin/server-config
saves (which call instance_config.reset_cache()) hot-reloaded the BQ
project. Without this fix, my refactor silently regresses that contract
— operators editing instance.yaml via the admin UI would see no effect
on v2 endpoints until container restart.

instance_config.reset_cache() now also calls
connectors.bigquery.access.get_bq_access.cache_clear() (lazy import,
swallowed if connectors module isn't loaded — keeps instance_config
usable in isolated unit tests).

Adds test_instance_config_reset_cache_invalidates_get_bq_access as
regression guard. Updates CHANGELOG Internal entry to mention the
hot-reload contract + the not-configured sentinel behavior (round-3
fix from Devin BUG_0001 was previously only in commit message).

* fix(bq_access): surface not_configured before identifier validation + plan path genericize

Devin BUG_0001 + BUG_0002 round 5 on PR #138.

BUG_0001 (plan doc): personal filesystem path violated CLAUDE.md
vendor-agnostic rule. Replaced with '<worktree-root>' placeholder.

BUG_0002 (sentinel error path): when get_bq_access() returns the sentinel
BqAccess (BQ not configured), the empty bq.projects.data was reaching
validate_quoted_identifier first and raising ValueError -> endpoint
mapped to HTTP 400 'unsafe_identifier' instead of structured 500
'not_configured' with hint.

Each fetch helper now checks 'if not bq.projects.data: bq.client()' as
the first step, which triggers the sentinel's BqAccessError(not_configured).
Endpoint catches the typed error and returns HTTP 500 with hint pointing
at data_source.bigquery.project. Best-effort _fetch_bq_table_options
returns {} silently in this case (preserves the swallow-all contract).

* fix(bq_access): classify DuckDB-native exceptions from bigquery_query() via string match

Devin ANALYSIS on PR #138 review (latest round). The DuckDB bigquery
extension is a C++ plugin making its own HTTP calls — when BQ returns
403, it throws duckdb.IOException with the BQ error embedded as text,
not gax.Forbidden. translate_bq_error's isinstance checks would miss
these, falling to case 7 → bare 500 in production for v2_scan, v2_sample,
and v2_schema (the bigquery_query() paths).

Fix: last-resort string-match heuristic before the re-raise. 'Forbidden'
/ '403' / 'Bad Request' / '400' in the lowercased message classifies via
the same kind hierarchy. The 'serviceusage' substring still distinguishes
cross_project_forbidden from bq_forbidden. Specific enough that random
exceptions without HTTP-error keywords still re-raise.

Adds 4 unit tests covering the new heuristic + the 'don't swallow random
exceptions' invariant.

* chore(release): cut 0.22.0

PR #138 contains issue #134 user-visible behavior changes:
- BREAKING: BIGQUERY_PROJECT env var now overrides instance.yaml
  data_source.bigquery.project for v2 endpoints (previously
  RemoteQueryEngine billing only).
- Fixed: structured 502/400 on /api/v2/sample, /scan, /scan/estimate,
  /schema when BigQuery raises Forbidden/BadRequest (was bare 500).
- Internal: BqAccess facade refactor unifying four duplicate BQ-access
  call sites; instance_config.reset_cache() now invalidates BqAccess
  cache too so admin server-config saves hot-reload BQ project IDs.

Bumps to 0.22.0 because PR #137 merged first and took 0.21.0.
2026-04-30 10:11:20 +02:00
Vojtech
38f6b639d2
feat(observability): request_id end-to-end + dev debug toolbar + centralized logging (#136)
Cuts release 0.20.0.

## Highlights
- X-Request-ID header on every response + sanitized to [A-Za-z0-9_-] (CRLF log-forging mitigation)
- Error pages (HTML + JSON 500) surface request_id for support tickets
- Dev debug toolbar gated by DEBUG=1 — fastapi-debug-toolbar with custom DuckDBPanel
- Centralized app.logging_config.setup_logging() replaces 23 scattered basicConfig calls
- Telegram bot drops bot.log file — stdout only (BREAKING)

## Devin findings addressed
- BUG_0001: .env.template no longer claims FastAPI debug=True
- BUG_0002: subprocess extractor logs INFO to stderr again
- ANALYSIS_0003: _wants_html no longer matches Accept: */* (curl gets JSON as before)
- BUG on b1c6ee9: HTML 500 page no longer leaks str(exc) in production
- BUG on b13d2fe: 2 CLAUDE.md compliance flags (transform.py + ws_gateway) accepted as scope-limited logging refactor — follow-up to update CLAUDE.md if needed

See CHANGELOG [0.20.0] for full notes.
2026-04-29 22:54:21 +02:00
ZdenekSrotyr
b7a1795834
feat(scheduler): re-wire sync_schedule + script.schedule; tune via env; OpenMetadata TLS (#135)
Bundles 4 issues:
- #79 — table_registry.sync_schedule honored at runtime (API-side filter + Pydantic validators)
- #78 — script_registry.schedule honored via new POST /api/scripts/run-due (atomic claim, BackgroundTask exec, deploy-time safety validation)
- #77 — sidecar JOBS env-driven (SCHEDULER_DATA_REFRESH_INTERVAL/HEALTH_CHECK_INTERVAL/SCRIPT_RUN_INTERVAL/TICK_SECONDS)
- #89 — OpenMetadataClient verify=True default (BREAKING for self-signed)

Cuts release 0.19.0. See CHANGELOG for full notes incl. Known Limitations.
2026-04-29 22:06:30 +02:00
ZdenekSrotyr
1824b9dd9c
feat(admin): #108 M1 — BigQuery table registration in UI + CLI (#119)
Issue #108 Milestone 1. Adds BigQuery table registration via /admin/tables UI and `da admin register-table` CLI without hand-editing table_registry. POST /api/admin/register-table/precheck for round-trip validation. --dry-run flag on CLI. Audit-log entries on register/update/unregister. PUT /api/admin/registry/{id} now preserves registered_at (closes #130).
2026-04-29 13:18:31 +02:00
ZdenekSrotyr
61f6b8d2d5
feat(ci+tests): deploy safety audit — linting, rollback, smoke tests, 50+ new tests (#120)
Comprehensive deploy safety audit implementing 19 improvements across CI/CD pipeline, test coverage, and source code.

### CI/CD Pipeline
- ruff + mypy added to both release.yml and keboola-deploy.yml (continue-on-error)
- Smoke test added to keboola-deploy.yml (was missing)
- Automatic rollback on smoke test failure in release.yml
- Expanded smoke-test.sh with catalog, admin/tables, marketplace.zip, metrics
- Required status checks via .github/settings.yml
- Dependabot + CODEOWNERS + pre-commit hooks + ruff config

### Source Code
- DB schema version check in /api/health (db_schema: ok/mismatch/unhealthy)
- Config versioning (config_version: 1 in instance.yaml, non-blocking validation)
- BigQuery extractor ATTACH error handling (try/except around INSTALL+ATTACH)
- Post-deploy smoke test script for prod VM validation

### Test Coverage (~50 new tests)
- v13->v14 migration, Email magic link TTL, PAT, Marketplace ZIP/Git,
  Jira webhooks, Hybrid Query BQ, Keboola/BQ extractor failure modes,
  Orchestrator failure modes

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
2026-04-29 09:18:55 +02:00
ZdenekSrotyr
2e1dfb7553
feat(v2): claude-driven fetch primitives + 0.14.0 (#102)
Replaces the BigQuery wrap-view pattern with a discovery + scoped-fetch toolkit driven by the analyst's Claude session. Adds /api/v2/{catalog,schema,sample,scan,scan/estimate}, da catalog/schema/describe/fetch/snapshot/disk-info CLI commands, sqlglot-backed WHERE validator, process-local quota tracker, agent rails skill (cli/skills/agnes-data-querying.md). BREAKING: BQ wrap views off by default — set data_source.bigquery.legacy_wrap_views=true for one cycle. Backward-compat field_validator on primary_key. Catalog cache now matches documented 300s TTL with RBAC fresh per request. Cuts release v0.14.0.
2026-04-29 01:07:19 +02:00
ZdenekSrotyr
ef74ec010c
fix(ops): #81 Group B — Keboola partial-failure exit code 2 (squashed) (#99)
Closes M14 from issue #81. Keboola extractor exits 0/1/2
(success/full-fail/partial). sync.py interprets exit 2 as
PARTIAL FAILURE (data-quality alert, distinct from exit 1).

Tests: tests/test_keboola_extractor_exit_codes.py — 14 cases including
runtime mock subprocess (rc=0/1/2/124).

Refs #81 Group B.
2026-04-27 21:52:46 +02:00
ZdenekSrotyr
569cd90d75
fix(security): #81 Group D — extractor-side identifier validation (squashed) (#97)
Closes M15 from issue #81 — SQL injection via attacker-controlled
identifiers in connectors/keboola/extractor.py and
connectors/bigquery/extractor.py.

Lifted _validate_identifier from src/orchestrator.py into a new
src/identifier_validation.py shared module (single source of truth for
both layers). Two validator policies:

- validate_identifier (strict, ^[a-zA-Z_][a-zA-Z0-9_]{0,63}$) for
  table_name — matches the orchestrator's rebuild-time check, so dashed
  names fail fast at extraction rather than being silently dropped.
- validate_quoted_identifier (relaxed, accepts dashes/dots) for
  bucket/dataset/source_table — Keboola in.c-foo and BigQuery
  my-dataset are legitimate, just need to be safe inside `"..."`.

Both extractors skip-and-continue on unsafe rows (logged + counted in
failure stats); _extract_via_extension re-validates as defense-in-depth.

71/71 extractor + orchestrator tests pass.
Refs #81 Group D.
2026-04-27 21:46:17 +02:00
ZdenekSrotyr
4e4d2a39e6
chore(oss): isolate customer-specific deploy bits from scripts/grpn/ (#88, wave 1) (#94)
* chore(oss): isolate customer-specific deploy bits from scripts/grpn/ (#88)

Vendor-neutralization step before public release. The directory mixed
two concerns: (1) generic ops scripts referenced from mainline OSS
infrastructure (TLS rotation, auto-upgrade cron) and (2) one operator's
hackathon manual-deploy helper with hardcoded GCP project IDs, VM names,
and admin emails. Splitting them per concern.

Moved (still in OSS, just under a vendor-neutral name):
- scripts/grpn/agnes-tls-rotate.sh   → scripts/ops/agnes-tls-rotate.sh
- scripts/grpn/agnes-auto-upgrade.sh → scripts/ops/agnes-auto-upgrade.sh

Removed (belongs in private consumer infra repos, not upstream OSS):
- scripts/grpn/Makefile (hardcoded prj-grp-foundryai-dev-7c37, foundryai-development VM name, e_zsrotyr@groupon.com bootstrap email)
- scripts/grpn/README.md (GRPN hackathon deploy walkthrough)
- docs/superpowers/plans/2026-04-22-grpn-deploy-learnings.md (org-specific deploy log)

Cross-refs updated in README.md, CLAUDE.md, docs/DEPLOYMENT.md,
docker-compose.yml. CHANGELOG entry flags BREAKING (ops) for any
consumer infra repo that installs these scripts via path-based systemd
timers.

This is the first wave of #88 — the remaining leaks (test data with
prj-grp-dataview-prod-1ff9, AIAgent.FoundryAI tags in OpenMetadata test
fixtures, docstrings in connectors/openmetadata/enricher.py) will be a
separate, smaller PR.

Refs #88.

* chore(oss): comprehensive vendor-neutralization (#88 wave 2 + review fixes)

PR #94 review found that the original wave-1 grep was scoped wrong and
many leaks survived. This commit closes wave 1 properly AND folds in all
wave-2 anonymization in a single pass — easier to review than two PRs.

Wave-1 review-fix corrections:
- Caddyfile: scripts/grpn/agnes-tls-rotate.sh → scripts/ops/ (the original
  wave-1 grep filter excluded extensionless files like Caddyfile).
- CHANGELOG bullet rewritten — original wording implied an in-repo migration
  for infra/modules/customer-instance/, which is wrong (the TF module embeds
  the script inline via heredoc, never sourced from scripts/grpn/). Now
  flags downstream consumer infra repos only.
- infra/modules/customer-instance/variables.tf: Czech docstring with `grpn`
  example → English description with `acme, example` placeholders.

Wave-2 anonymization:
- Code docstrings (connectors/openmetadata/{client,transformer,enricher}.py,
  src/catalog_export.py, scripts/duckdb_manager.py): prj-grp-… →
  my-bq-project / prj-example-1234, AIAgent.FoundryAI → AIAgent.MyAgent,
  FoundryAIDataModel → AnalyticsDataModel.
- Test fixtures (4 files): same set of replacements — 157 tests still pass.
- .github/workflows/keboola-deploy.yml: "Groupon-side dev VMs" comment →
  generic "per-developer dev VMs".
- docs/auth-groups.md + scripts/debug/probe_google_groups.py:
  kids-ai-data-analysis project name → acme-internal-prod placeholder.
- 5 planning/spec docs under docs/superpowers/{plans,specs}/2026-04-21-*:
  hardcoded IPs (34.77.94.14, 34.77.102.61) → <dev-vm-ip>/<prod-vm-ip>;
  GRPN/Groupon → Acme/another-customer; prj-grp-… → prj-example-….
- scripts/switch-dev-vm.sh deleted — hackathon-era helper hardcoded to a
  specific shared dev VM. Per-developer dev VMs are the supported pattern.

Final grep `groupon|grpn|foundryai|prj-grp|groupondev|34\.77\.(94|102)\.…|kids-ai-data`
returns zero hits (excluding CHANGELOG.md historical entries).

CHANGELOG entry expanded to document both waves under one bullet, with
the BREAKING (ops) clarification about the TF module being unaffected.

Refs review of #94, closes #88.

* fix(oss): close remaining #94 review-2 findings (Czech, padak refs, CHANGELOG)

Reviewer of PR #94 round 2 caught 4 remaining items the wave-2 pass missed:

1. infra/modules/customer-instance/variables.tf had Czech descriptions on
   8 more variables. Previous review only flagged line 19; this round
   audited the rest. Translated lines 2, 28, 42-46 (heredoc), 60, 65, 71,
   78, 84 to English. Same review concern: a Terraform module that is
   the customer-facing API surface in Czech is unfit for OSS distribution.

2. infra/modules/customer-instance/outputs.tf had Czech descriptions on
   four outputs. Same fix.

3. docs/padak-security.md referenced a private repo (padak/keboola_agent_cli#206)
   in two places. Replaced with generic 'tracked upstream in the auth-CLI repo'
   per CLAUDE.md vendor-agnostic rule (no cross-refs to private repos).

4. scripts/fetch-env-from-secrets.sh:41 had a Czech comment.
   Translated.

5. CHANGELOG cosmetic: bullet said 'AIAgent.FoundryAI -> AIAgent.MyAgent'
   but the actual code uses both MyAgent (in docstrings) and Example
   (in test fixtures). Reworded to mention both targets.

Final grep across all shipping file types (.md, .py, .yml, .yaml, .sh,
Makefile, .json, .tf, .tpl, Caddyfile, .toml) for groupon|grpn|foundryai|
prj-grp|groupondev|34.77.94.14|34.77.102.61|kids-ai-data|padak/keboola_agent_cli
returns ZERO hits (excluding CHANGELOG.md). Czech-diacritic grep across
.tf/.toml/Caddyfile/Makefile/.yml returns ZERO hits.

157/157 OpenMetadata + DuckDB tests still pass.

* fix(oss): close #94 round-3 leaks (env.template, instance.yaml.example, padak typo)

Round-3 reviewer caught two MUST-FIX leaks the round-2 grep missed
(grep was scoped to extensions that did not include .template / .example
suffixes — the audit was right, the previous grep was not paranoid enough):

1. config/instance.yaml.example:114 — '(optional - Groupon-specific)' brand
   leak in a shipping config example. Replaced with '(optional)'.

2. config/.env.template:68 — stale path 'scripts/grpn/agnes-tls-rotate.sh'
   in operator-facing env-template comment. The script lives at
   scripts/ops/ now (commit 16a85cc); this comment had been pointing
   operators at a non-existent path.

3. docs/padak-security.md:188 — phrase duplication 'tracked in tracked
   upstream' from a sloppy substitution in round-2. Trivial wording fix.

Final paranoid grep across .md/.py/.yml/.yaml/.sh/Makefile/.json/.tf/.tpl/
Caddyfile/.toml/.template/.example/.env* with the full token set
(groupon|grpn|foundryai|prj-grp|groupondev|34\.77\.94\.14|34\.77\.102\.61|
kids-ai-data|padak/keboola_agent_cli) returns ZERO hits, excluding
CHANGELOG.md historical entries.

* fix(oss): #94 round-4 — QUICKSTART.md + rename padak-security.md

Devin Review caught two findings on the latest round-3 commit:

1. docs/QUICKSTART.md:67 still pointed users at the deleted
   scripts/switch-dev-vm.sh. A Quickstart user following step-by-step
   would hit a missing-file error at the final step. Replaced with the
   inline gcloud-ssh equivalent that the Removed bullet documents.

2. docs/padak-security.md filename retains the personal identifier
   'padak'. The PR fixed the body content (replaced
   padak/keboola_agent_cli#206 references with generic wording) but
   missed the filename. Renamed to docs/security-audit-2026-04.md
   (date-anchored, vendor-neutral). Updated the historical CHANGELOG
   link to point at the new path with an inline note about the rename.

* fix(oss): redact remaining hardcoded IPs from planning docs + remove default email

Devin Review caught two more leaks:
1. scripts/fetch-env-from-secrets.sh line 16 had a hardcoded
   personal-email default (zdenek.srotyr@keboola.com). Replaced with
   ':?' bash error so SEED_ADMIN_EMAIL must be explicitly set —
   safer than carrying any specific identity.
2. Planning docs still had 35.195.96.98 and 34.62.223.189 (legacy
   prod/dev IPs) that the round-1 IP-replace pattern missed (it only
   targeted 34.77.x.x). Generic regex redaction across all five
   planning docs replaces every public IP with <redacted-ip>,
   preserving private/loopback/IAP ranges.
2026-04-27 20:24:34 +02:00
ZdenekSrotyr
2f783c5c0a
fix(security): close Jira webhook fail-open + path traversal (#83) (#93)
* fix(security): close Jira webhook fail-open + path traversal (#83)

Two related vulnerabilities:

1. Fail-open signature check: when JIRA_WEBHOOK_SECRET was unset,
   _verify_signature returned True and any unauthenticated POST to
   /webhooks/jira would run the full ingest pipeline. Now fail-closed —
   the handler short-circuits with 503 (operator-misconfiguration signal,
   distinct from 401 wrong-signature) when the secret is missing.

2. Path traversal via attacker-controlled issue_key: webhook payloads
   carry issue.key, which flowed unsanitized into save_issue (issues_dir /
   "{issue_key}.json"), download_attachment (attachments_dir / issue_key),
   and incremental_transform (raw_dir / "issues" / "{issue_key}.json"). A
   crafted webhook with issue.key="../../etc/passwd" could write outside
   the Jira data dir.

Defense-in-depth: new connectors/jira/validation.py exposes
is_valid_issue_key (whitelist regex ^[A-Z][A-Z0-9_]{0,31}-\d{1,12}$) and
safe_join_under (Path.resolve() containment check). Both are enforced at
the webhook entry point AND at every filesystem boundary in the connector.

Tests:
- New tests/test_jira_validation.py — unit tests for both helpers
  (parametrized invalid keys, traversal/symlink/absolute-path cases).
- Webhook tests: test_unconfigured_secret_returns_503,
  test_path_traversal_in_issue_key_rejected (parametrized over 10 bad keys),
  test_valid_issue_key_accepted.

CHANGELOG: two CRITICAL Fixed bullets under Unreleased.

Closes #83.

* fix(security): close remaining #83 review findings — webhookEvent traversal, _handle_deletion guard, regex tightening

Reviewer of PR #93 flagged four MUST-FIXes:

1. _log_webhook_event used the attacker-controlled `webhookEvent` field
   as a filename component without sanitization. Payload with
   `webhookEvent: "../../tmp/pwn"` could escape WEBHOOK_LOG_DIR. Now:
   - non-`[A-Za-z0-9_-]` runs are replaced with `_` (dot excluded so
     `..` cannot survive sanitization as a directory component)
   - length capped at 64 chars
   - final path routed through safe_join_under
   New regression test `test_webhook_event_path_traversal_sanitized`.

2. _handle_deletion (connectors/jira/service.py:530) and
   process_webhook_event (line 487) still used raw issue_key in path
   builds. Even though the webhook handler validates upstream, the
   "defense-in-depth at every filesystem boundary" claim required these
   too. Both now run is_valid_issue_key and safe_join_under guards.

3. Regex `^[A-Z][A-Z0-9_]{0,31}-\d{1,12}$` permitted underscores in
   project keys. Atlassian's project-key validator does not — `A_B-1`
   is rejected by Jira itself. Tightened to `[A-Z0-9]` and updated
   tests: `ABC_DEF-1` is now invalid, added Cyrillic А-1 (lookalike),
   CRLF, and oversize cases to the bad-key parametrization.

4. Existing test test_deletion_of_nonexistent_issue_returns_true used
   `PROJ-NOEXIST` which is not a real Jira key shape. Updated to
   `PROJ-99999`. The test still exercises the same intent (deletion of
   issue with no local file is idempotent).

73/73 jira tests pass locally (test_jira_webhooks + test_jira_validation
+ test_jira_service + test_jira_service_full + test_jira_incremental).

CHANGELOG updated to document the regex tightening and the new
webhookEvent sanitization.

Refs review of #93.

* fix(tests): test_journey_jira tests assumed fail-open before #83 fix

CI failure on PR #93 caught two journey tests that pinned the OLD
fail-open contract:

- test_webhook_with_no_secret_configured_accepted asserted 200 when
  JIRA_WEBHOOK_SECRET was unset. After the #83 fix that's a 503
  (operator misconfig). Renamed to _refused and flipped the assertion.

- test_webhook_empty_payload_rejected didn't set the secret, so the
  503 short-circuit fired before the empty-payload 400 could. Set
  JIRA_WEBHOOK_SECRET in the patched Config so the test exercises the
  intended path.

56/56 jira journey + webhook + validation tests now pass.

* fix(security): #93 round-3 — webhook fallback format + save_issue early validation

Devin Review caught two real findings:

1. Webhook handler regression: the round-2 fix extracted issue_key only
   from event_data['issue']['key'], but process_webhook_event has long
   supported a fallback 'issue_key' top-level field for certain Jira
   event formats (e.g. delete events historically). The handler now
   blocks those events with 400 before they reach the service layer.
   Fix: mirror process_webhook_event's fallback in the handler — try
   issue.key first, fall through to event_data.get('issue_key') when
   empty. is_valid_issue_key still validates whichever source provided
   the key.

2. save_issue defense-in-depth was incomplete: is_valid_issue_key ran
   AFTER fetch_remote_links and fetch_sla_fields had already used the
   unvalidated issue_key in HTTP URL construction
   ({base_url}/issue/{issue_key}/remotelink etc.). A future internal
   caller invoking save_issue directly with attacker-controlled input
   could trigger outbound requests with a malicious path component
   (limited SSRF / URL-path manipulation against the Jira API server).
   Fix: move the is_valid_issue_key check to immediately after the
   null guard, before any HTTP request or filesystem op. Webhook layer
   still validates upstream, this is the second layer.

66 jira tests pass.

Refs Devin Review of #93.

* fix(changelog): #93 round-4 — add BREAKING marker to fail-closed bullet

Devin Review caught: the JIRA_WEBHOOK_SECRET fail-closed change is a
behavior change for operators (response code 503 vs old 200) that
existing alerting may treat differently. Per CLAUDE.md changelog
discipline rule, operators grep for **BREAKING** before bumping the
pin. Added the marker + a short note on what action operators need
to take (set the env var if they haven't).

Refs Devin Review of #93.

* fix: #93 round-5 — null-issue crash + comment drift

Devin Review caught two findings on the round-4 commit:

1. Pre-existing crash on null issue field: a webhook payload with
   {"issue": null} (rather than omitting the key) caused
   event_data.get("issue", {}) to return None, then issue.get("key")
   raised AttributeError → unhandled 500. Pre-existing but reachable.
   Fix: 'event_data.get("issue") or {}' normalises None to {}, then
   the existing fallback / validation path returns 400 cleanly.
   New regression test test_null_issue_field_does_not_crash.

2. Inline comment drift: the comment at line 77 documented the allowed
   character class as [A-Za-z0-9._-] (with dot) but the regex at line 27
   excludes dot deliberately (so '..' cannot survive sanitization).
   Fixed the comment to match.

52 jira tests pass.

Refs Devin Review of #93 round 5.

* fix: #93 round-6 — process_webhook_event also normalises null issue field

Devin Review caught: the webhook handler at app/api/jira_webhooks.py
correctly handles {"issue": null} via 'event_data.get("issue") or {}',
but process_webhook_event at connectors/jira/service.py:509 still
used the bare 'event_data.get("issue", {})' which returns None on
explicit null. Internal callers (anything that invokes
process_webhook_event without going through the HTTP handler) would
hit the same AttributeError the round-5 fix closed at the handler
layer. Same one-line fix.

32 jira tests pass.

Refs Devin Review of #93 round 5.

* fix: #93 round-7 — issue-key regex uses [0-9] not \d

Devin Review caught: Python 3's \d matches any Unicode decimal digit
(Arabic-Indic ٣, Bengali ৩, Devanagari ३, …). A key like TEST-٣ would
pass the regex even though it's not a valid Jira input. Tightened to
[0-9] (ASCII only).

Added three Unicode-digit cases to the bad-key parametrization in
test_jira_validation.py to lock in the contract.

Refs Devin Review of #93 round 6.

* fix: #93 round-8 — use \\Z anchor not $ in issue-key regex

Devin Review caught: Python's $ anchor matches before a trailing \\n,
so re.match('…$', 'TEST-1\\n') returns a match. is_valid_issue_key
returned True for CRLF-injected keys. \\Z is hard end-of-string and
closes that bypass.

Manual verification:
  is_valid_issue_key('TEST-1\\n') → False (was True before fix)
  is_valid_issue_key('TEST-1\\r\\n') → False
  is_valid_issue_key('TEST-1') → True

Refs Devin Review of #93 round 7.

* docs: #93 round-9 — CHANGELOG regex matches implementation
2026-04-27 19:53:55 +02:00
ZdenekSrotyr
49cb940729 fix: strip HTML from table and column descriptions in OpenMetadata enricher
Imports strip_html from transformer and applies it to both table-level
and column-level descriptions parsed from the OpenMetadata API response.
2026-04-09 18:42:37 +02:00
ZdenekSrotyr
c20da6d744 Remove dead Flask Blueprint from Jira connector
The Flask-based webhook endpoint at connectors/jira/webhook.py is no longer used.
FastAPI handles webhooks via app/api/jira_webhooks.py which is already integrated
into the application. This removes the redundant Flask code.
2026-04-09 07:13:20 +02:00
ZdenekSrotyr
1488e01bf9 feat: add temp-file swap to BigQuery extractor
Write to extract.duckdb.tmp, then atomically swap into place with WAL cleanup.
Prevents lock conflicts with orchestrator holding read lock on existing database.
2026-04-09 07:00:19 +02:00
ZdenekSrotyr
f25393871d fix: escape single quotes in ATTACH TOKEN parameters
- In src/orchestrator.py _attach_remote_extensions: escape token with '' before passing to ATTACH
- In connectors/keboola/extractor.py _try_attach_extension: escape token with '' before passing to ATTACH

Prevents SQL injection if token contains single quotes.
2026-04-09 07:00:13 +02:00
ZdenekSrotyr
1b219cabe9 fix: remove dead PRAGMA enable_wal code
DuckDB has used WAL by default since v0.8, so this pragma is not
valid DuckDB syntax. Removed obsolete try-except block that attempted
to enable WAL on system database initialization.
2026-04-09 06:59:57 +02:00
ZdenekSrotyr
e425d4baa5 fix: handle WAL files in atomic swap to prevent DB corruption
Add _atomic_swap_db helper that removes stale WAL files before and after
moving the temp DuckDB into place. Apply CHECKPOINT before close in both
orchestrator and Keboola extractor so DuckDB flushes WAL before the swap.
2026-04-09 06:57:29 +02:00
ZdenekSrotyr
89154d043b chore: clean repo for public release — fix references, remove drafts
- Replace padak/tmp_oss → keboola/agnes-the-ai-analyst in all docs, infra, CLI
- Replace your-org/ai-data-analyst → keboola/agnes-the-ai-analyst in README, Jira docs
- Remove real GCP project ID from terraform.tfvars.example
- Delete internal draft documents (dev_docs/draft/)
- Update infra/main.tf to clone from main branch
2026-04-08 19:27:25 +02:00
ZdenekSrotyr
79443e0df4 fix: CSV all_varchar in legacy extractor, rewrite DEPLOYMENT.md from real deploy
- Legacy extractor now uses read_csv(all_varchar=true) to avoid type
  inference errors (e.g. seniority column typed as DOUBLE with string values)
- DEPLOYMENT.md rewritten based on actual dev VM deployment experience:
  deploy key setup, DuckDB write locking, env reload gotchas, bootstrap flow
2026-04-08 19:09:55 +02:00
ZdenekSrotyr
3ba207a7f8 feat: add _remote_attach to BigQuery extractor, support token-less ATTACH in orchestrator
BigQuery extension handles auth via GOOGLE_APPLICATION_CREDENTIALS env var,
so _remote_attach uses empty token_env. Orchestrator now supports both
token-based (Keboola) and env-based (BigQuery) authentication modes.
2026-04-08 18:13:31 +02:00
ZdenekSrotyr
06e1cf0a8d feat: generic _remote_attach contract for remote DuckDB extension views
Extractors with remote tables now write a _remote_attach table into
extract.duckdb so the orchestrator can re-ATTACH external extensions
at query time. The mechanism is source-agnostic — any connector can use it.

- Keboola extractor writes _remote_attach + creates views on kbc.*
- Orchestrator reads _remote_attach, installs extension, reads token from env
- Graceful degradation: missing token → warning, local tables still work
2026-04-08 18:10:12 +02:00
ZdenekSrotyr
92fbb88c15 chore: Docker prod config (Python 3.13, no reload), fix utcnow deprecation, update docs 2026-04-08 12:10:47 +02:00
ZdenekSrotyr
67a1e0bb45 feat: Jira webhook FastAPI adapter — replaces Flask Blueprint 2026-04-08 07:04:50 +02:00
ZdenekSrotyr
4d1acd014a refactor: remove legacy webapp + add missing tests + housekeeping
Phase A: Close fixed issues (#7, #8, #9), add server/ user/ to
.gitignore, increase extractor timeout to 30 min.

Phase B: Add 10 new tests — access request lifecycle (4), CLI admin
commands (5), sync subprocess trigger (1). 578 tests passing.

Phase C: Delete entire webapp/ directory (24,800 lines) — legacy Flask
app fully replaced by FastAPI app/. Fix auth providers to use
app.instance_config instead of webapp.config. Update CLAUDE.md.

Delete 6 webapp-only test files. Fix Jira service config imports.
2026-03-31 13:44:06 +02:00
ZdenekSrotyr
2d6a94fb6f fix: DuckDB concurrency — WAL mode, subprocess sync, temp+rename
Three-pronged fix for DuckDB lock conflicts:

1. WAL mode on system.duckdb — enables concurrent readers + writer
2. Sync trigger runs extractor as subprocess (not background task) —
   separate process = separate DuckDB connections, no lock conflict
3. Both extractor and orchestrator write to .tmp then atomic rename —
   avoids lock conflict with API reads on extract.duckdb/analytics.duckdb

Fixes #9 permanently.
2026-03-31 13:19:57 +02:00
ZdenekSrotyr
10d9280ab5 fix: extractor writes to temp file to avoid lock with orchestrator
Writes extract.duckdb.tmp then renames atomically, avoiding DuckDB lock
conflict when orchestrator holds a read connection on extract.duckdb.
2026-03-31 13:09:51 +02:00
ZdenekSrotyr
bd0b6d19c6 fix: legacy extractor constructs full Keboola table ID from bucket+source_table
Was using tc['id'] which is the registry ID (e.g. 'circle'), not the
full Keboola ID (e.g. 'in.c-finance.circle') needed by the API.
2026-03-31 12:06:38 +02:00
ZdenekSrotyr
0084f80ff6 fix: legacy extractor passes Path to export_table, not str
Fixes 'str' object has no attribute 'parent' when Keboola DuckDB
extension falls back to legacy client.
2026-03-31 12:03:16 +02:00
ZdenekSrotyr
865d6d657e fix: keboola client metadata_cache_path uses DATA_DIR instead of deleted config
Fixes #7 — NameError: name 'config' is not defined
2026-03-31 11:57:57 +02:00
ZdenekSrotyr
b502bd8bdd refactor: delete old sync pipeline — 9,500 lines removed
Phase 5 cleanup: remove all code replaced by extract.duckdb architecture.

Deleted modules:
- src/config.py (653) — replaced by DuckDB table_registry
- src/parquet_manager.py (755) — replaced by DuckDB COPY TO
- src/data_sync.py (734) — replaced by SyncOrchestrator
- src/remote_query.py (636) — replaced by DuckDB BigQuery ATTACH
- src/table_registry.py (464) — replaced by DuckDB repository
- connectors/keboola/adapter.py (820) — replaced by extractor.py
- connectors/bigquery/adapter.py (665) — replaced by extractor.py
- connectors/bigquery/client.py (644) — replaced by DuckDB BQ extension

Updated all imports in webapp, catalog_export, enricher, router,
sync_settings_service, generate_sample_data. Kept keboola/client.py
as fallback (removed src.config dependency).

704 tests passing.
2026-03-31 07:50:37 +02:00
ZdenekSrotyr
9f20529f10 fix: resolve 7 preexisting test failures
- Remove iCloud duplicate files (test_db 2.py, src/db 2.py)
- Fix metrics expression fallback to top-level field in transformer + webapp
- Fix sync_data.sh rsync exception pattern for $SSH_HOST variable
- Fix deploy_guard cp regex to skip shell variable expansions
- Update sudoers-deploy with missing root:data-ops rules
- Update CRITICAL_DIRS ownership expectations to match deploy.sh reality

913 tests passing, 0 failures.
2026-03-30 20:36:00 +02:00
ZdenekSrotyr
e2a7ee21a2 fix: Jira extract_init handles empty parquet dirs gracefully
DuckDB read_parquet glob fails when no files match. Skip view creation
for tables without parquet files, create views only after first write.
2026-03-30 20:28:29 +02:00