* fix: cutover regressions + parallel Keboola legacy fallback
Bundled fixes from a fresh-deploy run on a Keboola Storage backend with
the block-shared-snowflake-access feature flag — DuckDB Keboola
extension's per-table scan can't access bucket schemas, so the legacy
kbcstorage Storage-API client is the only working path.
CUTOVER REGRESSIONS
- agnes pull hash mismatch on every Keboola local-mode table —
src/orchestrator.py:_update_sync_state stored md5(mtime+size)[:12]
while the CLI compares against full 32-char content MD5. Now stores
the same content MD5 the materialized SQL path already used.
- Trailing-slash sanitization in connectors/keboola/access.py and
extractor.py — DuckDB Keboola extension's ATTACH fails when the URL
ends in / (canonical form).
- src/profiler.py:TableInfo.description becomes optional — two call
sites instantiated without it, crashing the profiler pass.
- scripts/ops/agnes-auto-upgrade.sh: chown on UID change — older images
ran as root, current runs as agnes (uid 999). Reads target uid:gid
from /etc/passwd inside the new image and chowns ${STATE_DIR},
/data/extracts, /data/analytics when the digest moves.
- POST /api/sync/trigger is now singleton per process — two
near-simultaneous trigger calls each forked an extractor subprocess,
fought for extract.duckdb's file lock, starved uvicorn, flipped the
container to unhealthy. Trigger now returns 409
(sync_already_in_progress) when held; _run_sync acquires non-blocking.
PARALLEL LEGACY FALLBACK
- Process pool fan-out for the _extract_via_legacy queue (default 8
workers, override via AGNES_KEBOOLA_PARALLELISM). Process pool, not
thread pool, because connectors/keboola/client.py:export_table does
os.chdir(temp_dir) — process-global, so threads raced and slice files
landed in the wrong directory ("[Errno 2] No such file or directory:
'<job_id>.csv_X_Y_Z.csv'").
- Extractor subprocess timeout 1800s -> 3600s (configurable via
AGNES_EXTRACTOR_TIMEOUT_SEC). 28+ tables × multi-minute Keboola export
jobs need the headroom on telemetry-class projects.
- Process group cleanup on timeout — Popen(start_new_session=True) puts
the extractor in its own group. On timeout the parent SIGTERMs the
group (10s grace) then SIGKILLs stragglers. Without this, the pool
workers were reparented to PID 1 and continued holding open Keboola
Storage export jobs. Inline extractor script also installs a SIGTERM
-> sys.exit(143) handler so the with ProcessPoolExecutor(...) block
__exit__ runs cleanly.
Tests: existing tests that patched subprocess.run updated to patch
subprocess.Popen with a _FakePopen stand-in (same exit-code-injection
contract). Two tests that exercised the parallel path forced
AGNES_KEBOOLA_PARALLELISM=1 to keep mocks alive (mocks don't ride into
ProcessPoolExecutor subprocesses).
Squashed onto current main (was 7 commits + multi-commit CHANGELOG +
agnes-auto-upgrade.sh conflicts; squash avoids per-commit conflict
resolution against main's flat-mount STATE_DIR refactor and 0.38.0
release cut).
* feat(keboola): Storage API direct extract path; drop extension data path
The DuckDB Keboola extension's COPY routes through Keboola QueryService,
which is unreliable on linked-bucket projects (extension v0.1.6 fixes
that case but isn't yet in the community CDN, and pre-fix any project
with the block-shared-snowflake-access feature flag couldn't see bucket
schemas at all). Move the extract path off the extension entirely and
talk to the Storage API directly via signed-URL download — works on any
project, regardless of extension state.
connectors/keboola/storage_api.py (NEW)
Lightweight client built on requests.Session. Three endpoints:
- POST /v2/storage/tables/{id}/export-async (kicks off job)
- GET /v2/storage/jobs/{id} (poll until done)
- GET /v2/storage/files/{id}?federationToken=1 (signed URL detail)
- GET <signed_url> (download bytes)
Supports sliced exports (manifest + per-slice signed URLs) and gzipped
payloads. ExportFilter dataclass mirrors the Keboola filter spec
(whereFilters / columns / changedSince / limit) and handles JSON
round-trip with the registry's source_query column. Token redaction
in error messages. Bounded exponential backoff on job polling.
No cloud-SDK dependency on the data path; thread-safe.
connectors/keboola/extractor.py
- materialize_query() rewritten: takes bucket/source_table/source_query
(JSON filter spec), exports via KeboolaStorageClient, converts CSV
to parquet via DuckDB, atomic os.replace. Same return shape so
sync.py downstream code stays uniform with the BQ branch.
- _extract_via_legacy() also moved to Storage API direct (kept the
name for caller compatibility with _legacy_worker / the parallel
batch extractor). Per-call temp directories — no os.chdir, threads
don't race.
app/api/sync.py
_run_materialized_pass for source_type='keboola' rows now constructs a
KeboolaStorageClient (replaces KeboolaAccess) and passes
bucket/source_table/source_query to materialize_query. Reuses one
client across rows for HTTP keep-alive. Sources keboola URL from env
too (KEBOOLA_STACK_URL) when instance.yaml doesn't have stack_url
configured.
cli/commands/admin.py
discover-and-register defaults Keboola rows to query_mode='materialized'
(NULL source_query = full table), matching the v26 migration's
unification of the local/materialized split for Keboola. BigQuery and
Jira keep their per-source defaults.
src/db.py
Schema bump 25 → 26. Migration: UPDATE table_registry SET
query_mode='materialized' WHERE source_type='keboola' AND
query_mode='local'. NULL source_query on those rows means "full table
export" — same effective behavior the local mode provided, but now
via Storage API instead of the extension.
pyproject.toml
kbcstorage dep stays (admin-side bucket/table list still uses the
SDK in app/api/admin.py / connectors/keboola/client.py); only the
data path is migrated off the SDK. Comment updated to reflect the
new boundary.
tests
- test_keboola_storage_api.py (NEW, 19 tests): ExportFilter parsing,
HTTP client (token redaction, retry logic, polling), download_file
(single, gzipped, sliced), end-to-end export_table_to_csv.
- test_keboola_materialize.py rewritten: mocks KeboolaStorageClient
instead of FakeAccess; same atomic-write + zero-rows + unsafe-id
contracts.
- test_sync_trigger_keboola_materialized.py: registry rows now carry
bucket+source_table+JSON-shape source_query.
114+ Keboola-impacted tests green locally.
* test: schema version assertion bumped to 26 alongside the keboola query_mode migration
* fix(keboola): cutover hot-patches surfaced on agnes-dev
Five small fixes that were applied as in-container hot-patches during
agnes-dev cutover and need to be on the source-of-truth image so a fresh
upgrade does not undo them.
- app/api/sync.py: auto-discover gate considers the WHOLE registry (any
source, any mode), not just rows where source matches and query_mode
is local. After the v25→v26 keboola materialized migration an
instance can have 30 materialized rows and zero local rows; the
previous gate kept re-firing _discover_and_register_tables every
scheduler tick, creating duplicate auto-discovered rows with the
wrong bucket prefix every time.
- app/api/admin.py: _discover_and_register_tables reassembles the
bucket as <stage>.<bucket-id> (e.g. in.c-finance) instead of
dropping the stage prefix; default query_mode for keboola is now
materialized (the v26 contract); validator allows NULL source_query
for keboola materialized rows (full-table export via Storage API
export-async, no SQL needed).
- cli/commands/admin.py: register-table mirrors the server validator
(NULL source_query allowed for source_type=keboola); --bucket help
text generalized to cover both BQ dataset and Keboola bucket id.
- connectors/keboola/extractor.py: max_line_size=64 MiB on
read_csv_auto so embedded JSON / SQL cells (kbc_component_configuration
in particular) do not trip the default 2 MiB ceiling.
- connectors/keboola/storage_api.py: GCP backend support — when the
Storage API returns a manifest whose slice URLs are gs://
references with a gcsCredentials block, rewrite to the JSON REST
download endpoint and authenticate with the issued OAuth bearer
token; redact tokens in any surfaced error string.
* test: align with new keboola materialized + auto-discover-gate contracts
- test_admin_keboola_materialized: rename
test_register_keboola_materialized_rejects_missing_source_query →
test_register_keboola_materialized_accepts_missing_source_query.
v25→v26 introduced 'keboola materialized with NULL source_query
means full-table export via Storage API export-async' as the
default registration shape; the rejection case is no longer the
contract.
- test_sync_filter: add list_all() to _StubRegistry. The auto-discover
gate in _run_sync now keys off the WHOLE registry (not just local
rows) so materialized-only Keboola instances do not re-trigger
discovery on every tick.
* feat(keboola): native parquet export — skip CSV roundtrip
Storage API export-async accepts fileType={csv,parquet}. Switching the
materialized sync to parquet eliminates the CSV → DuckDB COPY → parquet
roundtrip that pinned a single uvicorn worker over 4 GiB on multi-GB
tables (read_csv with all_varchar + max_line_size=64MB has to
materialize the whole CSV in memory before COPY can stream out a
parquet). Snowflake UNLOAD on Keboola's side already produces typed,
self-contained parquet files; the extractor downloads them and renames
into place.
Two cases:
- **Single-file** export (small table): file_info.url points at one
signed URL; download_file streams chunks straight to .parquet.tmp
and we're done. No DuckDB.
- **Sliced** export (Snowflake UNLOAD respects MAX_FILE_SIZE — 16 MiB
default — so anything larger arrives as N parquet slices): each
slice is a complete parquet file with its own footer; naive concat
would corrupt them. download_file_slices keeps the slices as
separate files in a tempdir, then DuckDB COPY (SELECT * FROM
read_parquet([slice0, slice1, ...])) merges them into one
consolidated parquet. DuckDB streams row groups during this — peak
memory bounded to one row group (~1 MiB) regardless of source size.
The legacy CSV path stays as the explicit opt-in via source_query=
'{"file_type":"csv"}' for projects whose backend can't UNLOAD
parquet (none known today; cheap escape hatch). Backward-compat alias
KeboolaStorageClient.export_table_to_csv kept.
Also fixes a latent bug in download_file's gzip detection: previous
heuristic flagged any unencrypted file as gzipped, which would have
corrupted parquet downloads at gunzip time. Name-suffix-only now.
* fix: tempdir leak cleanup, every 0m schedule, /sync/trigger body shapes
Three small self-contained fixes uncovered during agnes-dev cutover.
- connectors/keboola/extractor.py: tempfile.TemporaryDirectory now uses
ignore_cleanup_errors=True so a worker death mid-write doesn't leave
multi-GiB stale slice trees on the boot disk. (12 GiB seen after a
disk-full crash where TemporaryDirectory's own cleanup also raised
and got swallowed.)
- src/scheduler.py: is_valid_schedule accepts 'every 0m' (interval=0
= always due). Force-resync of an errored row no longer requires
waiting out the default 'every 1h' interval — admin can flip the
schedule, trigger, then flip back.
- app/api/sync.py: POST /api/sync/trigger accepts both ['table_id']
(legacy bare-array body) and {'tables': ['table_id']} (matches the
response payload shape, more discoverable for clients building
requests by hand). Malformed bodies return 422 with a structured
detail; null/missing means 'sync everything' as before.
Tests cover: tempdir cleanup on raise (sliced parquet path),
is_valid_schedule + is_table_due 'every 0m' acceptance, and trigger
body parametrized matrix (8 valid shapes + 6 rejection cases).
* fix: targeted-trigger filter in materialized pass + auto-upgrade defer
Two operational gaps observed during agnes-dev cutover, in the same
sync-routing area.
- _run_materialized_pass now takes a 'tables' arg and skips rows not in
the target set with reason='not_in_target'. POST /api/sync/trigger
with a body of tables previously only scoped the legacy extractor
subprocess — the materialized pass kept iterating every due
materialized row, so an admin asking to re-sync kbc_job re-ran
every other due materialized row alongside it. Match on registry id
OR name (admins commonly pass either form). tables=None preserves
the no-filter behavior.
- New GET /api/sync/status (public, no auth) returns {locked: bool}
off _sync_lock.locked(). agnes-auto-upgrade.sh probes this before
docker compose up -d and exits 0 with a 'deferred recreate' log
line if a sync is in flight — the next 5-min cron tick retries.
Pre-fix, an auto-upgrade triggered mid-sync would recreate the
uvicorn worker and kill the in-flight extractor / Snowflake-UNLOAD
download (observed when kbc_job's first 7-day retry got SIGKILLed).
Connection failures in the probe fall through to the upgrade —
being stuck on a wedged image is worse than interrupting a
hypothetical sync.
* fix: auto-discover protects admin overrides + surfaces drift
Two real-world incidents on agnes-dev drove this:
1. kbc_job was registered manually with the correct
(in.c-kbc_telemetry, kbc_job) coordinates. A naive auto-discover
re-run would have inserted a SECOND kbc_job row at the slugified
id 'in_c-keboola-storage_kbc_job' (where Keboola's discovery
places it) — and that row's Storage API export-async 404s.
2. An earlier auto-discover bug stripped the stage prefix from
bucket ids ('c-finance' instead of 'in.c-finance'), inserting
137 rows whose syncs all failed.
Fix:
- _discover_and_register_tables now builds a plan first
(_build_keboola_discovery_plan) classifying each discovered table
into one of new / existing_match / existing_drift / invalid, then
executes only the 'new' bucket. Drift rows are reported with both
sides of the disagreement plus drift_kind:
- same_id_diff_coords: registry has the same id but different
bucket / source_table (admin migrated coords inline).
- name_collision: discovery's slugified id differs from any
registry id, but the discovered .name matches an existing row's
.name (case-insensitive). Catches the kbc_job case.
- Bucket detection now prefers the API's authoritative bucket_id
field (separate field on the Keboola tables.list response,
normalised by KeboolaClient.discover_all_tables). Falls back to
id-string parsing only when bucket_id is missing (older fallback
path inside discover_all_tables).
- Endpoint POST /api/admin/discover-and-register?dry_run=true
returns the plan without writing — would_register, drift,
invalid lists. Lets an operator audit before merging discovery
with a registry that has admin overrides.
Removed 'every 0m' from test_register_request_rejects_malformed_sync_schedule
— the runtime started accepting it in the previous commit (force-resync
override) and the validator follows suit.
* feat(keboola): AGNES_TEMP_DIR routes tempfiles off overlayfs /tmp
The container's /tmp lives on the boot disk's overlayfs (29 GiB on
agnes-dev, shared with /var). Snowflake UNLOAD of a wide table writes
slices into per-call /tmp tempdirs that fill multi-GiB / many-slice
exports long before the dedicated data disk fills. agnes-dev hit
100% boot-disk while the 20 GiB data disk had 15 GiB free.
connectors.keboola.storage_api.get_temp_root() reads AGNES_TEMP_DIR;
mkdirs the target on first use; unset / empty / unwritable falls
back to None (system tempdir, OSS-pre-fix behaviour). Both
materialize_query (parquet path) and _extract_via_legacy (CSV
fallback) and the sliced-CSV concat path in storage_api use the
helper now.
docker-compose.yml defaults AGNES_TEMP_DIR=/data/tmp on app, scheduler,
and extract services. The data volume is the dedicated disk in
production layouts and a plain docker volume in single-disk
dev/laptop setups — same blast radius as the previous /tmp default
on the latter, no regression.
701 lines
32 KiB
Python
701 lines
32 KiB
Python
"""Sync orchestrator — ATTACHes extract.duckdb files into master analytics.duckdb.
|
|
|
|
Remote table support
|
|
--------------------
|
|
Extractors that create views referencing external DuckDB extensions (e.g. Keboola,
|
|
BigQuery) must include a ``_remote_attach`` table in their extract.duckdb:
|
|
|
|
CREATE TABLE _remote_attach (
|
|
alias VARCHAR, -- DuckDB alias used in views, e.g. 'kbc'
|
|
extension VARCHAR, -- Extension name, e.g. 'keboola'
|
|
url VARCHAR, -- Connection URL
|
|
token_env VARCHAR -- Env-var name holding the auth token (NOT the token itself).
|
|
-- Empty string for BigQuery — orchestrator detects
|
|
-- extension='bigquery' and refreshes the token from the
|
|
-- GCE metadata server on its own.
|
|
);
|
|
|
|
At rebuild time the orchestrator reads ``_remote_attach``, installs/loads the
|
|
extension, then either: (a) for BigQuery, fetches a fresh access token from the
|
|
GCE metadata server and creates a session-scoped DuckDB SECRET before ATTACH;
|
|
(b) for sources with a non-empty ``token_env``, reads that env var and passes
|
|
the token inline; (c) ATTACHes without auth. Views referencing
|
|
``bq."dataset"."table"`` or ``kbc."bucket"."table"`` then resolve correctly.
|
|
|
|
Note: BQ secrets are session-scoped, so ``src.db._reattach_remote_extensions``
|
|
re-fetches the metadata token and re-creates the secret each time a read-only
|
|
analytics connection is opened.
|
|
"""
|
|
|
|
import hashlib
|
|
import logging
|
|
import os
|
|
import threading
|
|
from pathlib import Path
|
|
from typing import Dict, List, Optional
|
|
|
|
import duckdb
|
|
|
|
from connectors.bigquery.auth import get_metadata_token, BQMetadataAuthError
|
|
from src.orchestrator_security import (
|
|
escape_sql_string_literal,
|
|
is_builtin_extension,
|
|
is_extension_allowed,
|
|
is_token_env_allowed,
|
|
)
|
|
|
|
logger = logging.getLogger(__name__)
|
|
|
|
_rebuild_lock = threading.Lock()
|
|
|
|
# Identifier validation lives in src/identifier_validation.py so the
|
|
# orchestrator and the extractors share the same regex (#81 Group D).
|
|
# The local names are kept as aliases so existing call sites need no
|
|
# rename — they import from a single source of truth now.
|
|
from src.identifier_validation import ( # noqa: E402
|
|
_SAFE_IDENTIFIER, # noqa: F401 (re-exported for any historical caller)
|
|
validate_identifier as _validate_identifier,
|
|
)
|
|
|
|
def _atomic_swap_db(tmp_path: str, target_path: str) -> None:
|
|
"""Atomically replace target DuckDB file, cleaning up WAL files."""
|
|
import shutil
|
|
target = Path(target_path)
|
|
tmp = Path(tmp_path)
|
|
|
|
# Remove old WAL file if it exists
|
|
old_wal = Path(str(target) + ".wal")
|
|
if old_wal.exists():
|
|
old_wal.unlink()
|
|
|
|
# Move temp DB into place
|
|
if tmp.exists():
|
|
shutil.move(str(tmp), str(target))
|
|
|
|
# Clean up temp WAL
|
|
tmp_wal = Path(str(tmp) + ".wal")
|
|
if tmp_wal.exists():
|
|
tmp_wal.unlink()
|
|
|
|
|
|
def _get_extracts_dir() -> Path:
|
|
data_dir = Path(os.environ.get("DATA_DIR", "./data"))
|
|
return data_dir / "extracts"
|
|
|
|
|
|
class SyncOrchestrator:
|
|
"""Scans /data/extracts/*, ATTACHes each extract.duckdb, creates master views."""
|
|
|
|
def __init__(self, analytics_db_path: str | None = None):
|
|
# analytics_db_path allows override for testing
|
|
if analytics_db_path:
|
|
self._db_path = analytics_db_path
|
|
else:
|
|
data_dir = Path(os.environ.get("DATA_DIR", "./data"))
|
|
self._db_path = str(data_dir / "analytics" / "server.duckdb")
|
|
Path(self._db_path).parent.mkdir(parents=True, exist_ok=True)
|
|
|
|
def rebuild(self) -> Dict[str, List[str]]:
|
|
"""Scan all extract directories, ATTACH each, create master views.
|
|
|
|
Returns: {source_name: [table_names]} for logging.
|
|
"""
|
|
with _rebuild_lock:
|
|
return self._do_rebuild()
|
|
|
|
def rebuild_source(self, source_name: str) -> List[str]:
|
|
"""Rebuild views from a single source (e.g. after Jira webhook)."""
|
|
with _rebuild_lock:
|
|
return self._do_rebuild_source(source_name)
|
|
|
|
def _scan_meta_pairs(self, extracts_dir: Path) -> tuple:
|
|
"""Read every connector's `_meta` and return (pairs, clean) where:
|
|
|
|
- ``pairs`` — list of (source_name, table_name) tuples successfully
|
|
gathered from `_meta`.
|
|
- ``clean`` — True iff every source's pre-scan succeeded. False if
|
|
any source's `_meta` couldn't be read (transient I/O, mid-write,
|
|
missing/corrupt extract.duckdb).
|
|
|
|
Used by view_ownership.reconcile to release stale claims before
|
|
the main rebuild loop tries to claim new names. The ``clean`` flag
|
|
guards against a correctness bug: if source B's pre-scan fails
|
|
and we naively reconcile against an incomplete `pairs` list, B's
|
|
prior ownership is dropped, and another source could claim B's
|
|
name in the same rebuild — a silent overwrite, exactly what
|
|
Group C is meant to prevent. Callers MUST skip reconcile when
|
|
``clean`` is False; per-row claim-time collision detection still
|
|
catches actual collisions.
|
|
"""
|
|
pairs: List[tuple] = []
|
|
clean = True
|
|
for ext_dir in sorted(extracts_dir.iterdir()):
|
|
if not ext_dir.is_dir():
|
|
continue
|
|
db_file = ext_dir / "extract.duckdb"
|
|
if not db_file.exists():
|
|
continue
|
|
if not _validate_identifier(ext_dir.name, "source_name"):
|
|
continue
|
|
try:
|
|
ro_conn = duckdb.connect(str(db_file), read_only=True)
|
|
try:
|
|
rows = ro_conn.execute(
|
|
"SELECT table_name FROM _meta"
|
|
).fetchall()
|
|
for (table_name,) in rows:
|
|
if _validate_identifier(table_name, "table_name"):
|
|
pairs.append((ext_dir.name, table_name))
|
|
finally:
|
|
ro_conn.close()
|
|
except Exception as e:
|
|
logger.warning(
|
|
"scan_meta_pairs: failed to read %s (%s) — "
|
|
"skipping reconcile this rebuild to avoid releasing "
|
|
"ownerships prematurely",
|
|
ext_dir.name, e,
|
|
)
|
|
clean = False
|
|
return pairs, clean
|
|
|
|
def _do_rebuild(self) -> Dict[str, List[str]]:
|
|
extracts_dir = _get_extracts_dir()
|
|
if not extracts_dir.exists():
|
|
logger.warning("Extracts directory %s does not exist", extracts_dir)
|
|
return {}
|
|
|
|
# Issue #81 Group C — load view ownership map from system DB so we
|
|
# can detect cross-connector view-name collisions during this
|
|
# rebuild and refuse to silently overwrite a previously-claimed
|
|
# name. The map is kept in system.duckdb (analytics.duckdb is
|
|
# rebuilt fresh each time and would not survive).
|
|
from src.db import get_system_db
|
|
from src.repositories.view_ownership import ViewOwnershipRepository
|
|
sys_conn_for_views = get_system_db()
|
|
view_repo = None
|
|
try:
|
|
view_repo = ViewOwnershipRepository(sys_conn_for_views)
|
|
# Pre-scan every connector's _meta so we can run the reconcile
|
|
# pass BEFORE claims are evaluated. This makes "owner stopped
|
|
# publishing → name freed → another source can claim" work in
|
|
# the SAME rebuild rather than requiring two consecutive runs.
|
|
#
|
|
# Correctness: only reconcile when EVERY source's pre-scan
|
|
# succeeded. Otherwise a transient I/O failure on source B
|
|
# would drop B's prior ownership and let another source steal
|
|
# B's name — silent overwrite, exactly the bug Group C
|
|
# prevents. Per-row claim-time collision detection still
|
|
# catches actual collisions even without reconcile this run.
|
|
current_pairs, pre_scan_clean = self._scan_meta_pairs(extracts_dir)
|
|
if pre_scan_clean:
|
|
view_repo.reconcile(current_pairs)
|
|
else:
|
|
logger.warning(
|
|
"view_ownership: skipping reconcile this rebuild — "
|
|
"pre-scan was incomplete; renamed tables will release "
|
|
"their names on the next clean rebuild instead"
|
|
)
|
|
existing_owners = view_repo.get_all()
|
|
except Exception as e:
|
|
logger.warning(
|
|
"view_ownership pre-scan failed: %s — proceeding without "
|
|
"collision detection", e,
|
|
)
|
|
existing_owners = {}
|
|
view_repo = None
|
|
try:
|
|
sys_conn_for_views.close()
|
|
except Exception:
|
|
pass
|
|
sys_conn_for_views = None
|
|
|
|
# Track every (source, view) pair this rebuild successfully claims.
|
|
claimed_pairs: List[tuple] = []
|
|
|
|
result = {}
|
|
# Write to temp file then rename — avoids lock conflict with query endpoint
|
|
tmp_path = self._db_path + ".tmp"
|
|
if Path(tmp_path).exists():
|
|
Path(tmp_path).unlink()
|
|
conn = duckdb.connect(tmp_path)
|
|
try:
|
|
# Detach any previously attached databases (except main and temp)
|
|
attached = [
|
|
row[0]
|
|
for row in conn.execute(
|
|
"SELECT database_name FROM duckdb_databases() "
|
|
"WHERE database_name NOT IN ('memory', 'system', 'temp')"
|
|
).fetchall()
|
|
]
|
|
for db_name in attached:
|
|
if db_name != Path(self._db_path).stem:
|
|
try:
|
|
conn.execute(f"DETACH {db_name}")
|
|
except Exception:
|
|
pass
|
|
|
|
for ext_dir in sorted(extracts_dir.iterdir()):
|
|
if not ext_dir.is_dir():
|
|
continue
|
|
db_file = ext_dir / "extract.duckdb"
|
|
if not db_file.exists():
|
|
logger.debug("Skipping %s — no extract.duckdb", ext_dir.name)
|
|
continue
|
|
|
|
if not _validate_identifier(ext_dir.name, "source_name"):
|
|
continue
|
|
|
|
tables = self._attach_and_create_views(
|
|
conn, ext_dir.name, str(db_file),
|
|
existing_owners=existing_owners,
|
|
claimed_pairs=claimed_pairs,
|
|
view_repo=view_repo if sys_conn_for_views else None,
|
|
)
|
|
if tables:
|
|
result[ext_dir.name] = tables
|
|
logger.info("Attached %s: %d tables", ext_dir.name, len(tables))
|
|
|
|
# No end-of-rebuild reconcile: the pre-scan reconcile above
|
|
# already released stale ownerships using a complete view of
|
|
# every source's `_meta`. Reconciling again here against
|
|
# `claimed_pairs` (which excludes refused collisions and any
|
|
# source that failed to attach) would incorrectly drop the
|
|
# legitimate prior owner of a name when its DB happens to be
|
|
# transiently unreadable. See test
|
|
# `test_pre_scan_failure_does_not_release_ownership` for the
|
|
# contract.
|
|
finally:
|
|
conn.execute("CHECKPOINT")
|
|
conn.close()
|
|
if sys_conn_for_views is not None:
|
|
try:
|
|
sys_conn_for_views.close()
|
|
except Exception:
|
|
pass
|
|
|
|
# Atomic swap: replace analytics.duckdb with new version
|
|
_atomic_swap_db(tmp_path, self._db_path)
|
|
|
|
return result
|
|
|
|
def _do_rebuild_source(self, source_name: str) -> List[str]:
|
|
"""Rebuild views for a single source by doing a full rebuild.
|
|
|
|
A full rebuild is necessary because the analytics DB is created fresh
|
|
each time (temp file + atomic swap). Rebuilding only one source would
|
|
destroy views from all other sources.
|
|
"""
|
|
extracts_dir = _get_extracts_dir()
|
|
db_file = extracts_dir / source_name / "extract.duckdb"
|
|
if not db_file.exists():
|
|
logger.warning("No extract.duckdb for source %s", source_name)
|
|
return []
|
|
|
|
result = self._do_rebuild()
|
|
return result.get(source_name, [])
|
|
|
|
def _attach_and_create_views(
|
|
self,
|
|
conn: duckdb.DuckDBPyConnection,
|
|
source_name: str,
|
|
db_path: str,
|
|
existing_owners: Optional[Dict[str, str]] = None,
|
|
claimed_pairs: Optional[List[tuple]] = None,
|
|
view_repo=None,
|
|
) -> List[str]:
|
|
"""ATTACH extract.duckdb, read _meta, create views in master.
|
|
|
|
Issue #81 Group C — when ``existing_owners`` and ``view_repo`` are
|
|
provided, the orchestrator checks for cross-connector view-name
|
|
collisions and refuses to overwrite a name owned by another source.
|
|
``claimed_pairs`` accumulates the (source, view) tuples this
|
|
rebuild successfully claims; the caller uses it for end-of-rebuild
|
|
reconcile.
|
|
"""
|
|
if existing_owners is None:
|
|
existing_owners = {}
|
|
tables = []
|
|
try:
|
|
conn.execute(f"ATTACH '{db_path}' AS {source_name} (READ_ONLY)")
|
|
|
|
# Re-ATTACH external extensions needed by remote views
|
|
self._attach_remote_extensions(conn, source_name)
|
|
|
|
# Read _meta to know what's available
|
|
meta_rows = conn.execute(
|
|
f"SELECT table_name, rows, size_bytes, query_mode "
|
|
f"FROM {source_name}._meta"
|
|
).fetchall()
|
|
|
|
# Pre-fetch the set of names that actually exist as views/tables in
|
|
# the attached extract.duckdb. Most connectors emit a `_meta` row
|
|
# alongside an inner view per registered name; the keboola
|
|
# extractor with `use_extension=False` (and other connectors)
|
|
# may insert `_meta` rows whose inner view doesn't exist yet —
|
|
# skip those to avoid creating a master view that would resolve
|
|
# to nothing.
|
|
inner_objects = {
|
|
row[0]
|
|
for row in conn.execute(
|
|
"SELECT table_name FROM information_schema.tables "
|
|
f"WHERE table_catalog='{source_name}'"
|
|
).fetchall()
|
|
}
|
|
|
|
for table_name, rows, size_bytes, query_mode in meta_rows:
|
|
if not _validate_identifier(table_name, "table_name"):
|
|
continue
|
|
if table_name not in inner_objects:
|
|
# `_meta` row without an inner object. Post-#160 the
|
|
# BigQuery extractor no longer emits these for unsupported
|
|
# entity types (it skips both the view AND the _meta row),
|
|
# so this branch fires for the keboola use_extension=False
|
|
# path and any future connector that splits writes across
|
|
# commits. Skip master-view creation; subsequent rows
|
|
# continue normally.
|
|
logger.info(
|
|
"Skipping master view for %s.%s — no inner object",
|
|
source_name, table_name,
|
|
)
|
|
continue
|
|
|
|
# Issue #81 Group C — refuse cross-connector collisions.
|
|
# First-come-first-served: the source already in
|
|
# view_ownership keeps the name; any other source that
|
|
# tries to claim it gets logged + skipped until the
|
|
# operator renames one side. Re-claim by the same source
|
|
# is fine (idempotent rebuild).
|
|
if view_repo is not None:
|
|
if not view_repo.claim(table_name, source_name):
|
|
prior_owner = (
|
|
view_repo.get_owner(table_name)
|
|
or existing_owners.get(table_name, "<unknown>")
|
|
)
|
|
logger.error(
|
|
"view_ownership collision: %s already owns view %r; "
|
|
"%s.%s will NOT be exposed. Rename `name` in the "
|
|
"table_registry on one side to resolve.",
|
|
prior_owner, table_name, source_name, table_name,
|
|
)
|
|
continue
|
|
if claimed_pairs is not None:
|
|
claimed_pairs.append((source_name, table_name))
|
|
|
|
try:
|
|
conn.execute(
|
|
f"CREATE OR REPLACE VIEW \"{table_name}\" AS "
|
|
f"SELECT * FROM {source_name}.\"{table_name}\""
|
|
)
|
|
tables.append(table_name)
|
|
except Exception as e:
|
|
# Per-row catch so one bad row doesn't drop the rest of
|
|
# the source's master views from the rebuild.
|
|
logger.error(
|
|
"Failed to create master view for %s.%s: %s",
|
|
source_name, table_name, e,
|
|
)
|
|
|
|
# Filesystem-fallback master views (0.41.0). The 0.40.0 fix in
|
|
# `materialize_query` tries to register the parquet in
|
|
# `extract.duckdb`'s `_meta` + inner view, but the open-as-
|
|
# second-write-handle from the same uvicorn process collides
|
|
# with the existing read-only ATTACH that `rebuild()` itself
|
|
# holds (`Unique file handle conflict: Cannot attach "extract"
|
|
# — already attached by database "<source>"`). The 0.40.0
|
|
# helper logs a WARNING and falls through, parquet is
|
|
# canonical, but the master view never appears via the meta
|
|
# path. This second pass scans `<extract_dir>/data/*.parquet`
|
|
# directly and creates a master view via `read_parquet()` for
|
|
# any parquet that didn't already get one through the meta
|
|
# path. Decoupled from materialize_query's open-handle race;
|
|
# robust against any registration drift between materialize
|
|
# and rebuild.
|
|
try:
|
|
extracts_dir = _get_extracts_dir()
|
|
except Exception:
|
|
extracts_dir = None
|
|
if extracts_dir is not None:
|
|
data_dir = extracts_dir / source_name / "data"
|
|
if data_dir.exists():
|
|
# Resolve the set of registry-known table_ids for this
|
|
# source. The fallback is a master-view recovery path
|
|
# for parquets that materialize_query wrote but
|
|
# couldn't register in `_meta`; an **orphan** parquet
|
|
# (registry row deleted by `DELETE /api/admin/registry`
|
|
# but parquet not yet cleaned up) must NOT get a
|
|
# master view — that would resurrect a deleted table.
|
|
# Pre-existing test `test_orchestrator_skips_orphan_
|
|
# parquet_in_extracts` pins this contract.
|
|
registered_ids: Optional[set] = None
|
|
try:
|
|
from src.db import get_system_db
|
|
from src.repositories.table_registry import (
|
|
TableRegistryRepository,
|
|
)
|
|
sys_conn = get_system_db()
|
|
try:
|
|
rows = TableRegistryRepository(sys_conn).list_all()
|
|
# Match parquet stems against registry rows for
|
|
# THIS source where query_mode='materialized'.
|
|
# The parquet filename is keyed by registry
|
|
# `name` (per `_run_materialized_pass` /
|
|
# `materialize_query` convention).
|
|
registered_ids = {
|
|
str(r.get("name"))
|
|
for r in rows
|
|
if (r.get("source_type") or "") == source_name
|
|
and (r.get("query_mode") or "") == "materialized"
|
|
and r.get("name")
|
|
}
|
|
finally:
|
|
try:
|
|
sys_conn.close()
|
|
except Exception:
|
|
pass
|
|
except Exception as e:
|
|
# No registry access (test fixture, transient DB
|
|
# error) — skip the fallback rather than risk
|
|
# exposing orphan parquets.
|
|
logger.warning(
|
|
"filesystem-fallback: registry read failed (%s); "
|
|
"skipping fallback scan for %s — orphan parquets "
|
|
"from a prior DELETE could otherwise be exposed.",
|
|
e, source_name,
|
|
)
|
|
registered_ids = None
|
|
|
|
if registered_ids is not None:
|
|
already_created = set(tables)
|
|
for parquet_path in sorted(data_dir.glob("*.parquet")):
|
|
table_id = parquet_path.stem
|
|
if not _validate_identifier(table_id, "fs_fallback table_id"):
|
|
continue
|
|
if table_id in already_created:
|
|
continue
|
|
# Only register parquets that have a live
|
|
# materialized registry row. Orphans skip.
|
|
if table_id not in registered_ids:
|
|
logger.debug(
|
|
"filesystem-fallback: skipping orphan "
|
|
"parquet %s/%s (no registry row)",
|
|
source_name, table_id,
|
|
)
|
|
continue
|
|
# view_repo claim — same first-come-first-served
|
|
# rule as the meta-path branch above.
|
|
if view_repo is not None:
|
|
if not view_repo.claim(table_id, source_name):
|
|
prior_owner = (
|
|
view_repo.get_owner(table_id)
|
|
or existing_owners.get(table_id, "<unknown>")
|
|
)
|
|
logger.error(
|
|
"view_ownership collision: %s already owns view %r; "
|
|
"%s.%s (filesystem-fallback) will NOT be exposed.",
|
|
prior_owner, table_id, source_name, table_id,
|
|
)
|
|
continue
|
|
if claimed_pairs is not None:
|
|
claimed_pairs.append((source_name, table_id))
|
|
try:
|
|
safe_path = str(parquet_path).replace("'", "''")
|
|
conn.execute(
|
|
f"CREATE OR REPLACE VIEW \"{table_id}\" AS "
|
|
f"SELECT * FROM read_parquet('{safe_path}')"
|
|
)
|
|
tables.append(table_id)
|
|
logger.info(
|
|
"filesystem-fallback master view created: "
|
|
"%s/%s (parquet at %s) — meta row was missing",
|
|
source_name, table_id, parquet_path,
|
|
)
|
|
except Exception as e:
|
|
logger.error(
|
|
"filesystem-fallback master view failed for %s/%s: %s",
|
|
source_name, table_id, e,
|
|
)
|
|
|
|
# Update sync_state in system DB
|
|
self._update_sync_state(meta_rows, source_name)
|
|
|
|
except Exception as e:
|
|
logger.error("Failed to attach %s: %s", source_name, e)
|
|
|
|
return tables
|
|
|
|
def _attach_remote_extensions(
|
|
self, conn: duckdb.DuckDBPyConnection, source_name: str
|
|
) -> None:
|
|
"""Read _remote_attach from extract.duckdb and ATTACH external sources."""
|
|
try:
|
|
# DuckDB attached-DB layout: ATTACH 'extract.duckdb' AS <alias>
|
|
# exposes information_schema.tables with table_catalog=<alias>
|
|
# and table_schema='main'. The earlier draft used
|
|
# table_schema=<alias> here, which never matched and made
|
|
# _attach_remote_extensions a silent no-op for every
|
|
# connector — defeating the entire Group A hardening in
|
|
# production. db.py:_reattach_remote_extensions already used
|
|
# the correct column; this aligns the rebuild path.
|
|
tables = conn.execute(
|
|
f"SELECT table_name FROM information_schema.tables "
|
|
f"WHERE table_catalog='{source_name}' AND table_name='_remote_attach'"
|
|
).fetchall()
|
|
if not tables:
|
|
return
|
|
except Exception:
|
|
return
|
|
|
|
rows = conn.execute(
|
|
f"SELECT alias, extension, url, token_env FROM {source_name}._remote_attach"
|
|
).fetchall()
|
|
|
|
for alias, extension, url, token_env in rows:
|
|
# Identifier sanity (defense against weird input). The hard
|
|
# security boundary is the allowlist a few lines down.
|
|
if not _validate_identifier(alias, "remote_attach alias"):
|
|
continue
|
|
if not _validate_identifier(extension, "remote_attach extension"):
|
|
continue
|
|
|
|
# #81 Group A.1 — extension allowlist. The connector does NOT
|
|
# get to pick what extensions the orchestrator loads.
|
|
if not is_extension_allowed(extension):
|
|
logger.error(
|
|
"Remote attach %s: extension %r is not in the allowlist; refusing. "
|
|
"Override via AGNES_REMOTE_ATTACH_EXTENSIONS if intended.",
|
|
alias, extension,
|
|
)
|
|
continue
|
|
|
|
# #81 Group A.2 — token-env hard allowlist. Refuses well-known
|
|
# runtime secrets (JWT_SECRET_KEY, OPENAI_API_KEY, …) that a
|
|
# malicious connector might ask us to send to its server.
|
|
if token_env and not is_token_env_allowed(token_env):
|
|
logger.error(
|
|
"Remote attach %s: token_env %r is not in the allowlist; refusing. "
|
|
"Override via AGNES_REMOTE_ATTACH_TOKEN_ENVS if intended.",
|
|
alias, token_env,
|
|
)
|
|
continue
|
|
|
|
token = os.environ.get(token_env, "") if token_env else ""
|
|
if token_env and not token:
|
|
logger.warning(
|
|
"Remote attach %s: env var %s not set, skipping", alias, token_env
|
|
)
|
|
continue
|
|
|
|
try:
|
|
# Skip if already attached (e.g. multiple sources share same extension)
|
|
attached = {
|
|
r[0] for r in conn.execute(
|
|
"SELECT database_name FROM duckdb_databases()"
|
|
).fetchall()
|
|
}
|
|
if alias in attached:
|
|
logger.debug("Remote source %s already attached", alias)
|
|
continue
|
|
|
|
# #81 Group A.1 — built-ins LOAD only; community needs INSTALL+LOAD.
|
|
if is_builtin_extension(extension):
|
|
conn.execute(f"LOAD {extension};")
|
|
else:
|
|
conn.execute(f"INSTALL {extension} FROM community; LOAD {extension};")
|
|
# #81 Group A.3 — escape URL single-quotes (mirrors src/db.py).
|
|
safe_url = escape_sql_string_literal(url)
|
|
|
|
# BQ-specific: refresh token from GCE metadata, create session-scoped
|
|
# secret before ATTACH. Empty token_env (set by the BQ extractor) is
|
|
# the contract that signals "use built-in metadata path".
|
|
if extension == "bigquery":
|
|
try:
|
|
bq_token = get_metadata_token()
|
|
except BQMetadataAuthError as e:
|
|
logger.error(
|
|
"Failed to fetch BQ metadata token for %s: %s — skipping ATTACH",
|
|
alias, e,
|
|
)
|
|
continue
|
|
escaped = escape_sql_string_literal(bq_token)
|
|
secret_name = f"bq_secret_{alias}"
|
|
conn.execute(
|
|
f"CREATE OR REPLACE SECRET {secret_name} "
|
|
f"(TYPE bigquery, ACCESS_TOKEN '{escaped}')"
|
|
)
|
|
from connectors.bigquery.access import apply_bq_session_settings
|
|
apply_bq_session_settings(conn)
|
|
conn.execute(
|
|
f"ATTACH '{safe_url}' AS {alias} (TYPE {extension}, READ_ONLY)"
|
|
)
|
|
elif token:
|
|
escaped_token = escape_sql_string_literal(token)
|
|
conn.execute(
|
|
f"ATTACH '{safe_url}' AS {alias} (TYPE {extension}, TOKEN '{escaped_token}')"
|
|
)
|
|
# Apply BQ session settings on every BQ-extension attach,
|
|
# not only the metadata-token branch above. The token-based
|
|
# branch previously fell through without calling
|
|
# apply_bq_session_settings, leaving the 90 s extension
|
|
# default for bq_query_timeout_ms in place.
|
|
if extension == "bigquery":
|
|
from connectors.bigquery.access import apply_bq_session_settings
|
|
apply_bq_session_settings(conn)
|
|
else:
|
|
# No auth required (or extension handles it via env automatically)
|
|
conn.execute(
|
|
f"ATTACH '{safe_url}' AS {alias} (TYPE {extension}, READ_ONLY)"
|
|
)
|
|
if extension == "bigquery":
|
|
from connectors.bigquery.access import apply_bq_session_settings
|
|
apply_bq_session_settings(conn)
|
|
|
|
logger.info("Attached remote source %s via %s extension", alias, extension)
|
|
except Exception as e:
|
|
logger.error("Failed to attach remote source %s: %s", alias, e)
|
|
|
|
def _update_sync_state(self, meta_rows: list, source_name: str) -> None:
|
|
"""Update sync_state table in system.duckdb from _meta entries.
|
|
|
|
The hash stored here MUST match what `agnes pull` computes
|
|
client-side via `cli/commands/sync.py:_md5_file` and what the
|
|
materialized SQL path stores via `app/api/sync.py:_file_hash` —
|
|
otherwise the CLI's post-download integrity check fails for every
|
|
local-mode table with `hash mismatch: expected … got …`. That's
|
|
a full content MD5 (`hashlib.md5(bytes).hexdigest()`), no
|
|
truncation.
|
|
|
|
Pre-fix this method computed `md5(f"{mtime_ns}:{size}")[:12]` —
|
|
a fingerprint, not a content hash, and 12-char truncated to boot
|
|
— which the CLI's full-32-char content MD5 could never match.
|
|
Symptom: `agnes pull` failed with hash mismatch on every Keboola
|
|
local-mode table because their sync_state hashes came from this
|
|
path while their on-disk content was unrelated.
|
|
"""
|
|
try:
|
|
from src.db import get_system_db
|
|
from src.repositories.sync_state import SyncStateRepository
|
|
|
|
extracts_dir = _get_extracts_dir()
|
|
sys_conn = get_system_db()
|
|
try:
|
|
repo = SyncStateRepository(sys_conn)
|
|
for table_name, rows, size_bytes, query_mode in meta_rows:
|
|
pq_path = extracts_dir / source_name / "data" / f"{table_name}.parquet"
|
|
file_hash = ""
|
|
if pq_path.exists():
|
|
h = hashlib.md5()
|
|
with open(pq_path, "rb") as f:
|
|
for chunk in iter(lambda: f.read(8192), b""):
|
|
h.update(chunk)
|
|
file_hash = h.hexdigest()
|
|
|
|
repo.update_sync(
|
|
table_id=table_name,
|
|
rows=rows or 0,
|
|
file_size_bytes=size_bytes or 0,
|
|
hash=file_hash,
|
|
)
|
|
finally:
|
|
sys_conn.close()
|
|
except Exception as e:
|
|
logger.warning("Could not update sync_state: %s", e)
|