Closes M15 from issue #81 — SQL injection via attacker-controlled
identifiers in connectors/keboola/extractor.py and
connectors/bigquery/extractor.py.
Lifted _validate_identifier from src/orchestrator.py into a new
src/identifier_validation.py shared module (single source of truth for
both layers). Two validator policies:
- validate_identifier (strict, ^[a-zA-Z_][a-zA-Z0-9_]{0,63}$) for
table_name — matches the orchestrator's rebuild-time check, so dashed
names fail fast at extraction rather than being silently dropped.
- validate_quoted_identifier (relaxed, accepts dashes/dots) for
bucket/dataset/source_table — Keboola in.c-foo and BigQuery
my-dataset are legitimate, just need to be safe inside `"..."`.
Both extractors skip-and-continue on unsafe rows (logged + counted in
failure stats); _extract_via_extension re-validates as defense-in-depth.
71/71 extractor + orchestrator tests pass.
Refs #81 Group D.
Write to extract.duckdb.tmp, then atomically swap into place with WAL cleanup.
Prevents lock conflicts with orchestrator holding read lock on existing database.
DuckDB has used WAL by default since v0.8, so this pragma is not
valid DuckDB syntax. Removed obsolete try-except block that attempted
to enable WAL on system database initialization.
BigQuery extension handles auth via GOOGLE_APPLICATION_CREDENTIALS env var,
so _remote_attach uses empty token_env. Orchestrator now supports both
token-based (Keboola) and env-based (BigQuery) authentication modes.
When BQ returns empty results (e.g., data not yet refreshed), the
scheduler was marking sync as complete for the day. This meant the
next 15-min tick would skip it ("none are due") and data would stay
stale until the next day's scheduled run.
Now: if partitioned sync processes partitions but gets 0 new rows,
last_sync is NOT updated. The scheduler will retry on the next tick
(15 min later) when data may be available.
Without explicit bqstorage_client parameter, to_arrow_iterable() silently
falls back to REST API pagination (~5K rows/sec). With explicit client,
it uses parallel gRPC streams via BQ Storage API (~300K rows/sec).
No temp table materialization - BQ already writes query results to an
internal temp table automatically. We just tell the reader to use the
fast gRPC path instead of slow HTTP pagination.
QueryJob only has to_arrow(), not to_arrow_iterable().
Must call query_job.result() first to get RowIterator,
which has the streaming to_arrow_iterable() method.
Replace to_arrow() (loads entire result into RAM) with
to_arrow_iterable() (streams RecordBatches). Each batch is written
directly to disk via ParquetWriter - constant memory regardless
of table size. Prevents OOM on 8GB server for multi-million row tables.
Propagate column selection and row filtering from data_description.md
through the BigQuery adapter to the BQ client. This enables exporting
only needed columns and applying date range filters at the SQL level,
critical for large DataView tables (e.g., 412-col unit_economics).
BigQuery connector that syncs BQ tables to local Parquet files via PyArrow
(no CSV intermediate step). Supports full refresh, timestamp-based
incremental (via incremental_column), and partition-based sync strategies.
- connectors/bigquery/client.py: BQ API wrapper with ADC auth, parameterized
queries, metadata cache, cross-project support (job project != data project)
- connectors/bigquery/adapter.py: DataSource implementation with merge/dedup
- src/config.py: Add incremental_column field to TableConfig
- 72 unit tests (mocked, no GCP SDK required)