Comprehensive deploy safety audit implementing 19 improvements across CI/CD pipeline, test coverage, and source code. ### CI/CD Pipeline - ruff + mypy added to both release.yml and keboola-deploy.yml (continue-on-error) - Smoke test added to keboola-deploy.yml (was missing) - Automatic rollback on smoke test failure in release.yml - Expanded smoke-test.sh with catalog, admin/tables, marketplace.zip, metrics - Required status checks via .github/settings.yml - Dependabot + CODEOWNERS + pre-commit hooks + ruff config ### Source Code - DB schema version check in /api/health (db_schema: ok/mismatch/unhealthy) - Config versioning (config_version: 1 in instance.yaml, non-blocking validation) - BigQuery extractor ATTACH error handling (try/except around INSTALL+ATTACH) - Post-deploy smoke test script for prod VM validation ### Test Coverage (~50 new tests) - v13->v14 migration, Email magic link TTL, PAT, Marketplace ZIP/Git, Jira webhooks, Hybrid Query BQ, Keboola/BQ extractor failure modes, Orchestrator failure modes Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
5.2 KiB
Data Sources
Overview
AI Data Analyst uses a connector system where each connector produces an extract.duckdb following a standard contract. The SyncOrchestrator auto-discovers and ATTACHes these into the master analytics.duckdb.
Configure the data source type in config/instance.yaml:
data_source:
type: "keboola" # Options: keboola, bigquery
Table definitions are stored in the DuckDB table_registry table (not in config files). Register tables via the admin API, CLI, or web UI.
Query Modes
Each table has a query_mode that determines how data is accessed:
local: Data is downloaded to parquet files on the Agnes server. Suitable for tables that fit in local storage.remote: Data stays in the external source; DuckDB extension ATTACHes at query time. Suitable for large tables where only query results are transferred.
Keboola Connector
Syncs tables from Keboola Storage API using the DuckDB Keboola extension.
Requirements
- Keboola Storage API token with read access
- DuckDB Keboola extension (auto-installed)
Configuration
In .env:
KEBOOLA_STORAGE_TOKEN=your-token-here
KEBOOLA_STACK_URL=https://connection.your-region.keboola.com
KEBOOLA_PROJECT_ID=12345
Or configure via the admin UI (/admin/tables) or CLI:
da admin register-table --source-type keboola --bucket "in.c-crm" --table "company" --query-mode local
How it works
- The extractor (
connectors/keboola/extractor.py) uses the DuckDB Keboola extension to download data - Produces
extract.duckdbwith_metatable + parquet files in/data/extracts/keboola/data/ - The SyncOrchestrator ATTACHes
extract.duckdbintoanalytics.duckdband creates views
Identifier validation
All Keboola table names, bucket names, and source table identifiers are validated against _SAFE_QUOTED_IDENTIFIER regex before use. Invalid identifiers are skipped with error logging.
BigQuery Connector
Queries BigQuery tables on-demand using the DuckDB BigQuery extension (remote attach).
Requirements
- Google Cloud project with BigQuery access
- Application Default Credentials (ADC) configured
Configuration
In config/instance.yaml:
bigquery:
project_id: "your-gcp-project"
Or via the admin UI or CLI:
da admin register-table --source-type bigquery --bucket "dataset" --table "table" --query-mode remote
Authentication
Uses Application Default Credentials (ADC) — the standard Google auth fallback chain:
GOOGLE_APPLICATION_CREDENTIALSenv var (service account key JSON)- gcloud user credentials (
gcloud auth application-default login) - GCE metadata server (automatic on Compute Engine)
No explicit key file configuration needed — ADC handles it.
How it works
- The extractor (
connectors/bigquery/extractor.py) createsextract.duckdbwith remote views _remote_attachtable tells the orchestrator how to ATTACH the BigQuery extension at query time- Queries go directly to BigQuery — no data is downloaded to local storage
- Identifier validation (
validate_identifier,validate_quoted_identifier) protects against injection
Hybrid Queries
For queries that JOIN local data with BigQuery results:
da query --sql "SELECT o.*, t.views FROM orders o JOIN traffic t ON o.date = t.date" \
--register-bq "traffic=SELECT date, SUM(views) as views FROM dataset.web GROUP BY 1"
Jira Connector
Real-time webhook-based connector that updates parquet files incrementally.
How it works
- Jira webhooks hit
/api/jira/webhookendpoint - The connector (
connectors/jira/) processes webhook events and updates parquet files - Produces
extract.duckdbwith_metatable + incremental parquet data
Writing a Custom Connector
Create a new connector in connectors/<name>/extractor.py that produces the extract.duckdb contract:
/data/extracts/{source_name}/
├── extract.duckdb ← _meta table + views
└── data/ ← parquet files (local sources only)
Required: _meta table
CREATE TABLE _meta (
table_name VARCHAR,
description VARCHAR,
rows INTEGER,
size_bytes INTEGER,
extracted_at TIMESTAMP,
query_mode VARCHAR -- 'local' or 'remote'
);
Optional: _remote_attach table (for remote sources)
CREATE TABLE _remote_attach (
alias VARCHAR, -- DuckDB alias used in views
extension VARCHAR, -- Extension name
url VARCHAR, -- Connection URL
token_env VARCHAR -- Env-var name holding the auth token (NOT the token itself)
);
Identifier validation
Import shared validators from src/identifier_validation.py:
from src.identifier_validation import validate_identifier, validate_quoted_identifier
Use validate_identifier() for strict names (alphanumeric + underscore) and validate_quoted_identifier() for names that may contain dots/hyphens (e.g., Keboola-style in.c-crm.orders).
The SyncOrchestrator auto-discovers connectors by scanning /data/extracts/*/extract.duckdb — no registration step needed beyond producing the correct output format.
See connectors/keboola/ for a complete batch-pull reference implementation, or connectors/bigquery/ for a remote-attach example.