feat(ci+tests): deploy safety audit — linting, rollback, smoke tests, 50+ new tests (#120 )

Comprehensive deploy safety audit implementing 19 improvements across CI/CD pipeline, test coverage, and source code.

### CI/CD Pipeline
- ruff + mypy added to both release.yml and keboola-deploy.yml (continue-on-error)
- Smoke test added to keboola-deploy.yml (was missing)
- Automatic rollback on smoke test failure in release.yml
- Expanded smoke-test.sh with catalog, admin/tables, marketplace.zip, metrics
- Required status checks via .github/settings.yml
- Dependabot + CODEOWNERS + pre-commit hooks + ruff config

### Source Code
- DB schema version check in /api/health (db_schema: ok/mismatch/unhealthy)
- Config versioning (config_version: 1 in instance.yaml, non-blocking validation)
- BigQuery extractor ATTACH error handling (try/except around INSTALL+ATTACH)
- Post-deploy smoke test script for prod VM validation

### Test Coverage (~50 new tests)
- v13->v14 migration, Email magic link TTL, PAT, Marketplace ZIP/Git,
  Jira webhooks, Hybrid Query BQ, Keboola/BQ extractor failure modes,
  Orchestrator failure modes

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>

2026-04-29 09:18:55 +02:00

5.2 KiB

Raw Blame History

Data Sources

Overview

AI Data Analyst uses a connector system where each connector produces an extract.duckdb following a standard contract. The SyncOrchestrator auto-discovers and ATTACHes these into the master analytics.duckdb.

Configure the data source type in config/instance.yaml:

data_source:
  type: "keboola"  # Options: keboola, bigquery

Table definitions are stored in the DuckDB table_registry table (not in config files). Register tables via the admin API, CLI, or web UI.

Query Modes

Each table has a query_mode that determines how data is accessed:

local: Data is downloaded to parquet files on the Agnes server. Suitable for tables that fit in local storage.
remote: Data stays in the external source; DuckDB extension ATTACHes at query time. Suitable for large tables where only query results are transferred.

Keboola Connector

Syncs tables from Keboola Storage API using the DuckDB Keboola extension.

Requirements

Keboola Storage API token with read access
DuckDB Keboola extension (auto-installed)

Configuration

In .env:

KEBOOLA_STORAGE_TOKEN=your-token-here
KEBOOLA_STACK_URL=https://connection.your-region.keboola.com
KEBOOLA_PROJECT_ID=12345

Or configure via the admin UI (/admin/tables) or CLI:

da admin register-table --source-type keboola --bucket "in.c-crm" --table "company" --query-mode local

How it works

The extractor (connectors/keboola/extractor.py) uses the DuckDB Keboola extension to download data
Produces extract.duckdb with _meta table + parquet files in /data/extracts/keboola/data/
The SyncOrchestrator ATTACHes extract.duckdb into analytics.duckdb and creates views

Identifier validation

All Keboola table names, bucket names, and source table identifiers are validated against _SAFE_QUOTED_IDENTIFIER regex before use. Invalid identifiers are skipped with error logging.

BigQuery Connector

Queries BigQuery tables on-demand using the DuckDB BigQuery extension (remote attach).

Requirements

Google Cloud project with BigQuery access
Application Default Credentials (ADC) configured

Configuration

In config/instance.yaml:

bigquery:
  project_id: "your-gcp-project"

Or via the admin UI or CLI:

da admin register-table --source-type bigquery --bucket "dataset" --table "table" --query-mode remote

Authentication

Uses Application Default Credentials (ADC) — the standard Google auth fallback chain:

GOOGLE_APPLICATION_CREDENTIALS env var (service account key JSON)
gcloud user credentials (gcloud auth application-default login)
GCE metadata server (automatic on Compute Engine)

No explicit key file configuration needed — ADC handles it.

How it works

The extractor (connectors/bigquery/extractor.py) creates extract.duckdb with remote views
_remote_attach table tells the orchestrator how to ATTACH the BigQuery extension at query time
Queries go directly to BigQuery — no data is downloaded to local storage
Identifier validation (validate_identifier, validate_quoted_identifier) protects against injection

Hybrid Queries

For queries that JOIN local data with BigQuery results:

da query --sql "SELECT o.*, t.views FROM orders o JOIN traffic t ON o.date = t.date" \
         --register-bq "traffic=SELECT date, SUM(views) as views FROM dataset.web GROUP BY 1"

Jira Connector

Real-time webhook-based connector that updates parquet files incrementally.

How it works

Jira webhooks hit /api/jira/webhook endpoint
The connector (connectors/jira/) processes webhook events and updates parquet files
Produces extract.duckdb with _meta table + incremental parquet data

Writing a Custom Connector

Create a new connector in connectors/<name>/extractor.py that produces the extract.duckdb contract:

/data/extracts/{source_name}/
├── extract.duckdb          ← _meta table + views
└── data/                   ← parquet files (local sources only)

Required: `_meta` table

CREATE TABLE _meta (
    table_name   VARCHAR,
    description  VARCHAR,
    rows         INTEGER,
    size_bytes   INTEGER,
    extracted_at TIMESTAMP,
    query_mode   VARCHAR   -- 'local' or 'remote'
);

Optional: `_remote_attach` table (for remote sources)

CREATE TABLE _remote_attach (
    alias     VARCHAR,  -- DuckDB alias used in views
    extension VARCHAR,  -- Extension name
    url       VARCHAR,  -- Connection URL
    token_env VARCHAR   -- Env-var name holding the auth token (NOT the token itself)
);

Identifier validation

Import shared validators from src/identifier_validation.py:

from src.identifier_validation import validate_identifier, validate_quoted_identifier

Use validate_identifier() for strict names (alphanumeric + underscore) and validate_quoted_identifier() for names that may contain dots/hyphens (e.g., Keboola-style in.c-crm.orders).

The SyncOrchestrator auto-discovers connectors by scanning /data/extracts/*/extract.duckdb — no registration step needed beyond producing the correct output format.

See connectors/keboola/ for a complete batch-pull reference implementation, or connectors/bigquery/ for a remote-attach example.

5.2 KiB Raw Blame History

Data Sources

Overview

Query Modes

Keboola Connector

Requirements

Configuration

How it works

Identifier validation

BigQuery Connector

Requirements

Configuration

Authentication

How it works

Hybrid Queries

Jira Connector

How it works

Writing a Custom Connector

Required: _meta table

Optional: _remote_attach table (for remote sources)

Identifier validation

5.2 KiB

Raw Blame History

Required: `_meta` table

Optional: `_remote_attach` table (for remote sources)