- New sync_schedule and profile_after_sync fields in TableConfig (formats: "every 15m", "every 1h", "daily 05:00") - New src/scheduler.py with schedule evaluation logic (is_table_due) - New --scheduled mode in data_sync.py: only syncs tables that are due, respects profile_after_sync flag, auto-restarts webapp after profiling - Systemd timer+service for data-refresh (every 15 min) - Systemd timer+service for catalog-refresh (every 15 min) - deploy.sh enables new timers automatically - Complete table config reference in data_description.md.example - 58 new scheduler tests
137 lines
5.6 KiB
Text
137 lines
5.6 KiB
Text
# Data Description
|
|
|
|
This file defines the tables available for synchronization and analysis.
|
|
Copy this file to `data_description.md` and customize for your data sources.
|
|
|
|
## Tables
|
|
|
|
```yaml
|
|
# Folder mapping: data source bucket -> local folder name
|
|
folder_mapping:
|
|
"in.c-example": "example"
|
|
|
|
tables:
|
|
# Small reference table - full refresh, no automatic sync
|
|
- id: "in.c-example.customers"
|
|
name: "customers"
|
|
description: "Customer master data"
|
|
primary_key: "id"
|
|
sync_strategy: "full_refresh"
|
|
|
|
# Large transactional table - daily automatic sync with profiling
|
|
- id: "in.c-example.orders"
|
|
name: "orders"
|
|
description: "Order transactions with line items"
|
|
primary_key: "id"
|
|
sync_strategy: "partitioned"
|
|
partition_by: "created_at"
|
|
partition_column_type: "DATE"
|
|
partition_granularity: "day"
|
|
incremental_window_days: 3
|
|
max_history_days: 450
|
|
query_mode: "local"
|
|
sync_schedule: "daily 05:00"
|
|
profile_after_sync: true
|
|
|
|
# Frequently updated table - sync every hour, skip profiling
|
|
- id: "in.c-example.events"
|
|
name: "events"
|
|
description: "Real-time event stream"
|
|
primary_key: "event_id"
|
|
sync_strategy: "partitioned"
|
|
partition_by: "event_date"
|
|
partition_column_type: "DATE"
|
|
partition_granularity: "day"
|
|
incremental_window_days: 1
|
|
query_mode: "local"
|
|
sync_schedule: "every 1h"
|
|
profile_after_sync: false
|
|
```
|
|
|
|
## Table Configuration Reference
|
|
|
|
### Required Fields
|
|
|
|
| Field | Description | Example |
|
|
|-------|-------------|---------|
|
|
| `id` | Full table identifier in data source | `"in.c-crm.company"` |
|
|
| `name` | Short name (used for Parquet filenames) | `"company"` |
|
|
| `description` | Human-readable description | `"Company master data"` |
|
|
| `primary_key` | Primary key column(s), comma-separated | `"id"` or `"order_id, line_id"` |
|
|
| `sync_strategy` | How data is downloaded (see below) | `"full_refresh"` |
|
|
|
|
### Sync Strategy
|
|
|
|
| Strategy | Description | Use for |
|
|
|----------|-------------|---------|
|
|
| `full_refresh` | Downloads entire table each sync | Small reference tables (< 100K rows) |
|
|
| `incremental` | Downloads changed rows via changedSince | Medium tables with update tracking |
|
|
| `partitioned` | Downloads by time partitions, overwrites only recent ones | Large tables with date column |
|
|
|
|
### Partitioning
|
|
|
|
| Field | Default | Description |
|
|
|-------|---------|-------------|
|
|
| `partition_by` | *(none)* | Column to partition by (e.g., `"created_at"`, `"event_date"`) |
|
|
| `partition_granularity` | `"month"` | `"day"`, `"month"`, or `"year"` |
|
|
| `partition_column_type` | `"TIMESTAMP"` | SQL type: `"DATE"`, `"TIMESTAMP"`, or `"DATETIME"` |
|
|
| `incremental_window_days` | `7` | How many recent days to re-download on each sync |
|
|
| `max_history_days` | *(all)* | Maximum history to keep (e.g., `450` for ~15 months) |
|
|
| `initial_load_chunk_days` | `30` | Chunk size for first-time download |
|
|
|
|
### Query Mode
|
|
|
|
| Field | Default | Description |
|
|
|-------|---------|-------------|
|
|
| `query_mode` | `"local"` | How the AI agent queries this table |
|
|
|
|
| Mode | Description | Best for |
|
|
|------|-------------|----------|
|
|
| `local` | Synced to Parquet, queried via DuckDB | Tables < 2 GB, fast queries |
|
|
| `remote` | Not synced, queried via BigQuery | Huge tables (100+ GB), live data |
|
|
| `hybrid` | Subset synced for profiling, queries go to BigQuery | Medium tables needing live data |
|
|
|
|
### Automatic Sync Schedule
|
|
|
|
| Field | Default | Description |
|
|
|-------|---------|-------------|
|
|
| `sync_schedule` | *(none)* | When to automatically sync this table |
|
|
| `profile_after_sync` | `true` | Run data profiler after sync completes |
|
|
|
|
The `sync_schedule` field controls automatic synchronization via the `data-refresh`
|
|
systemd timer (runs every 15 minutes). If omitted, the table is only synced manually.
|
|
|
|
**Schedule formats:**
|
|
|
|
| Format | Example | Description |
|
|
|--------|---------|-------------|
|
|
| `every {N}m` | `"every 15m"`, `"every 30m"` | Sync every N minutes |
|
|
| `every {N}h` | `"every 1h"`, `"every 6h"` | Sync every N hours |
|
|
| `daily HH:MM` | `"daily 05:00"`, `"daily 17:30"` | Sync once per day at HH:MM UTC |
|
|
| *(omitted)* | - | Manual sync only (`python -m src.data_sync`) |
|
|
|
|
**How scheduling works:**
|
|
- A systemd timer runs `python -m src.data_sync --scheduled` every 15 minutes
|
|
- For each table with `sync_schedule`, it checks the last sync time from `sync_state.json`
|
|
- `every` schedules: syncs if enough time has elapsed since last sync
|
|
- `daily` schedules: syncs once after the target time passes (skips if already synced today)
|
|
- Tables without `sync_schedule` are never synced automatically
|
|
|
|
**Profiling control:**
|
|
- `profile_after_sync: true` (default) - runs profiler after sync to update column statistics
|
|
- `profile_after_sync: false` - skips profiler (use for frequently synced tables where
|
|
profiling overhead is not worth it; the AI agent uses slightly older statistics)
|
|
- When profiling runs, the webapp is automatically restarted to load new statistics
|
|
|
|
### Optional Fields
|
|
|
|
| Field | Default | Description |
|
|
|-------|---------|-------------|
|
|
| `folder` | *(from folder_mapping)* | Override output folder name |
|
|
| `row_filter` | *(none)* | SQL WHERE clause (e.g., `"date >= DATE_SUB(CURRENT_DATE(), INTERVAL 15 MONTH)"`) |
|
|
| `columns` | *(all)* | List of columns to sync (subset) |
|
|
| `incremental_column` | *(none)* | Column for timestamp-based incremental sync (BigQuery) |
|
|
| `dataset` | *(none)* | Dataset group name for on-demand tables |
|
|
| `catalog_fqn` | *(auto)* | OpenMetadata FQN override (auto-derived from table ID if not set) |
|
|
| `foreign_keys` | `[]` | List of foreign key relationships |
|
|
| `where_filters` | `[]` | List of filters for Keboola Storage API |
|