Open-source AI data analyst platform extracted from internal repo. Includes data sync engine, Keboola adapter, Flask web portal, server deployment scripts, and configuration templates.
40 lines
1.2 KiB
Text
40 lines
1.2 KiB
Text
# Data Description
|
|
|
|
This file defines the tables available for synchronization and analysis.
|
|
Copy this file to `data_description.md` and customize for your data sources.
|
|
|
|
## Tables
|
|
|
|
```yaml
|
|
# Folder mapping: data source bucket -> local folder name
|
|
folder_mapping:
|
|
"in.c-example": "example"
|
|
|
|
tables:
|
|
- id: "in.c-example.customers"
|
|
name: "customers"
|
|
description: "Customer master data"
|
|
primary_key: "id"
|
|
sync_strategy: "full_refresh"
|
|
|
|
- id: "in.c-example.orders"
|
|
name: "orders"
|
|
description: "Order transactions with line items"
|
|
primary_key: "id"
|
|
sync_strategy: "incremental"
|
|
incremental_window_days: 7
|
|
partition_by: "created_at"
|
|
partition_granularity: "month"
|
|
```
|
|
|
|
## Sync Strategies
|
|
|
|
- **full_refresh**: Downloads entire table on each sync. Best for small reference tables.
|
|
- **incremental**: Downloads only new/changed rows based on a date column. Best for large transactional tables.
|
|
|
|
## Partition Granularity
|
|
|
|
When using `partition_by`, data is split into separate Parquet files by time period:
|
|
- **month**: One file per month (e.g., `orders/2024-01.parquet`)
|
|
- **day**: One file per day (e.g., `events/2024-01-15.parquet`)
|
|
- **none**: Single file (default)
|